Seriously kick-ass MCMC sampling
Once upon a time, there was an algorithm. This algorithm became very popular but it turned out that there were better ones. Some folks came up with a better one and some other folks implemented it in Python. It's a little crazy how much it rocks.
emcee is an extensible, pure-Python implementation of of Goodman & Weare's Affine Invariant Markov chain Monte Carlo (MCMC) Ensemble sampler. It's designed for Bayesian parameter estimation and it's really sweet!
Do one of the following:
sudo pip install emcee orsudo python setup.py install
in the unzipped directory.
Then run the tests to make sure that all went as planned:
python -c 'import emcee; emcee.test()'
import numpy as np
import emcee
def lnprob(x, ivar):
return -0.5*np.sum(ivar * x**2)
ndim, nwalkers = 10, 100
ivar = 1./np.random.rand(ndim)
p0 = [np.random.rand(ndim) for i in xrange(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[ivar])
sampler.run_mcmc(p0, 1000)
...and it's blazingly fast!
chain property is a different shape
(nwalkers, nsteps, ndim) instead of
(nwalkers, ndim, nsteps).
In principle, adding multiprocessing
support is as simple as instantiating the sampler
object with the threads argument set to
an integer greater than 1.
In practice, this only works when your likelihood function is pickleable.
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob, args=[means, icov], threads=15)
The annotated source code is an excellent place to get information about how this code works. Most of the core functionality is in the EnsembleSampler object that inherits from the Sampler abstract base class. There is also a simple Metropolis-Hastings sampler for comparison. You can also take a look at the unit tests.
An abstract sampler object that implements various helper functions
dim (int): Number of dimensions in the parameter space.lnpostfn (callable): A function that takes a vector in the parameter space as input and returns the natural logarithm of the posterior probability for that position.args (list): Optional list of extra arguments for lnpostfn. lnpostfn will be called with the sequence lnpostfn(p, *args).acceptance_fraction: An array (length: k) of the fraction of steps accepted for each
walker.acor: The autocorrelation time of each parameter in the chain (length:
dim) as estimated by the acor module.chain: A pointer to the Markov chain itself. The shape of this array is
(k, dim, iterations).flatchain: A shortcut for accessing chain flattened along the zeroth (walker)
axis.lnprobability: A pointer to the matrix of the value of lnprobfn produced at each
step for each walker. The shape is (k, iterations).random_state: The state of the internal random number generator. In practice, it's
the result of calling get_state() on a
numpy.random.mtrand.RandomState object. You can try to set this
property but be warned that if you do this and it fails, it will do
so silently.An alias for reset kept for backwards compatibility.
Return the log-probability at the given position.
Clear chain, lnprobability and the bookkeeping parameters.
Iterate sample for N iterations and return the result. The arguments
are passed directly to sample so see the parameter details given in
sample.
A generalized Ensemble sampler that uses 2 ensembles for parallelization.
This is a subclass of the Sampler object. See the Sampler object
for more details about the inherited properties.
k (int): The number of Goodman & Weare "walkers".dim (int): Number of dimensions in the parameter space.lnpostfn (callable): A function that takes a vector in the parameter space as input and returns the natural logarithm of the posterior probability for that position.a (float): The proposal scale parameter. (default: 2.0)args (list): Optional list of extra arguments for lnpostfn. lnpostfn will be called with the sequence lnpostfn(p, *args).postargs (list): Alias of args for backwards compatibility.threads (int): The number of threads to use for parallelization. If threads == 1, then the multiprocessing module is not used but if threads > 1, then a Pool object is created and calls to lnpostfn are run in parallel. (default: 1)pool (multiprocessing.Pool): An alternative method of using the parallelized algorithm. If pool is not None, the value of threads is ignored and the provided Pool is used for all parallelization. (default: None)AssertionError: If k < 2*dim or if k is not even.acceptance_fraction: An array (length: k) of the fraction of steps accepted for each
walker.acor: The autocorrelation time of each parameter in the chain (length:
dim) as estimated by the acor module.chain: A pointer to the Markov chain itself. The shape of this array is
(k, dim, iterations).flatchain: A shortcut for accessing chain flattened along the zeroth (walker)
axis.lnprobability: A pointer to the matrix of the value of lnprobfn produced at each
step for each walker. The shape is (k, iterations).random_state: The state of the internal random number generator. In practice, it's
the result of calling get_state() on a
numpy.random.mtrand.RandomState object. You can try to set this
property but be warned that if you do this and it fails, it will do
so silently.An alias for reset kept for backwards compatibility.
Return the log-probability at the given position.
Clear chain, lnprobability and the bookkeeping parameters.
Iterate sample for N iterations and return the result. The arguments
are passed directly to sample so see the parameter details given in
sample.
Advance the chain iterations steps as an iterator.
pos0 (numpy.ndarray): A list of the initial positions of the walkers in the parameter space. The shape is (k, dim).lnprob0 (numpy.ndarray): The list of log posterior probabilities for the walkers at positions given by p0. If lnprob is None, the initial values are calculated. The shape is (k, dim).rstate0 (tuple): The state of the random number generator. See the Sampler.random_state property for details.iterations (int): The number of steps to run. (default: 1)pos (numpy.ndarray): A list of the current positions of the walkers in the parameter space. The shape is (k, dim).lnprob (numpy.ndarray): The list of log posterior probabilities for the walkers at positions given by pos. The shape is (k, dim).rstate (tuple): The state of the random number generator.A sub-ensemble object that actually does the heavy lifting of the likelihood calculations and proposals of a new position.
sampler (Sampler): The sampler object that this sub-ensemble should be connected to.Calculate the vector of log-probability for the walkers.
pos (numpy.ndarray): The position vector in parameter space where the probability should be calculated. This defaults to the current position unless a different one is provided.lnprob (numpy.ndarray): A vector of log-probabilities with one entry for each walker in this sub-ensemble.Propose a new position for another ensemble given the current positions
ensemble (Ensemble): The ensemble to be advanced.q (numpy.array): The new proposed positions for the walkers in ensemble.newlnprob (numpy.ndarray): The vector of log-probabilities at the positions given by q.accept (numpy.ndarray): A vector of bools indicating whether or not the proposed position for each walker should be accepted.Walkers are the members of the ensemble. They are almost like separate Metropolis-Hastings chains but, of course, the proposal distribution for a given walker depends on the positions of all the other walkers in the ensemble. See Goodman & Weare (2010) for more details.
The best technique seems to be to start in a small ball around the a priori preferred position. Don't worry, the walkers quickly branch out and explore the rest of the space.
Double the number of walkers. If that doesn't work, double it again. And again. Until you run out of RAM. At that point, I don't know!
Try increasing the number of walkers. If that doesn't work, you can try pruning using a clustering algorithm like the one found in arxiv:1104.2612.