# STMcmc¶

class STMcmc(inTrees, bigT=None, modelName='SR2008_rf_aZ', beta=1.0, spaQ=0.5, stRFCalc='purePython1', runNum=0, sampleInterval=100, checkPointInterval=None, useSplitSupport=False)

An MCMC for making supertrees from a set of input trees.

This week, it implements the Steel and Rodrigo 2008 model, with the alpha calculation using the approximation in Bryant and Steel 2009.

Arguments

inTrees
A list of p4 tree objects. You could just use var.trees.
modelName

The SR2008 models implemented here are based on the Steel and Rodrigo 2008 description of a likelihood model, “Maximum likelihood supertrees” Syst. Biol. 57(2):243–250, 2008. At the moment, they are all SR2008_rf, meaning that they use Robinson-Foulds distances.

SR2008_rf_ia

Here ‘ia’ means ‘ignore alpha’. The alpha values are not calculated at all, as they are presumed (erroneously, but not too badly) to cancel out.

SR2008_rf_aZ

This uses the approximation for Z_T = alpha^{-1} as described in Equation 30 in the Bryant and Steel paper “Computing the distribution of a tree metric” in IEEE/ACM Transactions on computational biology and bioinformatics, VOL. 6, 2009.

SR2008_rf_aZ_fb

This is as SR2008_rf_aZ above, but additionally it allows beta to be a free parameter, and it is sampled. Samples are written to mcmc_prams* files.
beta
This only applies to SR2008. The beta is the weight as given in Steel and Rodrigo 2008. By default it is 1.0.

stRFCalc

There are three ways to calculate the RF distances and likelihood, for these SR2008_rf models above — all giving the same answer.

1. purePython1. Slow.
2. bitarray, using the bitarray module. About twice as fast as purePython1
3. fastReducedRF, written in C++ using boost and ublas. About 10 times faster than purePython1, but perhaps a bit of a bother to get going. It needs the fastReducedRF module, included in the p4 source code.

It is under control of the argument stRFCalc, which can be one of ‘purePython1’, ‘bitarray’, and ‘fastReducedRF’. By default it is purePython1, so you may want to at least install bitarray.

runNum

You may want to do more than one ‘run’ in the same directory, to facilitate convergence testing. The first runNum would be 0, and samples, likelihoods, and checkPoints are written to files with that number.

sampleInterval

Interval at which the chain is sampled, including writing a tree, and the logLike. Plan to get perhaps 1000 samples; so if you are planning to make a run of 10000 generations then you might set sampleInterval=10.

checkPointInterval

Interval at which checkpoints are made. If set to None (the default) it means don’t make checkpoints. My taste is to aim to make perhaps 2 to 4 per run. So if you are planning to start out with a run of 10000 generations, you could set checkPointInterval=5000, which will give you 2 checkpoints. See more about checkpointing below.

To prepare for a run, instantiate an Mcmc object, for example:

m = STMcmc(treeList, modelName='SR2008_rf_aZ_fb', stRFCalc='fastReducedRF', sampleInterval=10)


To start it running, do this:

# Tell it the number of generations to do
m.run(10000)


As it runs, it saves trees and likelihoods at sampleInterval intervals (actually whenever the current generation number is evenly divisible by the sampleInterval).

CheckPoints

Whenever the current generation number is evenly divisible by the checkPointInterval it will write a checkPoint file. A checkPoint file is the whole MCMC, pickled. Using a checkPoint, you can re-start an STMcmc from the point you left off. Or, in the event of a crash, you can restart from the latest checkPoint. But the most useful thing about them is that you can query checkPoints to get information about how the chain has been running, and about convergence diagnostics.

In order to restart the MCMC from the end of a previous run:

# read the last checkPoint file
m = func.unPickleStMcmc(0)  # runNum 0
m.run(20000)


Its that easy if your previous run finished properly. However, if your previous run has crashed and you want to restart it from a checkPoint, then you will need to repair the sample output files to remove samples that were taken after the last checkPoint, but before the crash. Fix the trees, likelihoods, prams, and sims. (You probably do not need to beware of confusing gen (eg 9999) and gen+1 (eg 10000) issues.) When you remove trees from the tree files be sure to leave the ‘end;’ at the end– p4 needs it, and will deal with it.

The checkPoints can help with convergence testing. To help with that, you can use the STMcmcCheckPointReader class. It will print out a table of average standard deviations of split supports between 2 runs, or between 2 checkPoints from the same run. It will print out tables of proposal acceptances to show whether they change over the course of the MCMC.

Making a consensus tree

See TreePartitions.

checkPoint()
run(nGensToDo, verbose=True)

Start the STMcmc running.

writeProposalAcceptances()

Pretty-print the proposal acceptances.

writeProposalIntendedProbs()

Tabulate the intended proposal probabilities.

writeProposalProbs()

(Another) Pretty-print the proposal probabilities.

writeSwapMatrix()