Various functions.

charsets(names, lens, fName=None)[source]

Write a nice NEXUS sets block, given partition names and lengths.

For example:

geneNames = 'coi cytb nad5'.split()
lens = []
for geneName in geneNames:
    read('%s.nex' % geneName)
func.charsets(geneNames, lens)

which writes:


begin sets;
  charset coi = 1 - 131;  [nChar = 131]
  charset cytb = 132 - 352;  [nChar = 221]
  charset nad5 = 353 - 521;  [nChar = 169]
  charpartition p1 =  coi:coi, cytb:cytb, nad5:nad5 ;
  [partition p1 = 3:coi, cytb, nad5;]
chiSquaredProb(xSquared, dof)[source]

Returns the probability of observing X^2.

compareSplitsBetweenTreePartitions(treePartitionsList, precision=3, linewidth=120)[source]

Pairwise ASDOSS (ASDSF) and MaxDiff calculations

Output is verbose. Shows - average standard deviation of split frequencies (or supports), like MrBayes - maximum difference between split supports from each pair of checkpoints, like PhyloBayes



dirichlet1(inSeq, alpha, theMin, theMax=None, u=None)[source]

Modify inSeq with a draw from a Dirichlet distribution with a single alpha value.

inSeq is a list, and a copy is made, modified, normalized so that it sums to 1.0, and returned.

This function uses the function random.gammavariate(x, 1.0), which takes draws from a gamma distribution (not gamma function). Now random.gammavariate(x, 1.0) can return zero for x less than about 0.001. Eg here are results for 1000 draws from different x values

x=1.0000000000 min= 0.000319714 mean= 0.96978 x=0.1000000000 min= 1.65643e-38 mean= 0.10074 x=0.0100000000 min=4.03328e-309 mean= 0.013386 x=0.0010000000 min= 0 mean= 0.0026256 x=0.0001000000 min= 0 mean= 1.625e-09 x=0.0000100000 min= 0 mean=8.8655e-15 x=0.0000010000 min= 0 mean=1.0435e-268 x=0.0000001000 min= 0 mean= 0

Assuming we do not want zeros, a big alpha would help, but only by however big alpha is, so for example if alpha is 100 then we can still usually get a non-zero from inSeq vals of 0.0001. One hack that is adopted here is to possibly add a constant u that might be 0.1 or more to x (depending on alpha), so that the arg x is always big enough to return a non-zero. Stick that in the arg u, which is by default None

dirichlet2(inSeq, outSeq, alpha, theMin)[source]

Modify inSeq with a draw from a Dirichlet distribution with a single alpha value.

Args inSeq and outSeq are both numpy arrays. Arg inSeq is not modified itself; the modification is placed in outSeq (and nothing is returned). The result is normalized to 1.0

This is about 20% slower than func.dirichlet1() – not quite sure why.


A top-level dump of p4 trees, files, sequenceLists, and alignments.

effectiveSampleSize(data, mean)[source]

As done in Tracer v1.4, by Drummond and Rambaut. Thanks guys!

But see p4.func.summarizeMcmcPrams(), which gives ESSs.


Return n!

Its fast for n up to 30, cuz it uses a dictionary, rather than doing the computations. Not needed for newer Pythons – its in the math module.

getSplitKeyFromTaxNames(allTaxNames, someTaxNames)[source]

Make a long int binary split key from a list of taxNames.

allTaxNames -> an ordered list, nTax long someTaxNames -> list of taxNames on one side of the split.

The split key that is returned will always be even. For example, assuming

allTaxNames ['A', 'B', 'C', 'D']
someTaxNames ['B', 'D']

The bits for all the taxa are:

A   B   C   D
1   2   4   8

So the split key for ['B', 'D'] will be 2 + 8 = 10


func.getSplitKeyFromTaxNames(['A', 'B', 'C', 'D'], ['B', 'D'])
# returns 10

However, if the splitKey is odd, it is bit-flipped. So if someTaxNames = ['A', 'D'], the raw split key for ['A','D'] will be 1 + 8 = 9, binary ‘1001’, which is then xor’d with 1111, giving 6.


getSplitKeyFromTaxNames(['A', 'B', 'C', 'D'], ['A', 'D'])
returns 6
getSplitStringFromKey(theKey, nTax, escaped=False)[source]

Convert a long int binary split key to dot-star notation.

gsl_meanVariance(seq, mean=None, variance=None)[source]

Use gsl to compute both the mean and variance.

Arg seq can be a list or a numpy array.

Returns a 2-tuple of single-item NumPy arrays. To save a little time, you can pass the mean and variance to this function, in the form of zero-dimensional, single item NumPy arrays (eg mean = numpy.array(0.0))

The numpy built-in variance function does not use n-1 weighting. This one (from gsl) does use n-1 weighting.

gsl_ran_dirichlet(alpha, theta)[source]

Make a random draw from a dirichlet distribution.


and theta (alpha) – both the same length (more than 1). The length is the dimension of the dirichlet. The contents of theta are over-written (without being used). The draw ends up in theta. It is normalized so that it sums to 1.0.

gsl_ran_gamma(a, b, seed=None)[source]

See also random.gammavariate()


Attempts to determinine the data type by the composition.

Returns 1 for DNA, 2 for RNA, and 0 for protein. Or so it thinks.

It only works for lowercase symbol letters.


Like the shell ls

maskFromNexusCharacterList(nexusCharListString, maskLength, invert=0)[source]

Returns a mask string, converted from a Nexus char list.

Convert a Nexus characters list to a mask string composed of zeros and 1’s Eg char list r'1 2-4 6-10' (don’t forget the ‘r’ if you have any backslash characters) becomes (for a maskLength of 16) 1111011111000000 And r'10-.' results in 0000000001111111 (for maskLength 16). Note that Nexus char lists are 1-based.

Invert inverts the zeros and 1’s.

matrixLowerTriangleToUpperTriangle(ltList, dim)[source]

Rearrange matrix from a lower triangle to an upper triangle list

  • ltList (list) – a lower triangle of a matrix in the form of a list

  • dim (int) – the dimension of the (square) matrix


the same items rearranged as the upper triangle, in the form of a list.

If your matrix is like this:

- - - -
A - - -
B D - -
C E F -

where the dim is 4, then the lower triangle is the list [A,B,D,C,E,F]. This function rearranges that so that it is the upper triangle:

- A B C
- - D E
- - - F
- - - -

and returns the list [A,B,C,D,E,F].

This is useful for user-specified empirical protein rate matrices where the rate matrix is given as a PAML-style lower triangle, but you need a p4-style upper triangle:

pamlStyleRates = [190 rates]
upTriangle = func.matrixLowerTriangleToUpperTriangle(pamlStyleRates, 20)
# then when you specify your model ...
t.newRMatrix(free=0, spec='specified', val=upTriangle)

Simple, pure-python mean. For big lists, use something better.

nChooseK(n, k)[source]

Get the number of all possible len k subsets from range(n).


Returns a list T(n,m) with m from zero to n-1

n is the number of leaves m is the number of internal nodes

The first number in the returned list will always be zero, and the second number will always be 1. See Felsenstein, page 27. So for example, for nTaxa = 8 (as in the example), this function returns [0, 1, 246, 6825, 56980, 190575, 270270, 135135].


Returns a list T(n,m) with m from zero to n-2

n is the number of leaves m is the number of internal nodes

The first number in the returned list will always be zero, and the second number will always be 1. See Felsenstein, page 27. So for example, for nTaxa = 9 (as in the example), this function returns [0, 1, 246, 6825, 56980, 190575, 270270, 135135].

newEmptyAlignment(dataType=None, symbols=None, taxNames=None, length=None)[source]

Make de novo and return an Alignment object, made of gaps.

It is not placed in var.alignments.

newtonRaftery94_eqn16(logLikes, delta=0.1, verbose=False)[source]

Importance sampling, as in Newton and Raftery 1994, equation 16


Check to see if theName conforms to Nexus standards

See page 597 of Maddison, Swofford, and Maddison 1997, where they say “Names are single NEXUS words; they cannot consist entirely of digits (e.g., a taxon called 123 is illegal).

The all-digit name restriction can be relaxed in p4 by setting var.nexus_allowAllDigitNames.

Single digit names are prohibited, regardless.

nexusFixNameIfQuotesAreNeeded(theName, verbose=0)[source]

Add quotes if needed, for writing to a Nexus file.

Returns a possibly modified theName. Usually it will not need quoting, so it just returns theName unchanged.

If theName is None, or if it starts with a single quote, just return it.

If it has (nexus-defined) punctuation, or spaces, then put quotes around it before returning it. If there are internal single quotes, then double them, nexus-style. Except if there are any already doubled single quotes, don’t double them.


Deal with underscores and quotes. Returns theName

If theName is not quoted, convert any underscores to spaces. If theName is quoted, remove the outside quotes, and convert any cases of 2 single quotes in a row to 1 single quote.

This does not appear to be used in the rest of p4.


Deal with quotes. Returns theName

If theName is not quoted, just return it. If theName is quoted, remove the outside quotes, and convert any cases of 2 single quotes in a row to 1 single quote.


Convert a coord in polar coords to usual (Cartesian? square?) coords.

Input is a list composed of the angle in radians and the length (ie from the origin). A list of [x,y] is returned.

randomTree(taxNames=None, nTax=None, name='random', seed=None, biRoot=0, randomBrLens=1, constraints=None, randomlyReRoot=True)[source]

Make a simple random Tree.

You can supply a list of taxNames, or simply specify nTax. In the latter case the specified number of (boringly-named) leaves will be made.

The default is to have ‘randomBrLens’, where internal nodes get brLens of 0.02 - 0.05, and terminal nodes get brLens of 0.02 - 0.5. Branch lengths are all 0.1 if randomBrLens is turned off.

This method starts with a star tree and keeps adding nodes until it is fully resolved. If ‘biRoot’ is set, it adds one more node, and roots on that node, to make a bifurcating root.

Repeated calls will give different random trees, without having to do any seed setting. If for some reason you want to make identical random trees, set the seed to some positive integer, or zero.

If you want the tree to have some topological constraints, say so with a Constraints object.

Returns a tree.


Read in data, trees, or python code from a file or from a string.

For example:





read('((A, (B, C)), D, (E, F));')

This is meant to be the main way to get phylogenetic stuff from files into p4. The argument stuff can be a file name, or filenames described with wildcards (a ‘glob’, eg *.nex), or it can be a string. If you are specifying a literal file name or a glob, you will of course want to put it in quotes.

If you want it to read a file, and you mis-specify the file name, it will try to read the bad file name as a string (and generally fail, of course).

This method will recognize these data files –

  • nexus

  • fasta

  • gde (not gde flat files)

  • clustalw (*.aln)

  • phylip (sequential or interleaved)

and these tree files –

  • nexus

  • phylip/newick

If there is a suffix, one of nex, nexus, aln, phy, phylip, gde, fasta, fas, fsa, or py, it will use that to make a choice of what kind of file it is. (So don’t give a fasta file the suffix ‘phy’ or it will fail for sure.)

Fortunately the various kinds of files (including Python) are almost mutually exclusive. A fasta file must start with either a ‘>’ or a ‘;’ as the first character. A gde file has ‘{‘ as the first character. A phylip data file must have the two integers as the first non-whitespace things on the first line. In a nexus file, the first non-whitespace character must be either a ‘#’ or a ‘[‘. (Python files might start with a ‘#’ also.)

It is easy to fool. For example, start any old nonsense file with two integers, and this function will think that it is a phylip data file! So have a little bit of care what you feed this function.

If you have trouble and want to know what it is thinking as it attempts to read your file, set var.verboseRead=True.

Does anybody use GDE anymore? This only reads parts of a GDE formatted alignment. It reads the name, type, and sequence. It does not read the Sequence-ID, creation date, direction, strandedness, nor the comments. It correctly handles offset. It fills out the ends of the sequences to make an alignment. In gde files, MASK sequences are changed into Nexus CharSet instances. The charsets made from the masks are of course printed out if you print out the alignment as a Nexus file.


Read in simple stuff, pop the single object from var lists, and return it.

The stuff to be read in must be convertible to a single object, one of Alignment, SequenceList, or Tree. When that is read, the stuff as usual goes into one of var.alignments, var.sequenceLists, or var.trees. The single object is popped from where it ends up, and returned.


If its a data or tree file, read it. If its python code, exec it.

readJplace(fName, verbose=False)[source]

Read *.jplace files for phylogenetic placement of short reads

It returns a tuple — (tree, heatSum). The tree is decorated with a ‘heat’ attribute on each The heatSum is the sum of all the heats.


Reminders, suggestions, multi-step methods…

These should be in the Sphinx docs also.


Set a new seed for the c-language random() function.

For those things in the C-language that use random(), this re-seeds the randomizer. Reseed to different integers to make duplicate runs of something come out different. This is not for the GSL random stuff. Confusing to have 2 systems, innit? And then there is the 3rd system that Python uses in the random module. Sorry!

If you wanted for example to repeat an MCMC, there are four random number generators to contend with. You could do something like this (where I am using a seed of zero for all – you could use your own seed, and they do not need to be the same):

# for C random() function, used in Brent-Powell optimization

# for gsl, used in simulations (amongst other places)
var.gsl_rng = pf.gsl_rng_get()
pf.gsl_rng_set(var.gsl_rng, 0)

# for the Python random library, used a lot in python code

# for Numpy and Scipy.  Used in func.dirichlet2()
sortListOfListsOnListElementNumber(aListOfLists, elementNumber)[source]

Returns a new sorted list.

sortListOfObjectsOn2Attributes(aListOfObjects, attributeString1, attributeString2)[source]

Returns a new sorted list.

sortListOfObjectsOnAttribute(aListOfObjects, attributeString)[source]

Returns a new sorted list.

spaceDelimitedToTabDelimited(fName, outFName=None)[source]

Convert space-delimited data files to tab-delimited.

The outfilename by default is the infilename with .tabbed stuck on the end.


Print a splash screen for p4.

splash2(outFile=None, verbose=True)[source]

Another splash, showing things like version, git hash, and date

If verbose is set, it gets printed to sys.stdout.

If you set an outFile, it will also be appended to that file.

It also returns the info as a list of strings.


Convert usual square coords to polar.

Arg is [x,y], and [angle, length] is returned.

studentsTStat(seq1, seq2)[source]

Returns Student’s t statistic for 2 lists or tuples.

Mean of seq1 - mean of seq2, divided by the _stdErrorOfTheDifferenceBetweenTwoMeans(seq1, seq2)

studentsTTest1(seq, mu=0.0, verbose=True)[source]

Test whether a sample differs from mu.

From wikipedia.

Arg ‘seq’ is a list of numbers. Internally it is converted to a numpy array of floats, so the input seq need not be floats, and need not be a numpy array, although it does not hurt to be either.

Arg ‘mu’ is by default zero.

Returns the p-value.

summarizeMcmcPrams(skip=0, run=- 1, theDir='.', makeDict=False)[source]

Find the mean, variance, and ess of mcmc parameters.

Ess is effective sample size, as in Tracer by Drummond and Rambaut.

The numbers are found in mcmc_prams_N, N=0, 1, etc. If arg ‘run’ is set to -1, the default, then all runs are done. Alternatively you can set the run to a specific run number, and that is the only one that is done.

The ‘profile’, with the names of the parameters, and the number of each, is found in (N=0,1,2 …). It is not essential, but it gives names to the parameters.

tailAreaProbability(theStat, theDistribution, verbose=1)[source]

Calculate the tail area probability of theStat.

That is the number of items in theDistribution that are greater than or equal to theStat. theDistribution need not be sorted.

unPickleMcmc(runNum, theData, verbose=True)[source]

Unpickle a checkpoint, return an Mcmc ready to go.

unPickleSTMcmc(runNum, verbose=True)[source]

Unpickle a STMcmc checkpoint, return an STMcmc ready to go.


Uninstall the p4 package.


Returns an open file and filename, modified from file_name.

If the file_name is, then the new filename will start with foo and end with .bar, with a bit of unique nonsense in between. With a simple file_name input the new file is made in current directory, but by supplying a file_name including a path, you can create the new file elsewhere.

Don’t forget to close the file.


This would not be good for a lot of data. n - 1 weighted.

which(what, verbose=0)[source]

Asks if an auxiliary program is available.

This uses the shell command ‘which’ to find whether a program (as given by the argument ‘what’) is in the path. It returns 0 or 1. If verbose is turned on, it speaks the path, if it exists.


Find an executable

writeInColor(theString, colour='blue')
writeInColour(theString, colour='blue')[source]

Calculate the X^2 statistic from an R x C table.

Arg observed is a 2D R x C table. This stat is sometimes called Chi-squared.