P4 — HOWTOs and Comments

2025-01-29.

To collapse nodes in a tree

To collapse nodes, don’t do it this way

for n in t.iterInternalsNoRoot():
    t.collapseNode(n)

because it modifies the list of nodes as it iterates. Do it this way instead

# First make a list of nodes that you want to collapse
toCollapse = [n for n in t.iterInternalsNoRoot()]

# Then collapse them
for n in toCollapse:
    t.collapseNode(n)

Or perhaps more realistically

toCollapse = [n for n in t.iterInternalsNoRoot() if n.br.support < 0.7]
for n in toCollapse:
    t.collapseNode(n)

Combine supports from two analyses onto one tree

You will often analyse the same data with two different methods, each of which gives you a tree with support values, but you may want to summarize the results of both trees using a single tree figure. The trees may differ in topology, or they may be the same. Even if they are the same topology, the branch lengths may differ. One approach is to consider one tree to be the main tree, giving the figure its topology and branch lengths, but you then somehow combine the split supports from the other tree onto the main tree.

Consider these two trees

          +-------------- E
   +------|0.84
   |      +----------- B
   |
   |-------- G
   |
   |                      +-------- C
   |                      |
   |               +------|0.90          +-------------- D
   |               |      |      +-------|0.95
   |               |      |      |       +----------- F
   |               |      +------|0.87
   |       +-------|0.95         |       +------------ H
   |       |       |             +-------|0.88
   |       |       |                     +-------------------- A
   +-------|0.59   |
           |       +-------------- I
           |
           +--------------- J

           +--------------- D
   +-------|73
   |       +------------- F
   |
   |       +-------------- H
   |-------|77
   |       +---------------------- A
   |
   |               +-------- C
   |       +-------|86
   |       |       +---------------- I
   |       |
   +-------|88      +---------------- J
           |        |
           +--------|58             +--------------- E
                    |       +-------|98
                    +-------|99     +------------ B
                            |
                            +-------- G

These two trees are both fully resolved, and mostly have the same splits – there is only one NNI difference (can you see that?). If we make the first tree the main tree, it is a fairly easy job to merge the split supports from the second tree onto that first tree, as shown below.

Doing the combining “by hand” is an option, but for bigger trees (which may share only some of the same splits) the task becomes onerous, time-consuming, and error-prone. The process can be automated, and one way to do that is shown here. The process involves assigning numbers to each split in both trees; those numbers being a numerical equivalent of the “dot-star” notation for splits — for example .**.** might be binary 011011. Since the taxa are the same in both trees, and in the same order, those numbers can be used by each split in the main tree to find the same split (if it exists) in the other tree. Then the supports can be combined

To start, we need a valid list of taxnames, which we might get from an alignment

a = func.readAndPop('myData.nex')
# Now a.taxNames has the list we want

We read in and name our two trees, and make sure they both have the same taxNames

tMain = func.readAndPop('myMainTree.nex')
tSecondary = func.readAndPop('mySecondaryTree.nex')

tMain.taxNames = a.taxNames
tSecondary.taxNames = a.taxNames

At this point we are set up and good to go

# Split keys are numerical versions of the 'dot-star' split notation.
# The same split on the two trees would have the same split key.
tMain.makeSplitKeys()
tSecondary.makeSplitKeys()

# Make a dictionary, so that we can fish out nodes in the secondary tree
# given a split key.  Split keys are found on node branches, here
# n.br.
myDict = {}
for n in tSecondary.iterInternalsNoRoot():
    myDict[n.br.splitKey] = n

for nM in tMain.iterInternalsNoRoot():
    # Given a split key in the master tree, we can find the
    # corresponding node in the secondary tree, using the split key with
    # the dictionary.
    nS = myDict.get(nM.br.splitKey)
    # If there was none, then nS is None
    if nS:
        nM.name = '%s/%s' % (nM.name, nS.name)
    else:
        nM.name = '%s/-' % nM.name
    #print nM.name
tMain.writeNexus('combinedSupportsTree.nex')

There is an example of this in share/Examples/F_picture/E_combiningSplitSupports.

MRP — Matrix representation with parsimony, for supertrees

Lets say that you have a number of trees with overlapping taxon sets, and you would like to do an MRP analysis on them to find a supertree. The first step would be to make an input matrix suitable for a parsimony program like PAUP or TNT. For this you can use the function mrp() The input for that function is a list of p4 tree objects, so you might do something like::

read('myTrees.phy')
a = mrp(var.trees)
a.writeNexus('mr.nex')

The file mr.nex is a matrix representation of the input trees, and is suitable for input to your parsimony search program.

Although it is not needed for the parsimony analysis, each representation of an input tree in mr.nex has a NEXUS charset showing which sites in the alignment correspond to which input trees. This allows reconstruction of the input trees from the matrix, as described in reverseMrp().

Kosiol’s AIS, almost invariant sets

This is for grouping amino acids into sets where there is a high probability of exchange within the set but exchange between sets has a lower probability. The method is described in Kosiol et al (2004) J. Theoret. Biol. 228: 97–106. The AIS program is available at Goldman’s site

The program wants the equilibrium frequencies for the model, the q-matrix (bigQ in p4), and the eigenvectors. Here is an example using the WAG model, with WAG frequencies

# The easiest way to get the bigQ and such from p4 is to just set up
# the appropriate protein model as usual, and then calculate a
# likelihood.

# Read in some data.  You could use your own data if you wanted to use
# empirical comps.
seqs = """ 2 20
one       arndcqeghilkmfpstwyv
two       rndcqeghilkmfpstwyva
"""

# a tree, with the corresponding taxa.
ts = "(one, two);"

read(seqs)
t = func.readAndPop(ts)
d = Data()
t.data = d
t.newComp(partNum=0, free=0, spec='wag')    # or maybe you want empirical for your own data?
t.newRMatrix(partNum=0, free=0, spec='wag')
t.setNGammaCat(partNum=0, nGammaCat=1)
t.setPInvar(partNum=0, free=0, val=0.0)
t.calcLogLike()
#t.model.dump()

# Write the aa freqs
f = file('equi', 'w')
f.write('20\n')
for i in range(20):
    f.write("%f\n" % t.model.parts[0].comps[0].val[i])
f.close()

bigQ = t.model.getBigQ()

# Write the bigQ
f = file('q', 'w')
f.write('20\n')
for i in range(20):
    for j in range(20):
        f.write("%5g  " % bigQ[i][j])
    f.write('\n')
f.close()

# Get the eigensystem
import numpy
import numpy.linalg
evals,evecs = numpy.linalg.eig(bigQ)

# look, if you want
if 0:
    numpy.set_printoptions(precision=4, linewidth=300)
    print evals
    print evecs

# According to the web page, "The right eigenvectors should be ordered
# according to the absolute value of their eigenvalues."  Well, the
# output from numpy, which uses lapack, is not so ordered.  So do it.
sorter = numpy.argsort(evals)
sorter = sorter[::-1]   # reverse
#print sorter

f = file('evec', 'w')
f.write('20\n')
f.write('20\n')
for colNum in sorter:
    for rowNum in range(20):
        f.write("%5g\n" % evecs[rowNum][colNum])
f.close()

Kosiol’s program ais asks for these 3 files made above, and the number of groups that you want, and it suggests a grouping.

When I did that with Dayhoff 78, with D78 composition, I got these groups —

- Set 0 = { R N D Q E H K T }
- Set 1 = { C V }
- Set 2 = { A G P S }
- Set 3 = { I L M }
- Set 4 = { W }
- Set 5 = { F Y }

While the groups from the log odds table are

1. c
2. stpag
3. ndeq
4. hrk
5. milv
6. fyw

These are both similar and different. Maybe it differs because the groups from the log odds are from a PAM250 matrix, which would correspond to highly diverged sequences?

Optimizing likelihood

Optimizing the log likelihood of a phylogenetic tree with a parameter-rich, tree-heterogeneous model is not as easy as for a simple tree-homogeneous model. None of the methods that I have implemented are perfect, and they can fail for difficult optimizations. I know this because it is possible to repeat the optimization, and sometimes the second optimization will find a better likelihood.

The method is part of the Tree class, Tree.optLogLike(), and the defaults are

t.optLogLike(verbose=1, method='BOBYQA', optBrLens=True)

Confusingly, the Tree method optLogLike has a method parameter. The method parameter is a string, one of

BOBYQA: from the nlopt library
allBrentPowell: the Brent-Powell method for model parameters and branch lengths
newtAndBrentPowell: the Brent-Powell method for model parameters, but Newton’s method for branch lengths
newtAndBOBYQA: Newton’s method for branch lengths, and BOBYQA for model parameters

The optimization methods differ in their computational approach, and in how fast they are and how robust they are in optimizing difficult likelihoods. BOBYQA is the default because it is fast and appears to be fairly robust. However, it is not perfect, and for difficult optimizations can stop short. As a practical solution, for difficult or important optimizations you can do something like

t.optLogLike(method="BOBYQA")
t.optLogLike(method="newtAndBrentPowell")

The idea is that the first optimization will be fast and will generally fully optimize. However if it does not fully optimize, the second method, taking a completely different approach, may finish the job. For bigger, more parameter-rich, or more important likelihood calculations you may even want to optimize again. Optimizations after the first should be fairly fast.

Ancestral character state reconstruction

Ancestral state reconstruction is a probabilistic question. The way p4 does it is to make draws from the posterior probability.

Ziheng Yang has a very good description of ancestral state reconstruction in his “Molecular Evolution — A statistical approach” (2014)

The user interface method is Tree.ancestralStateDraw() which needs a tree object with attached model and data. It returns a string, which is a single draw from the inferred ancestral states.

There are examples on how to use it included in the share/Examples/P_ancestralStates directory in the source. The first example uses an ML optimized model, and the second example makes draws from the posterior distribution of a Bayesian analysis.

Site entropy

Entropy for a site is given by

\[ S(\pi) = -\sum_i \pi_i \log(\pi_i) \]

where \(\pi_i\) is the proportion of each character state in the site.

For a constant site, the entropy would be zero. For a site in a DNA alignment with all character states of equal proportion, the entropy would be 2.0

The methods are

Alignment.entropyOfSites() for site-by-site entropy, and
Alignment.entropyMeanAndVariance() for the whole alignment

For example,

m,v = a.entropyMeanAndVariance()

Composition tests described in Foster 2004

A compositional homogeneity test using simulations

This test was described implicitly in Figure 1 in Foster (2004) and associated text. By “described implicitly” I mean that I described the Type-II error in the Chi-squared test, and described how to fix it using simulations. I did not actually use this test in Foster 2004. It was described as a motivator for the TAMBCFT; however, it is not the complete TAMBCFT. It could be considered a special case of the TAMBCFT, used with a model that uses a homogeneous empirical composition, so \(X^2_m = X^2\). It was not named in Foster 2004. A short name might be “Composition test using simulations” or “Compositional homogeneity test using simulations”. A long name might be “A compositional homogeneity test using simulations to assess significance”.

It is not original to me, as the idea was “in the air” when I was in Swofford’s lab 1997–1999. I discussed it with Dave Swofford and Jack Sullivan, back then. However, I did implement it, and described it reasonably well. There is not much to it.

It is implemented in p4 as a method of the Tree class, Tree.compoChiSquaredTestUsingSimulations(). The docstring for that method is fairly good. It is not required, but you would generally use a tree with ML optimized branch lengths and model parameters. It is fast compared to the TAMBCFT described below because it does not need ML optimizations of the simulation reps. It is meant to actually be used, so implementation is fairly robust, and it can handle things like multi-partition data and models, ambiguities, and blank sequences that you often see in concatenated datasets. It generally uses the \(X^2\) statistic, however it does not require that statistic — one could use something else. I would say that it is not a Chi-squared test, as it does not use \(\chi^2\) to assess significance. It is described in this old pdf in the source code for p4. There it is described there as an “improved” Chi-squared test, but I don’t think that is a good characterization; since it does not use \(\chi^2\) it is not a Chi-squared test.

TAMBCFT Tree- and model-based composition fit test

This is a novel test that was introduced and described in Foster 2004. It is described further in this pdf in the p4 source code. It is for assessing model fit of possibly compositionally tree heterogeneous data to a possibly tree heterogeneous model.

This is not the specific “compositional heterogeneity test using simulations” described above — it is more general. As mentioned above, the composition test using simulations described above could be considered a special case of the TAMBCFT, with a homogeneous model, where the model has empirical composition, and so \(X^2_m = X^2\), and optimizations are not needed.

The TAMBCFT would generally need a tree and model ML optimization for each simulation, so it is slow. Again, it is not a Chi-squared test, as it does not use \(\chi^2\) to assess significance. The way it is implemented in p4 is that the user makes the reps and optimizations first, as this is the time-consuming part, and once those are done the test can be completed.

Posterior predictive simulations

Posterior predictive simulations is a standard Bayesian method. Using posterior predictive simulations with \(X^2\) as the test quantity in Bayesian phylogenetics was described previously by Bollback and Huelsenbeck. It is a model fit test, and is implemented in p4. It uses the \(X^2\) stat, but it is otherwise not related to Chi-squared tests. It happens to use the \(X^2\) stat, but using it is not required; you could use something else that shows off compositional heterogeneity over the tree. (A few years later, to settle an argument, I implemented another posterior predictive test quantity to measure composition, not mentioned in Foster 2004, in order to check the results using \(X^2\). It was under-used, so I pulled the documentation from the source code.)