Tree

class Tree

A phylogenetic tree.

Some instance variables

  • fName, if the Tree was read from a file
  • name, the name of the tree.
  • root, the root node
  • nodes, a list of nodes.
  • preOrder and postOrder, lists of node numbers
  • recipWeight, the weight, if it exists, is usually 1/something, so the reciprocal looks nicer ...
  • nexusSets, if it exists, a NexusSets object.

Properties

  • taxNames, a list of names. Usually the order is important!
  • data, a Data.Data object
  • model, a Model.Model object
  • nTax, the number of taxa
  • nInternalNodes, the number of non-leaf nodes

The node method

You often will want to refer to a node in a tree. This can be done via its name, or its nodeNum, or as an object, via the method node(). For example, from a Tree t you can get node number 3 via:

n = t.node(3)

and you can get the node that is the parent of the Mastodon via:

n = t.node('Mastodon').parent

For many methods that require specifying a node, the method argument is nodeSpecifier, eg:

t.reRoot(23)

reRoots‘s the tree to node number 23.

Describe, draw, and get information about the tree

dump Print rubbish about self.
draw Draw the tree to the screen.
textDrawList
tv Tree Viewer.
btv Big Tree Viewer.
isFullyBifurcating Returns True if the tree is fully bifurcating.
taxSetIsASplit Asks whether a nexus taxset is a split in the tree.
getAllLeafNames Returns a list of the leaf names of all children
getChildrenNums Returns a list of nodeNums of children of the specified node.
getDegree
getLen Return the sum of all br.len’s.
getNodeNumsAbove Gets a list of nodeNums, in postOrder, above nodeSpecifier.
getPreAndPostOrderAbove Returns 2 lists of node numbers, preOrder and postOrder.
getPreAndPostOrderAboveRoot Sets self.preOrder and self.postOrder.
getSeqNumsAbove Gets a list of seqNums above nodeSpecifier.
subTreeIsFullyBifurcating Is theNode and everything above it (or below it) bifurcating?
summarizeModelThingsNNodes Summarize nNodes for all modelThings if isHet
verifyIdentityWith For MCMC debugging.

Write

write This writes out the Newick tree description to sys.stdout.
writeNewick Write the tree in Newick, aka Phylip, format.
writeNexus Write the tree out in Nexus format, in a trees block.
writePhylip Write the tree in Phylip or Newick format.
tPickle Pickle self to a file with a ‘p4_tPickle’ suffix.

See also Trees methods p4.trees.Trees.writeNexus() and p4.trees.Trees.writeNewick() for doing trees by the bunch.

Iteration over the nodes

Sometimes you don’t want to just iterate over the self.nodes list, because after some manipulations a node might be in self.nodes but not actually in the tree; using these ‘iter’ methods takes care of that, skipping such nodes.

iterInternals Internal node generator.
iterInternalsNoRoot Internal node generator, skipping the root.
iterInternalsNoRootPostOrder Internal post order node generator, skipping the root.
iterInternalsNoRootPreOrder Internal post order node generator, skipping the root.
iterInternalsPostOrder Internal post order node generator.
iterLeavesNoRoot Leaf node generator, skipping the root.
iterLeavesPostOrder
iterLeavesPreOrder
iterNodes Node generator, in preOrder.
iterNodesNoRoot Node generator, skipping the root.
iterPostOrder Node generator.
iterPreOrder Node generator.
nextNode Get next node cycling around a hub node.

See also Node methods that do similar things starting from a given node.

Copy

dupe Duplicates self, but with no c-pointers.
copyToTree
dupeSubTree Makes and returns a new Tree object, duping part of self.

In combination with Data and Model

calcLogLike Calculate the likelihood of the tree, without optimization.
optLogLike Calculate the likelihood of the tree, with optimization.
simulate Simulate into the attached data.
getSiteLikes Likelihoods, not log likes.
getSiteRates Get posterior mean site rate, and gamma category.
bigXSquaredSubM Calculate the X^2_m stat
compStatFromCharFreqs Calculate a statistic from observed and model character frequencies.
compoTestUsingSimulations Compositional homogeneity test using a null distribution from simulations.
modelFitTests Do model fit tests on the data.
modelSanityCheck Check that the tree, data, and model specs are good to go.
simsForModelFitTests Do simulations for model fit tests.
getEuclideanDistanceFromSelfDataToExpectedComposition Calculate the c_E stat between self.data and model expected composition.

Setting a model

newComp Make, attach, and return a new Comp object.
newRMatrix Make, attach, and return a new RMatrix instance.
newGdasrv
setPInvar
setRelRate
setModelThing
setModelThingsRandomly Place model things (semi-)randomly on the tree.
setModelThingsNNodes Set nNodes for all modelThings
summarizeModelThingsNNodes Summarize nNodes for all modelThings if isHet
setNGammaCat
setTextDrawSymbol

Tree manipulation

addLeaf Add a leaf to a tree.
addNodeBetweenNodes Add a node between 2 exisiting nodes, which should be parent-child.
addSibLeaf Add a leaf to a tree as a sibling, by specifying its parent.
addSubTree Add a subtree to a tree.
allBiRootedTrees Returns a Trees object containing all possible bi-rootings of self.
collapseNode Collapse the specified node to make a polytomy, and remove it from the tree.
ladderize Rotate nodes for a staircase effect.
lineUpLeaves Make the leaves line up, as in a cladogram.
nni Simple nearest-neighbor interchange.
pruneSubTreeWithoutParent Remove and return a node, together with everything above it.
randomSpr Do a random spr move.
randomizeTopology
reRoot Re-root the tree to the node described by the specifier.
reconnectSubTreeWithoutParent Attach subtree stNode to the rest of the tree at newParent.
removeEverythingExceptCladeAtNode Like it says.
removeNode Remove a node, together with everything above it.
removeAboveNode Remove everything above an internal node, making it a leaf, and so needing a new name.
removeRoot Removes the root if self.root is mono- or bifurcating.
renameForPhylip Rename with phylip-friendly short boring names.
restoreDupeTaxa Restore previously removed duplicate taxa from a dict file.
restoreNamesFromRenameForPhylip Given the dictionary file, restore proper names.
rotateAround Rotate a clade around a node.
spr Subtree pruning and reconnection.
stripBrLens Sets all node.br.len’s to 0.1, the default in p4.

Misc

checkDupedTaxonNames
checkSplitKeys
checkTaxNames Check that all taxNames are in the tree, and vice versa.
checkThatAllSelfNodesAreInTheTree Check that all self.nodes are actually part of the tree.
inputTreesToSuperTreeDistances Return the topology distance between input trees and pruned self.
makeSplitKeys Make long integer-valued split keys.
readBipartitionsFromPaupLogFile Assigns support to the tree, from the PAUP bipartitions table.
recalculateSplitKeysOfNodeFromChildren
setNexusSets Set self.nexusSets from var.nexusSets.
topologyDistance Compares the topology of self with tree2.
tvTopologyCompare Graphically show topology differences.
patristicDistanceMatrix Matrix of distances along tree path.
inputTreesToSuperTreeDistances Return the topology distance between input trees and pruned self.
addLeaf(attachmentNode, taxName)

Add a leaf to a tree.

The leaf is added to the branch leading from the specified node. A new node is made on that branch, so actually 2 nodes are added to the tree. The new leaf node is returned.

addNodeBetweenNodes(specifier1, specifier2)

Add a node between 2 exisiting nodes, which should be parent-child.

The specifier can be a nodeNum, name, or node object.

Returns the new node object.

addSibLeaf(attachmentNode, taxName)

Add a leaf to a tree as a sibling, by specifying its parent.

The leaf is added so that its parent is the specified node (ie attachmentNode), adding the node as a rightmost child to that parent. The attachmentNode should not be a leaf – it must have children nodes, to which the new leaf can be added as a sibling.

The new node is returned.

addSubTree(selfNode, theSubTree, subTreeTaxNames=None)

Add a subtree to a tree.

The nodes from theSubTree are added to self.nodes, and theSubTree is deleted.

If subTreeTaxNames is provided, fine, but if not this method can find them. Providing them saves a bit of time, I assume.

allBiRootedTrees()

Returns a Trees object containing all possible bi-rootings of self.

Self should have a root node of degree > 2, but need not be fully resolved.

Self needs a taxNames.

bigXSquaredSubM(verbose=False)

Calculate the X^2_m stat

This can handle gaps and ambiguities.

Column zeros in the observed is not a problem with this stat, as we are dividing by the expected composition, and that comes from the model, which does not allow compositions with values of zero.

btv()

Big Tree Viewer. Show the tree in a gui window.

This is for looking at big trees. The viewer has 2 panels – one for an overall view of the whole tree, and one for a zoomed view, controlled by a selection rectangle on the whole tree view.

Needs Tkinter.

If you have nexus taxsets defined, you can show them.

calcLogLike(verbose=1, resetEmpiricalComps=True)

Calculate the likelihood of the tree, without optimization.

checkDupedTaxonNames()
checkSplitKeys(useOldName=False, glitch=True, verbose=True)
checkTaxNames()

Check that all taxNames are in the tree, and vice versa.

checkThatAllSelfNodesAreInTheTree(verbose=False, andRemoveThem=False)

Check that all self.nodes are actually part of the tree.

Arg andRemoveThem will remove those nodes, renumber the nodes, and reset pre- and postOrder, and return None

If andRemoveThem is not set (the default is not set) then this method returns the list of nodes that are in self.nodes but not in the tree.

collapseNode(specifier)

Collapse the specified node to make a polytomy, and remove it from the tree.

Arg specifier, as usual, can be a node, node number, or node name.

The specified node remains in self.nodes, and is returned.

compStatFromCharFreqs(verbose=False)

Calculate a statistic from observed and model character frequencies.

Call it c_m, little c sub m.

It is calculated from observed character frequencies and character frequencies expected from the (possibly tree-heterogeneous) model.

It would be the sum of abs(obs-exp)/exp

compoTestUsingSimulations(nSims=100, doIndividualSequences=0, doChiSquare=0, verbose=1)

Compositional homogeneity test using a null distribution from simulations.

This does a compositional homogeneity test on each data partition. The statistic used here is X^2, obtained via Data.compoChiSquaredTest().

The null distribution of the stat is made using simulations, so of course you need to provide a tree with a model, with optimized branch lengths and model parameters. This is a comp homogeneity test, so the model should be tree-homogeneous.

The analysis usually tests all sequences in the data partition together (like paup), but you can also ‘doIndividualSequences’ (like puzzle). Beware that the latter is a multiple simultaneous stats test, and so the power may be compromized.

For purposes of comparison, this test can also do compo tests in the style of PAUP and puzzle, using chi-square to assess significance. Do this by turning ‘doChiSquare’ on. The compo test in PAUP tests all sequences together, while the compo test in puzzle tests all sequences separately. There are advantages and disadvantages to the latter– doing all sequences separately allows you to identify the worst offenders, but suffers due to the problems of multiple simultaneous stats tests. There are slight differences between the computation of the Chi-square in PAUP and puzzle and the p4 version. The compo test in PAUP (basefreq) does the chi-squared test, but if sequences are blank it still counts them in the degrees of freedom; p4 does not count blank sequences in the degrees of freedom. Puzzle simply uses the row sums, ie the contributions of each sequence to the total X-squared, and assesses significance with chi-squared using the number of symbols minus 1 as the degrees of freedom. Ie for DNA dof=3, for protein dof=19. Puzzle correctly gets the composition from sequences with gaps, but does not do the right thing for sequences with ambiguities like r, y, and so on. P4 does calculate the composition correctly when there are such ambiguities. So p4 will give you the same numbers as paup and puzzle for the chi-squared part as long as you don’t have blank sequences or ambiguities like r and y.

This uses the Data.compoChiSquaredTest() method to get the stats. See the doc string for that method, where it describes how zero column sums (ie some character is absent) can be dealt with. Here, when that method is invoked, ‘skipColumnZeros’ is turned on, so that the analysis is robust against data with zero or low values for some characters.

For example:

# First, do a homog opt, and pickle the optimized tree.
# Here I use a bionj tree, but you could use whatever.
read('d.nex')
a = var.alignments[0]
dm = a.pDistances()
t = dm.bionj()
d = Data()
t.data = d
t.newComp(free=1, spec='empirical')
t.newRMatrix(free=1, spec='ones')
t.setNGammaCat(nGammaCat=4)
t.newGdasrv(free=1, val=0.5)
t.setPInvar(free=0, val=0.0)
t.optLogLike()
t.name = 'homogOpt'
t.tPickle()

# Then, do the test ...
read('homogOpt.p4_tPickle')
t = var.trees[0]
read('d.nex')
d = Data()
t.data = d
t.compoTestUsingSimulations()

# Output would be something like ...
# Composition homogeneity test using simulations.
# P-values are shown.

#             Part Num       0
#            Part Name      all
# --------------------    --------
#        All Sequences     0.0000

# Or using more sims for more precision, and also doing the
# Chi-square test for contrast ...

t.compoTestUsingSimulations(nSims=1000, doChiSquare=True)

# Output might be something like ...
# Composition homogeneity test using simulations.
# P-values are shown.
# (P-values from Chi-Square are shown in parens.)

#             Part Num       0
#            Part Name      all
# --------------------    --------
#        All Sequences     0.0140
#   (Chi-Squared Prob)    (0.9933)

It is often the case, as above, that this test will show significance while the Chi-square test does not.

copyToTree(otherTree)
data
deleteCStuff()

Deletes c-pointers from nodes, self, and model, but not the data.

draw(showInternalNodeNames=1, addToBrLen=0.2, width=None, showNodeNums=1, partNum=0, model=None)

Draw the tree to the screen.

This method makes a text drawing of the tree and writes it to sys.stdout.

Arg addToBrLen adds, by default 0.2, to each branch length, to make the tree more legible. If you want the branch lengths more realistic, you can set it to zero, or better, use vector graphics for drawing the trees.

Setting arg model aids in drawing trees with tree-hetero models. If the model characteristic (usually composition or rMatrix) differs over the tree, this method can draw it for you.

See also Tree.Tree.textDrawList(), which returns the drawing as a list of strings.

See the method Tree.Tree.setTextDrawSymbol(), which facilitates drawing different branches with different symbols.

dump(tree=0, node=0, model=0, all=0)

Print rubbish about self.

tree
is the default, showing basic info about the tree.
node
shows info about all the nodes.
model
shows which modelThing number goes on which node. (which you can also get by drawing the tree)

(If you want the info about the model itself, do a aTree.model.dump() instead.)

dupe()

Duplicates self, but with no c-pointers. And no data object.

If there is a model, it is duped.

Returns a copy of self.

dupeSubTree(dupeNodeSpecifier, up, doBrLens=True, doSupport=True)

Makes and returns a new Tree object, duping part of self.

The dupeNodeSpecifier can be a node name, node number, or node object.

Arg ‘up’ should be True or False.

The returned subtree has a root-on-a-stick.

BrLens are not duped – brLens are default in the new subtree.

So if the tree is like this:

         +---------2:A
+--------1
|        +---------3:B
|
0--------4:C
|
|        +---------6:D
+--------5:dupeNode
         |         +--------8:E
         +---------7
                   +--------9:F

Then the subtree from node 5 up is:

                     +---------2:D
subTreeRoot:0--------1:dupeNode
                     |         +--------4:E
                     +---------3
                               +--------5:F

and the subtree from node 5 down is:

                            +--------2:A
                  +---------1
dupeNode:0--------5         +--------3:B
                  |
                  +---------4:C
eps(fName=None, width=500, putInternalNodeNamesOnBranches=0)

Make a basic eps drawing of self.

The ‘width’ is in postscript points.

By default, internal node names label the node, where the node name goes on the right of the node. You can make the node name label the branch by setting ‘putInternalNodeNamesOnBranches’.

getAllLeafNames(specifier)

Returns a list of the leaf names of all children

getChildrenNums(specifier)

Returns a list of nodeNums of children of the specified node.

See also Node.getNChildren()

getDegree(nodeSpecifier)
getEuclideanDistanceFromSelfDataToExpectedComposition()

Calculate the c_E stat between self.data and model expected composition.

The expected composition comes from the current tree (self) and model. There is an expected composition of each sequence in each part, and is obtained via pf.p4_expectedComposition(cTree). In non-stationary evolution, the expected composition of sequences approach the model composition asymptotically as the branch increases.

I am calling the Euclidean distance from the actual sequence composition to the expected composition c_E.

Returns:A list of lists — the c_E for each sequence, for each part. Order of the sequences is as in the Data.
getLen()

Return the sum of all br.len’s.

getNodeNumsAbove(nodeSpecifier, leavesOnly=0)

Gets a list of nodeNums, in postOrder, above nodeSpecifier.

The node specified is not included.

getPreAndPostOrderAbove(nodeSpecifier)

Returns 2 lists of node numbers, preOrder and postOrder.

This uses a stack, not recursion, so it should work for large trees without bumping into the recursion limit. The 2 lists are relative to the node specified, and include the node specified. PreOrder starts from theNode and goes to the tips; postOrder starts from the tips and goes to theNode.

getPreAndPostOrderAboveRoot()

Sets self.preOrder and self.postOrder.

This uses a stack, not recursion, so it should work for large trees without bumping into the recursion limit. PreOrder starts from the root and goes to the tips; postOrder starts from the tips and goes to the root.

getSeqNumsAbove(nodeSpecifier)

Gets a list of seqNums above nodeSpecifier.

getSiteLikes()

Likelihoods, not log likes. Placed in self.siteLikes, a list.

getSiteRates()

Get posterior mean site rate, and gamma category.

This says two things – 1. The posterior mean site rate, calculated like PAML 2. Which GDASRV category contributes most to the likelihood.

The posterior mean site rate calculation requires that there be only one gdasrv over the tree, which will usually be the case.

For placement in categories, if its a tie score, then it is placed in the first one.

The list of site rates, and the list of categories, both with one value for each site, are put into separate numpy arrays, returned as a list, ie [siteRatesArray, categoriesArray]

There is one of these lists for each data partition, and the results as a whole are returned as a list. So if you only have one data partition, then you get a 1-item list, and that single item is a list with 2 numpy arrays. Ie [[siteRatesArray, categoriesArray]]

If nGammaCat for a partition is 1, it will give that partition an array of ones for the site rates and zeros for the categories.

getWeightCommandComment(tok)
inputTreesToSuperTreeDistances(inputTrees, doSd=True, doScqdist=True)

Return the topology distance between input trees and pruned self.

Either or both of two distances, RF (‘sd’) and quartet (‘scqdist’) distances, are returned.

See also the Trees method of the same name that does a bunch of them.

isFullyBifurcating(verbose=False)

Returns True if the tree is fully bifurcating. Else False.

iterInternals()

Internal node generator. PreOrder. Including the root, if it is internal.

iterInternalsNoRoot()

Internal node generator, skipping the root. PreOrder

iterInternalsNoRootPostOrder()

Internal post order node generator, skipping the root. Assumes preAndPostOrderAreValid.

iterInternalsNoRootPreOrder()

Internal post order node generator, skipping the root. Assumes preAndPostOrderAreValid.

iterInternalsPostOrder()

Internal post order node generator. Assumes preAndPostOrderAreValid.

iterLeavesNoRoot()

Leaf node generator, skipping the root. PreOrder.

iterLeavesPostOrder()
iterLeavesPreOrder()
iterNodes()

Node generator, in preOrder. Assumes preAndPostOrderAreValid.

iterNodesNoRoot()

Node generator, skipping the root. PreOrder.

iterPostOrder()

Node generator. Assumes preAndPostOrderAreValid.

iterPreOrder()

Node generator. Assumes preAndPostOrderAreValid.

ladderize(biggerGroupsOnBottom=True)

Rotate nodes for a staircase effect.

This method, in its default biggerGroupsOnBottom way, will take a tree like this:

                            +---------4:A
                   +--------3
         +---------2        +---------5:B
         |         |
         |         +--------6:C
+--------1
|        |         +--------8:D
|        +---------7
|                  +--------9:E
0
|--------10:F
|
|        +---------12:G
+--------11
         +---------13:H

and rearranges it so that it is like ...

+--------10:F
|
|        +---------12:G
|--------11
|        +---------13:H
0
|                  +--------8:D
|        +---------7
|        |         +--------9:E
+--------1
         |         +--------6:C
         +---------2
                   |        +---------4:A
                   +--------3
                            +---------5:B

Note that for each node, the more populated group is on the bottom, the secondmost populated second, and so on.

To get it with the bigger groups on top, set biggerGroupsOnBottom=False. I made the default with the bigger groups on the bottom so that it often makes room for a scale bar.

The setting biggerGroupsOnBottom, the default here, would equivalent to set torder=right in paup; torder=left puts the bigger groups on the top.

lineUpLeaves(rootToLeaf=1.0, overWriteBrLens=True)

Make the leaves line up, as in a cladogram.

This makes the rootToLeaf distance the same for all leaves.

If overWriteBrLens is set, then the newly calculated br.lens replace the original br.lens. If it is not set, then the new br.lens are placed in br.lenL, and does not over-write the original br.lens.

makeSplitKeys(makeNodeForSplitKeyDict=False)

Make long integer-valued split keys.

This needs to have self.taxNames set.

We make 2 kinds of split keys– rawSplitKeys and splitKeys. Both are attributes of node.br, so we have eg node.br.splitKey.

Raw split keys for terminal nodes are 2**n, where n is the index of the taxon name. Eg for the first taxon, the rawSplitKey will be 1, for the 3rd taxon the rawSplitKey will be 4.

RawSplitKeys for internal nodes are the rawSplitKey’s for the children, bitwise OR’ed together.

SplitKeys, cf rawSplitKeys, are in ‘standard form’, where the numbers are even, ie do not contain the 1-bit. Having it in standard form means that you can compare splits among trees. If the rawSplitKey is even, then the splitKey is simply that, unchanged. If, however, the rawSplitKey is odd, then the splitKey is the rawSplitKey bit-flipped. For example, if there are 5 taxa, and one of the rawSplitKeys is 9 (odd), we can calculate the splitKey by bit-flipping, as:

01001 =  9   rawSplitKey
10110 = 22   splitKey

(Bit-flipping is done by exclusive-or’ing (xor) with 11111.)

The splitKey is readily converted to a splitString for display, as 22 becomes ‘.**.*’ (note the ‘1’ bit is now on the left). It is conventional that the first taxon, on the left, is always a dot. (I don’t know where the convention comes from.)

The root has no rawSplitKey or splitKey.

For example, the tree:

    +-------2:B (rawSplitKey = 2)
+---1
|   +---------3:C  (rawSplitKey = 4)
|
0-------------4:E  (rawSplitKey = 16)
|
|    +-----6:A  (rawSplitKey = 1)
+----5
     +-----------7:D  (rawSplitKey = 8)

has 2 internal splits, on nodes 1 and 5.

Node      n.br.rawSplitKey     n.br.splitKey
1             6                    6
5             9                   22

There should be no duplicated rawSplitKeys, but if the tree has a bifurcating root then there will be a duped splitKey.

This method will fail for trees with internal nodes that have only one child, because that will make duplicated splits.

If arg makeNodeForSplitKeyDict is set, then it will make a dictionary nodeForSplitKeyDict where the keys are the splitKeys and the values are the corresponding nodes.

model
modelFitTests(fName='model_fit_tests_out', writeRawStats=0)

Do model fit tests on the data.

The two tests are the Goldman-Cox test, and the tree- and model- based composition fit test. Both require simulations with optimizations in order to get a null distribution, and those simulations need to be done before this method. The simulations should be done with the simsForModelFitTests() method.

Self should have a data and a model attached, and be optimized.

The Goldman-Cox test (Goldman 1993. Statistical tests of models of DNA substitution. J Mol Evol 36: 182-198.) is a test for overall fit of the model to the data. It does not work if the data have gaps or ambiguities.

The tree- and model-based composition test asks the question: ‘Does the composition implied by the model fit the data?’ If the model is homogeneous and empirical comp is used, then this is the same as the chi-square test except that the null distribution comes from simulations, not from the chi-square distribution. In that case only the question is, additionally, ‘Are the data homogeneous in composition?’, ie the same question asked by the chi-square test. However, the data might be heterogeneous, and the model might be heterogeneous over the tree; the tree- and model-based composition fit test can ask whether the heterogeneous model fits the heterogeneous data. The composition is tested in each data partition, separately. The test is done both overall, ie for all the sequences together, and for individual sequences.

If you just want a compo homogeneity test with empirical homogeneous comp, try the compoTestUsingSimulations() method– its way faster, because there are not optimizations in the sims part.

Output is verbose, to a file.

modelSanityCheck(resetEmpiricalComps=True)

Check that the tree, data, and model specs are good to go.

Complain and exit if there is anything wrong that might prevent a likelihood evaluation from being done. We are assuming that a data object exists and is attached, and that model stuff has been set.

Check that each part has at least 1 each from comps, rMatrices, and gdasrvs (if nGammaCat is > 1).

If it is not a mixture model for a particular part, check that each node has a comp, rMatrix, and gdasr. Check that all comps, rMatrices, gdasrvs are used on a node somewhere.

Here relRate, ie the relative rate of each data partition, is adjusted based on the size of the data partitions.

newRelRate_p = oldRelRate_p * (Sum_p[oldRelRate_i * partLen_i] / Sum[partLen_i])

That ensures that Sum(relRate_i * partLen_i) = totalDataLength, ie that the weighted mean of the rates is 1.0.

This method also tallies up the number of free prams in the whole model, and sets self.model.nFreePrams.

nInternalNodes
nTax
newComp(partNum=0, free=0, spec='empirical', val=None, symbol=None)

Make, attach, and return a new Comp object.

The arg spec should be a string, one of:

'equal'          no val
'empirical'      no val
'specified'      val=[aList]
'wag', etc       no val
   (ie one of the empirical protein models, including
   cpREV, d78, jtt, mtREV24, mtmam, wag, etc)

If spec=’specified’, then you specify dim or dim-1 values in a list as the ‘val’ arg.

This method returns a Comp object, which you can ignore if it is a tree-homogeneous model. However, if it is a tree-hetero model then you may want to get that Comp object so that you can place it on the tree explicitly with setModelThing(), like this:

c0 = newComp(partNum=0, free=1, spec='empirical')
c1 = newComp(partNum=0, free=1, spec='empirical')
myTree.setModelThing(c0, node=myTree.root, clade=1)
myTree.setModelThing(c1, node=5, clade=1)
myTree.setModelThing(c1, node=18, clade=0)

Alternatively, you can simply let p4 place them randomly:

newComp(partNum=0, free=1, spec='empirical')
newComp(partNum=0, free=1, spec='empirical')
myTree.setModelThingsRandomly()

Calculation of probability matrices for likelihood calcs etc are wrong when there are any comp values that are zero, so that is not allowed. Any zeros are converted to var.PIVEC_MIN, which is 1e-18 this week. Hopefully close enough to zero for you.

newGdasrv(partNum=0, free=0, val=None, symbol=None)
newRMatrix(partNum=0, free=0, spec='ones', val=None, symbol=None)

Make, attach, and return a new RMatrix instance.

spec should be one of:

  • ‘ones’ - for JC, poisson, F81
  • ‘2p’ - for k2p and hky
  • ‘specified’
  • ‘cpREV’
  • ‘d78’
  • ‘jtt’
  • ‘mtREV24’
  • ‘mtmam’
  • ‘wag’
  • ‘rtRev’
  • ‘tmjtt94’
  • ‘tmlg99’
  • ‘lg’
  • ‘blosum62’
  • ‘hivb’
  • ‘mtart’
  • ‘mtzoa’

You do not set the ‘val’ arg unless the spec is ‘specified’ or ‘2p’. If spec=‘2p’, then you set val to kappa.

If the spec is ‘specified’, you specify all the numerical values in a list given as the ‘val’ arg. The length of that list will be (((dim * dim) - dim) / 2) - 1, so for DNA, where dim=4, you would specify a list containing 5 numbers.

nextNode(spokeSpecifier, hubSpecifier)

Get next node cycling around a hub node.

A bit of a hack to make a p4 Node behave sorta like a Felsenstein node. Imagine cycling around the branches emanating from a node like spokes on a hub, starting from anywhere, with no end.

The hub node would usually be the parent of the spoke, or the spoke would be the hub itself. Usually self.nextNode(spoke,hub) delivers spoke.sibling. What happens when the siblings run out is that self.nextNode(rightmostSibling, hub) delivers hub itself, and of course its branch (spoke) points toward the hub.parent. (Unless hub is the root, of course, in which case self.nextNode(rightmostSibling, hub) delivers hub.leftChild.) In the usual case of the hub not being the root, the next node to be delivered by nextNode(spokeIsHub, hub) is usually the leftChild of the hub. Round and round, clockwise.

nni(upperNodeSpec=None)

Simple nearest-neighbor interchange.

You specify an ‘upper’ node, via an upperNodeSpec, which as usual can be a node name, node number, or node object. If you don’t specify something, a random node will be chosen for you. (This latter option might be a little slow if you are doing many of them, as it uses iterInternalsNoRoot(), but mostly it should be fast enough).

The upper node has a parent – the ‘lower’ node. One subtree from the upper node and one subtree from the lower node are exchanged. Both subtrees are chosen randomly.

This works on biRooted trees also, preserving the biRoot.

node(specifier)

Get a node based on a specifier.

The specifier can be a nodeNum, name, or node object.

optLogLike(verbose=1, newtAndBrentPowell=1, allBrentPowell=0, simplex=0)

Calculate the likelihood of the tree, with optimization.

There are 3 optimization methods– choose one. I’ve made ‘newtAndBrentPowell’ the default, as it is fast and seems to be working. The ‘allBrentPowell’ optimizer used to be the default, as it seems to be the most robust, although it is slow. It would be good for checking important calculations. The simplex optimizer is the slowest, and will sometimes find better optima for difficult data, but often fails to optimize (with no warning).

optTest()
parseNewick(flob, translationHash, doModelComments=0)

Parse Newick tree descriptions.

This is stack-based, and does not use recursion.

parseNexus(flob, translationHash=None, doModelComments=0)

Start parsing nexus format newick tree description.

From just after the command word ‘tree’, to the first paren of the Newick part of the tree.

Parameters:
  • flob – an open file or file-like object
  • translationHash (dict) – associates short names or numbers with long proper taxon names
  • doModelComments (bool) – whether to parse p4-specific model command comments in the tree description
Returns:

None

patristicDistanceMatrix()

Matrix of distances along tree path.

This method sums the branch lengths between each pair of taxa, and puts the result in a DistanceMatrix object, which is returned.

Self.taxNames is required.

pruneSubTreeWithoutParent(specifier, allowSingleChildNode=False)

Remove and return a node, together with everything above it.

Arg specifier can be a nodeNum, name, or node object.

By default, the arg allowSingleChildNode is turned off, and is for those cases where the parent of the node has more than 2 children. So when the subTree is removed, the parent node that is left behind has more than one child.

The stuff that is removed is returned. The nodes are left in self; the idea being that the subTree will be added back to the tree again (via reconnectSubTreeWithoutParent()).

randomSpr()

Do a random spr move.

randomizeTopology(randomBrLens=True)
reRoot(specifier, moveInternalName=True, stompRootName=True, checkBiRoot=True, fixRawSplitKeys=False)

Re-root the tree to the node described by the specifier.

The specifier can be a node.nodeNum, node.name, or node object.

Here is a potential problem. Lets say you start with this tree, with split support as shown:

(((A, B)99, C)70, D, E);

                   +--------3:A
         +---------2:99
+--------1:70      +--------4:B
|        |
|        +---------5:C
0
|--------6:D
|
+--------7:E

Now we want to reRoot it to node 2. So we do that. Often, node.name’s are really there to label the branch, not the node; you may be using node.name’s to label your branches (splits), eg with support values. If that is the case, then you want to keep the node name with the branch (not the node) as you reRoot(). Ie you want node labels to behave like branch labels. If that is the case, we set moveInternalName=True; that is the default. When that is done in the example above, we get:

(A, B, (C, (D, E)70)99);

+--------3:A
|
|--------4:B
2
|        +---------5:C
+--------1:99
         |         +--------6:D
         +---------0:70
                   +--------7:E

Another possibility is that the node names actually are there to name the node, not the branch, and you want to keep the node name with the node during the re-rooting process. That can be done by setting moveInternalName=False, and the tree below is what happens when you do that. Each internal node.name stays with its node in the re-rooting process:

(A, B, (C, (D, E))70)99;

   +--------3:A
   |
   |--------4:B
99:2
   |        +---------5:C
   +--------1:70
            |         +--------6:D
            +---------0
                      +--------7:E

Now if you had the default moveInternalName=True and you had a node name on the root, that would not work (draw it out to convince yourself ...). So in that case you probably want stompRootName=True as well — Both are default. If you set stompRootName=True, it gives you a little warning as it does it. If you set stompRootName=2, it will do it silently. If moveInternalName is not set, then stompRootName is not used.

You probably do not want to reRoot() if the current root is bifurcating. If you do that, you will get a node in the tree with a single child, which is strictly ok, but useless and confusing. So by default I checkBiRoot=True, and throw a P4Error if there is one. If you want to draw such a pathological tree with a node with a single child, set checkBiRoot=False, and it will allow it.

readBipartitionsFromPaupLogFile(thePaupLogFileName)

Assigns support to the tree, from the PAUP bipartitions table.

This needs to have self.taxNames set.

This is useful if you want to make a consensus tree using PAUP, and get the support values. When you make a cons tree with PAUP, the support values, usually bootstrap values, are unfortunately not saved with the tree. That information is in the Bipartitions table, which can be saved to a PAUP log file so that p4 can get it. This method will read thePaupLogFileName and extract the split (tree bipartition) supports, and assign those supports to self as node.br.support’s (as a float, not a string).

It also returns a hash with the split strings as keys and the split support as values, if you need it.

recalculateSplitKeysOfNodeFromChildren(aNode, allOnes)
reconnectSubTreeWithoutParent(stNode, newParent, beforeNode=None)

Attach subtree stNode to the rest of the tree at newParent.

The beforeNode is by default None, and then the subtree is reconnected as the rightmost child of the new parent. However, if you want it somewhere else, for example as the leftmost child, or between two existing child nodes, specify a beforeNode (specified as usual as a node, nodeNumber, or node name) and the subtree will be inserted there.

removeAboveNode(specifier, newName)

Remove everything above an internal node, making it a leaf, and so needing a new name.

removeEverythingExceptCladeAtNode(specifier)

Like it says. Leaves a tree with a root-on-a-stick.

removeNode(specifier, alsoRemoveSingleChildParentNode=True, alsoRemoveBiRoot=True, alsoRemoveSingleChildRoot=True)

Remove a node, together with everything above it.

Arg specifier can be a nodeNum, name, or node object.

So lets say that we have a tree like this:

+-------1:A
0
|       +--------3:B
+-------2
        +--------4:C

and we remove node 4. When it is removed, node 2 ends up having only one child. Generally you would want to remove it as well (so that the parent of node 3 is node 0), so the option alsoRemoveSingleChildParentNode is turned on by default. If alsoRemoveSingleChildParentNode is turned off, nodes like node 2 get left in the tree.

Removal of a node might cause the creation of a bifurcating root. I assume that is not desired, so alsoRemoveBiRoot is turned on by default.

In the example above, if I were to remove node 4, by default node 2 would also disappear, but by default node 0 would also disappear because it would then be a tree with a bifurcating root node. So starting with a 5-node tree, by removing 1 node you would end up with a 2-node tree, with 1 branch.

In the example here, if I were to simply remove node 1:

+--------1:A
|
0        +---------3:B
+--------2
         |         +--------5:C
         +---------4
                   +--------6:D

Then node 0 would remain, as:

         +---------2:B
0--------1
         |         +--------4:C
         +---------3
                   +--------5:D

Presumably that is not wanted, so the arg alsoRemoveSingleChildRoot is turned on by default. When the root is removed, we are left with a bi-root, which (if the arg alsoRemoveBiRoot is set) would also be removed. The resulting tree would be:

+-------0:B
|
1-------2:C
|
+-------3:D

The deleted nodes are really deleted, and do not remain in self.nodes.

removeRoot()

Removes the root if self.root is mono- or bifurcating.

This removes the root node if the tree is rooted on a terminal node, or if the tree is rooted on a bifurcating node. Otherwise, it refuses to do anything.

In the usual case of removing a bifurcating root, the branch length of one fork of the bifurcation is added to the other fork, so the tree length is preserved.

In the unusual case of removing a monofurcating root (a root that is a terminal node, a tree-on-a-stick) then its branch length disappears.

renameForPhylip(dictFName='p4_renameForPhylip_dict.py')

Rename with phylip-friendly short boring names.

It saves the old names (together with the new) in a python dictionary, in a file by default named p4_renameForPhylip_dict.py

If self does not have taxNames set, it does not write originalNames to that file– which may cause problems restoring names. If you want to avoid that, be sure to set self.taxNames before you do this method.

This method does not deal with internal node names, at all. They are silently ignored. If they are too long for phylip, they are still silently ignored, which might cause problems.

restoreDupeTaxa(dictFileName='p4DupeSeqRenameDict.py', asMultiNames=True)

Restore previously removed duplicate taxa from a dict file.

The usual story would be like this: You read in your alignment and p4 tells you that you have duplicate sequences. So you use the Alignment method checkForDuplicateSequences() to remove them, which makes a dictionary file, by default ‘p4DupeSeqRenameDict.py’, to facilitate restoration of the names. You do your analysis on the reduced alignment, and get a tree. Then you use the dictionary file with this method to restore all the taxon names.

If asMultiNames is turned on, the default, then the leaf nodes are not replicated, and the name is changed to be a long multi-name.

If asMultiNames is turned off, then the restored taxa are made to be siblings, and the branch lengths are set to zero.

restoreNamesFromRenameForPhylip(dictFName='p4_renameForPhylip_dict.py')

Given the dictionary file, restore proper names.

The renaming is done by the Alignment method renameForPhylip(), which makes the dictionary file. The dictionary file is by default named p4_renameForPhylip_dict.py

rotateAround(specifier)

Rotate a clade around a node.

The specifier can be a nodeNum, name, or node object.

setCStuff()

Transfer info about self to c-language stuff.

Transfer relationships among nodes, the root position, branch lengths, model usage info (ie what model attributes apply to what nodes), and pre- and post-order.

setEmpiricalComps()

Set any empirical model comps to the comp of the data.

This is done by self.modelSanityCheck(), but sometimes you may want to do it at other times. For example, do this after exchanging Data objects, or after simulating. In those cases there does not seem to be a reasonable way to do it automatically.

setModelThing(theModelThing, node=None, clade=1)
setModelThingsNNodes()

Set nNodes for all modelThings

setModelThingsRandomly(forceRepresentation=2)

Place model things (semi-)randomly on the tree.

For example, if there are 2 compositions in model part partNum, this method will decorate each node of the tree with zeros and ones, randomly. The actual thing set is node.parts[partNum].compNum. If the model thing is homogeneous, it will just put zeros in all the nodes.

We want to have each model thing on the tree somewhere, and so it is not really randomly set. If the model thing numbers were assigned randomly on the tree, it may occur that some model thing numbers by chance would not be represented. This is not allowed, and you can set forceRepresentation to some positive integer, 1 or more. That number will be the lower limit allowed on the number of nodes that get assigned the model thing number. For example, if forceRepresentation is set to 2, then each model thing must get assigned to at least 2 nodes.

setNGammaCat(partNum=0, nGammaCat=1)
setNexusSets()

Set self.nexusSets from var.nexusSets.

A deepcopy is made of var.nexusSets, only if it exists. If var.nexusSets does not yet exist, a new blank one is not made (cf this method in Alignment class, where it would be made).

Important! This method depends on a correct taxNames.

setPInvar(partNum=0, free=0, val=0.0)
setPreAndPostOrder()

Sets or re-sets self.preOrder and self.postOrder lists of node numbers.

PreOrder starts from the root and goes to the tips; postOrder starts from the tips and goes to the root.

setRelRate(partNum=0, val=0.0)
setRjComp(partNum=0, val=True)
setRjRMatrix(partNum=0, val=True)
setTextDrawSymbol(theSymbol='-', node=None, clade=1)
simsForModelFitTests(reps=10, seed=None)

Do simulations for model fit tests.

The model fit tests are the Goldman-Cox test, and the tree- and model-based composition fit test. Both of those tests require simulations, optimization of the tree and model parameters on the simulated data, and extraction of statistics for use in the null distribution. So might as well do them together. The Goldman-Cox test is not possible if there are any gaps or ambiguities, and in that case Goldman-Cox simulation stats are not collected.

Doing the simulations is therefore the time-consuming part, and so this method facilitates doing that job in sections. If you do that, set the random number seed to different numbers. If the seed is not set, the process id is used. (So obviously you should explicitly set the seed if you are doing several runs in the same process.) Perhaps you may want to do the simulations on different machines in a cluster. The stats are saved to files. The output files have the seed number attached to the end, so that different runs of this method will have different output file names. Hopefully.

When your model uses empirical comps, simulation uses the empirical comp of the original data for simulation (good), then the optimization part uses the empirical comp of the newly-simulated data (also good, I think). In that case, if it is tree-homogeneous, the X^2_m statistic would be identical to the X^2 statistic.

You would follow this method with the modelFitTests() method, which uses all the stats files to make null distributions to assess significance of the same stats from self.

simulate(calculatePatterns=True, resetSequences=True, resetNexusSetsConstantMask=True, refTree=None)

Simulate into the attached data.

The tree self needs to have a data and model attached.

This week, generation of random numbers uses the C language random function, which is in stdlib on Linux. It will use the same series of random numbers over and over, unless you tell it otherwise. That means that (unless you tell it otherwise) it will generate the same simulated data if you run it twice. To reset the randomizer, you can use func.reseedCRandomizer(), eg

func.reseedCRandomizer(os.getpid())

The usual way to simulate does not use reference data. An unsual way to simulate comes from (inspired by?) PhyloBayes, where the simulation is conditional on the original data. It uses conditional likelihoods of that reference data at the root. To turn that on, set refTree to the tree+model+data that you would like to use. Calculate a likelihood with that refTree before using it, so that conditional likelihoods are set. The tree and model for refTree should be identical to the tree and model for self.

Parameters:
  • calculatePatterns (bool) – True by default. Whether to “compress” the newly simulated data to facilitate a faster likelihood calculation.
  • resetSequences (bool) – True by default. whether to bring the simulated sequences in C back into Python
  • resetNexusSetsConstantMask (bool) – True by default. When simulations are made, the constant mask in any associated nexus sets will get out of sync. Setting this to True makes a new mask and sets it.
  • refTree (Tree) – None by default. If supplied, a tree+model+data which has had its likelihood calculated, where the tree+model is identical to self.
spr(pruneNode=None, above=True, graftNode=None)

Subtree pruning and reconnection.

See also the Tree.randomSpr() method. It uses this method to do a random spr move.

This only works on fully bifurcating trees. Doing spr moves would tend to break up polytomies anyway; pruning subtrees from a polytomy would require creation of new nodes.

The subtree to be pruned might be pointing up or pointing down from a specified node. If the subtree is pointing up, the subtree to be pruned is specified by the appropriate child of the root of the subtree; the subtree would have a root-on-a-stick (Is monofurcating a proper word?) with the subtree root’s single child being the specified node. If the subtree is pointing down, then the tree is re-rooted to the specified node to allow pruning of the subtree, now above the specified node, with the specified node as the root, including the subtree with the pre-re-rooting parent of the specified node.

I’ll draw that out. Lets say we want to prune the subtree below node 2 in this tree. That would include nodes 0, 1, 2, and 7.

+--------1:A
|
|        +---------3:B
|--------2
0        |         +--------5:C
|        +---------4
|                  +--------6:D
|
+--------7:E

The way it is done in this method is to re-root at node 2, which is the specified node. Then the subtree including the pre-re-rooting parent of the specified node, ie node 0, is pruned.

+--------3:B
|
|        +--------5:C
2--------4
|        +--------6:D
|
|        +--------1:A
+--------0
         +--------7:E
stripBrLens()

Sets all node.br.len’s to 0.1, the default in p4.

Then, if you were to write it out in Nexus or Newick format, no branch lengths would be printed.

subTreeIsFullyBifurcating(theNode, up=True)

Is theNode and everything above it (or below it) bifurcating?

Arg up says whether its above or below.

summarizeModelThingsNNodes()

Summarize nNodes for all modelThings if isHet

tPickle(fName=None)

Pickle self to a file with a ‘p4_tPickle’ suffix.

If there is an attached Data object, it is not pickled. If there is an attached model object, it is pickled. Pointers to c-structs are not pickled.

If fName is supplied, the file name becomes fName.p4_tPickle, unless fName already ends with .p4_tPickle. If fName is not supplied, self.name is used in fName’s place. If neither is supplied, the pid is used as fName.

If a file with the chosen name already exists, it is silently over-written!

p4 can read a p4_tPickle file from the command line or using the read() function, as usual.

(This would not be a good option for long-term storage, because if you upgrade p4 and the p4 Classes change a lot then it may become impossible to unpickle it. If that happens, you can use the old version of p4 to unpickle.)

taxNames
taxSetIsASplit(taxSetName)

Asks whether a nexus taxset is a split in the tree.

Parameters:taxSetName (str) – The name of the taxset. Case does not matter.
Returns:the node in self if the taxset is a split, or else None
Return type:Node
textDrawList(showInternalNodeNames=1, addToBrLen=0.2, width=None, autoIncreaseWidth=True, showNodeNums=1, partNum=0, model=False)
topologyDistance(tree2, metric='sd', resetSplitKeySet=False)

Compares the topology of self with tree2.

The complete list of metrics is given in var.topologyDistanceMetrics

For most metrics using this method, taxNames needs to be set, to the same in the two trees. If the taxa differ, this method simply returns -1

The ‘metric’ can be one of ‘sd’ (symmetric difference), ‘wrf’ (weighted Robinson-Foulds), ‘bld’ (Felsenstein’s branch- length distance), or ‘diffs’. The unweighted Robinson-Foulds metric would be the same as the symmetric difference.

There is also an experimental scqdist, but that needs the scqdist.so module, in the QDist directory.

See Felsenstein 2004 Inferring Phylogenies, Pg 529.

The default metric is the very simple ‘sd’, symmetric difference. Using this metric, if the 2 trees share the same set of splits, they are deemed to be the same topology; branch lengths are not compared. This method returns the number of splits that are in self that are not in tree2 plus the number of splits that are in tree2 that are not in self. So it would return 0 for trees that are the same.

The ‘wrf’ and ‘bld’ metrics take branch lengths into account. Bifurcating roots complicate things, so they are not allowed for weighted distance calculations.

In the unweighted case (ie metric=’sd’), whether the trees compared have bifurcating roots or not is ignored. So the trees (A,B,(C,D)) and ((A,B),(C,D)) will be deemed to have the same topology, since they have the same splits.

The measurement ‘diffs’, which returns a tuple of 2 numbers – both are set differences. The first is the number of splits in self that are not in tree2, and the second is the number of splits in tree2 that are not in self. (Consider it as the the symmetric difference split into its 2 parts.)

If you calculate a distance and then make a topology change, a subsequent sd topologyDistance calculation will be wrong, as it uses previous splits. So then you need to ‘resetSplitKeySet’.

The ‘scqdist’ metric also gives quartet distances. It was written by Anders Kabell Kristensen for his Masters degree at Aarhus University, 2010. http://www.cs.au.dk/~dalko/thesis/ It has two versions – a pure Python version (that needs scipy) that I do not include here, and a fast C++ version, that I wrapped in python. Its speedy – the ‘sc’ in ‘scqdist’ is for ‘sub-cubic’, ie better than O(n^3).

tv()

Tree Viewer. Show the tree in a gui window.

Needs Tkinter.

If you have nexus taxsets defined, you can show them.

tvTopologyCompare(treeB)

Graphically show topology differences.

The taxNames need to be set, and need to be the same for both trees.

(If the red lines don’t show up right away, try adjusting the size of the windows slightly.)

verifyIdentityWith(otherTree, doSplitKeys)

For MCMC debugging. Verifies that two trees are identical.

write()

This writes out the Newick tree description to sys.stdout.

writeNewick(fName=None, withTranslation=0, translationHash=None, doMcmcCommandComments=0, toString=False, append=False, spaceAfterComma=True)

Write the tree in Newick, aka Phylip, format.

This is done in a Nexus-oriented way. If taxNames have spaces or odd characters, they are single-quoted. There is no restriction on the length of the taxon names. A translationHash can be used.

fName may also be an open file object.

If ‘toString’ is turned on, then ‘fName’ should be None, and a Newick representation of the tree is returned as a string.

The method ‘writePhylip()’ is the same as this, with fewer arguments.

writeNexus(fName=None, append=0, writeTaxaBlockIfTaxNamesIsSet=1, message=None)

Write the tree out in Nexus format, in a trees block.

If fName is None, the default, it is written to sys.stdout.

#NEXUS is written unless we are appending– set append=1.

If you want to write with a translation, use a Trees object.

writePhylip(fName=None, withTranslation=0, translationHash=None, doMcmcCommandComments=0)

Write the tree in Phylip or Newick format.

(This method is just a dupe of writeNewick(). Without the ‘toString’ or ‘append’ args.)

fName may also be an open file object.