UP | HOME

P4 — essentials

2024-10-07.

Help

Help is available in various places–

Tutorial
This tutorial (that you are reading now) is an introductory overview, not a comprehensive reference — it skips a lot of detail.
Examples
The examples are part of the source code, and should have been installed along with the rest of p4, in a share directory in the source tree. The splash screen, the info you see when you first start up p4, may tell you where they are. They should be current and work as advertised, even if the API docs are shamefully not up-to-date. Copy them to a working directory so that you can modify them without changing the originals. If any do not work, please let me know.
Recipes
When in interactive p4, you can call func.recipes() to get examples to use as starting points for your own scripts.
API
The class and module (API) descriptions. These are written automatically by the Python source. They should be fairly comprehensive and definitive.
  • The descriptions in the API are also available in the Python help() and dir() functions. For example, try help(Alignment) from within interactive p4 to read all the docstrings in the Alignment class. The dir() function lists the attributes.

Import p4

P4 is a Python package. You can import p4 into your Python as usual

import p4

or (and I prefer this, but you might not) you can use the p4 script. The p4 script can read data files at the command line. The p4 script loads the package as

from p4 import *

so all the p4 names go to the top level in Python, which you may not like.

This week, p4 will recognize and read data files in the form of Nexus files, fasta files, Phylip (sequential or interleaved), and clustalw aln files. P4 will recognize trees in Nexus and Phylip or raw Newick format. And of course p4 recognizes Python files. P4 understands Nexus data, taxa, characters, trees, and sets blocks, but it does not understand all the commands within those blocks.

Q: So why is it called p4?
A: Its an acronym- pppp
Q: So what’s it short for?
A: So its easy to type.

Quickstart

Here we provide a taster for some of the toolkit aspects of p4. These examples are very easy, and (except for the likelihood calculation) can be done using p4 interactively. You will probably find that for anything bigger writing a script would be easier.

These quickstart examples can be found in the Examples, in the A_quickStart directory. I suggest that you make a copy of the Examples, then change directory to A_quickStart in your copy, and then cd to directories therein.

Draw a tree on the screen

Lets say you have a file that has a tree description in Newick or nexus format, and you want to draw that tree to the screen to see what it is. Say, to your shell

p4 -d t.nex

(or whatever your file name is)

Convert between file formats

Lets say that you have an alignment in Phylip format and you want to convert it to Nexus format. This is a small job, so we can use p4 interactively. First, read in the data, then give the alignment a name, and then tell the alignment to write itself in nexus format to a file

$ p4 data.phy                # Read the data at the command line
p4> a = var.alignments[0]    # Give the alignment a name
p4> a.writeNexus('data.nex') # Tell it to write itself to a file
p4>                          # Control-d to quit.

Extract a charset from an alignment

Lets say you have a multi-gene alignment, and you want to extract one of the genes and put it in an alignment file by itself. Here, the placement of the various genes in the alignment is defined in a nexus sets block, in this example in the file sets.nex. Do something like this

: $ p4 d.phy sets.nex                   # Read everything
: p4> a=var.alignments[0]               # Name the alignment
: p4> b = a.subsetUsingCharSet('gene1') # b is a new alignment
: p4> b.writeNexus('gene1.nex')         # Tell b to write itself
: p4>                                   # Quit with control-d

Tabulate the character frequencies in the data

$ p4 d.nex             # Read the data
p4> d=Data()           # Make a Data object
p4> d.compoSummary()   # Ask for a table of composition

Compare the topology of two trees

Lets say that you have two best trees from two different phylogenetic analyses using the same data. The trees might have the same topology, but the trees might not look much like each other, because they are rooted on different nodes, and many branches are rotated on their stems. However, ignoring branch lengths, the trees might still have the same topology, regardless of the various permutations. You can quickly find out how different they are, by doing something like the following

$ p4 tree1.nex tree2.nex           # Read in both trees
p4> t1 = var.trees[0]              # Name the first one
p4> t2 = var.trees[1]              # Name the second one
p4> print(t1.topologyDistance(t2)) # Symmetric difference
0                                  # Zero means they are the same

The default metric for the Tree.topologyDistance() method is the symmetric difference, aka the unweighted Robinson-Foulds distance, which is the number of splits in one tree that are not in the other, plus the number of splits in the other tree that are not in the one. In this example, the trees are the same, and so the difference is zero. If the two trees had only one difference, the symmetric difference would be 2.

A very simple likelihood calculation

This example is a bit more involved, and is not well suited to interactive use. The usual way to use p4 would be to make a script, and that is what we do here. Make a file with the following, and save it as s.py

read(""" 2 2
one
ac
two
gt
""")
read('(one,two);')
t = var.trees[0]
t.data = Data()
t.newComp()
t.newRMatrix()
t.setPInvar()
t.calcLogLike()

Usually p4 scripts refer to other files for the data and the tree, but here it is all in the one script. Sequence data, a tree, and a model are described and then the likelihood is calculated without optimization. To make the script happen, say, to your command line

p4 s.py

Using p4

Using the p4 script

As usual for a Python package, you can import p4 from within Python, by import p4. However, my favourite way of using it is with the p4 script (which should have been installed, hopefully somewhere in your path). The p4 script does a from p4 import *, so your top-level namespace becomes instantly cluttered, but then you do not need to precede everything with p4. The p4 script allows you to read in phylogenetic data and tree files at the command line. To see if it is installed and working, say

p4 --help

Some peculiarities, bugs, and “features”

  • P4 doesn’t do searches with ML. It only evaluates explicit trees. It will do tree searches with Bayesian analysis.
  • Arrays and such in Nexus files are 1-based, while in Python and p4 they are zero-based. Nexus files are case-insensitive in p4, as they should be; but elsewhere in p4, as in Python, they are case-sensitive.
  • Nexus trees do not require a Taxa or Data block. Counter to the Nexus standard, having different trees with different taxa in the same trees block is acceptable in p4. However, if you do supply a taxa block, p4 will use it to enforce those taxon names in the trees.
  • Often the results are not written automatically: you have to tell p4 to write the result. However, you can tell it how you would like the output formatted. Since the interface is a programming language you can do whatever you want with the numbers.
  • You give starting values for optimization. If the params are already optimized, then it shouldn’t take long for re-optimization.
  • The default branch length in trees is 0.1
  • When reading Nexus files, p4 recognizes datatype DNA, but not RNA. If you give it datatype nucleotide, p4 will assume you mean DNA.
  • The Nexus spec does not allow names that are all numerals, and by default p4 follows that rule. You can (eg temporarily) override that behaviour and allow all-digit names by True-ing the variable var.nexus_allowAllDigitNames

    var.nexus_allowAllDigitNames = True
    
  • The symbols -, ?, and . are the only ones allowed for gaps, unknown, and matchchars, respectively.

The var object, the global bucket

When you import p4, or start the p4 script, you import var, which is an object that holds things like lists and flags that are generally useful to you and p4.

With the p4 script you can read things from the command line, via

p4 aFile anotherFile

If those files contain phylogenetic data, they get put in

  • var.sequenceLists (a list)
  • var.alignments (a list)
  • var.trees (a list)
  • var.nexusSets (not a list; if it exists it is a NexusSets object)

The read() function

The usual way to get phylogenetic data and trees into p4 is with the read() function. This function is used in scripts that you might write; it is also used within the p4 script to read in files from the command line.

Nexus files or files containing trees often contain multiple things which would be turned into multiple Python objects, so a command like

myAlignment = readNexus('myData.nex') # Doesn't work

would not always work. The read() function does not return anything; rather, when you read in alignments from files, Alignment objects are made and stuffed into var.alignments, a list, as explained above in the section The var object, the global bucket. So a typical script might start out by reading in data and trees, like this

read('myAlignment.nex')
a = var.alignments[0]
read('myTreeFile.nex')
t = var.trees[0]

There might be several trees in myTreeFile.nex, and when the file is read in they are all converted to Tree objects and put in the var.trees list. So if you want one of those trees, you have to say which one. To get the first one, say something like

firstTree = var.trees[0]

When you read in files from the command line using the p4 script, you can use file name globbing, ie use wildcards, as

p4 *.phy

In that case the shell does the file name expansion. You can also do the same sort of thing with read(), as in

read('*.phy')

If you want to make p4 forget the trees that it has read, say

var.trees = []

You can do the same for alignments and sequenceLists. To make p4 forget any Nexus sets info that it has read, you can do

var.nexusSets = None

If you are sure that the file that you are trying to read has only one thing (Tree, Alignment, SequenceList), then you can use func.readAndPop(). This is handy if for example you are reading in a bunch of sequence list files and one of them happens to have all its sequences the same length – so it gets promoted to an Alignment object and gets put in var.alignments. If you use func.readAndPop() then you would not need to check.

    # awkward using read()
    for fName in myFileList:
        var.alignments = []
        var.sequenceLists = []
        read(fName)
        try:
            sl = var.sequenceLists[0]
        except IndexError:
            sl = var.alignments[0]

    # easier using func.readAndPop()
    for fName in myFileList:
        sl = func.readAndPop(fName)   # SequenceList or Alignment

The dump() function and methods

The function func.dump() gives a quick summary of files that you have read, and objects that have been made and placed in var.trees, var.alignments, and so on. It does not know about alignments and such that are not in var.*.

Several classes have dump() methods as well. For example, to see inside trees in fine detail, you can use the Tree.dump() method, for example

t.dump()

or

t.dump(all=True)

To see details about models, use Model.dump(), for example

t.model.dump()

Alignments and data

You can read in an alignment from a file either with the p4 script with the file name as a command-line argument, or using the read() function within p4. Any alignments that are made are put in var.alignments, a list. (If its a fasta file and the sequences are not all the same length, it is not an alignment, it is a SequenceList, and it is put in var.sequenceLists, another list.) So a lot of p4 interactive sessions might start out like this

$ p4 myData.nex           # in your shell
a = var.alignments[0]     # in interactive python+p4

and a lot of p4 scripts might start out like this

read('myData.nex')
a = var.alignments[0]

Saying read() here reads the file, makes an Alignment object from it, and puts that object in var.alignments. Then you give it the name a. Then you can do useful things with a.

The function func.readAndPop() is similar but it only works for files that have one tree, alignment, or sequence list, and the object is popped from the var. list. So for example if =myData.nex“ has one alignment, then when I do

a = func.readAndPop('myData.nex')   # in interactive python+p4

then the Alignment object will not be in var.alignments.

Alignment, Part, and Data classes

Alignment
A SequenceList subclass, with a lot of functionality in Python.
Part
A data partition, with the sequence data in the C language, for likelihood calculations. Each alignment has at least one part, maybe more.
Data
All the partitions that you want to use. It is at least one Alignment (but perhaps more), with at least one Part (at least one Part for each Alignment, but perhaps more). A Data object is needed for likelihood calculations. It has a C-language representation of the sequences as a list of partitions.

An alignment of phylogenetic data in a file can be read in to make an Alignment object, which is good for simple tasks that you might want to do with alignments. However, if you want to calculate the likelihood of the data, then things get more complicated because then we transfer the alignment into the pf module, written in the C-language. A Part object encapsulates an alignment (or an alignment partition) in C. A Data object contains all the Part objects that you want to use, together with the alignments that the parts came from.

When you do likelihood calculations, your data might be simple– they might be in just one alignment, with only one data partition. In that case your Data object has only one Alignment object, and one Part object. Or you might have 2 alignments, each of which has a single part each, and so your Data object would have 2 alignments and 2 parts. Or perhaps you have a single alignment which is partitioned into 5 genes, in which case your Data object would have one alignment and 5 parts.

To reiterate, the way to get phylogenetic data into p4 is with the read() function. When you read in alignments from files, Alignment objects are made and stuffed into var.alignments, a list. So a typical script to do something simple (ie not likelihood) with an alignment might start out by reading in a file, like this

read('myAlignment.nex')
a = var.alignments[0]

For likelihood calculations you need a Data object, so a typical script for that might start out like this

read('myAlignment.nex')
d = Data()

In either case when you read in the file, an Alignment object is made and deposited in var.alignments. It is there whether or not you give it a name, and whether or not you use it to make a Data object.

If you need to partition your alignment, then you might do something like

read("myAlignment.nex")
read("mySets.nex")            # with the partition info
a = var.alignments[0]
a.setCharPartition('myCharPartitionName')
d = Data()  # now with more Parts than if the Alignment had not been partitioned

Part objects are made when you make a Data object. It is done behind the scenes so you may never need to be concerned with Part objects.

The sequence data that you want to use might be in more than one alignment, and that is perfectly acceptable to p4. In fact, it is essential if you are using more than one datatype (eg the DNA sequence for one gene, analysed simultaneously with the protein sequence for another gene). A MrBayes-style data matrix with more than one datatype is not allowed by p4 (nor is it Nexus compliant). So you might do something like the following to read in two alignments

read('myFirstAlignment.nex')
read('mySecondAlignment.nex')
d = Data()

P4 analyses data partition-by-partition. Separate alignments necessarily make separate data partitions, but you can also split up single alignments into separate partitions–that would be done using Nexus sets blocks. To do this you might do something like

read('myAlignment.nex')
a = var.alignments[0]
read('myNexusSetsBlock.nex') # Defines cp1
a.setCharPartition('cp1')    # Applies cp1 to the alignment
d = Data()  # now with more Parts than if the Alignment had not been partitioned

In order to get a Data object from one alignment that you want to partition into 2 parts together with one unpartitioned alignment, you might do something like the following

read('myFirstAlignment.nex')
a = var.alignments[0]
read('myNexusSetsBlock.nex') # defines cp1
a.setCharPartition('cp1')
read('mySecondAlignment.nex')
d = Data()

The d Data instance now contains 3 Part objects (in d.parts) and 2 Alignment objects (in d.alignments).

If you are going to use a Data object together with a tree, eg for ML evaluations, Bayesian analysis, or simulation, you need to attach the Data object to the tree, as

t.data = d

Nexus charpartitions

This week, p4 understands Nexus charset and charpartition commands within Nexus sets blocks. When a Nexus sets block is read in, its contents are added to var.nexusSets, a NexusSets object which can hold charsets and charpartitions. When you want to use a charpartition on an alignment, you would say something like

a = var.alignments[0]
a.setCharPartition(theCharPartitionName)

That attaches a copy of var.nexusSets to the alignment. To un-set a charpartition from an alignment, you can say a.setCharPartition(None).

Setting a charPartition does not actually do anything to the sequences in an Alignment, rather a charPartition only comes into play when you make a Data object. When you make a data object from your partitioned alignments, you can check that you are indeed dealing with partitioned data by doing a dump() method on your data object.

Subsetting alignments

By subsetting I mean extracting a part of an alignment into its own new alignment. You can subset alignments based on charsets or mask strings.

If you do not have a Nexus sets block, you can still use the Nexus character list format (eg 1 3-5 7 11 below, and note that it is 1-based numbering) to make a mask on the fly using the func.maskFromNexusCharacterList() function, for example

p4> print(func.maskFromNexusCharacterList("1 3-5 7 11", 12, invert=1))
010001011101

Nexus standard datatype

P4 supports Nexus standard datatype.

Something I have been looking into is recoding protein alignments into amino acid groups, for example the 6 Dayhoff groups, for analysis. That uses the Alignment.recodeDayhoff() or the Alignment.recodeProteinIntoGroups() method. Here the amino acid symbols are recoded to the numerals. For example in Dayhoff recoding, since C (Cysteine) is in a group by itself, it is simply recoded as 1, but the amino acids S, T, P, A, and G are all recoded as 2, as they are all in the same Dayhoff group. It is rather like recoding DNA into transversions to decrease the effects of saturation. It is based on the notion that changes within groups might be saturated, biased, and difficult to model, while changes between groups are less problematic. Certainly there is loss of information, but (hopefully) the information that is retained is of higher quality.

(Non-parametric) bootstrapping of data

Bootstrapping an alignment is trivially easy, and lots of programs do it, p4 included. It is a method in the Alignment class.

However, bootstrapping of partitioned data is also supported by p4. It is a method of the Data class. The alignment and partition structure is maintained in the bootstrapping process.

Dealing with duplicate sequences

There is no point in doing a deep phylogenetic analysis with duplicated sequences. (There may be a reason for popgen level work.) If you have any duplicated sequences, p4 will point them out when you read in your data file. P4 will remove the sequences if you tell it to. But often you want those dupes in your final result - generally a tree. So you will want to remove the duplicated sequences before you do your analysis, and then restore the taxa to the resulting tree after. For example, lets say that you read in your data like this

$ p4 d3.nex         # at your shell

and then you get a long message complaining about duplicated sequences, starting with

Alignment from file 'd3.nex'
This alignment has duplicate sequences!
Sequence numbers below are 1-based.
   sequence 1 (A) is the same as sequence 3 (C).
   sequence 1 (A) is the same as sequence 8 (H).
   sequence 2 (B) is the same as sequence 5 (E).
   sequence 2 (B) is the same as sequence 7 (G).
   ...

To remove the duplicated sequences and make a Python dictionary file that will be useful later for restoring those removed taxa, you can do something like this

read('d3.nex')
a = var.alignments[0]
a.checkForDuplicateSequences(removeDupes=True, makeDict=True)
a.writePhylip(fName='d3_noDupes.phy')

(Above I wrote the subsetted alignment as a phylip file, but that need not be- you could write it in Nexus or fasta format as you choose.)

That makes a dictionary file, by default named p4DupeSeqRenameDict.py. Don’t lose it, because it is needed when you restore the deleted duped taxa to the resulting tree.

It removes all but one of the duplicate sequences, and re-names the remaining sequence with a name like p4Dupe1 or p4Dupe2.

Then you can do your analysis, which will go faster because you don’t have duped sequences. At the end, you will generally have a tree. You can restore the removed taxa to the tree using the dictionary file made above. You would do something like

read('myTree.nex')
t = var.trees[0]
t.restoreDupeTaxa()
t.writeNexus(fName='myRestoredTree.nex')

The restored taxa will by default have compound names.

Trees

Tree, Node, and Trees classes

Trees in p4 are described by a list of nodes, one of which is the root. All trees have a root, but this has nothing to do with the number of children that the root has.

In phylogenetics when we talk about rooted trees we usually mean that we think that one node, the root, is the ancestor of all the other nodes; however the root in p4 usually carries no such implication. The root in p4 is often just a node that is a starting place for the rest of the nodes.

Can the root ever be a leaf? — Not usually, but it is possible.

Nodes are the vertexes of the tree, where the branches meet. (My dictionary tells me that both vertices and vertexes are correct plurals of vertex.) P4 describes relationships among nodes with 3 pointers- we have a pointer to the parent node, to a child node (which I imagine to be always on the left, so it is the leftChild), and to a sibling node (which I imagine to always be on the right). Any of these pointers might be None; for example the root node parent is None, the leftChild of any tip node is None, and the sibling of the rightmost child of a node is None

leftChild
         \\
          \\
           Node --- sibling
           |
           |
           parent

All the nodes except the root in a p4 Tree also have NodeBranch attributes, which embody the branches leading to the node from the parent. It has information about for example the length of the branch, and so for a node n we have n.br.len.

The usual way to get around trees is by recursive traversal, and p4 provides a few avenues for that, both as Node methods and as Tree methods. See Knuth’s The Art of Computer Programming for a good explanation of preorder and postorder tree traversal.

Sometimes it is natural to deal with trees by the bunch, and so we have the Trees class. It is this class that is able to write Nexus tree files that use translations, and it is this class that interfaces with consel to compare several trees.

This week, p4 will read trees in Nexus format, or in Phylip or raw Newick format. The Nexus format is described in the 1997 paper in Systematic Biology (MadSwofMad97). I am not sure where the Phylip/Newick format is described. Older Phylip tree files began with a number, indicating the number of trees in the file, but it appears that newer Phylip tree files do not have or need that number. When did it change?

P4 reads both Nexus and Phylip trees with Nexus-like rules, so if you have punctuation or spaces in your taxon names then you need to put the name in single quotes. P4 fails to read multi-word Phylip taxon names, so for example p4 will gag on the tree

((A, B), (C, Homo sapiens));

because one of the taxa has a space in it. That sort of taxon name is perfectly acceptable if

  1. you put it inside single quotes, or
  2. you True the flag var.newick_allowSpacesInNames

The tree files output by Tree-Puzzle have modifications to the Phylip/Newick format, where [comments bounded by square brackets] precede the tree description. P4 can handle those.

Branch lengths go after colons in the tree description, as usual. You can label internal nodes, including the root, immediately after unparens, as in the following

p4> read("((A, B)the_AB_clade:0.2, (C, 'Homo sapiens')98:0.3)theRoot;")
p4> t=var.trees[0]
p4> t.draw()
                 +------2:A
        +--------1:the_AB_clade
        |        +------3:B
theRoot:0
        |          +------5:C
        +----------4:98
                   +------6:Homo sapiens

P4 distinguishes Node objects (vertices) and NodeBranch objects (edges).

It is the branch that holds the length of the branch. Branches and branch lengths can be accessed by

myNode.br
myNode.br.len

After doing a consensus tree, it is the branch that holds the support values. That makes the branch supports immune to re-rooting. But it also means that in order to see them and save them in newick or Nexus format you will need to convert them to node.names. This has the advantage that you can format the conversion to your liking — you can make it percent, or as a a value from zero to 1.0 with however many significant figures you wish

tp = TreePartitions('myTrees.nex')
t = tp.consensus()

# list the support values
for n in t.iterInternalsNoRoot():
    print("node %3i branch support is %f" % (n.nodeNum, n.br.support))

# perhaps collapse nodes where the support is too small for you
toCollapse = [n for n in t.iterInternalsNoRoot() if n.br.support < 0.7]
for n in toCollapse:
    t.collapseNode(n)

# optionally re-root it
t.reRoot(t.node('myOutgroupTaxon').parent)

# Make the internal node names the percent support values
for n in t.iterInternalsNoRoot():
    n.name = '%.0f' % (100. * n.br.support)

# Drawings get a bit messy with both node nums and support
t.draw(showNodeNums=False)
t.writeNexus('consTreeWithSupport.nex')

The Nexus/Newick format can hold node names and branch lengths, but it is awkward and non-standard to make it hold more information (eg split support, branch colour, model usage information, rooting information, and so on). Perhaps an XML-based format would be useful here, but these are early days for XML in phylogenetics, and XML files can get very big very quickly. Update: NeXML looks interesting (files are still big, tho).

As an alternative, you can store information-rich trees by pickling them, a standard Python thing to do to archive objects to files. Trees, nodes, and models may have pointers to c-structures; these should not be archived, nor should you archive data objects along with the trees. To facilitate pickling trees you can use the Tree method tPickle(), which strips out the c-pointers and data before pickling the tree. Trees so pickled (with the p4_tPickle suffix) are autorecognized by p4 with the read() function or at the command line.

Tree diagrams

You can make a text diagram of a tree at the terminal with the Tree.draw() method. It provides some control over the presentation, for example the width, whether node numbers are displayed, and so on.

Topology distance

You can compare tree topologies such that the root of the tree and any rotations of clades do not matter. For example, these 2 trees have identical topologies (by this definition), even though they do not look like each other

     +--------1:A
     |
     |        +--------3:B
     0--------2
     |        +--------4:C
     |
     |        +--------6:D
     +--------5
              +--------7:E

     +--------1:E
     |
     |--------2:D
     0
     |                  +--------5:C
     |        +---------4
     +--------3         +--------6:B
              |
              +---------7:A

With the Tree.topologyDistance() method you can compare topologies without taking branch length differences into account, or you can use metrics that do take branch lengths into account. The default metric is the symmetric difference, aka the unweighted Robinson Foulds distance, which ignores branch lengths. The Tree.topologyDistance() method also provides the weighted Robinson-Foulds distance, and the branch length distance, which take branch lengths into account. These are described in Felsenstein’s book. To do several trees at once, you can use the Trees.topologyDistanceMatrix() method.

Patristic distances

This distance is just the length along the tree path between all pairs of nodes. The method Tree.patristicDistanceMatrix() returns a DistanceMatrix object, which you probably want to write to a file. For example, you might say

t = var.trees[0]
dm = t.patristicDistanceMatrix()
dm.writeNexus('patristic.nex')

Models and likelihood

In order to calculate the likelihood of a tree, you need to specify a model. Since the model depends on the datatype and data partition structure, before you can specify a model you need to specify the data.

Having specified a model, you can calculate the likelihood without optimization with Tree.calcLogLike(); You can optimize any free parameters with the Tree.optLogLike() method.

t = var.trees[0]
t.data = Data()   # specify the data
<... specify the model ...>
t.calcLogLike()

Describing models

You need to specify

  • a composition,
  • a rate matrix, and
  • the proportion of invariant sites.
  • Optionally you can specify gamma distributed among-site rate variation (gdasrv), with the number of discrete gamma classes, nGammaCat.

Here is a simple one, for an F81 model, with no among-site rate variation

t.newComp(free=1, spec='empirical')
t.newRMatrix(free=0, spec='ones')
t.setNGammaCat(nGammaCat=1)       # no 'gamma'
t.setPInvar(free=0, val=0.0)      # no proportion of invariant sites

(Note that while the F81 is a DNA model, the model above would also work for other datatypes.)

Parameters of the model may or may not be free (adjustable or optimizable), and so that needs to be specified. Depending on the spec (ie model specification), some model parameter numerical values may need to be specified as a val argument. For example, to define the a composition for the third data partition (partNum=2) for which you want the model composition to be fixed to certain values (ie not free), you might say

t.newComp(partNum=2, free=0, spec='specified', val=[0.4, 0.3, 0.2])

A simple Poisson or Jukes-Cantor-like model can be described for single partition data by the following

t.newComp(free=0, spec='equal')
t.newRMatrix(free=0, spec='ones')

Here the spec for the composition is equal, meaning that the character state frequencies are all equal; since they are equal you do not need to specify the numbers. In specifying the rate matrix, the spec='ones' arg means that the rates of change from one base to another are all the same. Here the model parameters are not free, and so they would not change in a likelihood optimization or in a Bayesian MCMC analysis.

If all the parameters on the model above were set to be free (ie by setting free=1) then it would become a GTR model. If the comp was set free but the rMatrix remained fixed then it would be an F81 model.

K2P and HKY models are specified by setting the rate matrix spec to ’2p’. Empirical protein models are specified by setting the rate matrix spec to, for example, ’d78’, ’jtt’, ’wag’, or ’mtrev24’. The current rMatrixSpecs is given by var.rMatrixSpecs

Multi-partition models for mult-partition data

The model can differ over the data, and if you have more than one data partition you need to specify the partNum when you specify the components of a model

pNum = 0    # First partition, F81
t.newComp(partNum=pNum, free=1, spec='empirical')
t.newRMatrix(partNum=pNum, free=0, spec='ones')
t.setNGammaCat(partNum=pNum, nGammaCat=1)
t.setPInvar(partNum=pNum, free=0, val=0.0)
t.setRelRate(partNum=pNum, val=0.9)

pNum = 1   # second partition, F81+G
t.newComp(partNum=pNum, free=1, spec='empirical')
t.newRMatrix(partNum=pNum, free=0, spec='ones')
t.setNGammaCat(partNum=pNum, nGammaCat=4)
t.newGdasrv(partNum=pNum, free=1, val=0.5)
t.setPInvar(partNum=pNum, free=0, val=0.0)
t.setRelRate(partNum=pNum, val=1.1)

If your model is heterogeneous over the data, you can optionally specify the relative rates of the different partitions, for example

t.setRelRate(partNum=0, val=0.5)
t.setRelRate(partNum=1, val=1.5)

If you want to make these free parameters, you can say

t.model.relRatesAreFree = 1

Tree-heterogeneous models

The composition or rate matrix of the model can differ over the tree. (The gdasrv can also differ over the tree, but this seems to be less useful.)

To specify it exactly (which is sometimes useful, but it would not generally be useful at the start of an MCMC), first you specify the model attribute, and then you put the model attribute on the tree, for example, if we start with this tree

         +--------2:one
+--------1
|        +--------3:two
0
|--------4:three
|
+--------5:four

and put 2 compositions on it, like this

A = t.newComp(spec='empirical', symbol='A')
B = t.newComp(spec='empirical', symbol='B')
t.setModelComponentOnNode(A, node=t.root, clade=1)
t.setModelComponentOnNode(B, node=1, clade=1)
t.draw(model=1)

then we end up with a tree like this

         +BBBBBBBB2:one
+BBBBBBBB1
|        +BBBBBBBB3:two
0
|AAAAAAAA4:three
|
+AAAAAAAA5:four
Part 0 comps
    0   A
    1   B
    root (node 0) has comp 0, symbol A

Here I have specified 2 compositions, A and B. We place A on the root node, but because we specify clade=1 that composition is applied over the entire tree. Then we place composition B on node 1, also clade-wise, and in that part of the tree B displaces (ie over-rides) A.

An alternative to placing model components on the tree explicitly as above, you can also Tree.setModelComponentsOnNodesRandomly

The multinomial or unconstrained likelihood is a property of the data only, and does not need a tree or model. It can only be calculated if there are no gaps or ambiguities in the data. There are 2 ways to calculate it– you can either take the data partitions into account, or not. The former uses the Data method Data.calcUnconstrainedLogLikelihood1(). The result is placed in the data attribute unconstrainedLogLikelihood. The Data method Data.calcUnconstrainedLogLikelihood2() calculates the unconstrained log like of each data partition. Note that the unconstrained log like of the combined data is not the sum of the unconstrained log likes of the separate partitions.

Switching Data and the Model

You will often want to analyse the same dataset and tree with different models, or perhaps analyse the same tree and model with different datasets.

It is straightforward to switch datasets.

a = func.readAndPop("data_a.nex")
dA = Data([a])
b = func.readAndPop("data_b.nex")
dB = Data([b])

t.data = dA
# set the model ...

t.data = dB
# continue with the same model...

When you switch model, first you set the model to None

a = func.readAndPop("data_a.nex")
dA = Data([a])

t.data = dA
# set the model ...

t.model = None
# set the new model

Bayesian analysis with MCMC

P4 has a basic MCMC for doing Bayesian analyses. It implements the models in p4, including some models that are Bayesian-only.

The Mcmc is started with a p4.Tree object, with model defined and data attached. Tree rearrangement with eTBR and LOCAL is implemented. If the tree root is bifurcating, the Mcmc will preserve a bifurcating root. Topology constraints, including rooting, can be enforced.

It can use the Metropolis-coupled MCMC, or MCMCMC, aka parallel tempering, to help mixing. To start an MCMC, you first set up your data, tree, and model as usual. Then you instantiate an Mcmc object, and run the chain, like this

# t is the tree, with model and data attached
m = Mcmc(t, nChains=4, runNum=0, sampleInterval=1000, checkPointInterval=250000, simulate=2)
m.run(1000000) # one million generations

Here nChains says to make 4 MCMC chains in parallel in an MCMCMC, where 1 chain is the cold chain that is sampled, and 3 are heated chains.

As the MCMC runs, it saves trees to a tree file, saves likelihoods to another file, and saves model parameters to another file. These samples need to be digested or summarized later.

It has a full implementation of Paul Lewis et al’s polytomy proposal (P. O. Lewis, M. T. Holder and K. E. Holsinger, 2005. Polytomies and Bayesian phylogenetic inference, Syst Biol 54(2): 241–253). P4 writes checkpoint files during the MCMC run, so that if you want to continue a run after it has ended, or if you want to recover from a crash or power outage, it is possible to restart from a checkpoint. You can specify topological constraints in the trees.

A simple MCMC

# You will usually start with a data file, and a random tree.
read("d.nex")
d = Data()
t = func.randomTree(taxNames=d.taxNames)
t.data = d

# Now define a model.  This one is GTR+IG.
t.newComp(free=1, spec='empirical')
t.newRMatrix(free=1, spec='ones')
t.setNGammaCat(nGammaCat=4)
t.newGdasrv(free=1, val=0.5)
t.setPInvar(free=1, val=0.2)

# Now instantiate an Mcmc object.  This one has 4 MCMCMC chains, is set
# to do a million generations (below, in run()), collect 2000 samples
# (every 500 generations) and is set to take 4 checkPoints.

m = Mcmc(t, nChains=4, runNum=0, sampleInterval=500, checkPointInterval=250000)

# Here you can adjust the proposal probabilities
m.prob.polytomy = 1.0       # Default is zero
m.prob.brLen = 0.001        # Default is zero

# This starts the run, specifying the number of gens you want.  You can
# restart it from a checkPoint if you later find you need more.
m.run(1000000)

Chain swapping and temperature tuning in the MC3

In the MCMCMC, we have a few MCMC chains, that run in parallel. All but one of the chains is heated or tempered, which makes proposal acceptances easier to allows better mixing. The unheated chain is the cold chain, and is the only chain that is sampled. The amount of heating, to continue the metaphor, is given by the temperature. Chains can swap states, and so for example the cold chain can swap states with lowest-temperature heated chain, and so after the swap the previously heated chain would be sampled. This allows the heated chains, with their mixing advantage, to inform the sampled chain.

The way p4 implements this is that by default only adjacent chains are proposed to swap. That is, the cold chain may swap with the lowest-temperature heated chain, but not with any of the other chains.

MrBayes implements swapping in a similar way, but chain-swapping proposals are made between any pairs of chains, not only between adjacent chains. I have noticed that chain-swapping acceptance is often low for non-adjacent chains, and so by default p4 uses a swap vector that only allows swapping between adjacent chains rather than a swap matrix that allows swapping between any two chains. This is under the control of

Mcmc.swapVector          # eg m.swapVector=True;  True by default

The amount of swapping between chains is related to the temperature difference, with a low temperature difference tending to have a high acceptance and a high temperature difference tending to have low swap proposal acceptance. For this reason the chain temperatures are actively tuned during the MCMC run to try to attain an acceptance of 0.1 – 0.25 (for the swap vector). Whether tuning is done at all is controlled by m.swapTunerDoTuning and it is by default True.

The Mcmc object stores information about the swaps, and for swap vectors this is accessible by the Mcmc method writeSwapVector(). This is conveniently done on Mcmc checkpoints, and so if you save Mcmc checkpoints you can do

cpr = McmcCheckPointReader(theGlob='*')
cpr.writeSwapVectors()

Below is the output of one such checkpoint

Mcmc.writeSwapVector(). Chain swapping, for 250000 gens, from gens 250000 to 499999, inclusive.
    swapVector is on, so swaps occur only between adjacent chains.
    last chainTemps       0.000   0.554   3.399   8.739
    last chainTempDiffs   0.554   2.845   5.340

chains          0--1      1--2      2--3
              ------    ------    ------
proposed       83882     82894     83224
accepted       15006     15459     15815
accepted    0.178894  0.186491  0.190029

Proposal probabilities

Proposals to change the model parameters or tree topology are made with default probabilities. These default probabilities are calculated based on the number of parameters involved. You can adjust these proposal probabilities, or perhaps turn a proposal type off altogether, by setting attributes of Mcmc.prob. These are user-specifiable relative probabilities, with default being 1; they are not proper probabilities, and do not sum to 1.

For an Mcmc object m, I can show these current relative probs by saying

print(m.prob)

You can see the real intended rates (the kind that add up to 1.0) that are calculated from these relative probs after a run, or on checkpoints

m.writeProposalProbs()

For my Mcmc object m, I could for example turn off the tree topology proposals by setting local and eTBR to zero.

m = Mcmc(t, ...)
m.prob.local = 0.0
m.prob.eTBR = 0.0
m.run(...)

These relative m.prob values affect proposal probabilities, but not in a linear way. For changes with inherently low proposal probabilities, such as relRate, doubling the relative prob approximately doubles the final proposal probability. As I mentioned, you can check the final calculated proposal probabilities with m.writeProposalProbs().

The polytomy proposal

This is an implementation of the move described in Lewis et al (2005) “Polytomies and Bayesian phylogenetic inference” Syst Biol 54:241-253. This move is turned off by default, but you can turn it on for an Mcmc object m by saying

m.prob.polytomy = 1.0
m.prob.brLen = 0.001

(If you have local turned on, which you probably do, you will also want to turn on the brLen proposal in addition, for those cases when the current tree is a star tree; the local proposal needs 3 contiguous edges, and so the local proposal does not work on a star tree. In that case, the chain uses the brLen proposal instead of local.)

This move tends to decrease spurious high support for short internal nodes. The prior probability favouring increased multifurcations or increased resolution can be adjusted. Actually, as in the paper, there are two ways to do this; you can use the polytomy prior, or you can use the resolution class prior. If you want to use the resolution class prior, you can turn it on, for an Mcmc object m, by

m.tunings.doPolytomyResolutionClassPrior = True  # False by default

but if it is left as False then you use the polytomy prior. In both cases, there is a ’C’ constant to set, that determines how much multifurcations are favoured. You can set it via its log, for example for an Mcmc object m

m.tunings.polytomyPriorLogBigC = 1.0   # Default is zero

By setting the logBigC to 1.0, above, I am setting the C to e as in the paper. Assuming the default of using polytomy priors, leaving logBigC set to zero (ie C=1), the default, means that every tree (of any resolution) has the same prior probability. Setting it to higher numbers favours polytomies.

As in the paper, you can use the resolution class priors, and set the C constant to 1.1 (actually 10/9 = 1.111111), 2, or 10, for an Mcmc object m, like this

import math
m.tunings.doPolytomyResolutionClassPrior = True
m.tunings.polytomyPriorLogBigC = math.log(10./9.)   # Or 2, or 10, or ...

Topology constraints

You can specify constraints on the tree topology with a Constraints object. To make a Constraints object, you need a list of taxNames (in the same order as your data, trees, and so on), and you also need a (usually partly-resolved) tree object

myTaxNames = myData.taxNames
cTree = func.readAndPop('myConstraintsTreeFile.phy')
constr = Constraints(taxNames, constraintTree=cTree, rooting=True)

You can pass a Constraints object to func.randomTree() and Mcmc() to enforce constraints. If you are starting a Mcmc with a randomTree, then it should have the same constraints as you pass to the Mcmc

t = func.randomTree(taxNames=myTaxNames, constraints=constr, biRoot=True)
<...>
m = Mcmc(t, ..., constraints=constr)

Checkpoints

When running an MCMC, you can write checkpoint files at intervals. These files are the state of the Mcmc at that time. Mcmc runs can be restarted using checkpoints. Also, you can do diagnostics on checkpoints like you can on live Mcmc objects, but with checkpoints you can do the diagnostics after the run has finished (and the Mcmc object no longer exists) or during the run (querying finished checkpoints), and you can do the diagnostics on several checkpoints to see how the diagnostics change over time.

To tell your Mcmc to make checkpoints, say (among other args when you start up an Mcmc) for example

m = Mcmc(t, ..., checkPointInterval=250000, ...)

To tell it not to do checkpoints, set it to zero or None

m = Mcmc(t, ..., checkPointInterval=None, ...)

The checkpoint interval should divide the number of generations requested by run() evenly, so for example

m = Mcmc(t, ..., checkPointInterval=250000, ...)
m.run(1000000)

will work, but

m = Mcmc(t, ..., checkPointInterval=250000, ...)
m.run(900000)

will not work.

I generally aim to collect perhaps 4 or 5 checkpoints in my runs. If you collect more checkpoints (by collecting them more often) then they will each contain fewer samples, and so the estimates might be a bit more noisy than if you take checkpoints at bigger intervals representing more samples.

There is a class, McmcCheckPointReader(), that is good for reading and digesting checkpoints. When you start it up like this

cpr = McmcCheckPointReader()

then it will read in all the checkpoint files in the current directory. There are other ways to start it up - read the class doc string. Having got a McmcCheckPointReader object (eg cpr here), then you can ask it to

cpr.writeProposalAcceptances()

or

cpr.writeSwapVectors()

which calls those methods on all the checkpoints. When you read in checkpoints, they are full Mcmc objects (except that they do not have data attached), and so you can ask questions of them as you would ask of an Mcmc object, as

m = cpr.mm[0]    # get the first Mcmc object
m.writeProposalAcceptances()

Using the McmcCheckPointReader is useful for seeing how things, for example acceptance rates, change over the run. Perhaps the most useful thing that a McmcCheckPointReader can do is compare the split supports between runs, using the average standard deviation of split support (or split frequency). When you do an Mcmc analysis it is good practice to do more than one run, and comparing split supports between runs measures of topological agreement between runs.

You can compare split supports between two different checkpoints, or between all pairs, as

cpr.compareSplits(0,1)    # Compare checkpoint 0 with 1
cpr.compareSplitsAll()    # Compare all checkpoints to each other

Restarting from checkpoints

You can restart an Mcmc from a checkpoint. If the previous run finished normally (ie was not a crash) then it is easy. A checkpoint is a full Mcmc object without the data - so you need to give it the data to get going. You can use the function

func.unPickleMcmc(runNum, theData, verbose=True)

to get the Mcmc object from the checkpoint. That function will get the last checkpoint from the specified run (runNum) and return an Mcmc object. So for example you might restart runNum 0 by

read("../d.nex")
d = Data()
m = func.unPickleMcmc(0, d)
m.run(1000000)

which tells it to continue on in the same way as it was, and do another million generations. It would be possible to make changes to the Mcmc before continuing the run, as

m = func.unPickleMcmc(0, d)
< make changes here ...>
m.run(1000000)

If the Mcmc crashed, you can restart from a checkpoint as above, but first you will want to repair the output files to get rid of output lines that were written after the last checkpoint but before the crash.

Output

Trees sampled in the MCMC are placed in a file, and the log likelihoods of those sample trees are found in another file. Model parameters are put in another file.

You can make a consensus tree like this

tp = TreePartitions('trees.nex', skip=1000)  # skip burnin
t = tp.consensus()

You will often want to transfer the node support values to internal node names – see the doc string for TreePartitions.

You can do a quick-and-dirty convergence test by plotting the log likelihood values found in the output file. If it reaches a plateau, then it is assumed to have converged. However, this method is unreliable. Certainly if the log likes have not reached a plateau then it has not converged, but the reverse cannot be assumed. Comparing split supports as described above (see Checkpoints) offers better convergence diagnostics.

You can also look at sampled model parameters, found in the mcmcpramsN (N = 0, 1, …) using

func.summarizeMcmcPrams()

You can set it to skip a burnin.

Posterior predictive simulations

To help assess model fit in a MCMC you can set up the MCMC to simulate data sets based on the current tree and model every writeInterval. When a simulation is made, a test quantity is extracted from it and written to a file.

You can turn on simulations during an MCMC when you instantiate the Mcmc object. You set the simulate arg to a number from 1-31. For example, if you set it to 1, you get the unconstrained or multinomial likelihood, if 2 you get X2, and if 3 you get both. The available test quantities are

1 multinomial log likelihood
2 composition \(X^2\)
4 mean nChars per site
8 \(c_m\), compStatFromCharFreqs (2 numbers per part; sim and curTree)
16 constant sites count (not proportion)

If you want to do the simulations after the MCMC is finished (or perhaps from a MrBayes run), see the PosteriorSamples class.

The idea is that you have a single test quantity from your data (or data partition) and you want to compare it to the range that is generated from the posterior distribution, to see whether the model that you are using might have plausibly generated your original data. A usual way to do that comparison is to use func.tailAreaProbability() (or the same within the Numbers class Numbers.tailAreaProbability()). Here is an example from the source code

# Get the test quantity, X^2, from the original data.
read("../../K_thermus/noTRuberNoGapsNoAmbiguities.nex")
d = Data()
ret = d.compoChiSquaredTest()
#print(ret)
originalStat = ret[0][0]

# Get the sim stats
n = Numbers('mcmc_sims_0', col=1, skip=500)

# Evaluate the tail area probability
n.tailAreaProbability(originalStat)

From which the output is something like

Part 0: Chi-square = 47.836914, (dof=12) P = 0.000003
# The stat is 47.8369140221
# The distribution has 500 items
# The distribution goes from 0.598552 to 10.158933
# Items in the distribution were >= theStat 0 times.
# The tail-area probability is 0.000000

In this example, the model does not fit, and could not have plausibly generated the data from which the original test quantity was 47.8.

Simulating data

The ability to simulate data is useful in many contexts. You can only simulate into an existing Data object, and so if you want to simulate a new data set, you need to make a blank Data object to simulate into. You can make one or more new empty alignments and make a Data object like this

a = func.newEmptyAlignment(dataType='dna', taxNames=myTaxNames, length=200)
d = Data([a])

Then you can attach that Data object to a tree, define a model, and simulate with the Tree simulate() method.

Generation of random numbers uses the GSL random number generator. The state is held in var.gslrng, which is None by default. If you do a simulation using this method, it will use var.gsl_rng if it exists, or make it if it does not exist yet. When it makes it, it seeds the state based on the current time. That should give you lots of variation in the simulations that follow.

If on the other hand you want to make simulations that are the same you can reseed the randomizer with the same seed whenever you do it, like this

if not var.gsl_rng:
    var.gsl_rng = pf.gsl_rng_get()
# unusually, set the seed with each simulation
mySeed = 23    # your chosen int seed
pf.gsl_rng_set(var.gsl_rng, mySeed)

Customizing

P4 is meant to be customized.

  • You can change variables found in the Var class
  • You can semi-permanently add your own code into files that p4 reads on startup. By doing that you can extend or over-ride the behaviour of p4.

Setting var variables

You can change the default behaviour by setting or un-setting some variables in var (which is an instance of the Var class). For example, when p4 reads files it is by default not verbose, not describing what it is doing as it goes. This is good everyday behaviour, but not very good for debugging when something goes wrong. You can make it verbose by setting

var.verboseRead=True

Generally you will get your phylogenetic stuff into p4 from files, using the read() function. The read() function also reads strings, which is occasionally handy. If you hand the read() function a string, it will first ask if there is a file by that name, and if so it will attempt to read it. If there is no file, it will try it as a string. The problem is that if you mis-spell a file name, you are liable to get a bunch of confusing error messages as it attempts to read the file name as some sort of phylogenetic data. So the default behaviour when it can’t find a file is to give a little warning, and you can turn that warning off by setting

var.warnReadNoFile=False

There are several other things in the Var class that you might want to know about.

You can make these settings to var, either

  • temporarily
    • in interactive p4
    • at the top of individual scripts
  • or in startup scripts
    • possibly in your ~/.p4 directory

Add your own code

You can of course add your own code in your p4 script to extend p4 - perhaps to add your own function, or to extend a p4 Class with an additional method or two. You might put that part of the code in a separate file in your working directory and ask your script to load that module. But if you find your new code generally useful you might want to semi-permanently add it to your p4, so that it is read in for any script in any directory, without you needing to explicitly ask it to do so. To do that, you can put your code in a directory known to p4, and p4 will read it on startup. One such directory is ~/.p4. So if you put python files in there, p4 will find them and read them on startup. Other directories can be specified by the environment variable P4_STARTUP_DIRS. That variable is a colon-separated list of directory names.

So for example when p4 starts up, by default it prints out a splash screen. Whether that is done or not is controlled by var.doSplash. When you get tired of seeing the splash screen, you can turn it off by setting var.doSplash=False. But wait! - By the time you have the opportunity to set that variable, its already too late– the splash screen has already displayed itself. The solution is to put var.doSplash=False in a file ending in .py in a .p4 directory in your home directory, or in another of the P4_STARTUP_DIRS. Any file ending in .py in one of those directories is read on startup. Its a good place to put var.warnReadNoFile=False, for example.

Since it is Python, you can add methods to existing classes. For example, if you have this in your script

def greet(self):
    print("Howdy!")
Alignment.greet = greet
del(greet)

and you had an Alignment instance a, you could then do

     p4> a.greet()
     Howdy!
     p4>

You can over-ride existing methods that way also. If you don’t like the way p4 works, change it to suit yourself.

A good place to put these new methods and functions is in files in the ~/.p4 directory, or in one of your specified P4_STARTUP_DIRS. You can easily turn off files, but allow them to remain there for later, by changing the file name so that it no longer ends in .py, or just put them in a sub-directory.

One special file that is read in after the others, but only if you are interactive, is the ~/.p4/interactive file, and note that while it is a Python file, it does not end in .py. I like to instantiate a few blank objects in there, for quick access to the doc strings via completion (see below). Here is my current interactive file, for what it’s worth

#print('Here is the interactive startup file.')

# Make some blank objects.
if 1:
    a0 = Alignment()
    d0 = Data([])
    t0 = Tree()
    n0 = Node()
    #tt0 = Trees()

# Give single Alignments, Trees, or SequenceLists names.
if pyFileCount == 0:
    if len(var.alignments) == 1:
        print("There is one alignment in var.alignments-- I am naming it 'a'")
        a = var.alignments[0]
    if len(var.trees) == 1:
        print("There is one tree in var.trees-- I am naming it 't'")
        t = var.trees[0]
    if len(var.sequenceLists) == 1:
        print("There is one sequenceList in var.sequenceLists-- I am naming it 'sl'")
        sl = var.sequenceLists[0]

Compositional homogeneity tests

The usual, poor, way to do a compositional homogeneity test is to do a Chi-square test on the data. Paup does this on the data overall, and Tree-Puzzle does it on individual sequences. In the latter, since there are many simultaneous comparisons, the power of the conclusions might be compromized. P4 does the test both ways, using the method Data.compoChiSquaredTest().

This test uses the X2 statistic. It is well-known that this sort of test suffers from a high probability of type II error, because the chi-square curve is not an appropriate null distribution by which to assess significance of the X2 statistic.

A better null distribution can be obtained by simulating data on the tree and model in question, and using X2 statistics extracted from those simulations to make a null distribution appropriate to the problem at hand. This is done using the Tree method compoTestUsingSimulations().

Model fit tests

For Bayesian model fit assessment you can do posterior predictive simulations during the MCMC.

P4 implements two ML-based model fit tests– the tree- and model-based composition fit test, and Goldman’s version of the Cox test. Both rely on simulations that generally need to be followed by optimizations, so they are expensive tests. In both tests the simulation/optimization part is the time-consuming part, and since both tests can use the same simulations, in p4 they are done together. First, the simulations are done using the Tree method simsForModelFitTests(). After that, the simulation results are digested with the Tree method modelFitTests().

Created: 2024-10-07 Mon 15:57