P4 — essentials
2024-10-07.
Help
Help is available in various places–
- Tutorial
- This tutorial (that you are reading now) is an introductory overview, not a comprehensive reference — it skips a lot of detail.
- Examples
- The examples are part of the source code, and should
have been installed along with the rest of
p4
, in ashare
directory in the source tree. The splash screen, the info you see when you first start up p4, may tell you where they are. They should be current and work as advertised, even if the API docs are shamefully not up-to-date. Copy them to a working directory so that you can modify them without changing the originals. If any do not work, please let me know. - Recipes
- When in interactive
p4
, you can callfunc.recipes()
to get examples to use as starting points for your own scripts. - API
- The class and module (API) descriptions. These are written
automatically by the Python source. They should be fairly
comprehensive and definitive.
- The descriptions in the API are also
available in the Python
help()
anddir()
functions. For example, tryhelp(Alignment)
from within interactivep4
to read all the docstrings in the Alignment class. Thedir()
function lists the attributes.
- The descriptions in the API are also
available in the Python
Import p4
P4
is a Python package. You can import p4
into your Python as usual
import p4
or (and I prefer this, but you might not) you can use the p4
script.
The p4
script can read data files at the command line.
The p4
script loads the package as
from p4 import *
so all the p4
names go to the top level in Python, which you may not like.
This week, p4
will recognize and read data files in the form of Nexus
files, fasta files, Phylip (sequential or interleaved), and clustalw aln files. P4
will recognize trees in Nexus and Phylip or raw Newick format. And of
course p4
recognizes Python files. P4
understands Nexus data, taxa,
characters, trees, and sets blocks, but it does not understand all the
commands within those blocks.
Q: So why is it called p4?
A: Its an acronym- pppp
Q: So what’s it short for?
A: So its easy to type.
Quickstart
Here we provide a taster for some of the toolkit aspects of p4
. These
examples are very easy, and (except for the likelihood calculation) can
be done using p4
interactively. You will probably find that for
anything bigger writing a script would be easier.
These quickstart examples can be found in the Examples, in the
A_quickStart
directory. I suggest that you make a copy of the Examples,
then change directory to A_quickStart
in your copy, and then cd to
directories therein.
Draw a tree on the screen
Lets say you have a file that has a tree description in Newick or nexus format, and you want to draw that tree to the screen to see what it is. Say, to your shell
p4 -d t.nex
(or whatever your file name is)
Convert between file formats
Lets say that you have an alignment in Phylip format and you want to
convert it to Nexus format. This is a small job, so we can use p4
interactively. First, read in the data, then give the alignment a name,
and then tell the alignment to write itself in nexus format to a file
$ p4 data.phy # Read the data at the command line p4> a = var.alignments[0] # Give the alignment a name p4> a.writeNexus('data.nex') # Tell it to write itself to a file p4> # Control-d to quit.
Extract a charset
from an alignment
Lets say you have a multi-gene alignment, and you want to extract one of
the genes and put it in an alignment file by itself. Here, the
placement of the various genes in the alignment is defined in a nexus
sets block, in this example in the file sets.nex
. Do something like
this
: $ p4 d.phy sets.nex # Read everything : p4> a=var.alignments[0] # Name the alignment : p4> b = a.subsetUsingCharSet('gene1') # b is a new alignment : p4> b.writeNexus('gene1.nex') # Tell b to write itself : p4> # Quit with control-d
Tabulate the character frequencies in the data
$ p4 d.nex # Read the data p4> d=Data() # Make a Data object p4> d.compoSummary() # Ask for a table of composition
Compare the topology of two trees
Lets say that you have two best trees from two different phylogenetic analyses using the same data. The trees might have the same topology, but the trees might not look much like each other, because they are rooted on different nodes, and many branches are rotated on their stems. However, ignoring branch lengths, the trees might still have the same topology, regardless of the various permutations. You can quickly find out how different they are, by doing something like the following
$ p4 tree1.nex tree2.nex # Read in both trees p4> t1 = var.trees[0] # Name the first one p4> t2 = var.trees[1] # Name the second one p4> print(t1.topologyDistance(t2)) # Symmetric difference 0 # Zero means they are the same
The default metric for the
Tree.topologyDistance()
method is the symmetric difference, aka the
unweighted Robinson-Foulds distance, which is the number of splits in
one tree that are not in the other, plus the number of splits in the
other tree that are not in the one. In this example, the trees are
the same, and so the difference is zero. If the two trees had only
one difference, the symmetric difference would be 2.
A very simple likelihood calculation
This example is a bit more involved, and is not well suited to
interactive use. The usual way to use p4
would be to make a script, and
that is what we do here. Make a file with the following, and save it as
s.py
read(""" 2 2 one ac two gt """) read('(one,two);') t = var.trees[0] t.data = Data() t.newComp() t.newRMatrix() t.setPInvar() t.calcLogLike()
Usually p4
scripts refer to other files for the data and the tree, but
here it is all in the one script. Sequence data, a tree, and a model
are described and then the likelihood is calculated without
optimization. To make the script happen, say, to your command line
p4 s.py
Using p4
Using the p4 script
As usual for a Python package, you can import p4
from within Python, by
import p4
. However, my favourite way of using it is with the p4
script (which should have been installed, hopefully somewhere in your
path). The p4
script does a from p4 import *
, so your top-level
namespace becomes instantly cluttered, but then you do not need to
precede everything with p4
. The p4
script allows you to read in
phylogenetic data and tree files at the command line. To see if it is
installed and working, say
p4 --help
Some peculiarities, bugs, and “features”
P4
doesn’t do searches with ML. It only evaluates explicit trees. It will do tree searches with Bayesian analysis.- Arrays and such in Nexus files are 1-based, while in Python and
p4
they are zero-based. Nexus files are case-insensitive inp4
, as they should be; but elsewhere inp4
, as in Python, they are case-sensitive. - Nexus trees do not require a Taxa or Data block. Counter to the
Nexus standard, having different trees with different taxa in the
same trees block is acceptable in
p4
. However, if you do supply a taxa block,p4
will use it to enforce those taxon names in the trees. - Often the results are not written automatically: you have to tell
p4
to write the result. However, you can tell it how you would like the output formatted. Since the interface is a programming language you can do whatever you want with the numbers. - You give starting values for optimization. If the params are already optimized, then it shouldn’t take long for re-optimization.
- The default branch length in trees is 0.1
- When reading Nexus files,
p4
recognizes datatype DNA, but not RNA. If you give it datatype nucleotide,p4
will assume you mean DNA. The Nexus spec does not allow names that are all numerals, and by default
p4
follows that rule. You can (eg temporarily) override that behaviour and allow all-digit names by True-ing the variablevar.nexus_allowAllDigitNames
var.nexus_allowAllDigitNames = True
- The symbols
-
,?
, and.
are the only ones allowed for gaps, unknown, and matchchars, respectively.
The var object, the global bucket
When you import p4
, or start the p4
script, you import var
, which
is an object that holds things like lists and flags that are generally
useful to you and p4
.
With the p4
script you can read things from the command line, via
p4 aFile anotherFile
If those files contain phylogenetic data, they get put in
var.sequenceLists
(a list)var.alignments
(a list)var.trees
(a list)var.nexusSets
(not a list; if it exists it is a NexusSets object)
The read()
function
The usual way to get phylogenetic data and trees into p4
is with the
read()
function. This function is used in scripts that you
might write; it is also used within the p4
script to read in files
from the command line.
Nexus files or files containing trees often contain multiple things which would be turned into multiple Python objects, so a command like
myAlignment = readNexus('myData.nex') # Doesn't work
would not always work. The read()
function does not return anything;
rather, when you read in alignments from files, Alignment objects are
made and stuffed into var.alignments
, a list, as explained above
in the section The var object, the global bucket. So a typical script
might start out by reading in data and trees, like this
read('myAlignment.nex') a = var.alignments[0] read('myTreeFile.nex') t = var.trees[0]
There might be several trees in myTreeFile.nex
, and when the file is
read in they are all converted to Tree objects and put in the var.trees
list. So if you want one of those trees, you have to say which one. To
get the first one, say something like
firstTree = var.trees[0]
When you read in files from the command line using the p4
script, you
can use file name globbing, ie use wildcards, as
p4 *.phy
In that case the shell does the file name expansion. You can also do
the same sort of thing with read()
, as in
read('*.phy')
If you want to make p4
forget the trees that it has read, say
var.trees = []
You can do the same for alignments and sequenceLists. To make p4
forget any Nexus sets info that it has read, you can do
var.nexusSets = None
If you are sure that the file that you are trying to read has only one
thing (Tree, Alignment, SequenceList), then you can use func.readAndPop()
. This is handy if for example you are reading
in a bunch of sequence list files and one of them happens to have all
its sequences the same length – so it gets promoted to an Alignment
object and gets put in var.alignments
. If you use
func.readAndPop()
then you would not need to check.
# awkward using read() for fName in myFileList: var.alignments = [] var.sequenceLists = [] read(fName) try: sl = var.sequenceLists[0] except IndexError: sl = var.alignments[0] # easier using func.readAndPop() for fName in myFileList: sl = func.readAndPop(fName) # SequenceList or Alignment
The dump()
function and methods
The function func.dump()
gives a quick summary of files
that you have read, and objects that have been made and placed in
var.trees
, var.alignments
, and so on. It does not know about alignments
and such that are not in var.*
.
Several classes have dump()
methods as well. For example, to see inside
trees in fine detail, you can use the Tree.dump()
method, for example
t.dump()
or
t.dump(all=True)
To see details about models, use Model.dump()
, for example
t.model.dump()
Alignments and data
You can read in an alignment from a file either with the p4
script with
the file name as a command-line argument, or using the read()
function
within p4
. Any alignments that are made are put in var.alignments
, a
list. (If its a fasta file and the sequences are not all the same
length, it is not an alignment, it is a SequenceList
, and it is put in
var.sequenceLists
, another list.) So a lot of p4
interactive sessions
might start out like this
$ p4 myData.nex # in your shell a = var.alignments[0] # in interactive python+p4
and a lot of p4
scripts might start out like this
read('myData.nex') a = var.alignments[0]
Saying read()
here reads the file, makes an Alignment object from
it, and puts that object in var.alignments
. Then you give it the
name a
. Then you can do useful things with a
.
The function func.readAndPop()
is similar but it only works for files that have one tree, alignment, or sequence list, and the object is popped from the var.
list. So for example if =myData.nex“ has one alignment, then when I do
a = func.readAndPop('myData.nex') # in interactive python+p4
then the Alignment object will not be in var.alignments
.
Alignment, Part, and Data classes
- Alignment
- A SequenceList subclass, with a lot of functionality in Python.
- Part
- A data partition, with the sequence data in the C language, for likelihood calculations. Each alignment has at least one part, maybe more.
- Data
- All the partitions that you want to use. It is at least one Alignment (but perhaps more), with at least one Part (at least one Part for each Alignment, but perhaps more). A Data object is needed for likelihood calculations. It has a C-language representation of the sequences as a list of partitions.
An alignment of phylogenetic data in a file can be read in to make an
Alignment object, which is good for simple tasks that you might want
to do with alignments. However, if you want to calculate the
likelihood of the data, then things get more complicated because then
we transfer the alignment into the pf
module, written in the
C-language. A Part object encapsulates an alignment (or an alignment
partition) in C. A Data object contains all the Part objects that you
want to use, together with the alignments that the parts came from.
When you do likelihood calculations, your data might be simple– they might be in just one alignment, with only one data partition. In that case your Data object has only one Alignment object, and one Part object. Or you might have 2 alignments, each of which has a single part each, and so your Data object would have 2 alignments and 2 parts. Or perhaps you have a single alignment which is partitioned into 5 genes, in which case your Data object would have one alignment and 5 parts.
To reiterate, the way to get phylogenetic data into p4
is with the read()
function.
When you read in alignments from files, Alignment
objects are made and stuffed into var.alignments
, a list. So a typical
script to do something simple (ie not likelihood) with an alignment
might start out by reading in a file, like this
read('myAlignment.nex') a = var.alignments[0]
For likelihood calculations you need a Data object, so a typical script for that might start out like this
read('myAlignment.nex') d = Data()
In either case when you read in the file, an Alignment object is made
and deposited in var.alignments
. It is there whether or not you give it
a name, and whether or not you use it to make a Data object.
If you need to partition your alignment, then you might do something like
read("myAlignment.nex") read("mySets.nex") # with the partition info a = var.alignments[0] a.setCharPartition('myCharPartitionName') d = Data() # now with more Parts than if the Alignment had not been partitioned
Part objects are made when you make a Data object. It is done behind the scenes so you may never need to be concerned with Part objects.
The sequence data that you want to use might be in more than one alignment, and
that is perfectly acceptable to p4
. In fact, it is essential if you
are using more than one datatype (eg the DNA sequence for one gene,
analysed simultaneously with the protein sequence for another gene). A
MrBayes-style data matrix with more than one datatype is not allowed by
p4 (nor is it Nexus compliant). So you might do something like the
following to read in two alignments
read('myFirstAlignment.nex') read('mySecondAlignment.nex') d = Data()
P4
analyses data partition-by-partition. Separate alignments
necessarily make separate data partitions, but you can also split up
single alignments into separate partitions–that would be done using Nexus
sets blocks. To do this you might do something like
read('myAlignment.nex') a = var.alignments[0] read('myNexusSetsBlock.nex') # Defines cp1 a.setCharPartition('cp1') # Applies cp1 to the alignment d = Data() # now with more Parts than if the Alignment had not been partitioned
In order to get a Data object from one alignment that you want to partition into 2 parts together with one unpartitioned alignment, you might do something like the following
read('myFirstAlignment.nex') a = var.alignments[0] read('myNexusSetsBlock.nex') # defines cp1 a.setCharPartition('cp1') read('mySecondAlignment.nex') d = Data()
The d
Data instance now contains 3 Part objects (in d.parts
) and 2
Alignment objects (in d.alignments
).
If you are going to use a Data object together with a tree, eg for ML evaluations, Bayesian analysis, or simulation, you need to attach the Data object to the tree, as
t.data = d
Nexus charpartitions
This week, p4
understands Nexus charset and charpartition commands
within Nexus sets blocks. When a Nexus sets block is read in, its
contents are added to var.nexusSets, a NexusSets object which can hold
charsets and charpartitions. When you want to use a charpartition on an
alignment, you would say something like
a = var.alignments[0] a.setCharPartition(theCharPartitionName)
That attaches a copy of var.nexusSets
to the alignment. To un-set a
charpartition from an alignment, you can say a.setCharPartition(None)
.
Setting a charPartition does not actually do anything to the sequences
in an Alignment, rather a charPartition only comes into play when you
make a Data object. When you make a data object from your partitioned
alignments, you can check that you are indeed dealing with partitioned
data by doing a dump()
method on your data object.
Subsetting alignments
By subsetting I mean extracting a part of an alignment into its own new alignment. You can subset alignments based on charsets or mask strings.
If you do not have a Nexus sets block, you can still use the Nexus
character list format (eg 1 3-5 7 11
below, and note that it is
1-based numbering) to make a mask on the fly using the
func.maskFromNexusCharacterList()
function, for example
p4> print(func.maskFromNexusCharacterList("1 3-5 7 11", 12, invert=1)) 010001011101
Nexus standard datatype
P4
supports Nexus standard datatype.
Something I have been looking into is recoding protein alignments
into amino acid groups, for example the 6 Dayhoff groups, for
analysis. That uses the Alignment.recodeDayhoff()
or the
Alignment.recodeProteinIntoGroups()
method. Here the amino acid
symbols are recoded to the numerals. For example in Dayhoff recoding,
since C (Cysteine) is in a group by itself, it is simply recoded as 1,
but the amino acids S, T, P, A, and G are all recoded as 2, as they
are all in the same Dayhoff group. It is rather like recoding DNA
into transversions to decrease the effects of saturation. It is based
on the notion that changes within groups might be saturated, biased,
and difficult to model, while changes between groups are less
problematic. Certainly there is loss of information, but (hopefully)
the information that is retained is of higher quality.
(Non-parametric) bootstrapping of data
Bootstrapping an alignment is trivially easy, and lots of programs do
it, p4
included. It is a method in the Alignment class.
However, bootstrapping of partitioned data is also supported by p4
. It
is a method of the Data class. The alignment and partition structure is
maintained in the bootstrapping process.
Dealing with duplicate sequences
There is no point in doing a deep phylogenetic analysis with
duplicated sequences. (There may be a reason for popgen level work.)
If you have any duplicated sequences, p4
will point them out when you
read in your data file. P4
will remove the sequences if you tell it
to. But often you want those dupes in your final result - generally a
tree. So you will want to remove the duplicated sequences before you
do your analysis, and then restore the taxa to the resulting tree
after. For example, lets say that you read in your data like this
$ p4 d3.nex # at your shell
and then you get a long message complaining about duplicated sequences, starting with
Alignment from file 'd3.nex' This alignment has duplicate sequences! Sequence numbers below are 1-based. sequence 1 (A) is the same as sequence 3 (C). sequence 1 (A) is the same as sequence 8 (H). sequence 2 (B) is the same as sequence 5 (E). sequence 2 (B) is the same as sequence 7 (G). ...
To remove the duplicated sequences and make a Python dictionary file that will be useful later for restoring those removed taxa, you can do something like this
read('d3.nex') a = var.alignments[0] a.checkForDuplicateSequences(removeDupes=True, makeDict=True) a.writePhylip(fName='d3_noDupes.phy')
(Above I wrote the subsetted alignment as a phylip file, but that need not be- you could write it in Nexus or fasta format as you choose.)
That makes a dictionary file, by default named p4DupeSeqRenameDict.py
.
Don’t lose it, because it is needed when you restore the deleted duped
taxa to the resulting tree.
It removes all but one of the duplicate sequences, and re-names the
remaining sequence with a name like p4Dupe1
or p4Dupe2
.
Then you can do your analysis, which will go faster because you don’t have duped sequences. At the end, you will generally have a tree. You can restore the removed taxa to the tree using the dictionary file made above. You would do something like
read('myTree.nex') t = var.trees[0] t.restoreDupeTaxa() t.writeNexus(fName='myRestoredTree.nex')
The restored taxa will by default have compound names.
Trees
Tree, Node, and Trees classes
Trees in p4
are described by a list of nodes, one of which is the root.
All trees have a root, but this has nothing to do with the number of
children that the root has.
In phylogenetics when we talk about rooted
trees we usually mean that we think that one node, the root, is the
ancestor of all the other nodes; however the root in p4
usually carries no such
implication. The root in p4
is often just a node that is a starting place for
the rest of the nodes.
Can the root ever be a leaf? — Not usually, but it is possible.
Nodes are the vertexes of the tree, where the branches meet. (My
dictionary tells me that both vertices and vertexes are correct plurals
of vertex.) P4
describes relationships among nodes with 3 pointers- we
have a pointer to the parent node, to a child node (which I imagine to
be always on the left, so it is the leftChild), and to a sibling node
(which I imagine to always be on the right). Any of these pointers
might be None; for example the root node parent is None, the leftChild
of any tip node is None, and the sibling of the rightmost child of a
node is None
leftChild \\ \\ Node --- sibling | | parent
All the nodes except the root in a p4
Tree also have NodeBranch
attributes, which embody the branches leading to the node from the
parent. It has information about for example the length of the branch,
and so for a node n
we have n.br.len
.
The usual way to get around trees is by recursive traversal, and p4
provides a few avenues for that, both as Node methods and as Tree
methods. See Knuth’s The Art of Computer Programming for a good
explanation of preorder and postorder tree traversal.
Sometimes it is natural to deal with trees by the bunch, and so we have
the Trees class. It is this class that is able to write Nexus tree
files that use translations, and it is this class that interfaces with
consel
to compare several trees.
This week, p4
will read trees in Nexus format, or in Phylip or raw
Newick format. The Nexus format is described in the 1997 paper in
Systematic Biology (MadSwofMad97). I am not sure where the
Phylip/Newick format is described. Older Phylip tree files began with
a number, indicating the number of trees in the file, but it appears
that newer Phylip tree files do not have or need that number. When did
it change?
P4
reads both Nexus and Phylip trees with Nexus-like rules, so if you
have punctuation or spaces in your taxon names then you need to put the
name in single quotes. P4
fails to read multi-word Phylip taxon names,
so for example p4
will gag on the tree
((A, B), (C, Homo sapiens));
because one of the taxa has a space in it. That sort of taxon name is perfectly acceptable if
- you put it inside single quotes, or
- you True the flag
var.newick_allowSpacesInNames
The tree files output by Tree-Puzzle have modifications to the
Phylip/Newick format, where [comments bounded by square brackets]
precede the tree description. P4
can handle those.
Branch lengths go after colons in the tree description, as usual. You can label internal nodes, including the root, immediately after unparens, as in the following
p4> read("((A, B)the_AB_clade:0.2, (C, 'Homo sapiens')98:0.3)theRoot;") p4> t=var.trees[0] p4> t.draw() +------2:A +--------1:the_AB_clade | +------3:B theRoot:0 | +------5:C +----------4:98 +------6:Homo sapiens
P4
distinguishes Node objects (vertices) and NodeBranch objects (edges).
It is the branch that holds the length of the branch. Branches and branch lengths can be accessed by
myNode.br myNode.br.len
After doing a consensus tree, it is the branch that holds the support values. That makes the branch supports immune to re-rooting. But it also means that in order to see them and save them in newick or Nexus format you will need to convert them to node.names. This has the advantage that you can format the conversion to your liking — you can make it percent, or as a a value from zero to 1.0 with however many significant figures you wish
tp = TreePartitions('myTrees.nex') t = tp.consensus() # list the support values for n in t.iterInternalsNoRoot(): print("node %3i branch support is %f" % (n.nodeNum, n.br.support)) # perhaps collapse nodes where the support is too small for you toCollapse = [n for n in t.iterInternalsNoRoot() if n.br.support < 0.7] for n in toCollapse: t.collapseNode(n) # optionally re-root it t.reRoot(t.node('myOutgroupTaxon').parent) # Make the internal node names the percent support values for n in t.iterInternalsNoRoot(): n.name = '%.0f' % (100. * n.br.support) # Drawings get a bit messy with both node nums and support t.draw(showNodeNums=False) t.writeNexus('consTreeWithSupport.nex')
The Nexus/Newick format can hold node names and branch lengths, but it is awkward and non-standard to make it hold more information (eg split support, branch colour, model usage information, rooting information, and so on). Perhaps an XML-based format would be useful here, but these are early days for XML in phylogenetics, and XML files can get very big very quickly. Update: NeXML looks interesting (files are still big, tho).
As an alternative, you can store information-rich trees by pickling them, a
standard Python thing to do to archive objects to files. Trees, nodes,
and models may have pointers to c-structures; these should not be
archived, nor should you archive data objects along with the trees. To
facilitate pickling trees you can use the Tree method tPickle()
,
which strips out the c-pointers and data before pickling the tree.
Trees so pickled (with the p4_tPickle
suffix) are autorecognized by p4
with the read()
function or at the command line.
Tree diagrams
You can make a text diagram of a tree at the terminal with the Tree.draw()
method. It provides some control over the presentation, for example the
width, whether node numbers are displayed, and so on.
Topology distance
You can compare tree topologies such that the root of the tree and any rotations of clades do not matter. For example, these 2 trees have identical topologies (by this definition), even though they do not look like each other
+--------1:A | | +--------3:B 0--------2 | +--------4:C | | +--------6:D +--------5 +--------7:E +--------1:E | |--------2:D 0 | +--------5:C | +---------4 +--------3 +--------6:B | +---------7:A
With the Tree.topologyDistance()
method you can compare
topologies without taking branch length differences into account,
or you can use metrics that do take branch lengths into account.
The default metric is the symmetric difference, aka the unweighted
Robinson Foulds distance, which ignores branch lengths. The
Tree.topologyDistance()
method also provides the
weighted Robinson-Foulds distance, and the branch length distance,
which take branch lengths into account. These are described in
Felsenstein’s book. To do several trees at once, you can use the
Trees.topologyDistanceMatrix()
method.
Patristic distances
This distance is just the length along the tree path between all pairs of nodes.
The method Tree.patristicDistanceMatrix()
returns a DistanceMatrix
object, which you probably want to write to a file. For example, you
might say
t = var.trees[0] dm = t.patristicDistanceMatrix() dm.writeNexus('patristic.nex')
Models and likelihood
In order to calculate the likelihood of a tree, you need to specify a model. Since the model depends on the datatype and data partition structure, before you can specify a model you need to specify the data.
Having specified a model, you can calculate the likelihood without
optimization with Tree.calcLogLike()
; You can optimize
any free parameters with the Tree.optLogLike()
method.
t = var.trees[0] t.data = Data() # specify the data <... specify the model ...> t.calcLogLike()
Describing models
You need to specify
- a composition,
- a rate matrix, and
- the proportion of invariant sites.
- Optionally you can specify gamma distributed among-site rate variation (gdasrv), with the number of discrete gamma classes, nGammaCat.
Here is a simple one, for an F81 model, with no among-site rate variation
t.newComp(free=1, spec='empirical') t.newRMatrix(free=0, spec='ones') t.setNGammaCat(nGammaCat=1) # no 'gamma' t.setPInvar(free=0, val=0.0) # no proportion of invariant sites
(Note that while the F81 is a DNA model, the model above would also work for other datatypes.)
Parameters of the model may or may not be free
(adjustable or optimizable), and so that needs to be specified.
Depending on the spec
(ie model specification), some model parameter
numerical values may need to be specified as a val
argument. For
example, to define the a composition for the third data partition
(partNum=2
) for which you want the model composition to be fixed to
certain values (ie not free), you might say
t.newComp(partNum=2, free=0, spec='specified', val=[0.4, 0.3, 0.2])
A simple Poisson or Jukes-Cantor-like model can be described for single partition data by the following
t.newComp(free=0, spec='equal') t.newRMatrix(free=0, spec='ones')
Here the spec
for the composition is equal
, meaning that the
character state frequencies are all equal; since they are equal you do
not need to specify the numbers. In specifying the rate matrix, the
spec='ones'
arg means that the rates of change from one base to another
are all the same. Here the model parameters are not free, and so they
would not change in a likelihood optimization or in a Bayesian MCMC
analysis.
If all the parameters on the model above were set to be free
(ie by
setting free=1
) then it would become a GTR model. If the comp was set
free but the rMatrix remained fixed then it would be an F81 model.
K2P and HKY models are specified by setting the rate matrix spec to
’2p’. Empirical protein models are specified by setting the rate matrix
spec to, for example, ’d78’, ’jtt’, ’wag’, or ’mtrev24’. The current rMatrixSpecs is given by var.rMatrixSpecs
Multi-partition models for mult-partition data
The model can differ over the data, and if you have more than one data
partition you need to specify the partNum
when you specify the
components of a model
pNum = 0 # First partition, F81 t.newComp(partNum=pNum, free=1, spec='empirical') t.newRMatrix(partNum=pNum, free=0, spec='ones') t.setNGammaCat(partNum=pNum, nGammaCat=1) t.setPInvar(partNum=pNum, free=0, val=0.0) t.setRelRate(partNum=pNum, val=0.9) pNum = 1 # second partition, F81+G t.newComp(partNum=pNum, free=1, spec='empirical') t.newRMatrix(partNum=pNum, free=0, spec='ones') t.setNGammaCat(partNum=pNum, nGammaCat=4) t.newGdasrv(partNum=pNum, free=1, val=0.5) t.setPInvar(partNum=pNum, free=0, val=0.0) t.setRelRate(partNum=pNum, val=1.1)
If your model is heterogeneous over the data, you can optionally specify the relative rates of the different partitions, for example
t.setRelRate(partNum=0, val=0.5) t.setRelRate(partNum=1, val=1.5)
If you want to make these free parameters, you can say
t.model.relRatesAreFree = 1
Tree-heterogeneous models
The composition or rate matrix of the model can differ over the tree. (The gdasrv can also differ over the tree, but this seems to be less useful.)
To specify it exactly (which is sometimes useful, but it would not generally be useful at the start of an MCMC), first you specify the model attribute, and then you put the model attribute on the tree, for example, if we start with this tree
+--------2:one +--------1 | +--------3:two 0 |--------4:three | +--------5:four
and put 2 compositions on it, like this
A = t.newComp(spec='empirical', symbol='A') B = t.newComp(spec='empirical', symbol='B') t.setModelComponentOnNode(A, node=t.root, clade=1) t.setModelComponentOnNode(B, node=1, clade=1) t.draw(model=1)
then we end up with a tree like this
+BBBBBBBB2:one +BBBBBBBB1 | +BBBBBBBB3:two 0 |AAAAAAAA4:three | +AAAAAAAA5:four Part 0 comps 0 A 1 B root (node 0) has comp 0, symbol A
Here I have specified 2 compositions, A and B. We place A on the root
node, but because we specify clade=1
that composition is applied over
the entire tree. Then we place composition B on node 1, also
clade-wise, and in that part of the tree B displaces (ie over-rides) A.
An alternative to placing model components on the tree explicitly as
above, you can also Tree.setModelComponentsOnNodesRandomly
The multinomial or unconstrained likelihood is a property of the data
only, and does not need a tree or model. It can only be calculated if
there are no gaps or ambiguities in the data. There are 2 ways to
calculate it– you can either take the data partitions into account, or
not. The former uses the Data method
Data.calcUnconstrainedLogLikelihood1()
. The result is placed in the data
attribute unconstrainedLogLikelihood
. The Data method
Data.calcUnconstrainedLogLikelihood2()
calculates the unconstrained log
like of each data partition. Note that the unconstrained log like of
the combined data is not the sum of the unconstrained log likes of the
separate partitions.
Switching Data and the Model
You will often want to analyse the same dataset and tree with different models, or perhaps analyse the same tree and model with different datasets.
It is straightforward to switch datasets.
a = func.readAndPop("data_a.nex") dA = Data([a]) b = func.readAndPop("data_b.nex") dB = Data([b]) t.data = dA # set the model ... t.data = dB # continue with the same model...
When you switch model, first you set the model to None
a = func.readAndPop("data_a.nex") dA = Data([a]) t.data = dA # set the model ... t.model = None # set the new model
Bayesian analysis with MCMC
P4
has a basic MCMC for doing Bayesian analyses.
It implements the models in p4, including some models that are Bayesian-only.
The Mcmc is started with a p4.Tree
object, with model defined and data attached.
Tree rearrangement with eTBR and LOCAL is implemented.
If the tree root is bifurcating, the Mcmc will preserve a bifurcating root.
Topology constraints, including rooting, can be enforced.
It can use the Metropolis-coupled MCMC, or MCMCMC, aka parallel tempering, to help mixing. To start an MCMC, you first set up your data, tree, and model as usual. Then you instantiate an Mcmc object, and run the chain, like this
# t is the tree, with model and data attached m = Mcmc(t, nChains=4, runNum=0, sampleInterval=1000, checkPointInterval=250000, simulate=2) m.run(1000000) # one million generations
Here nChains
says to make 4 MCMC chains in parallel in an MCMCMC, where
1 chain is the cold chain that is sampled, and 3 are heated chains.
As the MCMC runs, it saves trees to a tree file, saves likelihoods to another file, and saves model parameters to another file. These samples need to be digested or summarized later.
It has a full implementation of Paul Lewis et al’s polytomy proposal (P. O. Lewis, M. T. Holder and K. E. Holsinger, 2005. Polytomies and Bayesian phylogenetic inference, Syst Biol 54(2): 241–253). P4 writes checkpoint files during the MCMC run, so that if you want to continue a run after it has ended, or if you want to recover from a crash or power outage, it is possible to restart from a checkpoint. You can specify topological constraints in the trees.
A simple MCMC
# You will usually start with a data file, and a random tree. read("d.nex") d = Data() t = func.randomTree(taxNames=d.taxNames) t.data = d # Now define a model. This one is GTR+IG. t.newComp(free=1, spec='empirical') t.newRMatrix(free=1, spec='ones') t.setNGammaCat(nGammaCat=4) t.newGdasrv(free=1, val=0.5) t.setPInvar(free=1, val=0.2) # Now instantiate an Mcmc object. This one has 4 MCMCMC chains, is set # to do a million generations (below, in run()), collect 2000 samples # (every 500 generations) and is set to take 4 checkPoints. m = Mcmc(t, nChains=4, runNum=0, sampleInterval=500, checkPointInterval=250000) # Here you can adjust the proposal probabilities m.prob.polytomy = 1.0 # Default is zero m.prob.brLen = 0.001 # Default is zero # This starts the run, specifying the number of gens you want. You can # restart it from a checkPoint if you later find you need more. m.run(1000000)
Chain swapping and temperature tuning in the MC3
In the MCMCMC, we have a few MCMC chains, that run in parallel. All but one of the chains is heated or tempered, which makes proposal acceptances easier to allows better mixing. The unheated chain is the cold chain, and is the only chain that is sampled. The amount of heating, to continue the metaphor, is given by the temperature. Chains can swap states, and so for example the cold chain can swap states with lowest-temperature heated chain, and so after the swap the previously heated chain would be sampled. This allows the heated chains, with their mixing advantage, to inform the sampled chain.
The way p4
implements this is that by default only adjacent chains are proposed to swap.
That is, the cold chain may swap with the lowest-temperature heated chain, but not with any of the other chains.
MrBayes implements swapping in a similar way, but chain-swapping proposals are made between any pairs of chains, not only between adjacent chains. I have noticed that chain-swapping acceptance is often low for non-adjacent chains, and so by default p4
uses a swap vector that only allows swapping between adjacent chains rather than a swap matrix that allows swapping between any two chains. This is under the control of
Mcmc.swapVector # eg m.swapVector=True; True by default
The amount of swapping between chains is related to the temperature difference, with a low temperature difference tending to have a high acceptance and a high temperature difference tending to have low swap proposal acceptance. For this reason the chain temperatures are actively tuned during the MCMC run to try to attain an acceptance of 0.1 – 0.25 (for the swap vector). Whether tuning is done at all is controlled by m.swapTunerDoTuning
and it is by default True
.
The Mcmc object stores information about the swaps, and for swap vectors this is accessible by the Mcmc method writeSwapVector()
.
This is conveniently done on Mcmc checkpoints, and so if you save Mcmc checkpoints you can do
cpr = McmcCheckPointReader(theGlob='*') cpr.writeSwapVectors()
Below is the output of one such checkpoint
Mcmc.writeSwapVector(). Chain swapping, for 250000 gens, from gens 250000 to 499999, inclusive. swapVector is on, so swaps occur only between adjacent chains. last chainTemps 0.000 0.554 3.399 8.739 last chainTempDiffs 0.554 2.845 5.340 chains 0--1 1--2 2--3 ------ ------ ------ proposed 83882 82894 83224 accepted 15006 15459 15815 accepted 0.178894 0.186491 0.190029
Proposal probabilities
Proposals to change the model parameters or tree topology are made with default probabilities. These default probabilities are calculated based on the number of parameters involved. You can adjust these proposal probabilities, or perhaps turn a proposal type off altogether, by setting attributes of Mcmc.prob. These are user-specifiable relative probabilities, with default being 1; they are not proper probabilities, and do not sum to 1.
For an Mcmc object m
, I can show these current relative probs by
saying
print(m.prob)
You can see the real intended rates (the kind that add up to 1.0) that are calculated from these relative probs after a run, or on checkpoints
m.writeProposalProbs()
For my Mcmc object m
, I could for example turn off the tree topology proposals by
setting local and eTBR to zero.
m = Mcmc(t, ...) m.prob.local = 0.0 m.prob.eTBR = 0.0 m.run(...)
These relative m.prob
values affect proposal probabilities, but not in a
linear way. For changes with inherently low proposal probabilities,
such as relRate
, doubling the relative prob approximately doubles the
final proposal probability. As I mentioned, you can check the final
calculated proposal probabilities with m.writeProposalProbs()
.
The polytomy proposal
This is an implementation of the move described in Lewis et al (2005) “Polytomies and Bayesian phylogenetic inference” Syst Biol 54:241-253. This move is turned off by default, but you can turn it on for an Mcmc object m by saying
m.prob.polytomy = 1.0 m.prob.brLen = 0.001
(If you have local turned on, which you probably do, you will also want to turn on the brLen proposal in addition, for those cases when the current tree is a star tree; the local proposal needs 3 contiguous edges, and so the local proposal does not work on a star tree. In that case, the chain uses the brLen proposal instead of local.)
This move tends to decrease spurious high support for short internal nodes. The prior probability favouring increased multifurcations or increased resolution can be adjusted. Actually, as in the paper, there are two ways to do this; you can use the polytomy prior, or you can use the resolution class prior. If you want to use the resolution class prior, you can turn it on, for an Mcmc object m, by
m.tunings.doPolytomyResolutionClassPrior = True # False by default
but if it is left as False then you use the polytomy prior. In both
cases, there is a ’C’ constant to set, that determines how much
multifurcations are favoured. You can set it via its log, for example
for an Mcmc object m
m.tunings.polytomyPriorLogBigC = 1.0 # Default is zero
By setting the logBigC to 1.0, above, I am setting the C to e as in the paper. Assuming the default of using polytomy priors, leaving logBigC set to zero (ie C=1), the default, means that every tree (of any resolution) has the same prior probability. Setting it to higher numbers favours polytomies.
As in the paper, you can use the resolution class priors, and set the C constant to 1.1 (actually 10/9 = 1.111111), 2, or 10, for an Mcmc object m, like this
import math m.tunings.doPolytomyResolutionClassPrior = True m.tunings.polytomyPriorLogBigC = math.log(10./9.) # Or 2, or 10, or ...
Topology constraints
You can specify constraints on the tree topology with a Constraints object. To make a Constraints object, you need a list of taxNames (in the same order as your data, trees, and so on), and you also need a (usually partly-resolved) tree object
myTaxNames = myData.taxNames cTree = func.readAndPop('myConstraintsTreeFile.phy') constr = Constraints(taxNames, constraintTree=cTree, rooting=True)
You can pass a Constraints object to func.randomTree() and Mcmc() to enforce constraints. If you are starting a Mcmc with a randomTree, then it should have the same constraints as you pass to the Mcmc
t = func.randomTree(taxNames=myTaxNames, constraints=constr, biRoot=True) <...> m = Mcmc(t, ..., constraints=constr)
Checkpoints
When running an MCMC, you can write checkpoint files at intervals. These files are the state of the Mcmc at that time. Mcmc runs can be restarted using checkpoints. Also, you can do diagnostics on checkpoints like you can on live Mcmc objects, but with checkpoints you can do the diagnostics after the run has finished (and the Mcmc object no longer exists) or during the run (querying finished checkpoints), and you can do the diagnostics on several checkpoints to see how the diagnostics change over time.
To tell your Mcmc to make checkpoints, say (among other args when you start up an Mcmc) for example
m = Mcmc(t, ..., checkPointInterval=250000, ...)
To tell it not to do checkpoints, set it to zero or None
m = Mcmc(t, ..., checkPointInterval=None, ...)
The checkpoint interval should divide the number of generations requested by run() evenly, so for example
m = Mcmc(t, ..., checkPointInterval=250000, ...) m.run(1000000)
will work, but
m = Mcmc(t, ..., checkPointInterval=250000, ...) m.run(900000)
will not work.
I generally aim to collect perhaps 4 or 5 checkpoints in my runs. If you collect more checkpoints (by collecting them more often) then they will each contain fewer samples, and so the estimates might be a bit more noisy than if you take checkpoints at bigger intervals representing more samples.
There is a class, McmcCheckPointReader(), that is good for reading and digesting checkpoints. When you start it up like this
cpr = McmcCheckPointReader()
then it will read in all the checkpoint files in the current directory.
There are other ways to start it up - read the class doc string.
Having got a McmcCheckPointReader object (eg cpr
here), then you can ask it to
cpr.writeProposalAcceptances()
or
cpr.writeSwapVectors()
which calls those methods on all the checkpoints. When you read in checkpoints, they are full Mcmc objects (except that they do not have data attached), and so you can ask questions of them as you would ask of an Mcmc object, as
m = cpr.mm[0] # get the first Mcmc object m.writeProposalAcceptances()
Using the McmcCheckPointReader is useful for seeing how things, for example acceptance rates, change over the run. Perhaps the most useful thing that a McmcCheckPointReader can do is compare the split supports between runs, using the average standard deviation of split support (or split frequency). When you do an Mcmc analysis it is good practice to do more than one run, and comparing split supports between runs measures of topological agreement between runs.
You can compare split supports between two different checkpoints, or between all pairs, as
cpr.compareSplits(0,1) # Compare checkpoint 0 with 1 cpr.compareSplitsAll() # Compare all checkpoints to each other
Restarting from checkpoints
You can restart an Mcmc from a checkpoint. If the previous run finished normally (ie was not a crash) then it is easy. A checkpoint is a full Mcmc object without the data - so you need to give it the data to get going. You can use the function
func.unPickleMcmc(runNum, theData, verbose=True)
to get the Mcmc object from the checkpoint. That function will get the last checkpoint from the specified run (runNum) and return an Mcmc object. So for example you might restart runNum 0 by
read("../d.nex") d = Data() m = func.unPickleMcmc(0, d) m.run(1000000)
which tells it to continue on in the same way as it was, and do another million generations. It would be possible to make changes to the Mcmc before continuing the run, as
m = func.unPickleMcmc(0, d) < make changes here ...> m.run(1000000)
If the Mcmc crashed, you can restart from a checkpoint as above, but first you will want to repair the output files to get rid of output lines that were written after the last checkpoint but before the crash.
Output
Trees sampled in the MCMC are placed in a file, and the log likelihoods of those sample trees are found in another file. Model parameters are put in another file.
You can make a consensus tree like this
tp = TreePartitions('trees.nex', skip=1000) # skip burnin t = tp.consensus()
You will often want to transfer the node support values to internal node names – see the doc string for TreePartitions.
You can do a quick-and-dirty convergence test by plotting the log likelihood values found in the output file. If it reaches a plateau, then it is assumed to have converged. However, this method is unreliable. Certainly if the log likes have not reached a plateau then it has not converged, but the reverse cannot be assumed. Comparing split supports as described above (see Checkpoints) offers better convergence diagnostics.
You can also look at sampled model parameters, found in the mcmcpramsN (N = 0, 1, …) using
func.summarizeMcmcPrams()
You can set it to skip a burnin.
Posterior predictive simulations
To help assess model fit in a MCMC you can set up the MCMC to simulate data sets based on the current tree and model every writeInterval. When a simulation is made, a test quantity is extracted from it and written to a file.
You can turn on simulations during an MCMC when you instantiate the
Mcmc object. You set the simulate
arg to a number from 1-31. For
example, if you set it to 1, you get the unconstrained or multinomial
likelihood, if 2 you get X2, and if 3 you get both. The available
test quantities are
1 | multinomial log likelihood |
2 | composition \(X^2\) |
4 | mean nChars per site |
8 | \(c_m\), compStatFromCharFreqs (2 numbers per part; sim and curTree) |
16 | constant sites count (not proportion) |
If you want to do the simulations after the MCMC is finished (or perhaps from a MrBayes run), see the PosteriorSamples class.
The idea is that you have a single test quantity from your data (or
data partition) and you want to compare it to the range that is
generated from the posterior distribution, to see whether the model
that you are using might have plausibly generated your original data.
A usual way to do that comparison is to use
func.tailAreaProbability()
(or the same within the Numbers class
Numbers.tailAreaProbability()
). Here is an example from the
source code
# Get the test quantity, X^2, from the original data. read("../../K_thermus/noTRuberNoGapsNoAmbiguities.nex") d = Data() ret = d.compoChiSquaredTest() #print(ret) originalStat = ret[0][0] # Get the sim stats n = Numbers('mcmc_sims_0', col=1, skip=500) # Evaluate the tail area probability n.tailAreaProbability(originalStat)
From which the output is something like
Part 0: Chi-square = 47.836914, (dof=12) P = 0.000003 # The stat is 47.8369140221 # The distribution has 500 items # The distribution goes from 0.598552 to 10.158933 # Items in the distribution were >= theStat 0 times. # The tail-area probability is 0.000000
In this example, the model does not fit, and could not have plausibly generated the data from which the original test quantity was 47.8.
Simulating data
The ability to simulate data is useful in many contexts. You can only simulate into an existing Data object, and so if you want to simulate a new data set, you need to make a blank Data object to simulate into. You can make one or more new empty alignments and make a Data object like this
a = func.newEmptyAlignment(dataType='dna', taxNames=myTaxNames, length=200) d = Data([a])
Then you can attach that Data object to a tree, define a model, and
simulate with the Tree simulate()
method.
Generation of random numbers uses the GSL random number
generator. The state is held in var.gslrng, which is None by
default. If you do a simulation using this method, it will
use var.gsl_rng
if it exists, or make it if it does not exist
yet. When it makes it, it seeds the state based on the
current time. That should give you lots of variation in the
simulations that follow.
If on the other hand you want to make simulations that are the same you can reseed the randomizer with the same seed whenever you do it, like this
if not var.gsl_rng: var.gsl_rng = pf.gsl_rng_get() # unusually, set the seed with each simulation mySeed = 23 # your chosen int seed pf.gsl_rng_set(var.gsl_rng, mySeed)
Customizing
P4
is meant to be customized.
- You can change variables found in the
Var
class - You can semi-permanently add your own code into files that
p4
reads on startup. By doing that you can extend or over-ride the behaviour ofp4
.
Setting var variables
You can change the default behaviour by setting or un-setting some
variables in var
(which is an instance of the Var class). For example, when p4
reads files it is by
default not verbose, not describing what it is doing as it goes. This
is good everyday behaviour, but not very good for debugging when
something goes wrong. You can make it verbose by setting
var.verboseRead=True
Generally you will get your phylogenetic stuff into p4
from files, using
the read()
function. The read()
function also reads strings, which is
occasionally handy. If you hand the read()
function a string, it will
first ask if there is a file by that name, and if so it will attempt to
read it. If there is no file, it will try it as a string. The problem
is that if you mis-spell a file name, you are liable to get a bunch of
confusing error messages as it attempts to read the file name as some
sort of phylogenetic data. So the default behaviour when it can’t find
a file is to give a little warning, and you can turn that warning off by
setting
var.warnReadNoFile=False
There are several other things in the Var
class that you might want to know about.
You can make these settings to var
, either
- temporarily
- in interactive
p4
- at the top of individual scripts
- in interactive
- or in startup scripts
- possibly in your
~/.p4
directory
- possibly in your
Add your own code
You can of course add your own code in your p4
script to extend p4
-
perhaps to add your own function, or to extend a p4
Class with an
additional method or two. You might put that part of the code in a
separate file in your working directory and ask your script to load
that module. But if you find your new code generally useful you might
want to semi-permanently add it to your p4
, so that it is read in for
any script in any directory, without you needing to explicitly ask it
to do so. To do that, you can put your code in a directory known to
p4, and p4
will read it on startup. One such directory is ~/.p4
. So
if you put python files in there, p4
will find them and read them on
startup. Other directories can be specified by the environment
variable P4_STARTUP_DIRS
. That variable is a colon-separated list of
directory names.
So for example when p4
starts up, by default it prints out a splash
screen. Whether that is done or not is controlled by var.doSplash
.
When you get tired of seeing the splash screen, you can turn it off by
setting var.doSplash=False
. But wait! - By the time you have the
opportunity to set that variable, its already too late– the splash
screen has already displayed itself. The solution is to put
var.doSplash=False
in a file ending in .py
in a .p4
directory in
your home directory, or in another of the P4_STARTUP_DIRS
. Any file
ending in .py
in one of those directories is read on startup. Its a
good place to put var.warnReadNoFile=False
, for example.
Since it is Python, you can add methods to existing classes. For example, if you have this in your script
def greet(self): print("Howdy!") Alignment.greet = greet del(greet)
and you had an Alignment instance a
, you could then do
p4> a.greet() Howdy! p4>
You can over-ride existing methods that way also. If you don’t like the
way p4
works, change it to suit yourself.
A good place to put these new methods and functions is in files in the
~/.p4
directory, or in one of your specified P4_STARTUP_DIRS
.
You can easily turn off files, but allow them to remain there for
later, by changing the file name so that it no longer ends in .py
, or
just put them in a sub-directory.
One special file that is read in after the others, but only if you are
interactive, is the ~/.p4/interactive
file, and note that while it is
a Python file, it does not end in .py
. I like to instantiate a few
blank objects in there, for quick access to the doc strings via
completion (see below). Here is my current interactive
file, for what it’s worth
#print('Here is the interactive startup file.') # Make some blank objects. if 1: a0 = Alignment() d0 = Data([]) t0 = Tree() n0 = Node() #tt0 = Trees() # Give single Alignments, Trees, or SequenceLists names. if pyFileCount == 0: if len(var.alignments) == 1: print("There is one alignment in var.alignments-- I am naming it 'a'") a = var.alignments[0] if len(var.trees) == 1: print("There is one tree in var.trees-- I am naming it 't'") t = var.trees[0] if len(var.sequenceLists) == 1: print("There is one sequenceList in var.sequenceLists-- I am naming it 'sl'") sl = var.sequenceLists[0]
Compositional homogeneity tests
The usual, poor, way to do a compositional homogeneity test is to do a
Chi-square test on the data. Paup does this on the data overall, and
Tree-Puzzle does it on individual sequences. In the latter, since there
are many simultaneous comparisons, the power of the conclusions might be
compromized. P4
does the test both ways, using the method
Data.compoChiSquaredTest()
.
This test uses the X2 statistic. It is well-known that this sort of test suffers from a high probability of type II error, because the chi-square curve is not an appropriate null distribution by which to assess significance of the X2 statistic.
A better null distribution can be obtained by simulating data on the
tree and model in question, and using X2 statistics extracted from
those simulations to make a null distribution appropriate to the problem
at hand. This is done using the Tree method
compoTestUsingSimulations()
.
Model fit tests
For Bayesian model fit assessment you can do posterior predictive simulations during the MCMC.
P4
implements two ML-based model fit tests– the tree- and model-based
composition fit test, and Goldman’s version of the Cox test. Both
rely on simulations that generally need to be followed by
optimizations, so they are expensive tests. In both tests the
simulation/optimization part is the time-consuming part, and since
both tests can use the same simulations, in p4
they are done together.
First, the simulations are done using the Tree method
simsForModelFitTests()
. After that, the simulation
results are digested with the Tree method
modelFitTests()
.