Models and likelihood

In order to calculate the likelihood of a tree, you need to specify a model. Since the model depends on the datatype and data partition structure, before you can specify a model you need to specify the data.

Having specified a model, you can calculate the likelihood without optimization with tree.Tree.calcLogLike(); You can optimize any free parameters with the p4.tree.Tree.optLogLike() method.

t = var.trees[0] = Data()   # specify the data
<... specify the model ...>

Describing models

You need to specify

  • a composition,

  • a rate matrix, and

  • the proportion of invariant sites.

  • Optionally you can specify gamma distributed among-site rate variation (gdasrv), with the number of discrete gamma classes, nGammaCat.

Here is a simple one, for an F81 model, with no among-site rate variation:

t.newComp(free=1, spec='empirical')
t.newRMatrix(free=0, spec='ones')
t.setNGammaCat(nGammaCat=1)       # no 'gamma'
t.setPInvar(free=0, val=0.0)      # no proportion of invariant sites

(Note that while the F81 is a DNA model, the model above would also work for other datatypes.)

Parameters of the model may or may not be free (ie adjustable or optimizable), and so that needs to specified. Depending on the spec (ie model specification), some model parameter numerical values may need to be specified as a val argument. For example, to define the a composition for the third data partition (partNum=2) for which you want the model composition to be fixed to certain values (ie not free), you might say:

t.newComp(partNum=2, free=0, spec='specified', val=[0.4, 0.3, 0.2])

A simple Poisson or Jukes-Cantor-like model can be described for single partition data by the following:

t.newComp(free=0, spec='equal')
t.newRMatrix(free=0, spec='ones')

Here the spec for the composition is equal, meaning that the character state frequencies are all equal; since they are equal you do not need to specify the numbers. In specifying the rate matrix, the spec='ones' arg means that the rates of change from one base to another are all the same. Here the model parameters are not free, and so they would not change in a likelihood optimization or in a Bayesian MCMC analysis.

If all the parameters on the model above were set to be free (ie by setting free=1) then it would become a GTR model. If the comp was set free but the rMatrix remained fixed then it would be an F81 model.

K2P and HKY models are specified by setting the rate matrix spec to ‘2p’. Empirical protein models are specified by setting the rate matrix spec to ‘d78’, ‘jtt’, ‘wag’, or ‘mtrev24’.

Multi-partition models for mult-partition data

The model can differ over the data, and if you have more than one data partition you need to specify the partNum when you specify the components of a model:

pNum = 0    # First partition, F81
t.newComp(partNum=pNum, free=1, spec='empirical')
t.newRMatrix(partNum=pNum, free=0, spec='ones')
t.setNGammaCat(partNum=pNum, nGammaCat=1)
t.setPInvar(partNum=pNum, free=0, val=0.0)
t.setRelRate(partNum=pNum, val=0.9)

pNum = 1   # second partition, F81+G
t.newComp(partNum=pNum, free=1, spec='empirical')
t.newRMatrix(partNum=pNum, free=0, spec='ones')
t.setNGammaCat(partNum=pNum, nGammaCat=4)
t.newGdasrv(partNum=pNum, free=1, val=0.5)
t.setPInvar(partNum=pNum, free=0, val=0.0)
t.setRelRate(partNum=pNum, val=1.1)

If your model is heterogeneous over the data, you can optionally specify the relative rates of the different partitions, for example:

t.setRelRate(partNum=0, val=0.5)
t.setRelRate(partNum=1, val=1.5)

If you want to make these free parameters, you can say:

t.model.relRatesAreFree = 1

Tree-heterogeneous models

The composition or rate matrix of the model can differ over the tree. (The gdasrv can also differ over the tree, but this seems to be less useful.)

To specify it exactly (which is sometimes useful, but it would not generally be useful at the start of an MCMC), first you specify the model attribute, and then you put the model attribute on the tree, for example, if we start with this tree:

|        +--------3:two

and put 2 compositions on it, like this:

A = t.newComp(spec='empirical', symbol='A')
B = t.newComp(spec='empirical', symbol='B')
t.setModelComponentOnNode(A, node=t.root, clade=1)
t.setModelComponentOnNode(B, node=1, clade=1)

then we end up with a tree like this:

|        +BBBBBBBB3:two
Part 0 comps
    0   A
    1   B
    root (node 0) has comp 0, symbol A

Here I have specified 2 compositions, A and B. We place A on the root node, but because we specify clade=1 that composition is applied over the entire tree. Then we place composition B on node 1, also clade-wise, and in that part of the tree B displaces (ie over-rides) A.

An alternative to placing model components on the tree explicitly as above, you can also Tree.Tree.setModelComponentsOnNodesRandomly().

The multinomial or unconstrained likelihood is a property of the data only, and does not need a tree or model. It can only be calculated if there are no gaps or ambiguities in the data. There are 2 ways to calculate it– you can either take the data partitions into account, or not. The former uses the Data method Data.Data.calcUnconstrainedLogLikelihood1(). The result is placed in the data attribute unconstrainedLogLikelihood. The Data method Data.Data.calcUnconstrainedLogLikelihood2() calculates the unconstrained log like of each data partition. Note that the unconstrained log like of the combined data is not the sum of the unconstrained log likes of the separate partitions.