Data
¶
- class Data(alignments=None)[source]¶
All the alignments that you want to work with, in one place.
- Initialize this with one of
nothing (or None),
a list of Alignment objects, or
a single Alignment object.
If you initialize with nothing (or None), then all alignments in var.alignments are used. If you initialize with a list of alignments, then that is used. You can initialize with an empty list to get an empty Data object.
- bootstrap()[source]¶
Returns a new data object, filled with bootstrapped data.
It is a non-parametric bootstrap. Data partitions are handled properly, that is if your data has a charpartition, the bootstrap has the same charpartition, and sites are sampled only from the appropriate charpartition subset.
Generation of random numbers uses the GSL random number generator. The state is held in
var.gsl_rng
, which is None by default. If you do a bootstrap using this method, it will usevar.gsl_rng
if it exists, or make it if it does not exist yet. When it makes it, it seeds the state based on the current time. That should give you lots of variation.If on the other hand you want to make a series of bootstraps that are the same as a previous series you can reseed the randomizer with the same seed before you do it, like this:
if not var.gsl_rng: var.gsl_rng = pf.gsl_rng_get() # unusually, set the seed mySeed = 23 # your chosen int seed pf.gsl_rng_set(var.gsl_rng, mySeed)
- calcUnconstrainedLogLikelihood1()[source]¶
Calculate likelihood under the multinomial model.
This calculates the unconstrained (multinomial) log like without regard to character partitions. The result is placed in the data variable unconstrainedLogLikelihood. If there is more than one partition, it makes a new temporary alignment and puts all the sequences in one part in that alignment. So it ultimately only works on one data partition. If there is more than one alignment, there is possibly more than one datatype, and so this method will refuse to do it. Note that the unconstrained log like of the combined data is not the sum of the unconstrained log likes of the separate partitions.
See also calcUnconstrainedLogLikelihood2
- calcUnconstrainedLogLikelihood2()[source]¶
Calculate likelihood under the multinomial model.
This calculates the unconstrained log like of each data partition and places the sum in the Data (self) variable unconstrainedLogLikelihood. Note that the unconstrained log like of the combined data is not the sum of the unconstrained log likes of the separate partitions. See also calcUnconstrainedLogLikelihood1
- compoChiSquaredTest(verbose=1, skipColumnZeros=0, useConstantSites=1, skipTaxNums=None, getRows=0)[source]¶
A chi square composition test for each data partition.
So you could do, for example:
read('myData.nex') # Calling Data() with no args tells it to make a Data object # using all the alignments in var.alignments d = Data() # Do the test. By default it is verbose, and prints results. # Additionally, a list of lists is returned ret = d.compoChiSquaredTest() # With verbose on, it might print something like --- # Part 0: Chi-square = 145.435278, (dof=170) P = 0.913995 print ret # The list of lists that it returns might be something like --- # [[145.43527849758556, 170, 0.91399521077908041]] # which has the same numbers as above, with one # inner list for each data partition.
If your data has more than one partition:
read('first.nex') read('second.nex') d = Data() d.compoChiSquaredTest() # Output something like --- # Part 0: Chi-square = 200.870463, (dof=48) P = 0.000000 # Part 1: Chi-square = 57.794704, (dof=80) P = 0.971059 # [[200.87046313430443, 48, 0.0], [57.794704451018163, 80, 0.97105866938683427]]
where the last line is returned. With verbose turned off, the
Part N
lines are not printed.This method returns a list of lists, one for each data partition. If getRows is off, the default, then it is a list of 3-item lists, and if getRows is turned on then it is a list of 4-item lists. In each inner list, the first is the X-squared statistic, the second is the degrees of freedom, and the third is the probability from chi-squared. (The expected comes from the data.) If getRows is turned on, the 4th item is a list of X-sq contributions from individual rows (ie individual taxa), that together sum to the X-sq for the whole partition as found in the first item. This latter way is the way that Tree-Puzzle does it.
Note that this ostensibly tests whether the data are homogeneous in composition, but it does not work on sequences that are related. That is, testing whether the X^2 stat is significant using the chi^2 curve has a high probability of type II error for phylogenetic sequences.
However, the X-squared stat can be used in valid ways. You can simulate data under the tree and model, and so generate a valid null distribution of X^2 values from the simulations, by which to assess the significance of the original X^2. You can use this method to generate X^2 values.
A problem arises when a composition of a character is zero. If that happens, we can’t calculate X-squared because there will be a division by zero. If skipColumnZeros is set to 1, then those columns are simply skipped. They are silently skipped unless verbose is turned on.
So lets say that your original data have all characters, but one of them has a very low value. That is reflected in the model, and when you do simulations based on the model you occasionally get zeros for that character. Here it is up to you: you could say that the the data containing the zeros are validly part of the possibilities and so should be included, or you could say that the data containing the zeros are not valid and should be excluded. You choose between these by setting skipColumnZeros. Note that if you do not set skipColumnZeros, and then you analyse a partition that has column zeros, the result is None for that partition.
Another problem occurs when a partition is completely missing a sequence. Of course that sequence does not contribute to the stat. However, in any simulations that you might do, that sequence will be there, and will contribute to the stat. So you will want to skip that sequence when you do your calcs from the simulation. You can do that with the skipTaxNums arg, which is a list of lists. The outer list is nParts long, and each inner list is a list of taxNums to exclude.
- meanNCharsPerSite()[source]¶
Mean number of different characters per site
Constant sites are not ignored. Ambiguities and gaps are ignored.
This is implemented in C, allowing multiple parts. It is also implemented in pure Python in the Alignment class, for single parts (which also optionally gives you a distribution in addition to the mean); see
Alignment.Alignment.meanNCharsPerSite()
.
- simpleBigXSquared()[source]¶
No frills calculation of bigXSquared.
As in
Data.Data.compoChiSquaredTest()
, but with no options, and hopefully faster. It can’t handle gaps or ambiguities. It should be ok for simulations. It returns a list of bigXSquared numbers, one for each data partition.If a character happens to not be there, then a column will be zero, and so it can’t be calculated. In that case -1.0 is returned for that part.
- simpleConstantSitesCount()[source]¶
No frills constant sites count.
It can’t handle gaps or ambiguities. It should be ok for simulations. It returns a list of constant sites counts, one for each data partition.
For each part, of the sites that are not all gaps+ambigs, if the sites that are not gaps or ambigs are all the same, then it is considered here to be a constant site.
- writeNexus(fName=None, writeDataBlock=0, interleave=0, flat=0, append=0)[source]¶
Write all the alignments in self to a Nexus file.
If writeDataBlock=1, then taxa and characters are written to a ‘data’ block, rather than the default, which is to write separate ‘taxa’ and ‘characters’ blocks.
Arg ‘flat’ gives sequences all on one line. Arg ‘append’, if 0, writes #NEXUS first. If 1, does not write #NEXUS.