# Alignment¶

class Alignment

A SequenceList where all the sequences are the same length and dataType.

From SequenceList, this inherits

Alignment objects additionally have the length, which has a synonym nChar, and is also accessible via len(self).

Also:

• nTax, the number of sequences
• taxNames, a list of the names of the sequences
• dataType, a NEXUS-centric string — ‘dna’, ‘protein’, or ‘standard’
• symbols, the character symbols, not including equates or ambiguities
• equates, a dictionary of NEXUS-style ambiguities, eg 'r':['a','g'] for DNA
• dim, the dimension, ie the number of symbols
• nexusSets, a NexusSets.NexusSets object, usually made from a copy of var.nexusSets, but made specific to self

 p4.sequencelist.SequenceList.checkNamesForDupes checkForAllGapColumns Check for alignment columns that are made of only gap or ? chars. checkForBlankSequences Like it says, with verbose output. checkForDuplicateSequences Like it says, with verbose output. checkLengthsAndTypes Last checks after reading an Alignment.

The checks above are under the control of some variables in Var, always available to you as the instance var.

• var.doCheckForDuplicateSequenceNames
• var.doRepairDupedTaxonNames
• var.doCheckForAllGapColumns
• var.doCheckForBlankSequences
• var.doCheckForDuplicateSequences
Writing
 writeNexus Write self in Nexus format. writePhylip Write the alignment in Phylip format. p4.sequencelist.SequenceList.writeFasta Write out the sequences in Fasta format. writeMolphy Another phylogenetics program, another format.
Copy self
 dupe Duplicates self, with no c-pointers. noGapsOrAmbiguitiesCopy Returns a new Alignment with sites with gaps or ambiguities removed. bootstrap Returns a new Alignment, a bootstrap resampling of self. p4.data.Data.bootstrap Returns a new data object, filled with bootstrapped data. sequenceSlice Returns a list composed of the characters from the alignment at position pos.
Masks are strings that are nChar long, usually just 0s and 1s
 constantMask Returns a mask string with 1 at constant sites, and 0 otherwise. simpleConstantMask Returns a mask with 1 at simple constant sites, and 0 otherwise. gappedMask Returns a mask string with 1 at positions with any gaps, and 0 otherwise. getAllGapsMask If its all gaps in a site, its 1, else 0. getEnoughCharsMask Mask out sites that have too many gaps. getLikelihoodTopologyInformativeSitesMask Make and return a mask for those sites that are likelihood informative about the topology. getMaskForAutapomorphies Return a mask for autapomorphies getMaskForCharDiversity Return a mask for a given character diversity

You can also make masks with p4.func.maskFromNexusCharacterList(). You can combine masks bitwise with the Alignment methods andMasks() and orMasks()

CharSets and CharPartitions

A NEXUS-style charSet or charPartition is usually defined using a NEXUS sets block, as explained in p4.nexussets.NexusSets. A NEXUS sets block might be something like:

begin sets;
charset pos1 = 1 - .;
charset pos2 = 2 - .;
charset pos3 = 3 - .;
end;


or

begin sets;
charset gene1 = 1 - 103;
charset gene2 = 104 - 397;
charpartition by_gene = gene1:gene1, gene2:gene2;
end;


However, you can also make new charSets from an Alignment.nexusSets with a mask using p4.nexussets.NexusSets.newCharSet()

CharPartitions must cover the entire length of self, without overlap of the character partition subsets.

Whenever you do something like Alignment.subsetUsingCharSet() that requires self.nexusSets, if self.nexusSets does not exist yet then Alignment.setNexusSets() is automatically called to make it — so usually you do not need to do Alignment.setNexusSets().

 Alignment.excludeCharSet Exclude a CharSet. Alignment.setCharPartition Partition self into Parts based on charPartitionName. Alignment.setGBlocksCharSet Find conserved regions of an alignment using Gblocks Alignment.setNexusSets Set self.nexusSets from var.nexusSets. p4.nexussets.NexusSets.newCharSet p4.nexussets.NexusSets.dupeCharSet
Extracting subsets
 Alignment.subsetUsingCharSet Return a subset of self based on a charSet. Alignment.subsetUsingMask Returns a subset of self based on a mask.
Recoding the data into groups
 Alignment.recodeDayhoff Recode protein data into Dayhoff groups, in place. Alignment.recodeProteinIntoGroups Recode protein data into user-specified groups, in place. Alignment.recodeRY Recode DNA data into purines and pyrimidines, in place.
Translating DNA to protein
 Alignment.translate Returns a protein alignment from self, a DNA alignment. Alignment.checkTranslation Check that self translates to theProteinAlignment.
andMasks(maskA, maskB)

Given two masks, this logically and’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.

Eg. 0010 with 1010 will return 0010. and 0010 with 1000 will return 0000.

bluntEndLigate(alig, allowDifferentDataTypes=False)

Attaches alig to the end of self.

Various checks are made. If the sequences are in a different order in the two alignments then it will not work.

See the method Alignment.concatenate(), which can handle missing and out-of-order sequences.

bootstrap()

Returns a new Alignment, a bootstrap resampling of self.

This is done only in Python (ie no cParts involved), and does not handle partitioned data. If you want to bootstrap partitioned data, use the p4.data.Data.bootstrap() method.

changeDataTypeTo(newDataType, newSymbols=None, newEquates=None)

Coerce the alignment to be a new datatype.

This would be good for pathological cases where eg DNA with lots of ambigs is mistaken for protein.

It is not sufficient to simply change the dataType – we must change the symbols and equates as well. And the dataType for all the sequences in self. If you are changing to ‘standard’ dataType, then you need to specify the symbols and the equates. The symbols is a string, and the equates is a dictionary (see eg yourAlignment.equates for a DNA alignment to see the format).

checkForAllGapColumns(returnMask=False)

Check for alignment columns that are made of only gap or ? chars.

By default, p4 does this on new alignments that are read. It is under the control of var.doCheckForAllGapColumns.

If there are all gap columns then a verbose output is printed and a P4Error is raised, unless returnMask is set, in which case no output is printed, no P4Error is raised, but the mask is returned.

checkForBlankSequences(removeBlanks=False, includeN=True, listSeqNumsOfBlanks=False)

Like it says, with verbose output.

If the p4 variable var.doCheckForBlankSequences is set, this method, with removeBlanks=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForBlankSequences off.

When these automatic checks are done, if any blank sequences are found then a verbose P4Error is raised, inviting you to run it again with removeBlanks turned on.

Sometimes you just want to know what sequences are blank; get the sequence numbers by turning arg listSeqNumsOfBlanks on. That returns a list of sequence numbers, without removing blanks or raising a P4Error.

If removeBlanks is set, the blank sequences are removed. This option is not set by default.

If this method removes sequences, it returns the number of blank sequences removed.

Blank sequences are defined as sequences wholly composed of ‘?’ and ‘-‘. If includeN is turned on (which it is by default) then ‘n’ is included for DNA, and ‘x’ for protein. If that is done, that means, for DNA, that it is blank if it is wholly composed of ‘?’, ‘-‘, and ‘n’.

checkForDuplicateSequences(removeDupes=False, makeDict=True, dictFileName='p4DupeSeqRenameDict.py', dupeBaseName='p4Dupe')

Like it says, with verbose output.

If the p4 variable var.doCheckForDuplicateSequences is set, this method, with removeDupes=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForDuplicateSequences off.

When these automatic checks are done, if any dupe sequences are found then a verbose warning is issued, inviting you to run it again with removeDupes turned on.

If removeDupes is set, the duplicate sequences after the first are removed. This option is not set by default.

If both removeDupes and makeDict are set, then it will rename the first sequence to p4Dupe1 (or p4Dupe2, and so on — the dupeBaseName is ‘p4Dupe’ by default, but it can be set as an arg) and make a dictionary to hold the other names, and write that dictionary to a file (by default p4DupeSeqRenameDict.py). The option makeDict is set by default, but it won’t happen unless removeDupes is also set, and there are dupes to be removed.

checkLengthsAndTypes()

Last checks after reading an Alignment.

Make sure the sequence lengths and dataType are all the same. Set self.length, self.dataType, self.symbols, self.dim, and self.equates

checkTranslation(theProteinAlignment, transl_table=1, checkStarts=False)

Check that self translates to theProteinAlignment.

Self should be a DNA alignment. It is translated using p4.geneticcode.GeneticCode.translate() (so it should handle ambiguities) and compared against theProteinAlignment. The theProteinAlignment sequence order, names, and gap pattern should be the same as in the DNA alignment. The default transl_table is the standard (or so-called universal) genetic code.

Other available translation tables, this week:

if transl_table == 1:   # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
# and Coelenterate Mitochondrial Code
# and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

# and now 6, 10, 11, 12, 13, 14, 21.


(These are found in GeneticCode)

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

composition(sequenceNumberList=None)

Returns a list of compositions.

This returns a list of floats, the composition of the sequence(s) in the sequenceNumberList. If the sequenceNumberList=None (the default), then the overall composition is given, which is the mean of the individual sequences (the sequence comps are not weighted by the sequence length, ie the non-gap sequence length). For DNA, the order is acgt. For protein, the order is arndcqeghilkmfpstwyv. For standard, its the order in symbols. Gaps and questionmarks are ignored. Equates are handled properly, iterating to the final comp.

compositionEuclideanDistanceMatrix()

This returns a DistanceMatrix based on composition.

The formula is as given in Lockhart et al 94, the logDet paper. Its equation 4 there, page 608. One pairwise distance is the square root of the sum of the squares of the differences between the frequencies of character states of one pair of sequences.

concatenate(alig, sNames)

Attaches alig to the end of self.

You need to provide a list, argument sNames, of taxon names that are found in self and alig. This will determine the order of the taxa in self, the result.

It will still work if the sequences are in a different order in the two alignments.

The order of the taxa in sNames need not be the same order as either alignment. So this method can change the order of the sequences in self.

It will still work if there are missing sequences in either self or alig. It will add blank sequences as needed.

Parameters: alig (Alignment) – The other Alignment object. sNames (list) – A list of all the taxon names found in self and alig
constantMask(invert=None)

Returns a mask string with 1 at constant sites, and 0 otherwise.

See also simpleConstantMask()

Constant sites are defined in a PAUP-like manner. If, when all possibilities of ambiguities are tried, the site could possibly be constant, then it is constant.

So if a site is all gaps or missing (or a combination), or if all the non-gap chars are the same, then it is a constant site.

If a DNA site contains both r and y, it could not possibly be constant. A site containing only a and r is constant, but a site containing only a and y cannot be constant.

constantSitesCount()

Counts the sites that potentially have the same thing in each sequence.

Constant sites are defined in a PAUP-like manner. See the doc string for Alignment.constantMask().

constantSitesProportion()

Returns the proportion of the alignment that have possible constant sites.

It uses constantSitesCount()

covarionStats(listA, listB, verbose=True)

Calculates some covarion statistics.

After Pete Lockhart’s covarion paper. Reference? Looks at the two groups of sequences defined by listA and listB, and determines the number of sites that are constant overall, constant in both groups but in a different character, constant in one but variable in the other, variable in one and constant in the other, and variable overall. The lists can be either sequence numbers (zero-based) or names. If verbose, (the default) it prints a nice summary, with a tiny explanation. Returns a tuple.

dataType = None

Nexus-centric data type. One of dna, rna, protein, or standard.

dim = None

The number of symbols, eg 4 for DNA.

dump()

dupe()

Duplicates self, with no c-pointers. And no parts

equates = None

A dictionary of NEXUS-style equates, eg r=[a,g] in DNA

excludeCharSet(theCharSetName)

Exclude a CharSet.

gappedMask(invert=None)

Returns a mask string with 1 at positions with any gaps, and 0 otherwise.

getAllGapsMask(andMissing=True)

If its all gaps in a site, its 1, else 0.

If ‘andMissing’ is turned on, then if a site is all gaps or ‘?’, then the mask is 1. If ‘andMissing’ is turned off, then it is strictly gaps.

getCharDiversityDistribution()

Distribution of different chars per site, as a distribution.

Gaps and ambiguities are ignored.

A tuple is returned, composed of

index
0 num of autapomorphies
1 num of simple constant sites
2 num of sites with 2 kinds of char
3 num of sites with 3 kinds of char
... ...
nSymbols num of sites with some of each char

Autapomorphies are sites with two kinds of characters, where there is only one of one of the characters.

So if there are 10 sites with 2 kinds of char, and 4 of those are autapomorphies, 30 constant sites, and 20 sites with 3 kinds of character, we would have a distro like this — (4, 30, 10, 20, ...)

This is in pure Python, and can be used for Alignments with one part.

Returns: a tuple of ints, showing counts of sites with different diversities.
getEnoughCharsMask(perCent=85)

Mask out sites that have too many gaps.

This returns a mask – a string of 1’s and 0’s.

If perCent or more fraction of chars in a site are good, its a 1. Else 0.

Good means not gaps, missing, or ambigs. Good means in self.symbols.

getKosiolAISGroups(tree, n_bins, remove_files=False, verbose=True)

An interface for Kosiol’s program AIS, for grouping amino acids.

Writes three files: equi, q, and evec that are required as input to the software AIS (Almost Invariant Sets).

• tree = tree object with optimised model compatible with the alignment
• nBins = the number of groups required

Kosiol, C., Goldman, N. and Buttimore, N. (2004). A new criterion and method for amino acid classification. J Theo Biol 228: 97-106.

From share/examples/kosiol_ais.html This writes files.

getLikelihoodTopologyInformativeSitesMask()

Make and return a mask for those sites that are likelihood informative about the topology.

Mostly this means no constant and no autapomorphic sites. Autapomorphies are sites that are all one character except for one taxon that has another character.

The rules, this week–

Its not informative if:

• If there are no ambigs or gaps, then constants and autapomorphies are not informative.

• If there are gaps but no ambigs,

• if there are only 2 characters or fewer – not informative
• if constant + gaps – not informative
• if autapomorphy + gaps – not informative
• If there are ambigs but no gaps,

• if constant except for a single ambig, then not informative
• (an autapomorphy plus a single ambig can sometimes be informative)

Otherwise, I’m saying that it is informative. That includes

• sites with gaps plus ambigs, and
• sites with more than one ambig

I am conservatively calling some sites as informative, eg sites with more than one ambiguity, even though sites like that are often not topologically informative, as they might be under some models– I have not checked thoroughly.

Because of the above, we cannot be sure that each site indicated by a 1 in the mask is informative, but we can be sure that each 0 in the mask is uninformative.

getMaskForAutapomorphies()

Returns a string of zeros and 1s, where the 1s are autapomorphic sites.

Gaps and ambiguities are ignored.

In an alignment site, if there are two kinds of character states (ie the character diversity is 2) and one of the character states occurs once only, then it will be called an autapomorphy (or singleton).

Returns: a string nChar long, of zeros and 1s.
getMaskForCharDiversity(diversity=1)

Return a mask for a given character diversity

Gaps and ambiguities are ignored.

This makes a mask string with zeros and ones, with 1’s showing the sites with the chosen diversity, and zero for other sites.

So for example, if diversity is 1, that means constant sites, and this method returns a simple constant sites mask, with gaps and ambigs ignored.

If the diversity is 2, it will return a mask showing sites with two kinds of characters. And so on.

Returns: a string nChar long, of zeros and 1s.
getMinmaxChiSqGroups(percent_cutoff=0.05, min_bins=2, max_bins=20, n_choices=1000, seed=42, verbose=False)

An interface for Susko and Roger’s minmax-chisq program.

Susko, E. and Roger, A.J. (2007). On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24:2139-2150.

• percent_cutoff - P-value percentage above which chi-squared test is considered homogeneous. Default is 0.05.
• min_bins - The minimum number of bins to consider. The default value is 2.
• max_bins - The maximum number of bins to consider. The default value is 20.
• n_choices - number. The number of random choices of bins to consider for each bin size. The default is 1000.
• seed - An integer giving the starting seed for any random generation done by the program. By default this is 42.

All in memory – no files written.

getSimpleBigF(iA, iB)

Make and return a simple bigF between two sequences.

Eg:

seqA acgtacgt
seqB acgtcgta

bigF =
[[ 1.  1.  0.  0.]
[ 0.  1.  1.  0.]
[ 0.  0.  1.  1.]
[ 1.  0.  0.  1.]]


Sites with gaps and ambigs are ignored – thats what makes it ‘simple’.

It returns a 2-D Numpy array of floats.

hasGapsOrAmbiguities()

Asks whether self has any gaps or ambiguities.

initDataParts()
logDet(correction='TK02', doPInvarOfConstants=True, pInvar=None, pInvarOfConstants=None, missingCharacterStrategy='fudge', minCompCount=1, nonPositiveDetStrategy='invert')

logDet calculations. Returns a DistanceMatrix object.

Heavily influenced by paup and by LDDist. Thanks to Swofford and Thollesson, the authors.

The basis of the log det is the Fxy matrix, the matrix of observed changes between 2 sequences. It is from the Fxy that the determinant is calculated. The negative of the log of the determinant (‘logdet’) is related to the evolutionary distance separating the 2 sequences.

Corrections

The logdet can be corrected so that it expresses the distance between 2 sequences in the usual terms of mutations per site. There are 2 different corrections– that from Lockhart94 Eqn3 (pg 606), and Tamura and Kumar 2002. Which one is given by the arg ‘correction’, which should be one of ‘L94’ or ‘TK02’. The former is used in paup, and the latter is used in LDDist.

Ambiguities

In sequence pairs, when there are 2 N-like chars (n, gap, ?), they are completely ignored. Other combinations are not ignored. Consider this alignment:

acgtr
acgta


Now you might think that r-a would mean either a-a or g-a, but in the present implementation it is slightly different– the ambiguity is added to Fxy in proportion to the observed unambiguous counts. So in this case there was an observed a-a, but no g-a, so r-a is considered to be wholly a-a. Now consider the alignment:

aagcgtr
aaacgta


Here r-a is considered to be both a-a and g-a, in proportion to their occurrence, ie 2/3 a-a and 1/3 g-a.

pInvar

We can deal with pInvar in a couple of ways. In both ways, we subtract some numbers from the diagonal of the un-normalized Fxy matrix, in proportion to the composition of the constant sites. The paup-like way is simply to remove a pInvar * nSites * comp[i] from each un-normalized Fii. The LDDist-like way is to identify which sites are constant, and then to remove some pInvarOfConstants from the Fii. Do the former by setting doPInvarOfConstants=False, and do the latter by setting doPInvarOfConstants=True. To explain the difference, consider this alignment of 10 chars and 3 sequences:

A acgtacgtta
B acgtacgttg
C acgtacgtag
++++++++--  constant


The un-normalized Fxy matrix for the A-B pair is:

[[ 2.  0.  1.  0.]
[ 0.  2.  0.  0.]
[ 0.  0.  2.  0.]
[ 0.  0.  0.  3.]]


There are 8 constant sites, and those constant sites have equal base freqs. Say I want pInvar=0.8, and I use a paup-like strategy by setting doPInvarOfConstantSites=False. So I remove 0.8 * 10 * 0.25 = 2. from each un-normalized Fii. That leaves zeros on the diagonal and so it fails. I can have pInvar=0.79 and that works, but if I try pInvar=0.8 or more than that then it fails.

Now for more realistic sequences the Fii will be more than the pInvar*nSites*comp[i] because there will be some Fii contribution from non-constant sites; so often the Fii won’t go to zero. When that is the case, you will be able to successfully set pInvar to be equal to or even more than the proportion of constant sites. It seems strange to me to be able to set the pInvar higher than the proportion of constant sites, but paup allows you to do that, and p4 imitates it when doPInvarOfConstants=False.

The LDDist-like strategy is to first identify the constant sites in the alignment, and only allow removal of that many counts from the un-normalized Fii. I do that by setting doPInvarOfConstants=True, and I set pInvarOfConstants from 0-1.0. So in the alignment above, only if I set pInvarOfConstants=1.0 I will get some Fii = 0, but if pInvarOfConstants is anything less than 1.0 its ok. (For more realistic alignments there should be no trouble with pInvarOfConstants=1.0, as some Fii will come from non-constant sites).

Constant sites

Here constant sites are defined in a very simple way, different from the Alignment.constantSites*() methods. Here, a site is constant if it is constant and unambiguous; if it has any ambigs or gaps then it is not constant. In the Alignment.constantSites*() methods, a site is constant if it could possibly be constant in any resolution of ambigs or gaps.

It appears that paup has a slightly different definition of a constant site, which comes in to play when there are ambigs. It also appears that paup has a different way of calculating the composition of those constant sites.

Dealing with missing (or low-frequency) characters

If any character in any pairwise comparison in either sequence is missing then the calculation fails because of the correction. (The correction involves the log of the product of all the compositions; if any comp is zero then the product will be zero, and the log will be undefined — Boom!). There are 3 strategies implemented for dealing with missing chars–

1. refuse

This is the way paup does it. Refuse to calculate. Make up a ‘big distance’ later. By default in paup, the ‘big distance’ is twice the biggest defined distance (ie defined elsewhere in the distance matrix, from some other pairwise comparison). Of course this will fail if all the pairs are refused, leaving no defined distances.

2. fudge

This is the way that LDDist does it. LDDist replaces zeros with 0.5 in the un-normalized Fxy martrix. The present implementation replaces zeros on the diagonal of Fxy with either 0.5 or half of the smallest positive Fii, whichever is smaller.

3. reduce

Robert Hirt’s idea is that if the comp of any char in either of the pair of sequences is zero or very small, then probably all of the Fxy data that involve that particular character are going to unreliable, and so should not be used. So in this case the entire row and entire column for that character is removed if the comp is less than the arg ‘minCompCount’ (default is 1). If a reduced Fxy matrix, made by removal of a certain row-column, is needed for one pairwise comparison, then that row-column is removed from all pairwise compares in the alignment, even if the char in that row-column is in high frequency in other sequence pairs.

Non-positive determinants

Sometimes for very diverged sequences the determinant of the Fxy matrix is zero or less, which means that the logarithm is undefined. (This can happen even if all Fii are positive). There are a couple of strategies for dealing with that.

1. refuse :: This is the way that paup does it. Same as above.

2. invert :: This is the way that LDDist does it. If you get a

negative determinant, the sign is simply inverted. This is a hack, without any reason that I can see except to make the distance calculable. Fortunately when the det goes below zero, it only goes a little below zero, and so it does not matter much. Now this strategy of course won’t do anything for zero dets (for which LDDist throws an error), so for zero dets I simply make it a very small positive number.

matchedPairsTests(doProbs=True)

Get all Ababneh et al 2006 matched pairs stats, and, optionally, probabilies.

Returns 3 (or, if doProbs is turned on, 6) DistanceMatrix objects. If not doProbs, returns QB, QS, and QR (QR=Internal) matrices. If doProbs is turned on (the default), returns QB, QS, QR, PB, PS, PR.

For example:

a = var.alignments[0]
QB, QS, QR, PB, PS, PR = a.matchedPairsTests()


or:

QB, QS, QR = a.matchedPairsTests(doProbs=False)

meanNCharsPerSite(showDistribution=True)

Mean number of different chars per site, only of the variable sites.

Constant sites are ignored. Gaps and ambiguities are ignored.

This is in pure Python, and can be used for Alignments with one part. It is also implemented in c in the Data class, which allows more than one part (but no distribution).

mrpSlice(pos, zeroBasedNumbering=True)

Pretty-print a mrp site, with no ‘?’ positions.

Zero-based numbering, unless arg zeroBasedNumbering is set to False.

nChar

A synonym for Alignment.length

nEquates

A property – the number of equates

nTax

A property – the number of sequences

nexusSets = None

A NexusSets object, perhaps copied from var.nexusSets and made specific to self. You can do a p4.nexussets.NexusSets.dump() on it to see what is in there.

noGapsOrAmbiguitiesCopy()

Returns a new Alignment with sites with gaps or ambiguities removed.

This makes another Alignment instance by copying over the sequences site by site as long as there are no gaps or ambiguities.

orMasks(maskA, maskB)

Given two masks, this logically or’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.
Eg. 0010 with 1000 will return 1010.
and 0010 with 1010 will return 1010.
pDistances(ignoreGaps=True, divideByNPositionsCompared=True)

Returns a DistanceMatrix of mean character distances or pDistances.

The default behaviour is that only pairwise positions are compared in which both sequences have a character. The sum of differences is then divided by the number of positions compared.

Its not the same as paup p-dists when there are gaps or ambiguities. For example, the pDistances between the sequences ‘acgaa’ and ‘aaa–’ is 0.4 in paup, but 0.667 here. Paup would call these p4 distances ‘mean character distances’ (dset distance=mean).

Ambiguities are not handled intelligently. Ambigs are just another character. So r-a is considered to be different, and so contributes to the distance.

If ignoreGaps is set, then any position in a sequence pair that contains a gap in either sequence is ignored. If ignoreGaps is set to False, then those positions are looked at, and gap-a would be considered a difference.

If divideByNPositionsCompared is turned off, then the number of differences is divided by the length of the alignment.

parts = None

A list of Part objects, encapsulating data partitions in Data objects. There would be one or more parts in an Alignment.

putGaps(theDnaSequenceList)

Insert gaps in theDnaSequenceList based on gaps in self.

Like James O. McInerney’s ‘putgaps’ program. Self should be a protein alignment. The DNA is input as a SequenceList object, ‘theDnaSequenceList’. It creates and returns a new DNA alignment.

For example:
read(‘myProteinAlignment.nex’) pAlign = var.alignments[0] read(‘myUnalignedDna.fasta’) sl = var.sequenceLists[0] newDnaAlign = pAlign.putGaps(sl) newDnaAlign.writeNexus(‘d.nex’)
readOpenPhylipFile(flob, nTax, nChar)

Read flob to get data in phylip format.

The user would generally not need to call this method directly. It is called by read() etc.

recodeDayhoff(firstLetter=False)

Recode protein data into Dayhoff groups, in place.

1. c
2. stpag
3. ndeq
4. hrk
5. milv
6. fyw

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number. Eg k would be recoded as h rather than as 4.

recodeProteinIntoGroups(groups, firstLetter=False)

Recode protein data into user-specified groups, in place.

A generalization of Alignment.Alignment.recodeDayhoff()

The arg groups should be a list of strings indicating the groupings, with all AAs present. Case does not matter.

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number.

recodeRY(ambigsBecomeGaps=True)

Recode DNA data into purines and pyrimidines, in place.

So A and G becomes R, and C and T becomes Y. Gaps remain gaps, missing (?) remains missing, and Rs and Ys remain as they are. Depending on the setting of ambigsBecomeGaps, Ns may also remain unmodified, but any other characters (DNA ambiguity characters) become either gaps or N, depending on the setting of ambigsBecomeGaps. So if ambigsBecomeGaps is turned on, as it is by default, then N, S, M, and so on become ‘-‘ ie gap characters. If ambigsBecomeGaps is turned off, then they all become Ns. If ambigsBecomeGaps is turned off, then the alignment gets an equate, of N for R or Y.

The dataType becomes ‘standard’ and the dim becomes 2.

It does not make a new alignment– it does the re-coding ‘in-place’.

resetPartsContentFromSequences()

Reset Part.cPart sequences from self.sequences.

It then makes patterns, and sets the global invariant sites array.

resetSequencesFromParts()

Gets the sequences from Part.cPart, and installs them in self.

sequenceSlice(pos)

Returns a list composed of the characters from the alignment at position pos.

Parameters: pos (int) – Zero-based position A list of the characters at that position
setCharPartition(charPartitionName)

Partition self into Parts based on charPartitionName.

You need to do this before you ask for a Data object.

You can also un-partition an already partitioned alignment by feeding this charPartitionName = None.

setGBlocksCharSet(b1=None, b2=None, b3=8, b4=10, b5='n', pathToGBlocks='Gblocks', verbose=False, deleteFiles=True)

Find conserved regions of an alignment using Gblocks

Gblocks <http://molevol.ibmb.csic.es/Gblocks.html>

This method sets a charset for the gblocks-included sites. And it makes a duplicate charset. Both are put in self.nexusSets, so they are written if you write self in nexus format.

‘b1’ Minimum number of sequences for a conserved position.
Default- 50% of number of sequences + 1
‘b2’ Minimum number of sequences for a flank position
Default- 85% of the number of sequences
‘b3’ Maximum Number Of Contiguous Nonconserved Positions
Default 8
‘b4’ Minimum Length Of A Block
Default 10
‘b5’ Allowed Gap Positions (None, With Half, All) n,h,a
Default ‘n’

Gblocks removes all-gap sites from the alignment (before doing anything, I think), and the mask that it delivers is based on the (possibly) shortened alignment. The present method modifies that mask to make it full-length again, so that it applies to self correctly.

The default b5 of n is generally too conservative, and you may rather want b5=’h’ for half, or even b5=’a’.

A duplicate charset is also made and attached, so that you can start with a gblocks mask and modify it from there, while keeping the gblocks charset.

So you might use it like this:

a = func.readAndPop('myAlignment.fasta')
a.setGBlocksCharSet(b5='h')
a.writeNexus('align_withGblocksCharset.nex')  # Nexus format to get sets block


Then read align_withGblocksCharset.nex with Seaview, pull down the Sites menu, and note that there are 2 charsets – gblocks and myblocks. They are identical. You will often find that the gblocks alignment is not quite what you want, but I assume you want to keep it intact for reference, so set and modify the myblocks char set in Seaview. Then save it in Seaview as align_myblocks_handEdited.nex. Then, back in p4:

a = func.readAndPop('align_myblocks_handEdited.nex')
b = a.subsetUsingCharSet('myblocks')
b.writePhylip('align_goodSitesOnly.phy')

setNexusSets()

Set self.nexusSets from var.nexusSets.

A deepcopy is made of var.nexusSets, and then attached to self. Sometimes other Nexus-set related methods trigger this.

If var.nexusSets does not yet exist, a new blank one is made.

simpleCharCounts(seqNum=None)

Counts of chars that are symbols, only.

Returns a Numpy array of floats.

Gaps or ambigs are not counted.

If seqNum is given, then the counts are for that sequence only. Otherwise the counts are for the whole alignment.

simpleConstantMask(ignoreGapQ=True, invert=False)

Returns a mask with 1 at simple constant sites, and 0 otherwise.

See also constantMask()

A simple constant site is a site that has only 1 kind of symbol, and no ambiguities.

If ignoreGapQ (ignore ‘-‘ and ‘?’) is turned on, then a site is a simple constant site if it has gaps and Q-marks, but otherwise only one kind of symbol. If ignoreGapQ is turned off, then the presence of any ‘-‘ and ‘?’ makes it a non-constant site. If it is all gaps or Q-marks, then this method throws a P4Error.

seq1  ac---cgt-
seq2  acgtacgt-
seq3  rcgtacgta


All-gap columns are really undefined as to whether they are constant or not. It gets worse if we invert to ask for potentially variable sites – if an all-gap site is not constant, then it becomes potentially a variable site, which does not make sense to me. Best to strip out all-gap sites first.

subsetUsingCharSet(charSetName, inverse=0)

Return a subset of self based on a charSet.

A charset has a mask, composed of zeros and ones, which is used to subset self. Returns an alignment.

For example:

read("myAlignment.nex")
a = var.alignments[0]
read('myNexusSets.nex') # with a charset named 'foo'
b = a.subsetUsingCharSet('foo')
b.writeNexus('myFooSubset.nex')

subsetUsingMask(theMask, theMaskChar='1', inverse=0)

Returns a subset of self based on a mask.

This makes a copy of the alignment based on theMask. Arg theMask is a string, the same length as the alignment. The character theMaskChar determines which positions are included in the copy. If the character in theMask is theMaskChar, that position is included in the copy. Otherwise, no. If inverse is set, then theMaskChar determines which positions are excluded, and all other positions are included.

symbols = None

Lowercase string, eg ‘acgt’. The order is all-important.

taxNames

A list of the names of the Sequences.

testOverallFromAbabnehEtAl2006()

Marginal symmetry (Stuarts’s) test for more than two matched sequences

As in Ababneh et al 2006 Page 1227, calculated as in the R function Testoverall.

A tuple composed of the Ts stat, the degrees of freedom, and the probability, is returned.

translate(transl_table=1, checkStarts=False, nnn_is_gap=False)

Returns a protein alignment from self, a DNA alignment.

Self is translated using GeneticCode.GeneticCode.translate(), so it handles ambiguities. At the moment, we can only do translations where the frame of the codon is 123, ie the first sequence position is the first position of the codon. The default transl_table is the standard (or so-called universal) genetic code, but you can change it.

Other available translation tables, this week:

if transl_table == 1: # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
# and Coelenterate Mitochondrial Code
# and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

and now 6, 10, 11, 12, 13, 14, 21.


(These are found in GeneticCode.GeneticCode)

See also Alignment.Alignment.checkTranslation().

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

Arg nnn_is_gap is for odd alignments where there are long stretches of ‘nnn’ codons, which probably should be gaps. Probably best to correct those elsewise.

writeMolphy(fName=None)

Another phylogenetics program, another format.

Does anybody use this anymore?

writeNexus(fName=None, writeDataBlock=0, interleave=0, flat=0, append=0, userText='')

Write self in Nexus format.

If writeDataBlock=1, then a data block is written, rather than the default, which is to write a taxa and a characters block.

Flat gives sequences all on one line. Append, if 0, writes #NEXUS first. If 1, does not write #NEXUS.

userText is anything, eg a comment or another Nexus block, that you might want to add. It goes at the end.

writePhylip(fName=None, interleave=True, whitespaceSeparatesNames=True, flat=False)

Write the alignment in Phylip format.

If interleave is turned off, then sequences are written sequentially.

Phylip and phylip-like formats are too varied. The strict Phylip format has a set number of spaces for the taxon name, and there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be – it is a compile-time option in Phylip.

Other programs, eg phyml and PAML, use a phylip-like format where the tax name is set off from the sequence by whitespace. There is no set number of spaces that the sequence needs to occupy, and there may not be spaces in the tax name.

This method used to write strict, real phylip format by default, where there is a set number of spaces for the taxon name, and where there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be– it is set by var.phylipDataMaxNameLength (default 10).

However, it no longer does that by default, as people use the format too loosely. So now ‘whitespaceSeparatesNames’ is turned on by default. It accommodates names longer than 10 chars.

If you want to write strict Phylip format, turn ‘whitespaceSeparatesNames’ off. Note that in that format, described here, it is ok to have blanks in the sequence, and so p4 puts a blank at the beginning of the sequence when writing this way, even though the meaning of ‘whitespaceSeparatesNames’ would suggest not. Here the main effect of whitespaceSeparatesNames=False is that it is assumed that whitespace will not be used to trigger the end of the name in the downstream program. Rather it is assumed that in the downstream program the end of the name will triggered by reading in var.phylipDataMaxNameLength characters. If I were to set whitespaceSeparatesNames=True when I write (because I expect that the downstream program that will read it expects that) then I check for internal blank spaces in the names (which would be disallowed), and also I then allow more than var.phylipDataMaxNameLength characters. However, in both cases there is at least one blank after the name.

Arg ‘flat’ puts it all on one line. This is not compatible with ‘interleave’, of course.