`Alignment`¶

class Alignment[source]¶

A SequenceList where all the sequences are the same length and dataType.

From SequenceList, this inherits

sequences, a list of Sequence objects
fName, the file name, by default None
sequenceForNameDict, a dictionary which allows you to get a Sequence given its name. This week it is not made by default — you need to make it explicitly with the inherited method makeSequenceForNameDict().
and the other methods from SequenceList

Alignment objects additionally have the length, which has a synonym nChar, and is also accessible via len(self).

Also:

nTax, the number of sequences
taxNames, a list of the names of the sequences
dataType, a NEXUS-centric string — ‘dna’, ‘protein’, or ‘standard’
symbols, the character symbols, not including equates or ambiguities
equates, a dictionary of NEXUS-style ambiguities, eg 'r':['a','g'] for DNA
dim, the dimension, ie the number of symbols
nexusSets, a NexusSets object, usually made from a copy of var.nexusSets, but made specific to self

Various checks are made when alignments are read in from files

p4.sequencelist.SequenceList.checkNamesForDupes

checkForAllGapColumns

Check for alignment columns that are made of only gap or ? chars.

checkForBlankSequences

Like it says, with verbose output.

checkForDuplicateSequences

Like it says, with verbose output.

checkLengthsAndTypes

Last checks after reading an Alignment.

The checks above are under the control of some variables in Var, always available to you as the instance var.

var.doCheckForDuplicateSequenceNames
var.doRepairDupedTaxonNames
var.doCheckForAllGapColumns
var.doCheckForBlankSequences
var.doCheckForDuplicateSequences

Writing

`writeNexus`	Write self in Nexus format.
`writePhylip`	Write the alignment in Phylip format.
`p4.sequencelist.SequenceList.writeFasta`	Write out the sequences in Fasta format.
`writeMolphy`	Another phylogenetics program, another format.

Copy self

`dupe`	Duplicates self, with no c-pointers.
`noGapsOrAmbiguitiesCopy`	Returns a new Alignment with sites with gaps or ambiguities removed.
`bootstrap`	Returns a new Alignment, a bootstrap resampling of self.
`p4.data.Data.bootstrap`	Returns a new data object, filled with bootstrapped data.
`sequenceSlice`	Returns a list composed of the characters from the alignment at position pos.

Masks are strings that are nChar long, usually just 0s and 1s

`constantMask`	Returns a mask string with 1 at constant sites, and 0 otherwise.
`simpleConstantMask`	Returns a mask with 1 at simple constant sites, and 0 otherwise.
`gappedMask`	Returns a mask string with 1 at positions with any gaps, and 0 otherwise.
`getAllGapsMask`	If its all gaps in a site, its 1, else 0.
`getEnoughCharsMask`	Mask out sites that have too many gaps.
`getLikelihoodTopologyInformativeSitesMask`	Make and return a mask for those sites that are likelihood informative about the topology.
`getMaskForAutapomorphies`	Return a mask for autapomorphies
`getMaskForCharDiversity`	Return a mask for a given character diversity

You can also make masks with p4.func.maskFromNexusCharacterList(). You can combine masks bitwise with the Alignment methods andMasks() and orMasks()

CharSets and CharPartitions

A NEXUS-style charSet or charPartition is usually defined using a NEXUS sets block, as explained in p4.nexussets.NexusSets. A NEXUS sets block might be something like:

begin sets;
    charset pos1 = 1 - .;
    charset pos2 = 2 - .;
    charset pos3 = 3 - .;
end;

or

begin sets;
    charset gene1 = 1 - 103;
    charset gene2 = 104 - 397;
    charpartition by_gene = gene1:gene1, gene2:gene2;
end;

However, you can also make new charSets from an Alignment.nexusSets with a mask using p4.nexussets.NexusSets.newCharSet()

CharPartitions must cover the entire length of self, without overlap of the character partition subsets.

Whenever you do something like Alignment.subsetUsingCharSet() that requires self.nexusSets, if self.nexusSets does not exist yet then Alignment.setNexusSets() is automatically called to make it — so usually you do not need to do Alignment.setNexusSets().

Alignment.excludeCharSet

Exclude a CharSet.

Alignment.setCharPartition

Partition self into Parts based on charPartitionName.

Alignment.setGBlocksCharSet

Find conserved regions of an alignment using Gblocks

Alignment.setNexusSets

Set self.nexusSets from var.nexusSets.

p4.nexussets.NexusSets.newCharSet

p4.nexussets.NexusSets.dupeCharSet

Extracting subsets

`Alignment.subsetUsingCharSet`	Return a subset of self based on a charSet.
`Alignment.subsetUsingMask`	Returns a subset of self based on a mask.

Recoding the data into groups

`Alignment.recodeDayhoff`	Recode protein data into Dayhoff groups, in place.
`Alignment.recodeProteinIntoGroups`	Recode protein data into user-specified groups, in place.
`Alignment.recodeRY`	Recode DNA data into purines and pyrimidines, in place.

Translating DNA to protein

`Alignment.translate`	Returns a protein alignment from self, a DNA alignment.
`Alignment.checkTranslation`	Check that self translates to theProteinAlignment.

andMasks(maskA, maskB)¶

Given two masks, this logically and’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.

Eg. 0010 with 1010 will return 0010. and 0010 with 1000 will return 0000.

bluntEndLigate(alig, allowDifferentDataTypes=False)¶

Attaches alig to the end of self.

Various checks are made. If the sequences are in a different order in the two alignments then it will not work.

See the method Alignment.concatenate(), which can handle missing and out-of-order sequences.

bootstrap()¶

Returns a new Alignment, a bootstrap resampling of self.

This is done only in Python (ie no cParts involved), and does not handle partitioned data. If you want to bootstrap partitioned data, use the p4.data.Data.bootstrap() method.

To make this exactly repeatable, set the seed for the Python random module:

import random
random.seed(mySeed)

changeDataTypeTo(newDataType, newSymbols=None, newEquates=None)[source]¶

Coerce the alignment to be a new datatype.

This would be good for pathological cases where eg DNA with lots of ambigs is mistaken for protein.

It is not sufficient to simply change the dataType – we must change the symbols and equates as well. And the dataType for all the sequences in self. If you are changing to ‘standard’ dataType, then you need to specify the symbols and the equates. The symbols is a string, and the equates is a dictionary (see eg yourAlignment.equates for a DNA alignment to see the format).

checkForAllGapColumns(returnMask=False)[source]¶

Check for alignment columns that are made of only gap or ? chars.

By default, p4 does this on new alignments that are read. It is under the control of var.doCheckForAllGapColumns.

If there are all gap columns then a verbose output is printed and a P4Error is raised, unless returnMask is set, in which case no output is printed, no P4Error is raised, but the mask is returned.

checkForBlankSequences(removeBlanks=False, includeN=True, listSeqNumsOfBlanks=False)[source]¶

Like it says, with verbose output.

If the p4 variable var.doCheckForBlankSequences is set, this method, with removeBlanks=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForBlankSequences off.

When these automatic checks are done, if any blank sequences are found then a verbose P4Error is raised, inviting you to run it again with removeBlanks turned on.

Sometimes you just want to know what sequences are blank; get the sequence numbers by turning arg listSeqNumsOfBlanks on. That returns a list of sequence numbers, without removing blanks or raising a P4Error.

If removeBlanks is set, the blank sequences are removed. This option is not set by default.

If this method removes sequences, it returns the number of blank sequences removed.

Blank sequences are defined as sequences wholly composed of ‘?’ and ‘-‘. If includeN is turned on (which it is by default) then ‘n’ is included for DNA, and ‘x’ for protein. If that is done, that means, for DNA, that it is blank if it is wholly composed of ‘?’, ‘-‘, and ‘n’.

checkForDuplicateSequences(removeDupes=False, makeDict=True, dictFileName='p4DupeSeqRenameDict.py', dupeBaseName='p4Dupe')[source]¶

Like it says, with verbose output.

If the p4 variable var.doCheckForDuplicateSequences is set, this method, with removeDupes=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForDuplicateSequences off.

When these automatic checks are done, if any dupe sequences are found then a verbose warning is issued, inviting you to run it again with removeDupes turned on.

If removeDupes is set, the duplicate sequences after the first are removed. This option is not set by default.

If both removeDupes and makeDict are set, then it will rename the first sequence to p4Dupe1 (or p4Dupe2, and so on — the dupeBaseName is ‘p4Dupe’ by default, but it can be set as an arg) and make a dictionary to hold the other names, and write that dictionary to a file (by default p4DupeSeqRenameDict.py). The option makeDict is set by default, but it won’t happen unless removeDupes is also set, and there are dupes to be removed.

checkLengthsAndTypes()[source]¶

Last checks after reading an Alignment.

Make sure the sequence lengths and dataType are all the same. Set self.length, self.dataType, self.symbols, self.dim, and self.equates

checkTranslation(theProteinAlignment, transl_table=1, checkStarts=False)¶

Check that self translates to theProteinAlignment.

Self should be a DNA alignment. It is translated using p4.geneticcode.GeneticCode.translate() (so it should handle ambiguities) and compared against theProteinAlignment. The theProteinAlignment sequence order, names, and gap pattern should be the same as in the DNA alignment. The default transl_table is the standard (or so-called universal) genetic code.

Other available translation tables, this week:

if transl_table == 1:   # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
                        # and Coelenterate Mitochondrial Code
                        # and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

# and now 6, 10, 11, 12, 13, 14, 21.

(These are found in GeneticCode)

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

composition(sequenceNumberList=None)[source]¶

Returns a list of compositions.

This returns a list of floats, the composition of the sequence(s) in the sequenceNumberList. If the sequenceNumberList=None (the default), then the overall composition is given, which is the mean of the individual sequences (the sequence comps are not weighted by the sequence length, ie the non-gap sequence length). For DNA, the order is acgt. For protein, the order is arndcqeghilkmfpstwyv. For standard, its the order in symbols. Gaps and questionmarks are ignored. Equates are handled properly, iterating to the final comp.

compositionEuclideanDistanceMatrix()¶

This returns a DistanceMatrix based on composition.

The formula is as given in Lockhart et al 94, the logDet paper. Its equation 4 there, page 608. One pairwise distance is the square root of the sum of the squares of the differences between the frequencies of character states of one pair of sequences.

concatenate(alig, sNames)¶

Attaches alig to the end of self.

You need to provide a list, argument sNames, of taxon names that are found in self and alig. This will determine the order of the taxa in self, the result.

It will still work if the sequences are in a different order in the two alignments.

The order of the taxa in sNames need not be the same order as either alignment. So this method can change the order of the sequences in self.

It will still work if there are missing sequences in either self or alig. It will add blank sequences as needed.

Parameters

alig (Alignment) – The other Alignment object.
sNames (list) – A list of all the taxon names found in self and alig

constantMask(invert=None)¶

Returns a mask string with 1 at constant sites, and 0 otherwise.

See also getMaskForAutapomorphies()

getMinmaxChiSqGroups(percent_cutoff=0.05, min_bins=2, max_bins=20, n_choices=1000, seed=42, verbose=False)¶

An interface for Susko and Roger’s minmax-chisq program.

Susko, E. and Roger, A.J. (2007). On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24:2139-2150.

percent_cutoff - P-value percentage above which chi-squared test is considered homogeneous. Default is 0.05.
min_bins - The minimum number of bins to consider. The default value is 2.
max_bins - The maximum number of bins to consider. The default value is 20.
n_choices - number. The number of random choices of bins to consider for each bin size. The default is 1000.
seed - An integer giving the starting seed for any random generation done by the program. By default this is 42.

All in memory – no files written.

getSimpleBigF(iA, iB)¶

Make and return a simple bigF between two sequences.

Eg:

seqA acgtacgt
seqB acgtcgta

bigF =
[[ 1.  1.  0.  0.]
 [ 0.  1.  1.  0.]
 [ 0.  0.  1.  1.]
 [ 1.  0.  0.  1.]]

Sites with gaps and ambigs are ignored – thats what makes it ‘simple’.

It returns a 2-D Numpy array of floats.

hasGapsOrAmbiguities()¶: Asks whether self has any gaps or ambiguities.

initDataParts()¶

matchedPairsTests(mostSignificantOnly=False)¶

Get all Ababneh et al 2006 matched pairs stats and probabilies.

Parameters: mostSignificantOnly (bool) – False by default, which gives the full matrices. Setting this to True returns the three most significant values only, as a tuple.
Returns: By default it returns six DistanceMatrix objects. QB, QS, and QR (QR=Internal) matrices contain the statistics, and PB, PS, and PR contain the P-values.

For example:

a = var.alignments[0]
QB, QS, QR, PB, PS, PR = a.matchedPairsTests()

The tests are pairwise on all pairs of sequences. Note that it may fail to do the calculations for a pair. If so it will return None for that test for that pair, and that will end up in DistanceMatrix objects that are returned.

The tests are pair-wize on all pairs of sequences. Note that it may fail to do the calculations for a pair. If so it will return None for that test for that pair, and that will end up in DistanceMatrix objects that are returned.

meanNCharsPerSite(includeConstantSites=True, showDistribution=True)¶

Mean number of different chars per site.

Constant sites can optionally be ignored. Gaps and ambiguities are ignored.

This is in pure Python, and can be used for Alignments with one part. It is also implemented in c in the Data class, which allows more than one part (but no distribution).

mrpSlice(pos, zeroBasedNumbering=True)¶

Pretty-print a mrp site, with no ‘?’ positions.

Zero-based numbering, unless arg zeroBasedNumbering is set to False.

property nChar¶: Return the length of the alignment

property nEquates¶: Return the number of equates

property nTax¶: Return the number of sequences

nexusSets¶: A NexusSets object, perhaps copied from var.nexusSets and made specific to self. You can do a p4.nexussets.NexusSets.dump() on it to see what is in there.

noGapsOrAmbiguitiesCopy()¶

Returns a new Alignment with sites with gaps or ambiguities removed.

This makes another Alignment instance by copying over the sequences site by site as long as there are no gaps or ambiguities.

orMasks(maskA, maskB)¶: Given two masks, this logically or’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.

Eg. 0010 with 1000 will return 1010.

and 0010 with 1010 will return 1010.

pDistances(ignoreGaps=True, divideByNPositionsCompared=True)¶

Returns a DistanceMatrix of mean character distances or pDistances.

The default behaviour is that only pairwise positions are compared in which both sequences have a character. The sum of differences is then divided by the number of positions compared.

Its not the same as paup p-dists when there are gaps or ambiguities. For example, the pDistances between the sequences ‘acgaa’ and ‘aaa–’ is 0.4 in paup, but 0.667 here. Paup would call these p4 distances ‘mean character distances’ (dset distance=mean).

Ambiguities are not handled intelligently. Ambigs are just another character. So r-a is considered to be different, and so contributes to the distance.

If ignoreGaps is set, then any position in a sequence pair that contains a gap in either sequence is ignored. If ignoreGaps is set to False, then those positions are looked at, and gap-a would be considered a difference.

If divideByNPositionsCompared is turned off, then the number of differences is divided by the length of the alignment.

parts¶: A list of Part objects, encapsulating data partitions in Data objects. There would be one or more parts in an Alignment.

putGaps(theDnaSequenceList)¶

Insert gaps in theDnaSequenceList based on gaps in self.

Like James O. McInerney’s ‘putgaps’ program. Self should be a protein alignment. The DNA is input as a SequenceList object, ‘theDnaSequenceList’. It creates and returns a new DNA alignment.

For example:

read('myProteinAlignment.nex')
pAlign = var.alignments[0]
read('myUnalignedDna.fasta')
sl = var.sequenceLists[0]
newDnaAlign = pAlign.putGaps(sl)
newDnaAlign.writeNexus('d.nex')

Parameters: DNA SequenceList object (a) –
Returns: a new Alignment object

readOpenPhylipFile(flob, nTax, nChar)¶

Read flob to get data in phylip format.

The user would generally not need to call this method directly. It is called by read() etc.

recodeDayhoff(firstLetter=False)¶

Recode protein data into Dayhoff groups, in place.

c
stpag
ndeq
hrk
milv
fyw

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number. Eg k would be recoded as h rather than as 4.

recodeProteinIntoGroups(groups, firstLetter=False)¶

Recode protein data into user-specified groups, in place.

A generalization of p4.alignment.Alignment.recodeDayhoff()

The arg groups should be a list of strings indicating the groupings, with all AAs present. Case does not matter.

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number.

recodeRY(ambigsBecomeGaps=True, use01=False)¶

Recode DNA data into purines and pyrimidines, in place.

So A and G becomes R, and C and T becomes Y. Gaps remain gaps, missing (?) remains missing, and Rs and Ys remain as they are. Depending on the setting of ambigsBecomeGaps, Ns may also remain unmodified, but any other characters (DNA ambiguity characters) become either gaps or N, depending on the setting of ambigsBecomeGaps. So if ambigsBecomeGaps is turned on, as it is by default, then N, S, M, and so on become ‘-‘ ie gap characters. If ambigsBecomeGaps is turned off, then they all become Ns. If ambigsBecomeGaps is turned off, then the alignment gets an equate, of N for R or Y.

If use01 is turned on (it is off by default) then we use 0 and 1 for characters, rather than R and Y. This is a bit buggy because any existing R or Y characters are ignored, and other ambigs are encoded as N.

The dataType becomes ‘standard’ and the dim becomes 2.

It does not make a new alignment– it does the re-coding ‘in-place’.

resetPartsContentFromSequences()¶

Reset Part.cPart sequences from self.sequences.

It then makes patterns, and sets the global invariant sites array.

resetSequencesFromParts()¶: Gets the sequences from Part.cPart, and installs them in self.

sequenceSlice(pos)¶

Returns a list composed of the characters from the alignment at position pos.

Parameters: pos (int) – Zero-based position
Returns: A list of the characters at that position

setCharPartition(charPartitionName)[source]¶

Partition self into Parts based on charPartitionName.

You need to do this before you ask for a Data object.

You can also un-partition an already partitioned alignment by feeding this charPartitionName = None.

setGBlocksCharSet(b1=None, b2=None, b3=8, b4=10, b5='n', pathToGBlocks='Gblocks', verbose=False, deleteFiles=True)¶

Find conserved regions of an alignment using Gblocks

Gblocks <http://molevol.ibmb.csic.es/Gblocks.html>

This method sets a charset for the gblocks-included sites. And it makes a duplicate charset. Both are put in self.nexusSets, so they are written if you write self in nexus format.

‘b1’ Minimum number of sequences for a conserved position.
Default- 50% of number of sequences + 1

‘b2’ Minimum number of sequences for a flank position
Default- 85% of the number of sequences

‘b3’ Maximum Number Of Contiguous Nonconserved Positions
Default 8

‘b4’ Minimum Length Of A Block
Default 10

‘b5’ Allowed Gap Positions (None, With Half, All) n,h,a
Default ‘n’

Gblocks removes all-gap sites from the alignment (before doing anything, I think), and the mask that it delivers is based on the (possibly) shortened alignment. The present method modifies that mask to make it full-length again, so that it applies to self correctly.

The default b5 of n is generally too conservative, and you may rather want b5=’h’ for half, or even b5=’a’.

A duplicate charset is also made and attached, so that you can start with a gblocks mask and modify it from there, while keeping the gblocks charset.

So you might use it like this:

a = func.readAndPop('myAlignment.fasta')
a.setGBlocksCharSet(b5='h')
a.writeNexus('align_withGblocksCharset.nex')  # Nexus format to get sets block

Then read align_withGblocksCharset.nex with Seaview, pull down the Sites menu, and note that there are 2 charsets – gblocks and myblocks. They are identical. You will often find that the gblocks alignment is not quite what you want, but I assume you want to keep it intact for reference, so set and modify the myblocks char set in Seaview. Then save it in Seaview as align_myblocks_handEdited.nex. Then, back in p4:

a = func.readAndPop('align_myblocks_handEdited.nex')
b = a.subsetUsingCharSet('myblocks')
b.writePhylip('align_goodSitesOnly.phy')

setNexusSets()[source]¶

Set self.nexusSets from var.nexusSets.

A deepcopy is made of var.nexusSets, and then attached to self. Sometimes other Nexus-set related methods trigger this.

If var.nexusSets does not yet exist, a new blank one is made.

simpleCharCounts(seqNum=None)¶

Counts of chars that are symbols, only.

Returns a Numpy array of floats.

Gaps or ambigs are not counted.

If seqNum is given, then the counts are for that sequence only. Otherwise the counts are for the whole alignment.

simpleConstantMask(ignoreGapQ=True, invert=False)¶

Returns a mask with 1 at simple constant sites, and 0 otherwise.

See also constantMask()

A simple constant site is a site that has only 1 kind of symbol, and no ambiguities.

If ignoreGapQ (ignore ‘-‘ and ‘?’) is turned on, then a site is a simple constant site if it has gaps and Q-marks, but otherwise only one kind of symbol. If ignoreGapQ is turned off, then the presence of any ‘-‘ and ‘?’ makes it a non-constant site. If it is all gaps or Q-marks, then this method throws a P4Error.

seq1  ac---cgt-
seq2  acgtacgt-
seq3  rcgtacgta
mask  011111111   if ignoreGapQ=True
mask  010001110   if ignoreGapQ=False

All-gap columns are really undefined as to whether they are constant or not. It gets worse if we invert to ask for potentially variable sites – if an all-gap site is not constant, then it becomes potentially a variable site, which does not make sense to me. Best to strip out all-gap sites first.

subsetUsingCharSet(charSetName, inverse=0)[source]¶

Return a subset of self based on a charSet.

A charset has a mask, composed of zeros and ones, which is used to subset self. Returns an alignment.

For example:

read("myAlignment.nex")
a = var.alignments[0]
read('myNexusSets.nex') # with a charset named 'foo'
b = a.subsetUsingCharSet('foo')
b.writeNexus('myFooSubset.nex')

subsetUsingMask(theMask, theMaskChar='1', inverse=0)[source]¶

Returns a subset of self based on a mask.

This makes a copy of the alignment based on theMask. Arg theMask is a string, the same length as the alignment. The character theMaskChar determines which positions are included in the copy. If the character in theMask is theMaskChar, that position is included in the copy. Otherwise, no. If inverse is set, then theMaskChar determines which positions are excluded, and all other positions are included.

symbols¶: Lowercase string, eg ‘acgt’. The order is all-important.

symtestAsInIQTreeNaserKhdour(verbose=True)¶

Matched-pairs tests of one pair, as in IQTree

This has appeared in IQTree betas from about 1.7beta onwards, and is part of iqtree2. (There was a small bug in --symtest that was fixed in v 2.0.6, June 2020.)

See Naser-Khdour et al GBE 2019 https://doi.org/10.1093/gbe/evz193

This is my attempt to replicate it. That implementation chooses the sequence pair with the biggest divergence, and only reports stats for that pair. This does not necessarily show the smallest stats.

If it is verbose, it speaks the results to stdout.

It returns the 3 probabilities (not stats) – PB, PS, PR, which are Bowker’s, Stuart’s and Ababneh’s.

property taxNames¶: A list of the names of the Sequences.

testOverallFromAbabnehEtAl2006()¶

Marginal symmetry (Stuarts’s) test for more than two matched sequences

As in Ababneh et al 2006 Page 1227, calculated as in the R function Testoverall.

A tuple composed of the Ts stat, the degrees of freedom, and the probability, is returned.

translate(transl_table=1, checkStarts=False, nnn_is_gap=False)¶

Returns a protein alignment from self, a DNA alignment.

Self is translated using GeneticCode.GeneticCode.translate(), so it handles ambiguities. At the moment, we can only do translations where the frame of the codon is 123, ie the first sequence position is the first position of the codon. The default transl_table is the standard (or so-called universal) genetic code, but you can change it.

Other available translation tables, this week:

if transl_table == 1: # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
                        # and Coelenterate Mitochondrial Code
                        # and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

and now 6, 10, 11, 12, 13, 14, 21.

(These are found in p4.geneticcode.GeneticCode)

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

Arg nnn_is_gap is for odd alignments where there are long stretches of ‘nnn’ codons, which probably should be gaps. Probably best to correct those elsewise.

writeMolphy(fName=None)¶

Another phylogenetics program, another format.

Does anybody use this anymore?

writeNexus(fName=None, writeDataBlock=0, interleave=0, flat=0, append=0, userText='')¶

Write self in Nexus format.

If writeDataBlock=1, then a data block is written, rather than the default, which is to write a taxa and a characters block.

Flat gives sequences all on one line. Append, if 0, writes #NEXUS first. If 1, does not write #NEXUS.

userText is anything, eg a comment or another Nexus block, that you might want to add. It goes at the end.

writePhylip(fName=None, interleave=True, whitespaceSeparatesNames=True, flat=False, offset=1, uppercase=False)¶

Write the alignment in Phylip format.

Arg fName can be None, the default, in which case the alignment is written to stdout, or the fName can be an open file-like object, or fName can be a string, the name of a file to write to. If it is a file-like object it will be re-wound to the beginning, but it is is up to the user to open and close it.

If interleave is turned off (it is on by default), then sequences are written sequentially.

Phylip and phylip-like formats are too varied. The strict Phylip format has a set number of spaces for the taxon name, and there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be – it is a compile-time option in Phylip.

Other programs, eg phyml and PAML, use a phylip-like format where the tax name is set off from the sequence by whitespace. There is no set number of spaces that the sequence needs to occupy, and there may not be spaces in the tax name. ‘offset’ is the number of spaces from the end of the name to the beginning of the sequence.

This method used to write strict, real phylip format by default, where there is a set number of spaces for the taxon name, and where there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be– it is set by var.phylipDataMaxNameLength (default 10).

However, it no longer does that by default, as people use the format too loosely. So now ‘whitespaceSeparatesNames’ is turned on by default. It accommodates names longer than 10 chars.

If you want to write strict Phylip format, turn ‘whitespaceSeparatesNames’ off. Note that in that format, described here, it is ok to have blanks in the sequence, and so p4 puts a blank at the beginning of the sequence when writing this way, even though the meaning of ‘whitespaceSeparatesNames’ would suggest not. Here the main effect of whitespaceSeparatesNames=False is that it is assumed that whitespace will not be used to trigger the end of the name in the downstream program. Rather it is assumed that in the downstream program the end of the name will triggered by reading in var.phylipDataMaxNameLength characters. If I were to set whitespaceSeparatesNames=True when I write (because I expect that the downstream program that will read it expects that) then I check for internal blank spaces in the names (which would be disallowed), and also I then allow more than var.phylipDataMaxNameLength characters. However, in both cases there is at least one blank after the name.

Arg ‘flat’ puts it all on one line. This is not compatible with ‘interleave’, of course.

index
0	num of autapomorphies
1	num of simple constant sites
2	num of sites with 2 kinds of char
3	num of sites with 3 kinds of char
…	…
nSymbols	num of sites with some of each char

p4

Navigation

Related Topics

`Alignment`¶

`p4.sequencelist.SequenceList.checkNamesForDupes`
`checkForAllGapColumns`	Check for alignment columns that are made of only gap or ? chars.
`checkForBlankSequences`	Like it says, with verbose output.
`checkForDuplicateSequences`	Like it says, with verbose output.
`checkLengthsAndTypes`	Last checks after reading an Alignment.

`Alignment.excludeCharSet`	Exclude a CharSet.
`Alignment.setCharPartition`	Partition self into Parts based on charPartitionName.
`Alignment.setGBlocksCharSet`	Find conserved regions of an alignment using Gblocks
`Alignment.setNexusSets`	Set self.nexusSets from var.nexusSets.
`p4.nexussets.NexusSets.newCharSet`
`p4.nexussets.NexusSets.dupeCharSet`

Alignment¶

`Alignment`¶