Alignment

class Alignment[source]

A SequenceList where all the sequences are the same length and dataType.

From SequenceList, this inherits

  • sequences, a list of Sequence objects

  • fName, the file name, by default None

  • sequenceForNameDict, a dictionary which allows you to get a Sequence given its name. This week it is not made by default — you need to make it explicitly with the inherited method makeSequenceForNameDict().

  • and the other methods from SequenceList

Alignment objects additionally have the length, which has a synonym nChar, and is also accessible via len(self).

Also:

  • nTax, the number of sequences

  • taxNames, a list of the names of the sequences

  • dataType, a NEXUS-centric string — ‘dna’, ‘protein’, or ‘standard’

  • symbols, the character symbols, not including equates or ambiguities

  • equates, a dictionary of NEXUS-style ambiguities, eg 'r':['a','g'] for DNA

  • dim, the dimension, ie the number of symbols

  • nexusSets, a NexusSets object, usually made from a copy of var.nexusSets, but made specific to self

Various checks are made when alignments are read in from files

p4.sequencelist.SequenceList.checkNamesForDupes

checkForAllGapColumns

Check for alignment columns that are made of only gap or ? chars.

checkForBlankSequences

Like it says, with verbose output.

checkForDuplicateSequences

Like it says, with verbose output.

checkLengthsAndTypes

Last checks after reading an Alignment.

The checks above are under the control of some variables in Var, always available to you as the instance var.

  • var.doCheckForDuplicateSequenceNames

  • var.doRepairDupedTaxonNames

  • var.doCheckForAllGapColumns

  • var.doCheckForBlankSequences

  • var.doCheckForDuplicateSequences

Writing

writeNexus

Write self in Nexus format.

writePhylip

Write the alignment in Phylip format.

p4.sequencelist.SequenceList.writeFasta

Write out the sequences in Fasta format.

writeMolphy

Another phylogenetics program, another format.

Copy self

dupe

Duplicates self, with no c-pointers.

noGapsOrAmbiguitiesCopy

Returns a new Alignment with sites with gaps or ambiguities removed.

bootstrap

Returns a new Alignment, a bootstrap resampling of self.

p4.data.Data.bootstrap

Returns a new data object, filled with bootstrapped data.

sequenceSlice

Returns a list composed of the characters from the alignment at position pos.

Masks are strings that are nChar long, usually just 0s and 1s

constantMask

Returns a mask string with 1 at constant sites, and 0 otherwise.

simpleConstantMask

Returns a mask with 1 at simple constant sites, and 0 otherwise.

gappedMask

Returns a mask string with 1 at positions with any gaps, and 0 otherwise.

getAllGapsMask

If its all gaps in a site, its 1, else 0.

getEnoughCharsMask

Mask out sites that have too many gaps.

getLikelihoodTopologyInformativeSitesMask

Make and return a mask for those sites that are likelihood informative about the topology.

getMaskForAutapomorphies

Return a mask for autapomorphies

getMaskForCharDiversity

Return a mask for a given character diversity

You can also make masks with p4.func.maskFromNexusCharacterList(). You can combine masks bitwise with the Alignment methods andMasks() and orMasks()

CharSets and CharPartitions

A NEXUS-style charSet or charPartition is usually defined using a NEXUS sets block, as explained in p4.nexussets.NexusSets. A NEXUS sets block might be something like:

begin sets;
    charset pos1 = 1 - .;
    charset pos2 = 2 - .;
    charset pos3 = 3 - .;
end;

or

begin sets;
    charset gene1 = 1 - 103;
    charset gene2 = 104 - 397;
    charpartition by_gene = gene1:gene1, gene2:gene2;
end;

However, you can also make new charSets from an Alignment.nexusSets with a mask using p4.nexussets.NexusSets.newCharSet()

CharPartitions must cover the entire length of self, without overlap of the character partition subsets.

Whenever you do something like Alignment.subsetUsingCharSet() that requires self.nexusSets, if self.nexusSets does not exist yet then Alignment.setNexusSets() is automatically called to make it — so usually you do not need to do Alignment.setNexusSets().

Alignment.excludeCharSet

Exclude a CharSet.

Alignment.setCharPartition

Partition self into Parts based on charPartitionName.

Alignment.setGBlocksCharSet

Find conserved regions of an alignment using Gblocks

Alignment.setNexusSets

Set self.nexusSets from var.nexusSets.

p4.nexussets.NexusSets.newCharSet

p4.nexussets.NexusSets.dupeCharSet

Extracting subsets

Alignment.subsetUsingCharSet

Return a subset of self based on a charSet.

Alignment.subsetUsingMask

Returns a subset of self based on a mask.

Recoding the data into groups

Alignment.recodeDayhoff

Recode protein data into Dayhoff groups, in place.

Alignment.recodeProteinIntoGroups

Recode protein data into user-specified groups, in place.

Alignment.recodeRY

Recode DNA data into purines and pyrimidines, in place.

Translating DNA to protein

Alignment.translate

Returns a protein alignment from self, a DNA alignment.

Alignment.checkTranslation

Check that self translates to theProteinAlignment.

andMasks(maskA, maskB)

Given two masks, this logically and’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.

Eg. 0010 with 1010 will return 0010. and 0010 with 1000 will return 0000.

bluntEndLigate(alig, allowDifferentDataTypes=False)

Attaches alig to the end of self.

Various checks are made. If the sequences are in a different order in the two alignments then it will not work.

See the method Alignment.concatenate(), which can handle missing and out-of-order sequences.

bootstrap()

Returns a new Alignment, a bootstrap resampling of self.

This is done only in Python (ie no cParts involved), and does not handle partitioned data. If you want to bootstrap partitioned data, use the p4.data.Data.bootstrap() method.

To make this exactly repeatable, set the seed for the Python random module:

import random
random.seed(mySeed)
changeDataTypeTo(newDataType, newSymbols=None, newEquates=None)[source]

Coerce the alignment to be a new datatype.

This would be good for pathological cases where eg DNA with lots of ambigs is mistaken for protein.

It is not sufficient to simply change the dataType – we must change the symbols and equates as well. And the dataType for all the sequences in self. If you are changing to ‘standard’ dataType, then you need to specify the symbols and the equates. The symbols is a string, and the equates is a dictionary (see eg yourAlignment.equates for a DNA alignment to see the format).

checkForAllGapColumns(returnMask=False)[source]

Check for alignment columns that are made of only gap or ? chars.

By default, p4 does this on new alignments that are read. It is under the control of var.doCheckForAllGapColumns.

If there are all gap columns then a verbose output is printed and a P4Error is raised, unless returnMask is set, in which case no output is printed, no P4Error is raised, but the mask is returned.

checkForBlankSequences(removeBlanks=False, includeN=True, listSeqNumsOfBlanks=False)[source]

Like it says, with verbose output.

If the p4 variable var.doCheckForBlankSequences is set, this method, with removeBlanks=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForBlankSequences off.

When these automatic checks are done, if any blank sequences are found then a verbose P4Error is raised, inviting you to run it again with removeBlanks turned on.

Sometimes you just want to know what sequences are blank; get the sequence numbers by turning arg listSeqNumsOfBlanks on. That returns a list of sequence numbers, without removing blanks or raising a P4Error.

If removeBlanks is set, the blank sequences are removed. This option is not set by default.

If this method removes sequences, it returns the number of blank sequences removed.

Blank sequences are defined as sequences wholly composed of ‘?’ and ‘-‘. If includeN is turned on (which it is by default) then ‘n’ is included for DNA, and ‘x’ for protein. If that is done, that means, for DNA, that it is blank if it is wholly composed of ‘?’, ‘-‘, and ‘n’.

checkForDuplicateSequences(removeDupes=False, makeDict=True, dictFileName='p4DupeSeqRenameDict.py', dupeBaseName='p4Dupe')[source]

Like it says, with verbose output.

If the p4 variable var.doCheckForDuplicateSequences is set, this method, with removeDupes=False, is called automatically every time an alignment is read in. If you would rather it not do that, turn var.doCheckForDuplicateSequences off.

When these automatic checks are done, if any dupe sequences are found then a verbose warning is issued, inviting you to run it again with removeDupes turned on.

If removeDupes is set, the duplicate sequences after the first are removed. This option is not set by default.

If both removeDupes and makeDict are set, then it will rename the first sequence to p4Dupe1 (or p4Dupe2, and so on — the dupeBaseName is ‘p4Dupe’ by default, but it can be set as an arg) and make a dictionary to hold the other names, and write that dictionary to a file (by default p4DupeSeqRenameDict.py). The option makeDict is set by default, but it won’t happen unless removeDupes is also set, and there are dupes to be removed.

checkLengthsAndTypes()[source]

Last checks after reading an Alignment.

Make sure the sequence lengths and dataType are all the same. Set self.length, self.dataType, self.symbols, self.dim, and self.equates

checkTranslation(theProteinAlignment, transl_table=1, checkStarts=False)

Check that self translates to theProteinAlignment.

Self should be a DNA alignment. It is translated using p4.geneticcode.GeneticCode.translate() (so it should handle ambiguities) and compared against theProteinAlignment. The theProteinAlignment sequence order, names, and gap pattern should be the same as in the DNA alignment. The default transl_table is the standard (or so-called universal) genetic code.

Other available translation tables, this week:

if transl_table == 1:   # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
                        # and Coelenterate Mitochondrial Code
                        # and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

# and now 6, 10, 11, 12, 13, 14, 21.

(These are found in GeneticCode)

See also p4.alignment.Alignment.translate()

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

composition(sequenceNumberList=None)[source]

Returns a list of compositions.

This returns a list of floats, the composition of the sequence(s) in the sequenceNumberList. If the sequenceNumberList=None (the default), then the overall composition is given, which is the mean of the individual sequences (the sequence comps are not weighted by the sequence length, ie the non-gap sequence length). For DNA, the order is acgt. For protein, the order is arndcqeghilkmfpstwyv. For standard, its the order in symbols. Gaps and questionmarks are ignored. Equates are handled properly, iterating to the final comp.

compositionEuclideanDistanceMatrix()

This returns a DistanceMatrix based on composition.

The formula is as given in Lockhart et al 94, the logDet paper. Its equation 4 there, page 608. One pairwise distance is the square root of the sum of the squares of the differences between the frequencies of character states of one pair of sequences.

concatenate(alig, sNames)

Attaches alig to the end of self.

You need to provide a list, argument sNames, of taxon names that are found in self and alig. This will determine the order of the taxa in self, the result.

It will still work if the sequences are in a different order in the two alignments.

The order of the taxa in sNames need not be the same order as either alignment. So this method can change the order of the sequences in self.

It will still work if there are missing sequences in either self or alig. It will add blank sequences as needed.

Parameters
  • alig (Alignment) – The other Alignment object.

  • sNames (list) – A list of all the taxon names found in self and alig

constantMask(invert=None)

Returns a mask string with 1 at constant sites, and 0 otherwise.

See also simpleConstantMask()

Constant sites are defined in a PAUP-like manner. If, when all possibilities of ambiguities are tried, the site could possibly be constant, then it is constant.

So if a site is all gaps or missing (or a combination), or if all the non-gap chars are the same, then it is a constant site.

If a DNA site contains both r and y, it could not possibly be constant. A site containing only a and r is constant, but a site containing only a and y cannot be constant.

constantSitesCount()

Counts the sites that potentially have the same thing in each sequence.

Constant sites are defined in a PAUP-like manner. See the doc string for Alignment.constantMask().

constantSitesProportion()

Returns the proportion of the alignment that have possible constant sites.

It uses constantSitesCount()

covarionStats(listA, listB, verbose=True)

Calculates some covarion statistics.

After Pete Lockhart’s covarion paper. Reference? Looks at the two groups of sequences defined by listA and listB, and determines the number of sites that are constant overall, constant in both groups but in a different character, constant in one but variable in the other, variable in one and constant in the other, and variable overall. The lists can be either sequence numbers (zero-based) or names. If verbose, (the default) it prints a nice summary, with a tiny explanation. Returns a tuple.

dataType

Nexus-centric data type. One of dna, rna, protein, or standard.

dim

The number of symbols, eg 4 for DNA.

dump()[source]

Print rubbish about self.

dupe()

Duplicates self, with no c-pointers. And no parts

equates

A dictionary of NEXUS-style equates, eg r=[a,g] in DNA

excludeCharSet(theCharSetName)

Exclude a CharSet.

gappedMask(invert=None)

Returns a mask string with 1 at positions with any gaps, and 0 otherwise.

getAllGapsMask(andMissing=True)

If its all gaps in a site, its 1, else 0.

If ‘andMissing’ is turned on, then if a site is all gaps or ‘?’, then the mask is 1. If ‘andMissing’ is turned off, then it is strictly gaps.

getCharDiversityDistribution()

Distribution of different chars per site, as a distribution.

Gaps and ambiguities are ignored.

A tuple is returned, composed of

index

0

num of autapomorphies

1

num of simple constant sites

2

num of sites with 2 kinds of char

3

num of sites with 3 kinds of char

nSymbols

num of sites with some of each char

Autapomorphies are sites with two kinds of characters, where there is only one of one of the characters.

So if there are 10 sites with 2 kinds of char, and 4 of those are autapomorphies, 30 constant sites, and 20 sites with 3 kinds of character, we would have a distro like this — (4, 30, 10, 20, …)

This is in pure Python, and can be used for Alignments with one part.

Returns

a tuple of ints, showing counts of sites with different diversities.

See also getMaskForCharDiversity()

getEnoughCharsMask(perCent=85)

Mask out sites that have too many gaps.

This returns a mask – a string of 1’s and 0’s.

If perCent or more fraction of chars in a site are good, its a 1. Else 0.

Good means not gaps, missing, or ambigs. Good means in self.symbols.

getKosiolAISGroups(tree, n_bins, remove_files=False, verbose=True)

An interface for Kosiol’s program AIS, for grouping amino acids.

Writes three files: equi, q, and evec that are required as input to the software AIS (Almost Invariant Sets).

  • tree = tree object with optimised model compatible with the alignment

  • nBins = the number of groups required

Kosiol, C., Goldman, N. and Buttimore, N. (2004). A new criterion and method for amino acid classification. J Theo Biol 228: 97-106.

From share/examples/kosiol_ais.html This writes files.

getLikelihoodTopologyInformativeSitesMask()

Make and return a mask for those sites that are likelihood informative about the topology.

Mostly this means no constant and no autapomorphic sites. Autapomorphies are sites that are all one character except for one taxon that has another character.

The rules, this week–

Its not informative if:

  • If there are no ambigs or gaps, then constants and autapomorphies are not informative.

  • If there are gaps but no ambigs,

    • if there are only 2 characters or fewer – not informative

    • if constant + gaps – not informative

    • if autapomorphy + gaps – not informative

  • If there are ambigs but no gaps,

    • if constant except for a single ambig, then not informative

    • (an autapomorphy plus a single ambig can sometimes be informative)

Otherwise, I’m saying that it is informative. That includes

  • sites with gaps plus ambigs, and

  • sites with more than one ambig

I am conservatively calling some sites as informative, eg sites with more than one ambiguity, even though sites like that are often not topologically informative, as they might be under some models– I have not checked thoroughly.

Because of the above, we cannot be sure that each site indicated by a 1 in the mask is informative, but we can be sure that each 0 in the mask is uninformative.

Returns

a mask string

getMaskForAutapomorphies()

Return a mask for autapomorphies

Returns a string of zeros and 1s, where the 1s are autapomorphic sites.

Gaps and ambiguities are ignored.

In an alignment site, if there are two kinds of character states (ie the character diversity is 2) and one of the character states occurs once only, then it will be called an autapomorphy (or singleton).

Returns

a string nChar long, of zeros and 1s.

See also getMaskForCharDiversity() and getLikelihoodTopologyInformativeSitesMask()

getMaskForCharDiversity(diversity=1)

Return a mask for a given character diversity

Gaps and ambiguities are ignored.

This makes a mask string with zeros and ones, with 1’s showing the sites with the chosen diversity, and zero for other sites.

So for example, if diversity is 1, that means constant sites, and this method returns a simple constant sites mask, with gaps and ambigs ignored.

If the diversity is 2, it will return a mask showing sites with two kinds of characters. And so on.

Returns

a string nChar long, of zeros and 1s.

See also getMaskForAutapomorphies()

getMinmaxChiSqGroups(percent_cutoff=0.05, min_bins=2, max_bins=20, n_choices=1000, seed=42, verbose=False)

An interface for Susko and Roger’s minmax-chisq program.

Susko, E. and Roger, A.J. (2007). On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. 24:2139-2150.

  • percent_cutoff - P-value percentage above which chi-squared test is considered homogeneous. Default is 0.05.

  • min_bins - The minimum number of bins to consider. The default value is 2.

  • max_bins - The maximum number of bins to consider. The default value is 20.

  • n_choices - number. The number of random choices of bins to consider for each bin size. The default is 1000.

  • seed - An integer giving the starting seed for any random generation done by the program. By default this is 42.

All in memory – no files written.

getSimpleBigF(iA, iB)

Make and return a simple bigF between two sequences.

Eg:

seqA acgtacgt
seqB acgtcgta

bigF =
[[ 1.  1.  0.  0.]
 [ 0.  1.  1.  0.]
 [ 0.  0.  1.  1.]
 [ 1.  0.  0.  1.]]

Sites with gaps and ambigs are ignored – thats what makes it ‘simple’.

It returns a 2-D Numpy array of floats.

hasGapsOrAmbiguities()

Asks whether self has any gaps or ambiguities.

initDataParts()
matchedPairsTests(mostSignificantOnly=False)

Get all Ababneh et al 2006 matched pairs stats and probabilies.

Parameters

mostSignificantOnly (bool) – False by default, which gives the full matrices. Setting this to True returns the three most significant values only, as a tuple.

Returns

By default it returns six DistanceMatrix objects. QB, QS, and QR (QR=Internal) matrices contain the statistics, and PB, PS, and PR contain the P-values.

For example:

a = var.alignments[0]
QB, QS, QR, PB, PS, PR = a.matchedPairsTests()

The tests are pairwise on all pairs of sequences. Note that it may fail to do the calculations for a pair. If so it will return None for that test for that pair, and that will end up in DistanceMatrix objects that are returned.

The tests are pair-wize on all pairs of sequences. Note that it may fail to do the calculations for a pair. If so it will return None for that test for that pair, and that will end up in DistanceMatrix objects that are returned.

meanNCharsPerSite(includeConstantSites=True, showDistribution=True)

Mean number of different chars per site.

Constant sites can optionally be ignored. Gaps and ambiguities are ignored.

This is in pure Python, and can be used for Alignments with one part. It is also implemented in c in the Data class, which allows more than one part (but no distribution).

mrpSlice(pos, zeroBasedNumbering=True)

Pretty-print a mrp site, with no ‘?’ positions.

Zero-based numbering, unless arg zeroBasedNumbering is set to False.

property nChar

Return the length of the alignment

property nEquates

Return the number of equates

property nTax

Return the number of sequences

nexusSets

A NexusSets object, perhaps copied from var.nexusSets and made specific to self. You can do a p4.nexussets.NexusSets.dump() on it to see what is in there.

noGapsOrAmbiguitiesCopy()

Returns a new Alignment with sites with gaps or ambiguities removed.

This makes another Alignment instance by copying over the sequences site by site as long as there are no gaps or ambiguities.

orMasks(maskA, maskB)

Given two masks, this logically or’s the string chars.

Only zero and ‘1’ chars are allowed, and returned.
Eg. 0010 with 1000 will return 1010.
and 0010 with 1010 will return 1010.
pDistances(ignoreGaps=True, divideByNPositionsCompared=True)

Returns a DistanceMatrix of mean character distances or pDistances.

The default behaviour is that only pairwise positions are compared in which both sequences have a character. The sum of differences is then divided by the number of positions compared.

Its not the same as paup p-dists when there are gaps or ambiguities. For example, the pDistances between the sequences ‘acgaa’ and ‘aaa–’ is 0.4 in paup, but 0.667 here. Paup would call these p4 distances ‘mean character distances’ (dset distance=mean).

Ambiguities are not handled intelligently. Ambigs are just another character. So r-a is considered to be different, and so contributes to the distance.

If ignoreGaps is set, then any position in a sequence pair that contains a gap in either sequence is ignored. If ignoreGaps is set to False, then those positions are looked at, and gap-a would be considered a difference.

If divideByNPositionsCompared is turned off, then the number of differences is divided by the length of the alignment.

parts

A list of Part objects, encapsulating data partitions in Data objects. There would be one or more parts in an Alignment.

putGaps(theDnaSequenceList)

Insert gaps in theDnaSequenceList based on gaps in self.

Like James O. McInerney’s ‘putgaps’ program. Self should be a protein alignment. The DNA is input as a SequenceList object, ‘theDnaSequenceList’. It creates and returns a new DNA alignment.

For example:

read('myProteinAlignment.nex')
pAlign = var.alignments[0]
read('myUnalignedDna.fasta')
sl = var.sequenceLists[0]
newDnaAlign = pAlign.putGaps(sl)
newDnaAlign.writeNexus('d.nex')
Parameters

DNA SequenceList object (a) –

Returns

a new Alignment object

readOpenPhylipFile(flob, nTax, nChar)

Read flob to get data in phylip format.

The user would generally not need to call this method directly. It is called by read() etc.

recodeDayhoff(firstLetter=False)

Recode protein data into Dayhoff groups, in place.

  1. c

  2. stpag

  3. ndeq

  4. hrk

  5. milv

  6. fyw

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number. Eg k would be recoded as h rather than as 4.

recodeProteinIntoGroups(groups, firstLetter=False)

Recode protein data into user-specified groups, in place.

A generalization of p4.alignment.Alignment.recodeDayhoff()

The arg groups should be a list of strings indicating the groupings, with all AAs present. Case does not matter.

The ambiguity character ‘x’ is recoded as a gap.

It does not make a new alignment– it does the re-coding ‘in-place’.

If arg firstLetter is set, then the character is recoded as the first letter of its group rather than as a number.

recodeRY(ambigsBecomeGaps=True, use01=False)

Recode DNA data into purines and pyrimidines, in place.

So A and G becomes R, and C and T becomes Y. Gaps remain gaps, missing (?) remains missing, and Rs and Ys remain as they are. Depending on the setting of ambigsBecomeGaps, Ns may also remain unmodified, but any other characters (DNA ambiguity characters) become either gaps or N, depending on the setting of ambigsBecomeGaps. So if ambigsBecomeGaps is turned on, as it is by default, then N, S, M, and so on become ‘-‘ ie gap characters. If ambigsBecomeGaps is turned off, then they all become Ns. If ambigsBecomeGaps is turned off, then the alignment gets an equate, of N for R or Y.

If use01 is turned on (it is off by default) then we use 0 and 1 for characters, rather than R and Y. This is a bit buggy because any existing R or Y characters are ignored, and other ambigs are encoded as N.

The dataType becomes ‘standard’ and the dim becomes 2.

It does not make a new alignment– it does the re-coding ‘in-place’.

resetPartsContentFromSequences()

Reset Part.cPart sequences from self.sequences.

It then makes patterns, and sets the global invariant sites array.

resetSequencesFromParts()

Gets the sequences from Part.cPart, and installs them in self.

sequenceSlice(pos)

Returns a list composed of the characters from the alignment at position pos.

Parameters

pos (int) – Zero-based position

Returns

A list of the characters at that position

setCharPartition(charPartitionName)[source]

Partition self into Parts based on charPartitionName.

You need to do this before you ask for a Data object.

You can also un-partition an already partitioned alignment by feeding this charPartitionName = None.

setGBlocksCharSet(b1=None, b2=None, b3=8, b4=10, b5='n', pathToGBlocks='Gblocks', verbose=False, deleteFiles=True)

Find conserved regions of an alignment using Gblocks

Gblocks <http://molevol.ibmb.csic.es/Gblocks.html>

This method sets a charset for the gblocks-included sites. And it makes a duplicate charset. Both are put in self.nexusSets, so they are written if you write self in nexus format.

‘b1’ Minimum number of sequences for a conserved position.

Default- 50% of number of sequences + 1

‘b2’ Minimum number of sequences for a flank position

Default- 85% of the number of sequences

‘b3’ Maximum Number Of Contiguous Nonconserved Positions

Default 8

‘b4’ Minimum Length Of A Block

Default 10

‘b5’ Allowed Gap Positions (None, With Half, All) n,h,a

Default ‘n’

Gblocks removes all-gap sites from the alignment (before doing anything, I think), and the mask that it delivers is based on the (possibly) shortened alignment. The present method modifies that mask to make it full-length again, so that it applies to self correctly.

The default b5 of n is generally too conservative, and you may rather want b5=’h’ for half, or even b5=’a’.

A duplicate charset is also made and attached, so that you can start with a gblocks mask and modify it from there, while keeping the gblocks charset.

So you might use it like this:

a = func.readAndPop('myAlignment.fasta')
a.setGBlocksCharSet(b5='h')
a.writeNexus('align_withGblocksCharset.nex')  # Nexus format to get sets block

Then read align_withGblocksCharset.nex with Seaview, pull down the Sites menu, and note that there are 2 charsets – gblocks and myblocks. They are identical. You will often find that the gblocks alignment is not quite what you want, but I assume you want to keep it intact for reference, so set and modify the myblocks char set in Seaview. Then save it in Seaview as align_myblocks_handEdited.nex. Then, back in p4:

a = func.readAndPop('align_myblocks_handEdited.nex')
b = a.subsetUsingCharSet('myblocks')
b.writePhylip('align_goodSitesOnly.phy')
setNexusSets()[source]

Set self.nexusSets from var.nexusSets.

A deepcopy is made of var.nexusSets, and then attached to self. Sometimes other Nexus-set related methods trigger this.

If var.nexusSets does not yet exist, a new blank one is made.

simpleCharCounts(seqNum=None)

Counts of chars that are symbols, only.

Returns a Numpy array of floats.

Gaps or ambigs are not counted.

If seqNum is given, then the counts are for that sequence only. Otherwise the counts are for the whole alignment.

simpleConstantMask(ignoreGapQ=True, invert=False)

Returns a mask with 1 at simple constant sites, and 0 otherwise.

See also constantMask()

A simple constant site is a site that has only 1 kind of symbol, and no ambiguities.

If ignoreGapQ (ignore ‘-‘ and ‘?’) is turned on, then a site is a simple constant site if it has gaps and Q-marks, but otherwise only one kind of symbol. If ignoreGapQ is turned off, then the presence of any ‘-‘ and ‘?’ makes it a non-constant site. If it is all gaps or Q-marks, then this method throws a P4Error.

seq1  ac---cgt-
seq2  acgtacgt-
seq3  rcgtacgta
mask  011111111   if ignoreGapQ=True
mask  010001110   if ignoreGapQ=False

All-gap columns are really undefined as to whether they are constant or not. It gets worse if we invert to ask for potentially variable sites – if an all-gap site is not constant, then it becomes potentially a variable site, which does not make sense to me. Best to strip out all-gap sites first.

subsetUsingCharSet(charSetName, inverse=0)[source]

Return a subset of self based on a charSet.

A charset has a mask, composed of zeros and ones, which is used to subset self. Returns an alignment.

For example:

read("myAlignment.nex")
a = var.alignments[0]
read('myNexusSets.nex') # with a charset named 'foo'
b = a.subsetUsingCharSet('foo')
b.writeNexus('myFooSubset.nex')
subsetUsingMask(theMask, theMaskChar='1', inverse=0)[source]

Returns a subset of self based on a mask.

This makes a copy of the alignment based on theMask. Arg theMask is a string, the same length as the alignment. The character theMaskChar determines which positions are included in the copy. If the character in theMask is theMaskChar, that position is included in the copy. Otherwise, no. If inverse is set, then theMaskChar determines which positions are excluded, and all other positions are included.

symbols

Lowercase string, eg ‘acgt’. The order is all-important.

symtestAsInIQTreeNaserKhdour(verbose=True)

Matched-pairs tests of one pair, as in IQTree

This has appeared in IQTree betas from about 1.7beta onwards, and is part of iqtree2. (There was a small bug in --symtest that was fixed in v 2.0.6, June 2020.)

See Naser-Khdour et al GBE 2019 https://doi.org/10.1093/gbe/evz193

This is my attempt to replicate it. That implementation chooses the sequence pair with the biggest divergence, and only reports stats for that pair. This does not necessarily show the smallest stats.

If it is verbose, it speaks the results to stdout.

It returns the 3 probabilities (not stats) – PB, PS, PR, which are Bowker’s, Stuart’s and Ababneh’s.

property taxNames

A list of the names of the Sequences.

testOverallFromAbabnehEtAl2006()

Marginal symmetry (Stuarts’s) test for more than two matched sequences

As in Ababneh et al 2006 Page 1227, calculated as in the R function Testoverall.

A tuple composed of the Ts stat, the degrees of freedom, and the probability, is returned.

translate(transl_table=1, checkStarts=False, nnn_is_gap=False)

Returns a protein alignment from self, a DNA alignment.

Self is translated using GeneticCode.GeneticCode.translate(), so it handles ambiguities. At the moment, we can only do translations where the frame of the codon is 123, ie the first sequence position is the first position of the codon. The default transl_table is the standard (or so-called universal) genetic code, but you can change it.

Other available translation tables, this week:

if transl_table == 1: # standard
elif transl_table == 2: # vertebrate mito
elif transl_table == 4: # Mold, Protozoan,
                        # and Coelenterate Mitochondrial Code
                        # and the Mycoplasma/Spiroplasma Code
elif transl_table == 5: # invertebrate mito
elif transl_table == 9: # echinoderm mito

and now 6, 10, 11, 12, 13, 14, 21.

(These are found in p4.geneticcode.GeneticCode)

See also p4.alignment.Alignment.checkTranslation().

If the arg checkStarts is turned on (by default it is not turned on) then this method checks whether the first codon is a start codon.

Arg nnn_is_gap is for odd alignments where there are long stretches of ‘nnn’ codons, which probably should be gaps. Probably best to correct those elsewise.

writeMolphy(fName=None)

Another phylogenetics program, another format.

Does anybody use this anymore?

writeNexus(fName=None, writeDataBlock=0, interleave=0, flat=0, append=0, userText='')

Write self in Nexus format.

If writeDataBlock=1, then a data block is written, rather than the default, which is to write a taxa and a characters block.

Flat gives sequences all on one line. Append, if 0, writes #NEXUS first. If 1, does not write #NEXUS.

userText is anything, eg a comment or another Nexus block, that you might want to add. It goes at the end.

writePhylip(fName=None, interleave=True, whitespaceSeparatesNames=True, flat=False, offset=1, uppercase=False)

Write the alignment in Phylip format.

Arg fName can be None, the default, in which case the alignment is written to stdout, or the fName can be an open file-like object, or fName can be a string, the name of a file to write to. If it is a file-like object it will be re-wound to the beginning, but it is is up to the user to open and close it.

If interleave is turned off (it is on by default), then sequences are written sequentially.

Phylip and phylip-like formats are too varied. The strict Phylip format has a set number of spaces for the taxon name, and there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be – it is a compile-time option in Phylip.

Other programs, eg phyml and PAML, use a phylip-like format where the tax name is set off from the sequence by whitespace. There is no set number of spaces that the sequence needs to occupy, and there may not be spaces in the tax name. ‘offset’ is the number of spaces from the end of the name to the beginning of the sequence.

This method used to write strict, real phylip format by default, where there is a set number of spaces for the taxon name, and where there may not necessarily be a space between the name and the sequence. The name size is commonly 10 spaces, but it need not be– it is set by var.phylipDataMaxNameLength (default 10).

However, it no longer does that by default, as people use the format too loosely. So now ‘whitespaceSeparatesNames’ is turned on by default. It accommodates names longer than 10 chars.

If you want to write strict Phylip format, turn ‘whitespaceSeparatesNames’ off. Note that in that format, described here, it is ok to have blanks in the sequence, and so p4 puts a blank at the beginning of the sequence when writing this way, even though the meaning of ‘whitespaceSeparatesNames’ would suggest not. Here the main effect of whitespaceSeparatesNames=False is that it is assumed that whitespace will not be used to trigger the end of the name in the downstream program. Rather it is assumed that in the downstream program the end of the name will triggered by reading in var.phylipDataMaxNameLength characters. If I were to set whitespaceSeparatesNames=True when I write (because I expect that the downstream program that will read it expects that) then I check for internal blank spaces in the names (which would be disallowed), and also I then allow more than var.phylipDataMaxNameLength characters. However, in both cases there is at least one blank after the name.

Arg ‘flat’ puts it all on one line. This is not compatible with ‘interleave’, of course.