Chapter 8 Cookbook -- Cool things to do with it
8.1 PubMed
8.1.1 Sending a query to PubMed
If you are in the Medical field or interested in human issues (and many times even if you are not!), PubMed (http://www.ncbi.nlm.nih.gov/PubMed/) is an excellent source of all kinds of goodies. So like other things, we'd like to be able to grab information from it and use it in python scripts.
Querying PubMed using Biopython is extremely painless. To get all of the article ids for articles having to do with orchids (see section 2.3 for our motivation), we only need the following three lines of code:
from Bio import PubMed
search_term = 'orchid'
orchid_ids = PubMed.search_for(search_term)
This returns a python list containing all of the orchid ids
['11070358', '11064040', '11028023', '10947239', '10938351', '10936520',
'10905611', '10899814', '10856762', '10854740', '10758893', '10716342',
...
With this list of ids we are ready to start retrieving the records, so follow on ahead to the next section.
8.1.2 Retrieving a PubMed record
The previous section described how to get a bunch of article ids. Now that we've got them, we obviously want to get the corresponding Medline records and extract the information from them.
The interface for retrieving records from PubMed should be very intuitive to python programmers -- it models a python dictionary. To set up this interface, we need to set up a parser that will parse the results that we retrieve. The following lines of code get everything set up:
from Bio import PubMed
from Bio import Medline
rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)
What we've done is create a dictionary like object medline_dict
. To get an article we access it like medline_dict[id_to_get]
. What this does is connect with PubMed, get the article you ask for, parse it into a record object, and return it. Very cool!
Now let's look at how to use this nice dictionary to print out some information about some ids. We just need to loop through our ids (orchid_ids
from the previous section) and print out the information we are interested in:
for oid in orchid_ids[0:5]:
cur_record = medline_dict[oid]
print 'title:', cur_record.title.rstrip()
print 'authors:', cur_record.authors
print 'source:', cur_record.source.strip()
print
The output for this looks like:
title: Sex pheromone mimicry in the early spider orchid (ophrys sphegodes):
patterns of hydrocarbons as the key mechanism for pollination by sexual
deception [In Process Citation]
authors: ['Schiestl FP', 'Ayasse M', 'Paulus HF', 'Lofstedt C', 'Hansson BS',
'Ibarra F', 'Francke W']
source: J Comp Physiol [A] 2000 Jun;186(6):567-74
Especially interesting to note is the list of authors, which is returned as a standard python list. This makes it easy to manipulate and search using standard python tools. For instance, we could loop through a whole bunch of entries searching for a particular author with code like the following:
search_author = 'Waits T'
for our_id in our_id_list:
cur_record = medline_dict[our_id]
if search_author in cur_record.authors:
print "Author %s found: %s" % (search_author,
cur_record.source.strip())
The PubMed and Medline interfaces are very mature and nice to work with -- hopefully this section gave you an idea of the power of the interfaces and how they can be used.
8.2 GenBank
The GenBank record format is a very popular method of holding information about sequences, sequence features, and other associated sequence information. The format is a good way to get information from the NCBI databases at http://www.ncbi.nlm.nih.gov/.
8.2.1 Retrieving GenBank entries from NCBI
One very nice feature of the GenBank libraries is the ability to automate retrieval of entries from GenBank. This is very convenient for creating scripts that automate a lot of your daily work. In this example we'll show how to query the NCBI databases, and to retrieve the records from the query - something touched on in Section 4.2.1.
First, we want to make a query and find out the ids of the records to retrieve. Here we'll do a quick search for our favorite organism, Opuntia. We can do quick search and get back the GIs (GenBank identifiers) for all of the corresponding records:
from Bio import GenBank
gi_list = GenBank.search_for("Opuntia AND rpl16")
gi_list
will be a list of all of the GenBank identifiers that match our query:
["6273291", "6273290", "6273289", "6273287", "6273286", "6273285", "6273284"]
Now that we've got the GIs, we can use these to access the NCBI database through a dictionary interface. For instance, to retrieve the information for the first GI, we'll first have to create a dictionary that accesses NCBI:
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")
Now that we've got this, we do the retrieval:
gb_record = ncbi_dict[gi_list[0]]
In this case, gb_record
will be GenBank formatted record:
LOCUS AF191665 902 bp DNA PLN 07-NOV-1999
DEFINITION Opuntia marenae rpl16 gene; chloroplast gene for chloroplast
product, partial intron sequence.
ACCESSION AF191665
VERSION AF191665.1 GI:6273291
...
In this case, we are just getting the raw records. We can also pass these records directly into a parser and return the parsed record. For instance, if we wanted to get back SeqRecord objects with the GenBank file parsed into SeqFeature objects we would need to create the dictionary with the GenBank FeatureParser:
record_parser = GenBank.FeatureParser()
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank",
parser = record_parser)
Now retrieving a record will give you a SeqRecord object instead of the raw record:
>>> gb_seqrecord = ncbi_dict[gi_list[0]]
>>> print gb_seqrecord
<Bio.SeqRecord.SeqRecord instance at 0x102f9404>
For more information of formats you can parse GenBank records into, please see section 8.2.2.
Using these automated query retrieval functionality is a big plus over doing things by hand. Additionally, the retrieval has nice built in features like a time-delay, which will prevent NCBI from getting mad at you and blocking your access.
8.2.2 Parsing GenBank records
While GenBank files are nice and have lots of information, at the same time you probably only want to extract a small amount of that information at a time. The key to doing this is parsing out the information. Biopython provides GenBank parsers which help you accomplish this task. Right now the GenBank module provides the following parsers:
-
RecordParser -- This parses the raw record into a GenBank specific Record object. This object models the information in a raw record very closely, so this is good to use if you are just interested in GenBank records themselves.
- FeatureParser -- This parses the raw record in a SeqRecord object with all of the feature table information represented in SeqFeatures (see section 9.1 for more info on these objects). This is best to use if you are interested in getting things in a more standard format. If you use
Bio.SeqIO
(Chapter 4) to read a GenBank file, it will call this FeatureParser for you.
Depending on the type of GenBank files you are interested in, they will either contain a single record, or multiple records. Each record will start with a LOCUS line, various other header lines, a list of features, and finally the sequence data, ending with a // line.
Dealing with a GenBank file containing a single record is very easy. For example, let's use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can be downloaded from the NCBI here (only 1.15 MB):
from Bio import GenBank
feature_parser = GenBank.FeatureParser()
gb_record = feature_parser.parse(open("AE017199.gbk"))
# now do something with the record
print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
print repr(gb_record.seq)
Or, using Bio.SeqIO
instead (see Chapter 4):
from Bio import SeqIO
gb_record = SeqIO.read(open("AE017199.gbk"), "genbank")
print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
print repr(gb_record.seq)
Either should give the following output:
Name AE017199, 1107 features
Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGA...TTA', IUPACAmbiguousDNA())
8.2.3 Iterating over GenBank records
For multi-record GenBank files, the most common usage will be creating an iterator, and parsing through the file record by record. Doing this is very similar to how things are done in other formats, as the following code demonstrates, using an example file cor6_6.gb which is included in the BioPython source code under the Tests/GenBank/ directory:
from Bio import GenBank
feature_parser = GenBank.FeatureParser()
gb_iterator = GenBank.Iterator(open("cor6_6.gb"), feature_parser)
for cur_record in gb_iterator :
print "Name %s, %i features" % (cur_record.name, len(cur_record.features))
print repr(cur_record.seq)
Or, using Bio.SeqIO
instead (see Chapter 4):
from Bio import SeqIO
for cur_record in SeqIO.parse(open("cor6_6.gb"), "genbank") :
print "Name %s, %i features" % (cur_record.name, len(cur_record.features))
print repr(cur_record.seq)
This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out the Seq objects representing the sequences in the record.
As with other formats, you have lots of tools for dealing with GenBank records. This should make it possible to do whatever you need to with GenBank.
8.2.4 Making your very own GenBank database
One very cool thing that you can do is set up your own personal GenBank database and access it like a dictionary (this can be extra cool because you can also allow access to these local databases over a network using BioCorba -- see the BioCorba documentation for more information).
Note - this is only worth doing if your GenBank file contains more than one record.
Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function.
Again, this example uses the file cor6_6.gb
which is included in the BioPython source code under the Tests/GenBank/ directory:
>>> from Bio import GenBank
>>> dict_file = "cor6_6.gb"
>>> index_file = "cor6_6.idx"
>>> GenBank.index_file(dict_file, index_file)
This will create a directory called cor6_6.idx
containing the index files. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a FeatureParser
so that when we get a record, then we retrieve a SeqRecord object.
Setting up the dictionary is as easy as one line:
>>> gb_dict = GenBank.Dictionary(index_file, GenBank.FeatureParser())
Now we can deal with this like a dictionary. For instance:
>>> len(gb_dict)
6
>>> gb_dict.keys()
['L31939', 'AJ237582', 'X62281', 'AF297471', 'M81224', 'X55053']
Finally, we retrieve objects using subscripting:
>>> gb_dict['AJ237582']
<Bio.SeqRecord.SeqRecord instance at 0x102fdd8c>
>>> print len(gb_dict['X55053'].features)
3
8.3 Dealing with alignments
It is often very useful to be able to align particular sequences. I do this quite often to get a quick and dirty idea of relationships between sequences. Consequently, it is very nice to be able to quickly write up a python script that does an alignment and gives you back objects that are easy to work with. The alignment related code in Biopython is meant to allow python-level access to alignment programs so that you can run alignments quickly from within scripts.
Clustalx (http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html) is a very nice program for doing multiple alignments. Biopython offers access to alignments in clustal format (these normally have a *.aln
extension) that are produced by Clustalx. It also offers access to clustalw, which the is command line version of clustalx.
We'll need some sequences to align, such as opuntia.fasta (also available online here) which is a small FASTA file containing seven orchid gene DNA sequences, which you can also from Doc/examples/
in the Biopython source distribution.
The first step in interacting with clustalw is to set up a command line you want to pass to the program. Clustalw has a ton of command line options, and if you set a lot of parameters, you can end up typing in a huge ol' command line quite a bit. This command line class models the command line by making all of the options be attributes of the class that can be set. A few convenience functions also exist to set certain parameters, so that some error checking on the parameters can be done.
To create a command line object to do a clustalw multiple alignment we do the following:
import os
from Bio.Clustalw import MultipleAlignCL
cline = MultipleAlignCL(os.path.join(os.curdir, "opuntia.fasta"))
cline.set_output("test.aln")
First we import the MultipleAlignCL
object, which models running a multiple alignment from clustalw. We then initialize the command line, with a single argument of the fasta file that we are going to be using for the alignment. The initialization function also takes an optional second argument which specifies the location of the clustalw
executable. By default, the commandline will just be invoked with 'clustalw,' assuming that you've got it somewhere on your PATH
.
The second argument sets the output to go to the file test.aln
. The MultipleAlignCL
object also has numerous other parameters to specify things like output format, gap costs, etc.
We can look at the command line we have generated by invoking the __str__
member attribute of the MultipleAlignCL
class. This is done by calling str(cline)
or simple by printing out the command line with print cline
. In this case, doing this would give the following output:
clustalw ./opuntia.fasta -OUTFILE=test.aln
Now that we've set up a simple command line, we now want to run the commandline and collect the results so we can deal with them. This can be done using the do_alignment
function of Clustalw
as follows:
from Bio import Clustalw
alignment = Clustalw.do_alignment(cline)
What happens when you run this if that Biopython executes your command line and runs clustalw with the given parameters. It then grabs the output, and if it is in a format that Biopython can parse (currently only clustal format), then it will parse the results and return them as an alignment object of the appropriate type. So in this case since we are getting results in the default clustal format, the returned alignment
object will be a ClustalAlignment
type.
Once we've got this alignment, we can do some interesting things with it such as get seq_record
objects for all of the sequences involved in the alignment:
all_records = alignment.get_all_seqs()
print "description:", all_records[0].description
print "sequence:", all_records[0].seq
This prints out the description and sequence object for the first sequence in the alignment:
description: gi|6273285|gb|AF191659.1|AF191
sequence: Seq('TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...', IUPACAmbiguousDNA())
You can also calculate the maximum length of the alignment with:
length = alignment.get_alignment_length()
Finally, to write out the alignment object in the original format, we just need to access the __str__
function. So doing a print alignment
gives:
CLUSTAL X (1.81) multiple sequence alignment
gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
...
This makes it easy to write your alignment back into a file with all of the original info intact.
If you want to do more interesting things with an alignment, the best thing to do is to pass the alignment to an alignment information generating object, such as the SummaryInfo object, described in section 8.3.2.
8.3.2 Calculating summary information
Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we've tried to separate out the functionality into separate classes, which act on the alignment.
Getting ready to calculate summary information about an object is quick to do. Let's say we've got an alignment object called alignment
. All we need to do to get an object that will calculate summary information is:
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)
The summary_align
object is very useful, and will do the following neat things for you:
-
Calculate a quick consensus sequence -- see section 8.3.3
- Get a position specific score matrix for the alignment -- see section 8.3.4
- Calculate the information content for the alignment -- see section 8.3.5
- Generate information on substitutions in the alignment -- section 8.4 details using this to generate a substitution matrix.
8.3.3 Calculating a quick consensus sequence
The SummaryInfo
object, described in section 8.3.2, provides functionality to calculate a quick consensus of an alignment. Assuming we've got a SummaryInfo
object called summary_align
we can calculate a consensus by doing:
consensus = summary_align.dumb_consensus()
As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value (the default is .3) will add the common residue to the consensus. If it doesn't reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus
would give:
consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...', IUPACAmbiguousDNA())
You can adjust how dumb_consensus
works by passing optional parameters:
-
the threshold
- This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is .7.
- the ambiguous character
- This is the ambiguity character to use. The default is 'N'.
- the consensus alphabet
- This is the alphabet to use for the consensus sequence. If an alphabet is not specified than we will try to guess the alphabet based on the alphabets of the sequences in the alignment.
8.3.4 Position Specific Score Matrices
Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment,
GTATC
AT--C
CTGTC
the PSSM is:
G A T C
G 1 1 0 1
T 0 0 3 0
A 1 1 0 0
T 0 0 2 0
C 0 0 0 3
Let's assume we've got an alignment object called c_align
. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence:
summary_align = AlignInfo.SummaryInfo(c_align)
consensus = summary_align.dumb_consensus()
Now, we want to make the PSSM, but ignore any N
ambiguity residues when calculating this:
my_pssm = summary_align.pos_specific_score_matrix(consensus,
chars_to_ignore = ['N'])
Two notes should be made about this:
-
To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM.
- The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in the alignment along this axis, you would need to do:
second_seq = alignment.get_seq_by_num(1)
my_pssm = summary_align.pos_specific_score_matrix(second_seq
chars_to_ignore = ['N'])
The command above returns a PSSM
object. To print out the PSSM as we showed above, we simply need to do a print my_pssm
, which gives:
A C G T
T 0.0 0.0 0.0 7.0
A 7.0 0.0 0.0 0.0
T 0.0 0.0 0.0 7.0
A 7.0 0.0 0.0 0.0
C 0.0 7.0 0.0 0.0
A 7.0 0.0 0.0 0.0
T 0.0 0.0 0.0 7.0
T 1.0 0.0 0.0 6.0
...
You can access any element of the PSSM by subscripting like your_pssm[sequence_number][residue_count_name]
. For instance, to get the counts for the 'A' residue in the second element of the above PSSM you would do:
>>> print my_pssm[1]["A"]
7.0
The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.
8.3.5 Information Content
A potentially useful measure of evolutionary conservation is the information ceontent of a sequence.
A useful introduction to information theory targetted towards molecular biologists can be found at http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:
where:
-
ICj -- The information content for the jth column in an alignment.
- Na -- The number of letters in the alphabet.
- Pij -- The frequency of a particular letter in the column (i. e. if G occured 3 out of 6 times in an aligment column, this would be 0.5)
- Qi -- The expected frequency of a letter. This is an
optional argument, usage of which is left at the user's
discretion. By default, it is automatically assigned to 0.05 for a
protein alphabet, and 0.25 for a nucleic acid alphabet. This is for
geting the information content without any assumption of prior
distribtions. When assuming priors, or when using a non-standard
alphabet, user should supply the values for Qi.
Well, now that we have an idea what information content is being calculated in Biopython, let's look at how to get it for a particular region of the alignment.
First, we need to use our alignment to get a alignment summary object, which we'll assume is called summary_align
(see section 8.3.2) for instructions on how to get this. Once we've got this object, calculating the information content for a region is as easy as:
info_content = summary_align.information_content(5, 30,
chars_to_ignore = ['N'])
Wow, that was much easier then the formula above made it look! The variable info_content
now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue 'N' when calculating the information content, since this value is not included in our alphabet (so we shouldn't be interested in looking at it!).
As mentioned above, we can also calculate relative information content by supplying the expected frequencies:
expect_freq = {
'A' : .3,
'G' : .2,
'T' : .3,
'C' : .2}
The expected should not be passed as a raw dictionary, but instead by passed as a SubsMat.FreqTable
object (see section 9.4.2 for more information about FreqTables). The FreqTable object provides a standard for associating the dictionary with an Alphabet, similar to how the Biopython Seq class works.
To create a FreqTable object, from the frequency dictionary you just need to do:
from Bio.Alphabet import IUPAC
from Bio.SubsMat import FreqTable
e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,
IUPAC.unambiguous_dna)
Now that we've got that, calculating the relative information content for our region of the alignment is as simple as:
info_content = summary_align.information_content(5, 30,
e_freq_table = e_freq_table,
chars_to_ignore = ['N'])
Now, info_content
will contain the relative information content over the region in relation to the expected frequencies.
The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter log_base
as the base you want:
info_content = summary_align.information_content(5, 30, log_base = 10
chars_to_ignore = ['N'])
Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won't reveal any mistakes made in coding this function!
8.3.6 Translating between Alignment formats
One thing that you always end up having to do is convert between different formats. Biopython does this using a FormatConverter class for alignment objects. First, let's say we have just parsed an alignment from clustal format into a ClustalAlignment
object:
import os
from Bio import Clustalw
alignment = Clustalw.parse_file(os.path.join(os.curdir, "test.aln"))
Now, let's convert this alignment into FASTA format. First, we create a converter object:
from Bio.Align.FormatConvert import FormatConverter
converter = FormatConverter(alignment)
We pass the converter the alignment that we want to convert. Now, to get this in FASTA alignment format, we simply do the following:
fasta_align = converter.to_fasta()
Looking at the newly created fasta_align
object using print fasta_align
gives:
>gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAATATATA----
------ATATATTTCAAATTTCCTTATATACCCAAATATAAAAATATCTAATAAATTAGA
...
The conversion process will, of course, lose information specific to a particular alignment format. Howerver, most of the basic information about the alignment will be retained.
As more formats are added the converter will be beefed up to read and write all of these different formats.
8.4 Substitution Matrices
Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book ``Biological Sequence Analysis'' by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.
Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.
8.4.1 Using common substitution matrices
8.4.2 Creating your own substitution matrix from an alignment
A very cool thing that you can do easily with the substitution matrix
classes is to create your own substitution matrix from an
alignment. In practice, this is normally done with protein
alignments. In this example, we'll first get a biopython alignment
object and then get a summary object to calculate info about the
alignment. The file containing protein.aln
(also available online
here)
contains the Clustalw alignment output.
from Bio import Clustalw
from Bio.Alphabet import IUPAC
from Bio.Align import AlignInfo
# get an alignment object from a Clustalw alignment output
c_align = Clustalw.parse_file("protein.aln", IUPAC.protein)
summary_align = AlignInfo.SummaryInfo(c_align)
Sections 8.3.1 and 8.3.2 contain
more information on doing this.
Now that we've got our summary_align
object, we want to use it
to find out the number of times different residues substitute for each
other. To make the example more readable, we'll focus on only amino
acids with polar charged side chains. Luckily, this can be done easily
when generating a replacement dictionary, by passing in all of the
characters that should be ignored. Thus we'll create a dictionary of
replacements for only charged polar amino acids using:
replace_info = summary_align.replacement_dictionary(["G", "A", "V", "L", "I",
"M", "P", "F", "W", "S",
"T", "N", "Q", "Y", "C"])
This information about amino acid replacements is represented as a
python dictionary which will look something like:
{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0,
('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0,
('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0,
('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0,
('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0,
('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0,
('D', 'E'): 270.0, ('D', 'D'): 2360.0}
This information gives us our accepted number of replacements, or how
often we expect different things to substitute for each other. It
turns out, amazingly enough, that this is all of the information we
need to go ahead and create a substitution matrix. First, we use the
replacement dictionary information to create an Accepted Replacement
Matrix (ARM):
from Bio import SubsMat
my_arm = SubsMat.SeqMat(replace_info)
With this accepted replacement matrix, we can go right ahead and
create our log odds matrix (i. e. a standard type Substitution Matrix):
my_lom = SubsMat.make_log_odds_matrix(my_arm)
The log odds matrix you create is customizable with the following
optional arguments:
-
exp_freq_table
-- You can pass a table of expected
frequencies for each alphabet. If supplied, this will be used
instead of the passed accepted replacement matrix when calculate
expected replacments.
logbase
- The base of the logarithm taken to create the
log odd matrix. Defaults to base 10.
factor
- The factor to multiply each matrix entry
by. This defaults to 10, which normally makes the matrix numbers
easy to work with.
round_digit
- The digit to round to in the matrix. This
defaults to 0 (i. e. no digits).
Once you've got your log odds matrix, you can display it prettily
using the function print_mat
. Doing this on our created matrix
gives:
>>> my_lom.print_mat()
D 6
E -5 5
H -15 -13 10
K -31 -15 -13 6
R -13 -25 -14 -7 7
D E H K R
Very nice. Now we've got our very own substitution matrix to play with!
8.5 BioRegistry -- automatically finding sequence sources
A consistently annoying problem in bioinformatics is easily finding a
sequence and making it available to your program. Sequences are
available from a ton of standard locations like NCBI and EMBL. as well
as from non-standard locations such as local databases or web servers.
To make this problem easier, Biopython (as well as the other open-bio
projects) is working towards a standard mechanism to allow specification
of the locations of resources. Once locations are specified, your code
using Biopython can readily retrieve sequences without having to worry
about the specifics of where the sequence came from.
This transparency of retrieval has a number of advantages for your code.
If a single web service is down (ie. NCBI is too busy
and is refusing connections), backup locations can be tried without
having any effect on the code that you wrote. Similary, you can have
local repositories of sequences that you use often, and then if these
repositories are off-line, switch to a web based service. Third, it
keeps the details of retrieval out of your code, allowing you to focus
on your biological problem, instead of focusing on boring retrieval
details. Finally, it's just a very cool idea.
This section deals with the specifics of setting up and using this
system of automatically retrieving sequences. The first section deals
with the interoperable configuration file method, while the second talks
about a similar Biopython-specific method. The configuration file method
is definately the way to go, unless you have specific needs it won't
give you.
8.5.1 Finding resources using a configuration file
8.5.1.1 Writing a configuration file
8.5.1.2 Sequence retrieval using the configuration file
8.5.2 Finding resources through a biopython specific interface
Biopython has also developed a proprietary mechanism for retrieval that
is Biopython only. This is only a good choice to use if the standard
configuration file system doesn't give you everything you want, since
this method is not compatible with other open-bio projects.
8.5.2.1 Retrieving sequences
By default, Biopython is configured to allow retrieval of sequences from
a number of standard locations. This makes it useable immediately
without knowing much about the system itself. To retrieve a Registry of
databases, all you need to do is:
>>> from Bio import db
You can readily view all of the different databases that retrieval is
possible be either printing the object and examining them, or
programmatically through the keys() function of object:
>>> print db
DBRegistry, exporting 'embl', 'embl-dbfetch-cgi', 'embl-ebi-cgi',
'embl-fast', 'embl-xembl-cgi', 'interpro-ebi-cgi',
'nucleotide-dbfetch-cgi', 'nucleotide-genbank-cgi', 'pdb',
'pdb-ebi-cgi', 'pdb-rcsb-cgi', 'prodoc-expasy-cgi',
'prosite-expasy-cgi', 'protein-genbank-cgi', 'swissprot',
'swissprot-expasy-cgi'
>>> db.keys()
['embl-dbfetch-cgi', 'embl-fast', 'embl', 'prosite-expasy-cgi',
'swissprot-expasy-cgi', 'nucleotide-genbank-cgi', 'pdb-ebi-cgi',
'interpro-ebi-cgi', 'embl-ebi-cgi', 'embl-xembl-cgi',
'protein-genbank-cgi', 'pdb', 'prodoc-expasy-cgi',
'nucleotide-dbfetch-cgi', 'swissprot', 'pdb-rcsb-cgi']
Now, let's say we want to retrieve a swissprot record for one of our
orchid chalcone synthases. First, we get the swissprot connection, then
we retrieve an record of interest:
>>> sp = db["swissprot"]
>>> sp
<Bio.DBRegistry.DBGroup instance at 0x82fdb2c>
record_handle = sp['O23729']
>>> print record_handle.read()[:200]
ID CHS3_BROFI STANDARD; PRT; 394 AA.
AC O23729;
DT 15-JUL-1999 (Rel. 38, Created)
DT 15-JUL-1999 (Rel. 38, Last sequence update)
DT 15-JUL-1999 (Rel. 38, Last annotation update)
This retrieval method is nice for a number of reasons. First, we didn't
have to worry about where exactly swissprot records were being retrieved
from -- we only ask for an object that will give us any swissprot record
we can get. Secondly, once we get the swissprot object, we don't need to
worry about how we are getting our sequence -- we just ask for it by id
and don't worry about the implementation details.
The default biopython database registry object can be used similarly to
retrieve sequences from EMBL, prosite, PDB, interpro, GenBank and XEMBL.
8.5.2.2 Registering and Grouping databases
The basic registry objects are nice in that they provide basic
functionality, but if you have a more advanced system it is nice to be
able to specify new databases. This is a more advanced topic, but is
very possible with the current system.
This example describes adding a local CGI script serving out GenBank
(ie. if you had something like a local mirror of GenBank), and then
registering this and the normal NCBI GenBank as a single group to
retrieve from. This would allow you to normally get things from a local
mirror and then switch over to the main GenBank server if your server
goes down, all without adjusting your retrieval code.
First, we need to describe the CGI script to retrieve from. This example
uses a CGI script, but we eventually hope to handle other sources such
as Applications, databases, or CORBA servers (XXX, should have an
example once this is in place). We describe the CGI script as follows:
from Bio.sources import CGI
local_cgi = CGI(name = "local_cgi",
delay = 0.0,
cgi = "http://www.myserver.org/cgi-bin/my_local.cgi",
url = "http://www.myserver.org/cgi_documentation.html",
doc = "Query a local databases",
failure_cases = [])
Now that we have specified the details for connecting to the CGI script,
we are ready to register this CGI script. We just need one more detail
-- specifying what the script returns upon failure to find a sequence.
We do this using Martel regular expressions:
import Martel
my_failures = [
(Martel.Str("Sequence not available"), "No sequence found")]
Now we've got everything we need, and can register the database:
from Bio import register_db
register_db(name = "nucleotide-genbank-local",
key = "uid",
source = local_cgi,
failure = my_failures)
This makes the database available as before, so if we print the keys of
the database, we'll see "nucleotide-genbank-local" available. Now that
we've got it registered, we'd like to link all of the genbank databases
together. We do this, using a group_db
command. First, we need to
create a group named "genbank" to retrieve things from the database:
register_db(name = "genbank", behavior = "concurrent")
The behavior
argument specifies how the group will try to
retrieve things from the various databases registered with it.
concurrent
tells it to try to retrieve from all databases at
once, and then just take whatever sequence record comes back first. You
can also specify serial
behavior, in which the retriever will
connect to one database at a time until something gets retrieved.
Now that we've got the group, we want to register our local GenBank and
the NCBI GenBank with this command:
group_db("genbank", "nucleotide-genbank-local")
group_db("genbank", "nucleotide-genbank-cgi")
Now we've got our database access set up, and the database registry
contains our genbank and nucleotide-genbank-local entries:
['embl-dbfetch-cgi', 'embl-fast', 'embl', 'prosite-expasy-cgi',
'swissprot-expasy-cgi', 'nucleotide-genbank-cgi', 'pdb-ebi-cgi',
'genbank', 'nucleotide-genbank-local', 'interpro-ebi-cgi',
'embl-ebi-cgi', 'embl-xembl-cgi', 'protein-genbank-cgi', 'pdb',
'prodoc-expasy-cgi', 'nucleotide-dbfetch-cgi', 'swissprot',
'pdb-rcsb-cgi']
Cool, now we can add our own databases to the registry and make use of
the simplified retrieval scheme!
8.6 BioSQL -- storing sequences in a relational database
8.7 BioCorba
Biocorba does some cool stuff with CORBA. Basically, it allows you to easily interact with code written in other languages, including Perl and Java. This is all done through an interface which is very similar to the standard biopython interface. Much work has been done to make it easy to use knowing only very little about CORBA. You should check out the biocorba specific documentation, which describes in detail how to use it.
8.8 Going 3D: The PDB module
Biopython also allows you to explore the extensive realm of macromolecular structure.
Biopython comes with a PDBParser class that produces a Structure object. The Structure object
can be used to access the atomic data in the file in a convenient manner.
8.8.1 Structure representation
A macromolecular structure is represented using a structure, model chain,
residue, atom (or SMCRA) hierarchy. Fig. 8.8.1 shows a UML
class diagram of the SMCRA data structure. Such a data structure is not
necessarily best suited for the representation of the macromolecular content of
a structure, but it is absolutely necessary for a good interpretation of the
data present in a file that describes the structure (typically a PDB or MMCIF
file). If this hierarchy cannot represent the contents of a structure file, it
is fairly certain that the file contains an error or at least does not describe
the structure unambiguously. If a SMCRA data structure cannot be generated,
there is reason to suspect a problem. Parsing a PDB file can thus be used to
detect likely problems. We will give several examples of this in section
8.8.5.1.
Structure, Model, Chain and Residue are all subclasses of the Entity base class.
The Atom class only (partly) implements the Entity interface (because an Atom
does not have children).
For each Entity subclass, you can extract a child by using a unique id for that
child as a key (e.g. you can extract an Atom object from a Residue object by
using an atom name string as a key, you can extract a Chain object from a Model
object by using its chain identifier as a key).
Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue
classes, which are both subclasses of the DisorderedEntityWrapper base class.
They hide the complexity associated with disorder and behave exactly as Atom
and Residue objects.
In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be
extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively)
by using an id as a key.
child_entity=parent_entity[child_id]
You can also get a list of all child Entities of a parent Entity object. Note
that this list is sorted in a specific way (e.g. according to chain identifier
for Chain objects in a Model object).
child_list=parent_entity.get_list()
You can also get the parent from a child.
parent_entity=child_entity.get_parent()
At all levels of the SMCRA hierarchy, you can also extract a full id.
The full id is a tuple containing all id's starting from the top object (Structure)
down to the current object. A full id for a Residue object e.g. is something
like:
full_id=residue.get_full_id()
print full_id
("1abc", 0, "A", ("", 10, "A"))
This corresponds to:
-
The Structure with id "1abc"
- The Model with id 0
- The Chain with id "A"
- The Residue with id (" ", 10, "A").
The Residue id indicates that the residue is not a hetero-residue (nor a water)
because it has a blanc hetero field, that its sequence identifier is 10 and
that its insertion code is "A".
Some other useful methods:
# get the entity's id
entity.get_id()
# check if there is a child with a given id
entity.has_id(entity_id)
# get number of children
nr_children=len(entity)
It is possible to delete, rename, add, etc. child entities from a parent entity,
but this does not include any sanity checks (e.g. it is possible to add two
residues with the same id to one chain). This really should be done via a nice
Decorator class that includes integrity checking, but you can take a look at
the code (Entity.py) if you want to use the raw interface.
8.8.1.1 Structure
The Structure object is at the top of the hierarchy. Its id is a user given
string. The Structure contains a number of Model children. Most crystal structures
(but not all) contain a single model, while NMR structures typically consist
of several models. Disorder in crystal structures of large parts of molecules
can also result in several models.
8.8.1.1.1 Constructing a Structure object
A Structure object is produced by a PDBParser object:
from Bio.PDB.PDBParser import PDBParser
p=PDBParser(PERMISSIVE=1)
structure_id="1fat"
filename="pdb1fat.ent"
s=p.get_structure(structure_id, filename)
The PERMISSIVE flag indicates that a number of common problems (see 8.8.5.1)
associated with PDB files will be ignored (but note that some atoms and/or residues
will be missing). If the flag is not present a PDBConstructionException
will be generated during the parse operation.
8.8.1.1.2 Header and trailer
You can extract the header and trailer (simple lists of strings) of the PDB
file from the PDBParser object with the get_header and get_trailer
methods.
8.8.1.2 Model
The id of the Model object is an integer, which is derived from the position
of the model in the parsed file (they are automatically numbered starting from
0). The Model object stores a list of Chain children.
8.8.1.2.1 Example
Get the first model from a Structure object.
first_model=structure[0]
8.8.1.3 Chain
The id of a Chain object is derived from the chain identifier in the structure
file, and can be any string. Each Chain in a Model object has a unique id. The
Chain object stores a list of Residue children.
8.8.1.3.1 Example
Get the Chain object with identifier ``A'' from a Model object.
chain_A=model["A"]
8.8.1.4 Residue
Unsurprisingly, a Residue object stores a set of Atom children. In addition,
it also contains a string that specifies the residue name (e.g. ``ASN'')
and the segment identifier of the residue (well known to X-PLOR users, but not
used in the construction of the SMCRA data structure).
The id of a Residue object is composed of three parts: the hetero field (hetfield),
the sequence identifier (resseq) and the insertion code (icode).
The hetero field is a string : it is ``W'' for waters, ``H_'' followed
by the residue name (e.g. ``H_FUC'') for other hetero residues and blank
for standard amino and nucleic acids. This scheme is adopted for reasons described
in section 8.8.3.1.
The second field in the Residue id is the sequence identifier, an integer describing
the position of the residue in the chain.
The third field is a string, consisting of the insertion code. The insertion
code is sometimes used to preserve a certain desirable residue numbering scheme.
A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue)
could e.g. have sequence identifiers and insertion codes as followed: Thr 80
A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune
with that of the wild type structure.
Let's give some examples. Asn 10 with a blank insertion code would have residue
id ('' '', 10, '' ''). Water 10 would have residue id (``W``, 10, `` ``).
A glucose molecule (a hetero residue with residue name GLC) with sequence identifier
10 would have residue id (''H_GLC'', 10, '' ''). In this way, the three
residues (with the same insertion code and sequence identifier) can be part
of the same chain because their residue id's are distinct.
In most cases, the hetflag and insertion code fields will be blank, e.g. ('' '', 10, '' '').
In these cases, the sequence identifier can be used as a shortcut for the full
id:
# use full id
res10=chain[("", 10, "")]
# use shortcut
res10=chain[10]
Each Residue object in a Chain object should have a unique id. However, disordered
residues are dealt with in a special way, as described in section 8.8.2.3.2.
A Residue object has a number of additional methods:
r.get_resname() # return residue name, e.g. "ASN"
r.get_segid() # return the SEGID, e.g. "CHN1"
8.8.1.5 Atom
The Atom object stores the data associated with an atom, and has no children.
The id of an atom is its atom name (e.g. ``OG'' for the side chain oxygen
of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception
is made for disordered atoms, as described in section 8.8.2.2.
In a PDB file, an atom name consists of 4 chars, typically with leading and
trailing spaces. Often these spaces can be removed for ease of use (e.g. an
amino acid C a atom is labeled ``.CA.'' in a PDB file, where
the dots represent spaces). To generate an atom name (and thus an atom id) the
spaces are removed, unless this would result in a name collision in a Residue
(i.e. two Atom objects with the same atom name and id). In the latter case,
the atom name including spaces is tried. This situation can e.g. happen when
one residue contains atoms with names ``.CA.'' and ``CA..'', although
this is not very likely.
The atomic data stored includes the atom name, the atomic coordinates (including
standard deviation if present), the B factor (including anisotropic B factors
and standard deviation if present), the altloc specifier and the full atom name
including spaces. Less used items like the atom element number or the atomic
charge sometimes specified in a PDB file are not stored.
An Atom object has the following additional methods:
a.get_name() # atom name (spaces stripped, e.g. "CA")
a.get_id() # id (equals atom name)
a.get_coord() # atomic coordinates
a.get_bfactor() # B factor
a.get_occupancy() # occupancy
a.get_altloc() # alternative location specifie
a.get_sigatm() # std. dev. of atomic parameters
a.get_siguij() # std. dev. of anisotropic B factor
a.get_anisou() # anisotropic B factor
a.get_fullname() # atom name (with spaces, e.g. ".CA.")
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy
arrays are used.
8.8.2.1 General approach
Disorder should be dealt with from two points of view: the atom and the residue
points of view. In general, we have tried to encapsulate all the complexity that
arises from disorder. If you just want to loop over all C a atoms,
you do not care that some residues have a disordered side chain. On the other
hand it should also be possible to represent disorder completely in the data
structure. Therefore, disordered atoms or residues are stored in special objects
that behave as if there is no disorder. This is done by only representing a
subset of the disordered atoms or residues. Which subset is picked (e.g. which
of the two disordered OG side chain atom positions of a Ser residue is used)
can be specified by the user.
8.8.2.2 Disordered atoms
Disordered atoms are represented by ordinary Atom objects, but all Atom objects
that represent the same physical atom are stored in a DisorderedAtom object.
Each Atom object in a DisorderedAtom object can be uniquely indexed using its
altloc specifier. The DisorderedAtom object forwards all uncaught method calls
to the selected Atom object, by default the one that represents the atom with
with the highest occupancy. The user can of course change the selected Atom
object, making use of its altloc specifier. In this way atom disorder is represented
correctly without much additional complexity. In other words, if you are not
interested in atom disorder, you will not be bothered by it.
Each disordered atom has a characteristic altloc identifier. You can specify
that a DisorderedAtom object should behave like the Atom object associated with
a specific altloc identifier:
atom.disordered\_select("A") # select altloc A atom
print atom.get_altloc()
"A"
atom.disordered_select("B") # select altloc B atom
print atom.get_altloc()
"B"
8.8.2.3 Disordered residues
8.8.2.3.1 Common case
The most common case is a residue that contains one or more disordered atoms.
This is evidently solved by using DisorderedAtom objects to represent the disordered
atoms, and storing the DisorderedAtom object in a Residue object just like ordinary
Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in
fact the atom with the highest occupancy) by forwarding all uncaught method
calls to one of the Atom objects (the selected Atom object) it contains.
8.8.2.3.2 Point mutations
A special case arises when disorder is due to a point mutation, i.e. when two
or more point mutants of a polypeptide are present in the crystal. An example
of this can be found in PDB structure 1EN2.
Since these residues belong to a different residue type (e.g. let's say Ser
60 and Cys 60) they should not be stored in a single Residue object as in the
common case. In this case, each residue is represented by one Residue object,
and both Residue objects are stored in a DisorderedResidue object.
The DisorderedResidue object forwards all uncaught methods to the selected Residue
object (by default the last Residue object added), and thus behaves like an
ordinary residue. Each Residue object in a DisorderedResidue object can be uniquely
identified by its residue name. In the above example, residue Ser 60 would have
id ``SER'' in the DisorderedResidue object, while residue Cys 60 would
have id ``CYS''. The user can select the active Residue object in a DisorderedResidue
object via this id.
8.8.3 Hetero residues
8.8.3.1 Associated problems
A common problem with hetero residues is that several hetero and non-hetero
residues present in the same chain share the same sequence identifier (and insertion
code). Therefore, to generate a unique id for each hetero residue, waters and
other hetero residues are treated in a different way.
Remember that Residue object have the tuple (hetfield, resseq, icode) as id.
The hetfield is blank (`` ``) for amino and nucleic acids, and a string
for waters and other hetero residues. The content of the hetfield is explained
below.
8.8.3.2 Water residues
The hetfield string of a water residue consists of the letter ``W''. So
a typical residue id for a water is (``W'', 1, `` ``).
8.8.3.3 Other hetero residues
The hetfield string for other hetero residues starts with ``H_'' followed
by the residue name. A glucose molecule e.g. with residue name ``GLC''
would have hetfield ``H_GLC''. It's residue id could e.g. be (``H_GLC'',
1, `` ``).
8.8.4 Some random usage examples
Parse a PDB file, and extract some Model, Chain, Residue and Atom objects.
from PDBParser import PDBParser
parser=PDBParser()
structure=parser.get_structure("test", "1fat.pdb")
model=structure[0]
chain=model["A"]
residue=chain[1]
atom=residue["CA"]
Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq
10).
residue_id=("H_GLC", 10, " ")
residue=chain[residue_id]
Print all hetero residues in chain.
for residue in chain.get_list():
residue_id=residue.get_id()
hetfield=residue_id[0]
if hetfield[0]=="H":
print residue_id
Print out the coordinates of all CA atoms in a structure with B factor greater
than 50.
for model in structure.get_list():
for chain in model.get_list():
for residue in chain.get_list():
if residue.has_id("CA"):
ca=residue["CA"]
if ca.get_bfactor()>50.0:
print ca.get_coord()
Print out all the residues that contain disordered atoms.
for model in structure.get_list()
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
resseq=residue.get_id()[1]
resname=residue.get_resname()
model_id=model.get_id()
chain_id=chain.get_id()
print model_id, chain_id, resname, resseq
Loop over all disordered atoms, and select all atoms with altloc A (if present).
This will make sure that the SMCRA data structure will behave as if only the
atoms with altloc A are present.
for model in structure.get_list()
for chain in model.get_list():
for residue in chain.get_list():
if residue.is_disordered():
for atom in residue.get_list():
if atom.is_disordered():
if atom.disordered_has_id("A"):
atom.disordered_select("A")
Suppose that a chain has a point mutation at position 10, consisting of a Ser
and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys
residue.
residue=chain[10]
residue.disordered_select("CYS")
8.8.5 Common problems in PDB files
8.8.5.1 Examples
The PDBParser/Structure class was tested on about 800 structures (each belonging
to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5
seconds per structure. Parsing the structure of the large ribosomal subunit
(1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC.
Three exceptions were generated in cases where an unambiguous data structure
could not be built. In all three cases, the likely cause is an error in the
PDB file that should be corrected. Generating an exception in these cases
is much better than running the chance of incorrectly describing
the structure in a data structure.
8.8.5.1.1 Duplicate residues
One structure contains two amino acid residues in one chain with the same sequence
identifier (resseq 3) and icode. Upon inspection it was found that this chain
contains the residues Thr A3, ..., Gly A202, Leu A3, Glu A204. Clearly,
Leu A3 should be Leu A203. A couple of similar situations exist for structure
1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu
B65 should be Glu B66).
8.8.5.1.2 Duplicate atoms
Structure 1EJG contains a Ser/Pro point mutation in chain A at position 22.
In turn, Ser 22 contains some disordered atoms. As expected, all atoms belonging
to Ser 22 have a non-blank altloc specifier (B or C). All atoms of Pro 22 have
altloc A, except the N atom which has a blank altloc. This generates an exception,
because all atoms belonging to two residues at a point mutation should have
non-blank altloc. It turns out that this atom is probably shared by Ser and
Pro 22, as Ser 22 misses the N atom. Again, this points to a problem in the
file: the N atom should be present in both the Ser and the Pro residue, in both
cases associated with a suitable altloc identifier.
8.8.5.2 Automatic correction
Some errors are quite common and can be easily corrected without much risk of
making a wrong interpretation. These cases are listed below.
8.8.5.2.1 A blank altloc for a disordered atom
Normally each disordered atom should have a non-blanc altloc identifier. However,
there are many structures that do not follow this convention, and have a blank
and a non-blank identifier for two disordered positions of the same atom. This
is automatically interpreted in the right way.
8.8.5.2.2 Broken chains
Sometimes a structure contains a list of residues belonging to chain A, followed
by residues belonging to chain B, and again followed by residues belonging to
chain A, i.e. the chains are ``broken''. This is correctly interpreted.
8.8.5.3 Fatal errors
Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing
and risking a mistake, an exception is generated, and the user is expected to
correct the PDB file. These cases are listed below.
8.8.5.3.1 Duplicate residues
All residues in a chain should have a unique id. This id is generated based
on:
-
The sequence identifier (resseq).
- The insertion code (icode).
- The hetfield string (``W'' for waters and ``H_'' followed by the
residue name for other hetero residues)
- The residue names of the residues in the case of point mutations (to store the
Residue objects in a DisorderedResidue object).
If this does not lead to a unique id something is quite likely wrong, and an
exception is generated.
8.8.5.3.2 Duplicate atoms
All atoms in a residue should have a unique id. This id is generated based on:
-
The atom name (without spaces, or with spaces if a problem arises).
- The altloc specifier.
If this does not lead to a unique id something is quite likely wrong, and an
exception is generated.
8.8.6 Other features
There are also some tools to analyze a crystal structure. Tools
exist to superimpose two coordinate sets (SVDSuperimposer), to extract
polypeptides from a structure (Polypeptide), to perform neighbor lookup
(NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup
is done using a KD tree module written in C++. It is very fast and also
includes a fast method to find all point pairs within a certain distance
of each other.
A Polypeptide object is simply a UserList of Residue objects. You can
construct a list of Polypeptide objects from a Structure object as follows:
model_nr=1
polypeptide_list=build_peptides(structure, model_nr)
for polypeptide in polypeptide_list:
print polypeptide
The Polypeptide objects are always created from a single
Model (in this case model 1).
8.9 Bio.PopGen: Population genetics
Bio.PopGen is a new Biopython module supporting population genetics,
available in Biopython 1.44 onwards.
The medium term objective for the module is to support widely used data
formats, applications and databases. This module is currently under intense
development and support for new features should appear at a rather fast pace.
Unfortunately this might also entail some instability on the API, especially
if you are using a CVS version. APIs that are made available on public
versions should be much stabler.
GenePop (http://genepop.curtin.edu.au/) is a popular population
genetics software package supporting Hardy-Weinberg tests, linkage
desiquilibrium, population diferentiation, basic statistics, Fst and
migration estimates, among others. GenePop does not supply sequence
based statistics as it doesn't handle sequence data.
The GenePop file format is supported by a wide range of other population
genetic software applications, thus making it a relevant format in the
population genetics field.
Bio.PopGen provides a parser and generator of GenePop file format.
Utilities to manipulate the content of a record are also provided.
Here is an example on how to read a GenePop file (you can find
example GenePop data files in the Test/PopGen directory of Biopython):
from Bio.PopGen import GenePop
handle = open("example.gen")
rec = GenePop.parse(handle)
handle.close()
This will read a file called example.gen and parse it. If you
do print rec, the record will be output again, in GenePop format.
The most important information in rec will be the loci names and
population information (but there is more -- use help(GenePop.Record)
to check the API documentation). Loci names can be found on rec.loci_list.
Population information can be found on rec.populations.
Populations is a list with one element per population. Each element is itself
a list of individuals, each individual is a pair composed by individual
name and a list of alleles (2 per marker), here is an example for
rec.populations:
[
[
('Ind1', [(1, 2), (3, 3), (200, 201)],
('Ind2', [(2, None), (3, 3), (None, None)],
],
[
('Other1', [(1, 1), (4, 3), (200, 200)],
]
]
So we have two populations, the first with two individuals, the
second with only one. The first individual of the first
population is called Ind1, allelic information for each of
the 3 loci follows. Please note that for any locus, information
might be missing (see as an example, Ind2 above).
A few utility functions to manipulate GenePop records are made
available, here is an example:
from Bio.PopGen import GenePop
#Imagine that you have loaded rec, as per the code snippet above...
rec.remove_population(pos)
#Removes a population from a record, pos is the population position in
# rec.populations, remember that it starts on position 0.
# rec is altered.
rec.remove_locus_by_position(pos)
#Removes a locus by its position, pos is the locus position in
# rec.loci_list, remember that it starts on position 0.
# rec is altered.
rec.remove_locus_by_name(name)
#Removes a locus by its name, name is the locus name as in
# rec.loci_list. If the name doesn't exist the function fails
# silently.
# rec is altered.
rec_loci = rec.split_in_loci()
#Splits a record in loci, that is, for each loci, it creates a new
# record, with a single loci and all populations.
# The result is returned in a dictionary, being each key the locus name.
# The value is the GenePop record.
# rec is not altered.
rec_pops = rec.split_in_pops(pop_names)
#Splits a record in populations, that is, for each population, it creates
# a new record, with a single population and all loci.
# The result is returned in a dictionary, being each key
# the population name. As population names are not available in GenePop,
# they are passed in array (pop_names).
# The value of each dictionary entry is the GenePop record.
# rec is not altered.
GenePop does not support population names, a limitation which can be
cumbersome at times. Functionality to enable population names is currently
being planned for Biopython. These extensions won't break compatibility in
any way with the standard format. In the medium term, we would also like to
support the GenePop web service.
8.9.2 Coalescent simulation
A coalescent simulation is a backward model of population genetics with relation to
time. A simulation of ancestry is done until the Most Recent Common Ancestor (MRCA) is found.
This ancestry relationship starting on the MRCA and ending on the current generation
sample is sometimes called a genealogy. Simple cases assume a population of constant
size in time, haploidy, no population structure, and simulate the alleles of a single
locus under no selection pressure.
Coalescent theory is used in many fields like selection detection, estimation of
demographic parameters of real populations or disease gene mapping.
The strategy followed in the Biopython implementation of the coalescent was not
to create a new, built-in, simulator from scratch but to use an existing one,
SIMCOAL2 (http://cmpg.unibe.ch/software/simcoal2/). SIMCOAL2 allows for,
among others, population structure, multiple demographic events, simulation
of multiple types of loci (SNPs, sequences, STRs/microsatellites and RFLPs)
with recombination, diploidy multiple chromosomes or ascertainment bias. Notably
SIMCOAL2 doesn't support any selection model. We recommend reading SIMCOAL2's
documentation, available in the link above.
The input for SIMCOAL2 is a file specifying the desired demography and genome,
the output is a set of files (typically around 1000) with the simulated genomes
of a sample of individuals per subpopulation. This set of files can be used
in many ways, like to compute confidence intervals where which certain
statistics (e.g., Fst or Tajima D) are expected to lie. Real population
genetics datasets statistics can then be compared to those confidence intervals.
Biopython coalescent code allows to create demographic scenarios and genomes and
to run SIMCOAL2.
8.9.2.1 Creating scenarios
Creating a scenario involves both creating a demography and a chromosome structure.
In many cases (e.g. when doing Approximate Bayesian Computations -- ABC) it is
important to test many parameter variations (e.g. vary the effective population size,
Ne, between 10, 50, 500 and 1000 individuals). The code provided allows for
the simulation of scenarios with different demographic parameters very easily.
Below we see how we can create scenarios and then how simulate them.
8.9.2.1.1 Demography
A few predefined demographies are built-in, all have two shared parameters: sample size
(called sample_size on the template, see below for its use) per deme and deme size, i.e.
subpopulation size (pop_size). All demographies are available as templates where all
parameters can be varied, each template has a system name. The prefedined
demographies/templates are:
-
Single population, constant size
- The standard parameters are enough to specifity
it. Template name: simple.
- Single population, bottleneck
- As seen on figure 8.9.2.1.1. The parameters
are current population size (pop_size on template ne3 on figure), time of expansion,
given as the generation in the past when it occured (expand_gen),
effective population size during bottleneck (ne2), time of contraction
(contract_gen) and original size in the remote past (ne3). Template name: bottle.
- Island model
- The typical island model. The total number of demes is specified
by total_demes and the migration rate by mig. Template name island.
- Stepping stone model - 1 dimension
- The stepping stone model in 1 dimension,
extremes disconnected. The total number of demes is total_demes, migration rate
is mig. Template name is ssm_1d.
- Stepping stone model - 2 dimensions
- The stepping stone model in 2 dimensions,
extremes disconnected. The parameters are x for the horizontal dimension and y
for the vertical (being the total number of demes x times y), migration rate is mig.
Template name is ssm_2d.
In our first example, we will generate a template for a single population, constant size
model with a sample size of 30 and a deme size of 500. The code for this is:
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
generate_simcoal_from_template('simple',
[(1, [('SNP', [24, 0.0005, 0.0])])],
[('sample_size', [30]),
('pop_size', [100])])
Executing this code snippet will generate a file on the current directory called
simple_100_300.par this file can be given as input to SIMCOAL2 to simulate the
demography (below we will see how Biopython can take care of calling SIMCOAL2).
This code consists of a single function call, lets discuss it paramter by parameter.
The first parameter is the template id (from the list above). We are using the id
'simple' which is the template for a single population of constant size along time.
The second parameter is the chromosome structure. Please ignore it for now, it will be
explained in the next section.
The third parameter is a list of all required parameters (recall that the simple model
only needs sample_size and pop_size) and possible values (in this case each
parameter only has a possible value).
Now, lets consider an example where we want to generate several island models, and we
are interested in varying the number of demes: 10, 50 and 100 with a migration
rate of 1%. Sample size and deme
size will be the same as before. Here is the code:
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
generate_simcoal_from_template('island',
[(1, [('SNP', [24, 0.0005, 0.0])])],
[('sample_size', [30]),
('pop_size', [100]),
('mig', [0.01]),
('total_demes', [10, 50, 100])])
In this case, 3 files will be generated: island_100_0.01_100_30.par,
island_10_0.01_100_30.par and island_50_0.01_100_30.par. Notice the
rule to make file names: template name, followed by parameter values in
reverse order.
A few, arguably more esoteric template demographies exist (please check the
Bio/PopGen/SimCoal/data directory on Biopython source tree). Furthermore it is possible
for the user to create new templates. That functionality will be discussed in a future
version of this document.
8.9.2.1.2 Chromosome structure
We strongly recommend reading SIMCOAL2 documentation to understand the full potential
available in modeling chromosome structures. In this subsection we only discuss how
to implement chromosome structures using the Biopython interface, not the underlying
SIMCOAL2 capabilities.
We will start by implementing a single chromosome, with 24 SNPs with
a recombination rate immediately on the right of each locus of 0.0005 and a
minimum frequency of the minor allele of 0. This will be specified by the
following list (to be passed as second parameter to the function
generate_simcoal_from_template):
[(1, [('SNP', [24, 0.0005, 0.0])])]
This is actually the chromosome structure used in the above examples.
The chromosome structure is represented by a list of chromosomes,
each chromosome (i.e., each element in the list)
is composed by a tuple (a pair): the first element
is the number of times the chromosome is to be repeated (as there
might be interest in repeating the same chromosome many times).
The second element is a list of the actual components of the chromosome.
Each element is again a pair, the first member is the locus type and
the second element the parameters for that locus type. Confused?
Before showing more examples lets review the example above: We have
a list with one element (thus one chromosome), the chromosome is
a single instance (therefore not to be repeated), it is composed
of 24 SNPs, with a recombination rate of 0.0005 between each
consecutive SNP, the minimum frequency of the minor allele is
0.0 (i.e, it can be absent from a certain population).
Lets see a more complicated example:
[
(5, [
('SNP', [24, 0.0005, 0.0])
]
),
(2, [
('DNA', [10, 0.0, 0.00005, 0.33]),
('RFLP', [1, 0.0, 0.0001]),
('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])
]
)
]
We start by having 5 chromosomes with the same structure as
above (i.e., 24 SNPs). We then have 2 chromosomes which
have a DNA sequence with 10 nucleotides, 0.0 recombination rate,
0.0005 mutation rate, and a transition rate of 0.33. Then we
have an RFLP with 0.0 recombination rate to the next locus and
a 0.0001 mutation rate. Finally we have a microsatellite (or STR),
with 0.0 recombination rate to the next locus (note, that as this
is a single microsatellite which has no loci following, this
recombination rate here is irrelevant), with a mutation rate
of 0.001, geometric paramater of 0.0 and a range constraint
of 0.0 (for information about this parameters please consult
the SIMCOAL2 documentation, you can use them to simulate
various mutation models, including the typical -- for microsatellites --
stepwise mutation model among others).
8.9.2.2 Running SIMCOAL2
We now discuss how to run SIMCOAL2 from inside Biopython. It is required
that the binary for SIMCOAL2 is called simcoal2 (or simcoal2.exe on Windows
based platforms), please note that the typical name when downloading the
program is in the format simcoal2_x_y. As such renaming of the binary
after download is needed.
It is possible to run SIMCOAL2 on files that were not generated using the method
above (e.g., writing a parameter file by hand), but we will show an
example by creating a model using the framework presented above.
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template
from Bio.PopGen.SimCoal.Controller import SimCoalController
generate_simcoal_from_template('simple',
[
(5, [
('SNP', [24, 0.0005, 0.0])
]
),
(2, [
('DNA', [10, 0.0, 0.00005, 0.33]),
('RFLP', [1, 0.0, 0.0001]),
('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0])
]
)
],
[('sample_size', [30]),
('pop_size', [100])])
ctrl = SimCoalController('.')
ctrl.run_simcoal('simple_100_30.par', 50)
The lines of interest are the last two (plus the new import).
Firstly a controller for the
application is created. The directory where the binary is located has
to be specified.
The simulator is then run on the last line: we know, from the rules explained
above, that the input file name is simple_100_30.par for the
simulation parameter file created. We then specify
that we want to run 50 independent simulations, by default Biopython
requests a simulation of diploid data, but a third parameter can
be added to simulate haploid data (adding as a parameter the
string '0'). SIMCOAL2 will now run (please
note that this can take quite a lot of time) and will create a directory
with the simulation results. The results can now be analysed (typically
studying the data with Arlequin3). In the future Biopython might support
reading the Arlequin3 format and thus allowing for the analysis of SIMCOAL2
data inside Biopython.
8.9.3 Other applications
Here we discuss interfaces and utilities to deal with population genetics'
applications which arguably have a smaller user base.
8.9.3.1 FDist: Detecting selection and molecular adaptation
FDist is a selection detection application suite based on computing
(i.e. simulating) a ``neutral'' confidence interval based on Fst and
heterozygosity. Markers (which can be SNPs, microsatellites, AFLPs
among others) which lie outside the ``neutral'' interval are to be
considered as possible candidates for being under selection.
FDist is mainly used when the number of markers is considered enough
to estimate an average Fst, but not enough to either have outliers
calculated from the dataset directly or, with even more markers for
which the relative positions in the genome are known, to use
approaches based on, e.g., Extended Haplotype Heterozygosity (EHH).
The typical usage pattern for FDist is as follows:
-
Import a dataset from an external format into FDist format.
- Compute average Fst. This is done by datacal inside FDist.
- Simulate ``neutral'' markers based on the
average Fst and expected number of total populations.
This is the core operation, done by fdist inside FDist.
- Calculate the confidence interval, based on the desired
confidence boundaries (typically 95% or 99%). This is done by
cplot and is mainly used to plot the interval.
- Assess each marker status against the simulation ``neutral''
confidence interval. Done
by pv. This is used to detect the outlier status of each marker
against the simulation.
We will now discuss each step with illustrating example code
(for this example to work FDist binaries have to be on the
executable PATH).
The FDist data format is application specific and is not used at
all by other applications, as such you will probably have to convert
your data for use with FDist. Biopython can help you do this.
Here is an example converting from GenePop format to FDist format
(along with imports that will be needed on examples further below):
from Bio.PopGen import GenePop
from Bio.PopGen import FDist
from Bio.PopGen.FDist import Controller
from Bio.PopGen.FDist.Utils import convert_genepop_to_fdist
gp_rec = GenePop.parse(open("example.gen"))
fd_rec = convert_genepop_to_fdist(gp_rec)
in_file = open("infile", "w")
in_file.write(str(fd_rec))
in_file.close()
In this code we simply parse a GenePop file and convert it to a FDist
record.
Printing an FDist record will generate
a string that can be directly saved to a file and supplied to FDist. FDist
requires the input file to be called infile, therefore we save the record on
a file with that name.
The most important fields on a FDist record are: num_pops, the number of
populations; num_loci, the number of loci and loci_data with the marker
data itself. Most probably the details of the record are of no interest
to the user, as the record only purpose is to be passed to FDist.
The next step is to calculate the average Fst of the dataset (along
with the sample size):
ctrl = Controller.FDistController()
fst, samp_size = ctrl.run_datacal()
On the first line we create an object to control the call of FDist
suite, this object will be used further on in order to call other
suite applications.
On the second line we call the datacal application which computes the
average Fst
and the sample size. It is worth noting that the Fst computed by
datacal is a variation of Weir and Cockerham's q.
We can now call the main fdist application in order to simulate neutral
markers.
sim_fst = ctrl.run_fdist(npops = 15, nsamples = fd_rec.num_pops, fst = fst,
sample_size = samp_size, mut = 0, num_sims = 40000)
-
npops
- Number of populations existing in nature. This is really a
``guestimate''. Has to be lower than 100.
- nsamples
- Number of populations sampled, has to be lower than npops.
- fst
- Average Fst.
- sample_size
- Average number of individuals sampled on each population.
- mut
- Mutation model: 0 - Infinite alleles; 1 - Stepwise mutations
- num_sims
- Number of simulations to perform. Typically a number around
40000 will be OK, but if you get a confidence interval that looks sharp
(this can be detected when plotting the confidence interval computed
below) the value can be increased (a suggestion would be steps of 10000
simulations).
The confusion in wording between number of samples and sample size
stems from the original application.
A file named out.dat will be created with the simulated heterozygosities
and Fsts, it will have as many lines as the number of simulations
requested.
Note that fdist returns the average Fst that it was capable of
simulating, for more details about this issue please read below the paragraph
on approximating the desired average Fst.
The next (optional) step is to calculate the confidence interval:
cpl_interval = ctrl.run_cplot(ci=0.99)
You can only call cplot after having run fdist.
This will calculate the confidence intervals (99% in this case)
for a previous fdist run. A list of quadruples is returned. The
first element represents the heterozygosity, the second the lower
bound of Fst confidence interval for that heterozygosity,
the third the average and the fourth the upper bound. This can
be used to trace the confidence interval contour. This list
is also written to a file, out.cpl.
The main purpose of this step is return a set of points which can
be easily used to plot a confidence interval. It can be skipped
if the objective is only to assess the status of each marker against
the simulation, which is the next step...
pv_data = ctrl.run_pv()
You can only call cplot after having run datacal and fdist.
This will use the simulated markers to assess the status of each
individual real marker. A list, in the same order than the loci_list
that is on the FDist record (which is in the same order that the GenePop
record) is returned. Each element in the list is a quadruple, the
fundamental member of each quadruple is the last element (regarding the
other elements, please refer to the pv documentation -- for the
sake of simplicity we will not discuss them here) which returns the
probability of the simulated Fst being lower than the marker Fst.
Higher values would indicate a stronger candidate for positive selection,
lower values a candidate for balancing selection, and intermediate values
a possible neutral marker. What is ``higher'', ``lower'' or ``intermediate''
is really a subjective issue, but taking a ``confidence interval'' approach
and considering a 95% confidence interval, ``higher'' would be between 0.95
and 1.0, ``lower'' between 0.0 and 0.05 and ``intermediate'' between 0.05 and
0.95.
8.9.3.1.1 Approximating the desired average Fst
Fdist tries to approximate the desired average Fst by doing a
coalescent simulation using migration rates based on the formula
This formula assumes a few premises like an infinite number of populations.
In practice, when the number of populations is low, the mutation model
is stepwise and the sample size increases, fdist will not be able to
simulate an acceptable approximate average Fst.
To address that, a function is provided to iteratively approach the desired
value by running several fdists in sequence. This approach is computationally
more intensive than running a single fdist run, but yields good results.
The following code runs fdist approximating the desired Fst:
sim_fst = ctrl.run_fdist_force_fst(npops = 15, nsamples = fd_rec.num_pops,
fst = fst, sample_size = samp_size, mut = 0, num_sims = 40000,
limit = 0.05)
The only new optional parameter, when comparing with run_fdist, is limit
which is the desired maximum error. run_fdist can (and probably should)
be safely replaced with run_fdist_force_fst.
8.9.3.1.2 Final notes
The process to determine the average Fst can be more sophisticated than
the one presented here. For more information we refer you to the FDist
README file. Biopython's code can be used to implement more sophisticated
approaches.
8.9.4 Future Developments
The most desired future developments would be the ones you add yourself ;) .
That being said, already existing fully functional code is currently being
incorporated in Bio.PopGen, that code covers the applications FDist and
SimCoal2, the HapMap and UCSC Table Browser databases and some simple statistics
like Fst, or allele counts.
8.10 InterPro
The Bio.InterPro
module works with files from the
InterPro database, which can be obtained from the InterPro project:
http://www.ebi.ac.uk/interpro/.
The Bio.InterPro.Record
contains all the information stored in
an InterPro record. Its string representation also is a valid InterPro
record, but it is NOT guaranteed to be equivalent to the record
from which it was produced.
The Bio.InterPro.Record
contains:
-
Database
Accession
Name
Dates
Type
Parent
Process
Function
Component
Signatures
Abstract
Examples
References
Database links