Bioinformatics with Python Cookbook

A small introduction to my book with advanced content on Python for Bioinformatics. A Caveat, the book assumes that you know intermediate Python.

Getting FASTA output from PLINK files using Biopython

Here is a bit of code that you can use to extract FASTA files from PLINK genotype calls. I am doing a few assumptions:

  • There is a reference genome to compare against
  • If no data is available from the genotyping we will infer the reference call. This might be acceptable for some Whole Genome Sequencing data sources (and even so...), but if you are using something less dense then you might want to mark it as missing data. The version without the reference genome would be much easier to code
  • Adding to the reference genome, your genes of interest will be on a BED file
  • I am doing this inside IPython Notebook. I only make use of a magic, which can be trivially changed by os.system
  • Your PLINK is binary encoded

Lets start with the import code, plus some variables that you need to change:

import gzip
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

indiv_file = 'indivs.txt'

genome = '/data/atroparvus/tra/ag/ref/pest3.fa.gz'
my_bed = 'my_genes.bed'
plink_pref = '/data/gambiae/phase1/AR3'

You will need to change the genome file (a gzipped reference fasta), the bed file with your positions on interest) and the plink prefix.

Next, lets do a simple function to parse BED files, this will be coded as a generator:

def get_features(bed_file):
    for l in open(bed_file):
        toks = l.rstrip().split('\t')
        chrom = toks[0]
        start = int(toks[1])
        end = int(toks[2])
        gene = toks[3]
        strand = toks[4]
        yield chrom, start, end, gene, strand
Be very careful with the reference coordinates that you use. Biopython and BED are typically 0-based. PLINK is typically 1-based. Serious pain and suffering (of the kind that is difficult to detect) can be caused by this.

Now lets get the sequence for a certain area of interest:

def get_seq(fasta_name, chrom, start, end, determine_chrom=lambda x: x.description.split(':')[2]):
    recs = SeqIO.parse(gzip.open(fasta_name, 'rt', encoding='utf8'), 'fasta')
    for rec in recs:
        if determine_chrom(rec) == chrom:
            return rec.seq[start:end + 1]

This will take your reference genome file and return a Biopython sequence with your area of interest. Notice the parameter to determine the chromosome from the FASTA description. Different FASTA files can have completely different descriptor lines, so we abstract that away. In my concrete case, the description looks like this:

>2L chromosome:AgamP3:2L:1:49364325:1

So, I take the description, split it by : and get the 3rd element (2L in this case). "But I could have used the ID", I hear you say. Yes I could, but I wanted to illustrate code for strange description lines...

OK, now lets get the SNPs from the area of interest in the genome from our PLINK file. The file is in binary format (you can easily change this for text or even VCF as long as you use PLINK 1.9+) and the output is is text format (so that we can parse it). We are using IPython magic here (os.system, if you prefer):

def get_snps(plink_prefix_in, plink_prefix_out,
             indiv_file, chrom, start, end):
    !plink --bfile $plink_prefix_in --chr $chrom --from-bp $start --to-bp $end\
    --keep $indiv_file --allow-extra-chr --recode --out $plink_prefix_out

OK, now the work-horse of all of this, a function to read a PLINK file and return sequences based on the reference genome:

def get_plink_seqs(ref_seq, plink_pref, start):
    f = open(plink_pref + '.map')
    poses = []
    for l in f:
        pos = l.rstrip().split('\t')[-1]
        poses.append(int(pos))
    f = open(plink_pref + '.ped')
    for l in f:
        toks = l.rstrip().replace('\t', ' ').split(' ')
        fam = toks[0]
        ind = toks[1]
        alleles = toks[6:]
        seq1 = list(ref_seq)
        seq2 = list(ref_seq)
        for i, pos in enumerate(poses):
            ref_pos = pos - start
            ref_allele = ref_seq[ref_pos]
            a1 = alleles[2 * i]
            a2 = alleles[2 * i + 1]
            if ref_allele == ref_allele.lower():
                #maintain case of the ref
                a1 = a1.lower()
                a2 = a2.lower()
            seq1[ref_pos] = a1
            seq2[ref_pos] = a2
            #if a1 != ref_allele or a2 != ref_allele:
            #    print(fam, ind, i, ref_pos, pos, ref_allele, a1, a2)
        yield fam, ind, ''.join(seq1), ''.join(seq2)

So, the input a reference sequence (only the tiny slice of the genome corresponding to your area of interest), the PLINK prefix of your PLINK files and the start position. The code starts by loading the PLINK positions from the MAP file (notice that I ignore the chromosome: you should only have your chromosome, plus PLINK can do strange things to chromosome names - I will not detail that here). Most of the code involves offsetting from the absolute coordinates in the PLINK file to the relative coordinates in our reference sequence. Also note that we maintain the capitalization as in the reference genome (which normally has a meaning, like soft-masking).

Finally, lets extract the sequences proper to FASTA files: these will be file named by the feature/gene:

for feature in get_features(my_bed):
    chrom, start, end, gene, strand = feature
    ref_seq = get_seq(genome, chrom, start, end)
    print(chrom, start + 1, end + 1, gene)
    get_snps(plink_pref + '/' + chrom, gene, indiv_file,
             chrom, start + 1, end + 1)# 1 based
    my_records = []
    my_records.append(SeqRecord(seq=ref_seq, id=gene, description='Reference'))
    for fam, ind, seq1, seq2 in get_plink_seqs(ref_seq, gene, start + 1):
        my_records.append(SeqRecord(seq=Seq(seq1), id=gene, description='%s_%s_1' % (fam, ind)))
        my_records.append(SeqRecord(seq=Seq(seq2), id=gene, description='%s_%s_2' % (fam, ind)))
    SeqIO.write(my_records, '%s.fa' % gene, 'fasta')

We iterate through all our features on the BED file: we get the corresponding reference sequence, get the SNPs from PLINK and generate the FASTA sequences. We use SeqIO for writing.

A few important notes
  • The PLINK files are not phased, so you cannot assume correct phasing of the FASTA files
  • I am assuming diploid data here
  • The code could be made more efficient by getting the correct reference sequence only once (it can be quite heavy). This can be easily achieved by retrieving the BED features by chromosome order (many times that have that order, anyway) and only loading the reference sequence when the chromosome/scafold changes<

I hope this code is useful to you. As usual comments and criticism are very welcome.

A Docker container for Biopython testing using Buildbot

This is the last in a series of posts related to creating a testing container for Biopython using Docker.

In the previous post we developed a container that allows you to have a complete, fully functional environment for Biopython.

Our objective now is different: We want to have a container to help do integration testing for Biopython. That is a completely different kind of animal:

  • It should be able to connect to a buildbot server (remember buildbot is a continuous integration framework). Biopython has a buildbot server.
  • It is a server/executable container. You do not log-in into it (unless something strange is happening). When you start it, it becomes and independent agent (we are helping to build SkyNet here): it connects to the builbdot server and does whatever testing tasks are required from it.

Buildbot installation

Buildbot installation can be done by adding something like this:

RUN apt-get install -y buildbot-slave RUN buildslave create-slave biopython testing.open-bio.org:9989 CHANGEUSER CHANGEPASS

note

Notice that you need to change the username and password on the code (CHANGEUSER, CHANGEPASS). This will have to be agreed beforehand with the buildbot system administrator. This also means that you cannot use the Dockerfile from the Internet directly. You have to download and manually edit it. Alternative approaches would be appreciated, BTW.

Server (executable) mode

This kind of container is different from the previous one, where everything was prepared for a user to login and do interactive work inside the container. Here, when you fire up the container it goes on to its business (testing).

In Docker that is achieved by creating an ENTRYPOINT:

ENTRYPOINT bash entrypoint.sh

entrypoint.sh should include everything needed to start up the server (for example, the database servers that on the previous post were started on .bashrc, should here go into entrypoint.sh). Importantly, if you have a daemon server (i.e., that goes into the background) you have to keep the entrypoint running or the container will terminate. So, in our case we will have the following file:

service postgresql start
service mysql start
export DIALIGN2_DIR=/tmp
buildslave start biopython
tail -f biopython/twistd.log #To hold things

Final dispositions

This series of tutorials was made by a Docker newbie. While I hope this will help others with their docker installations, there might be sub-optimal solutions here (I would appreciate if you point me in the right direction).

The Biopython docker work will continue here. Updates can be found there.

Remember, while you can directly run the container from the previous blog post, the testing container will require you to a) download the container file b) edit username and password c) then you can run it.

A Docker container for Biopython

In this post we will create a docker container for Biopython. Our final objective is to have a container to test Biopython (a different kind of beast compared with what we are doing here), but this one might actually be interesting for a lot more people. For this we will use Docker.

A Caveat: Docker is undergoing intense development thus some of the suggestions below might break with time. If you find such a case, please inform me and I will amend this post. I will assume that you have installed Docker and that your user has group permissions to interact with Docker (if not, then just sudo most of the commands below).

For the impatient
Install docker. Remember: depending on your installation you might need to add sudo to the commands below.
docker build -t biopython https://raw.githubusercontent.com/tiagoantao/my-containers/master/biopython/Biopython3
#Grab a coffee, wait a bit
docker run -t -i biopython /bin/bash

Creating a Docker file

Basic stuff

We will use Ubuntu, most specifically Ubuntu Saucy. Why Saucy? For no specific reason, but we want to make sure that the environment is stable, so we pick a recent-but-not-bleeding-edge distro. So, our file starts with:

FROM ubuntu:saucy

Which simple uses Saucy (downloading the image if necessary)

We now add all the ubuntu standard packages needed for Biopython:

#We need this for phylip
RUN echo 'deb http://archive.ubuntu.com/ubuntu precise multiverse' >> /etc/apt/sources.list
RUN apt-get update
RUN apt-get install -y git python-numpy wget gcc python-dev
RUN apt-get install -y python-matplotlib python-reportlab python-rdflib
RUN apt-get install -y clustalw fasttree t-coffee
RUN apt-get install -y bwa ncbi-blast+ emboss clustalo phylip mafft muscle
RUN apt-get instally -y embassy-phylip samtools phyml wise raxml
# For BioSQL
RUN apt-get install -y mysql-server python-mysqldb postgresql python-psycopg2

Notice the change of repositories and all support packages (git, gcc, ...)

Non-standard packages

There are several pieces of software that require manual installation. It is an ongoing task, but it is mostly simple grunt work, for example:

#reportlab fonts
RUN wget http://www.reportlab.com/ftp/fonts/pfbfer.zip
WORKDIR cd /usr/lib/python2.7/dist-packages/reportlab
RUN  mkdir fonts
WORKDIR cd /usr/lib/python2.7/dist-packages/reportlab/fonts
RUN unzip /pfbfer.zip
WORKDIR /
RUN rm pfbfer.zip
#genepop
RUN mkdir genepop
WORKDIR /genepop
RUN wget http://kimura.univ-montp2.fr/~rousset/sources.tar.gz
RUN tar zxf sources.tar.gz
RUN g++ -DNO_MODULES -o Genepop GenepopS.cpp -O3
RUN cp Genepop /usr/bin
WORKDIR /
RUN rm -rf genepop

Not much more than a sequence of bash commands in all the cases that I have done (download stuff, compile, copy, cleanup, ...).

Configuring and starting services (DBs)

Here we need to configure the databases needed for BioSQL (PostgreSQL and MySQL - sqlite is ready). The configuration looks like this:

RUN echo "host    all             all             ::1/128                 trust" > /etc/postgresql/9.1/main/pg_hba.conf
RUN echo "service postgresql start" > .bashrc
RUN echo "service mysql start" >> .bashrc

We the need to configure permissions access to the postgreSQL server. Notice that the address is a IPv6 one. Something in the system (I did not research what) is doing IPv6 first (localhost has both a v4 and v6 address). Modern: yes, welcome: yes, expected: no. So, if something based on localhost seems to be failing check if it is using IPv6.

The Database servers are started in .bashrc. This solution is, in my view, sub-optimal (for instance you can run a container without starting with bash, and there goes database server initialization). If you know of a better way, please say...

Preparing Biopython

It is actually quite easy:

#Biopython
RUN git clone https://github.com/biopython/biopython.git
WORKDIR /biopython
RUN python setup.py install

Running and getting the Docker file

If you want to run this do, on your machine (with docker, preferably with the sudo issue resolved):

docker build -t biopython https://raw.github.com/tiagoantao/my-containers/master/Biopython
docker run -i -t biopython /bin/bash

You will see a few errors related to database startup, but these are not important in this context.

You can now do, for example:

root@dc9d8c3c48f8:/biopython# cd Tests/
root@dc9d8c3c48f8:/biopython/Tests# python run_tests.py --offline

Grab the docker file here, if you want to look at it.

Next steps

Next step will be the creation of a buildbot docker for Biopython. Also finalize the list of dependencies (almost done).