Memory hierarchy and computing performance

We discuss code optimization from the point of view of the memory hierarchy. You are probably aware of the disk/RAM hierarchy and the tradeoff between memory and speed. With certain classes of algorithms you can easily take advantage of the cache/RAM hierarchy just by being careful on how you process data. I show a very simple example involving matrix manipulation where you can easily get speedups of up to 40%. And I am pretty sure there are plenty situations where you can do better.

The uncanny similarities between modern JavaScript and Python.

Modern JavaScript way more similar to Python than > you would imagine

Bioinformatics with Python Cookbook

A small introduction to my book with advanced content on Python for Bioinformatics. A Caveat, the book assumes that you know intermediate Python.

Getting FASTA output from PLINK files using Biopython

Here is a bit of code that you can use to extract FASTA files from PLINK genotype calls. I am doing a few assumptions:

  • There is a reference genome to compare against
  • If no data is available from the genotyping we will infer the reference call. This might be acceptable for some Whole Genome Sequencing data sources (and even so...), but if you are using something less dense then you might want to mark it as missing data. The version without the reference genome would be much easier to code
  • Adding to the reference genome, your genes of interest will be on a BED file
  • I am doing this inside IPython Notebook. I only make use of a magic, which can be trivially changed by os.system
  • Your PLINK is binary encoded

Lets start with the import code, plus some variables that you need to change:

import gzip
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord

indiv_file = 'indivs.txt'

genome = '/data/atroparvus/tra/ag/ref/pest3.fa.gz'
my_bed = 'my_genes.bed'
plink_pref = '/data/gambiae/phase1/AR3'

You will need to change the genome file (a gzipped reference fasta), the bed file with your positions on interest) and the plink prefix.

Next, lets do a simple function to parse BED files, this will be coded as a generator:

def get_features(bed_file):
    for l in open(bed_file):
        toks = l.rstrip().split('\t')
        chrom = toks[0]
        start = int(toks[1])
        end = int(toks[2])
        gene = toks[3]
        strand = toks[4]
        yield chrom, start, end, gene, strand
Be very careful with the reference coordinates that you use. Biopython and BED are typically 0-based. PLINK is typically 1-based. Serious pain and suffering (of the kind that is difficult to detect) can be caused by this.

Now lets get the sequence for a certain area of interest:

def get_seq(fasta_name, chrom, start, end, determine_chrom=lambda x: x.description.split(':')[2]):
    recs = SeqIO.parse(, 'rt', encoding='utf8'), 'fasta')
    for rec in recs:
        if determine_chrom(rec) == chrom:
            return rec.seq[start:end + 1]

This will take your reference genome file and return a Biopython sequence with your area of interest. Notice the parameter to determine the chromosome from the FASTA description. Different FASTA files can have completely different descriptor lines, so we abstract that away. In my concrete case, the description looks like this:

>2L chromosome:AgamP3:2L:1:49364325:1

So, I take the description, split it by : and get the 3rd element (2L in this case). "But I could have used the ID", I hear you say. Yes I could, but I wanted to illustrate code for strange description lines...

OK, now lets get the SNPs from the area of interest in the genome from our PLINK file. The file is in binary format (you can easily change this for text or even VCF as long as you use PLINK 1.9+) and the output is is text format (so that we can parse it). We are using IPython magic here (os.system, if you prefer):

def get_snps(plink_prefix_in, plink_prefix_out,
             indiv_file, chrom, start, end):
    !plink --bfile $plink_prefix_in --chr $chrom --from-bp $start --to-bp $end\
    --keep $indiv_file --allow-extra-chr --recode --out $plink_prefix_out

OK, now the work-horse of all of this, a function to read a PLINK file and return sequences based on the reference genome:

def get_plink_seqs(ref_seq, plink_pref, start):
    f = open(plink_pref + '.map')
    poses = []
    for l in f:
        pos = l.rstrip().split('\t')[-1]
    f = open(plink_pref + '.ped')
    for l in f:
        toks = l.rstrip().replace('\t', ' ').split(' ')
        fam = toks[0]
        ind = toks[1]
        alleles = toks[6:]
        seq1 = list(ref_seq)
        seq2 = list(ref_seq)
        for i, pos in enumerate(poses):
            ref_pos = pos - start
            ref_allele = ref_seq[ref_pos]
            a1 = alleles[2 * i]
            a2 = alleles[2 * i + 1]
            if ref_allele == ref_allele.lower():
                #maintain case of the ref
                a1 = a1.lower()
                a2 = a2.lower()
            seq1[ref_pos] = a1
            seq2[ref_pos] = a2
            #if a1 != ref_allele or a2 != ref_allele:
            #    print(fam, ind, i, ref_pos, pos, ref_allele, a1, a2)
        yield fam, ind, ''.join(seq1), ''.join(seq2)

So, the input a reference sequence (only the tiny slice of the genome corresponding to your area of interest), the PLINK prefix of your PLINK files and the start position. The code starts by loading the PLINK positions from the MAP file (notice that I ignore the chromosome: you should only have your chromosome, plus PLINK can do strange things to chromosome names - I will not detail that here). Most of the code involves offsetting from the absolute coordinates in the PLINK file to the relative coordinates in our reference sequence. Also note that we maintain the capitalization as in the reference genome (which normally has a meaning, like soft-masking).

Finally, lets extract the sequences proper to FASTA files: these will be file named by the feature/gene:

for feature in get_features(my_bed):
    chrom, start, end, gene, strand = feature
    ref_seq = get_seq(genome, chrom, start, end)
    print(chrom, start + 1, end + 1, gene)
    get_snps(plink_pref + '/' + chrom, gene, indiv_file,
             chrom, start + 1, end + 1)# 1 based
    my_records = []
    my_records.append(SeqRecord(seq=ref_seq, id=gene, description='Reference'))
    for fam, ind, seq1, seq2 in get_plink_seqs(ref_seq, gene, start + 1):
        my_records.append(SeqRecord(seq=Seq(seq1), id=gene, description='%s_%s_1' % (fam, ind)))
        my_records.append(SeqRecord(seq=Seq(seq2), id=gene, description='%s_%s_2' % (fam, ind)))
    SeqIO.write(my_records, '%s.fa' % gene, 'fasta')

We iterate through all our features on the BED file: we get the corresponding reference sequence, get the SNPs from PLINK and generate the FASTA sequences. We use SeqIO for writing.

A few important notes
  • The PLINK files are not phased, so you cannot assume correct phasing of the FASTA files
  • I am assuming diploid data here
  • The code could be made more efficient by getting the correct reference sequence only once (it can be quite heavy). This can be easily achieved by retrieving the BED features by chromosome order (many times that have that order, anyway) and only loading the reference sequence when the chromosome/scafold changes<

I hope this code is useful to you. As usual comments and criticism are very welcome.

Automatic code changing in Python with the ast module

Most language compilers and interpreters create Abstract Syntax Trees (ASTs) to represent your code before translation or execution. Python makes these trees accessible to the developer via the ast module. This allows for pretty powerful program analysis and even transformation. This article is a primer on Python's ast module. After reading this you will understand the basics of automated code traversing and manipulation by exploring a illustrative example dealing with converting Python 2 to 3 code.