Constructing bacterial pan-genomes

Pan-genome construction notebook

31/10/2012

65 genome alignment has taken 9 hours to do nucmer alignments, and the MugsyWGA step is still running (~8 hours elapsed), so far about 4Mb of pan-genome available so I figure it’s less than 50% of the way through.

On to the visualisation on some test data:

  • Align some WGS reads and metagenomics reads against pan-genome
  • Write R script to display as heatmap (colour = coverage)
  • Then think about how to align regular whole-genomes
    • MUMMER
    • BLAT (psl2bam.py to simplify)
  • Try fixed length blocks and variable length blocks

Example input to R:

Block Sample MeanDepth PercentCoveredByReads
Block1 Sample1 40 99
Block2 Sample2 30 98
Block3 Sample3 40 99

Just try it and see how it works? Exclude gapped characters from % covered by reads.

Options for depth-counting:

  • GATK DepthOfCoverage
  • samtools depth
  • BEDtools
  • VarScan2

Try piping samtools depth into a Python script first.

Script: depth.py

Mugsy finally finished running on the 65 genomes set, final pan-genome size 25Mb, 113k blocks. (is this reasonable?)

Now run all the metagenomics samples across the test pan-genome and depth script and get visualisation working.

Issues:

  • pan-genomes have hyphens in, probably best to remove these as they won’t map well?

A simple guide to variant calling with BWA, samtools, VarScan2

Mapping and variant calling

Just as soon as we have got on top of this outbreak, cases are being reported in France which are also of the same sequence type. By now, a reference genome for the German outbreak strain has been nearly completed and is available at reference_genomes/280_flxplus.fna

Using SNPs we would like to determine how related the strains are, and how the two outbreaks may be related. Is there evidence of chains of transmission between outbreaks?

Indexing our reference for first use

First we need to index our reference file:

bwa index -a is reference_genomes/280_flxplus.fna
samtools faidx reference_genomes/280_flxplus.fna

Perform an alignment

Then we do an alignment. We will use bwasw to do this (-t 4 uses 4 CPUs):

bwa bwasw -t 4 reference_genomes/280_flxplus.fna data/French_1.fastq data/French_2.fastq -f my_output_file.sam

What is the calculated insert size and standard deviation?

Analysing SAM format

Have a look at the SAM file which has been produced, e.g.

less my_output_file.sam

Have a look at the first mapping:

  • How much of the read mapped to the reference?
  • What orientation is the read in?
  • What is the reported mapping quality?
  • How many mismatches are there?

    Hint: refer to http://genome.sph.umich.edu/wiki/SAM

In order to do SNP calling we have to convert the reads to binary format and sort them:

samtools view -bS my_output_file.sam -o my_output_file.bam
samtools sort my_output_file.bam my_output_file.sorted

We can look at the consensus sequence using the samtools mpileup command.

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | less
  • What do you notice about the output?
  • By eyeballing the data, what is the depth of coverage for your sample?
  • Is the coverage consistent across the genome? Does it vary? Are there places where it varies more?
  • Can you spot any potential SNPs?

Let’s try and call some SNPs. We.ll do this by passing the output of mpileup through to VarScan which will perform some basic filtering for us, that we will customise later.

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2snp --output-vcf --strand-filter 0

Examine the output of VarScan.

  • How many SNPs are called?
  • How many of them fail a test? Which ones?

Load the Savant browser

Savant.sh

Have a look at a selection of SNPs. How do those which fail the filters differ from those that pass? What is the difference between HET and HOM SNPs? What does this mean for a bacterial strain? How could we get rid of HET SNPs?

mpileup2snp
This command calls SNPs from an mpileup file based on user-defined parameters:
	USAGE: java -jar VarScan.jar mpileup2snp [mpileup file] OPTIONS
	mpileup file - The SAMtools mpileup file
 
	OPTIONS:
		--min-coverage          Minimum read depth at a position to make a call [8]
		--min-reads2            Minimum supporting reads at a position to call variants [2]
		--min-avg-qual          Minimum base quality at a position to count a read [15]
		--min-var-freq          Minimum variant allele frequency threshold [0.01]
		--min-freq-for-hom  Minimum frequency to call homozygote [0.75]
		--p-value Default p-value threshold for calling variants [99e-02]
		--strand-filter         Ignore variants with >90% support on one strand [1]
		--output-vcf            If set to 1, outputs in VCF format
		--variants		Report only variant (SNP/indel) positions (mpileup2cns only) [0]

In the Grad 2011 paper they used the following criteria for SNP filtering:

Potential SNPs from the Illumina sequences were called by GATK Unified Genotyper (22), filtering the data according to the following parameters: >90% agreement among reads; at least five unambiguously mapped reads; no greater than 50% mapping ambiguity; insertions and deletions were ignored. Over 97% of the bases in the genome of each outbreak isolate fulfilled these criteria.

Can you reproduce these methods using VarScan? Trying setting different options:

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2snp --output-vcf modifiers_to_varscan_output
  • How many SNPs do you have now?
  • Are any of these SNPs dubious?

Examine the SNPs in Savant and check you are happy with them all.

Advanced: look for indels with mpileup2indel

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2indel --output-vcf --strand-filter 0

As a group we can decide on some reasonable filtering settings, then we will do a large mpileup of all the available reads, and review the results in PhyloViz at the end.

UNIX recipes, tips and tricks

Since switching to Ubuntu I’ve been driven crazy by having cut-and-pasted commands not appear in my bash history. This seems to be the fix: http://superuser.com/questions/392134/when-i-paste-a-command-on-my-bash-prompt-it-is-not-in-history-how-can-i-add-it

set HISTCONTROL to something else than ignorespace or ignoreboth.
set HISTCONTROL=ignoredups will work.

Constructing bacterial pan-genomes

Pan-genome construction notebook

29/10/2012

Plan:

Using Mugsy to generate LCBs for all E. coli / Shigella sequences. Extract LCBs to create a pan-genome and use as reference alignment database for short-read mapping of metagenomics data. By plotting presence/absence of mapping reads to each individual LCB hope to create some kind of visual “barcode” for the E. coli / E. colis in a metagenomics sample (obviously could be applied to other species). Compare with other genomes, e.g. E. coli O157, E. coli ETEC etc. Sort blocks for obviously recognisable elements e.g. E. coli core genome, Shiga-toxin phage, plasmids, etc.

Steps:

Download all E. coli and Shigella from Genbank (https://gist.github.com/3974107) - 65 genomes. Place paths in inputfiles.

Needed to edit Mugsy mugsyenv.sh file replace ‘mapping’ with ‘MUMmer3.20’

Seemed like Mugsy required the first file to be a single replicon, but this might have been my mistake.

30/10/2012

Create pan-genome for all E. coli/Shigella

source mugsyenv.sh
cat inputfiles | xargs mugsy --directory . --prefix all_ecoli_shig
67 genomes
Starting Nucmer: Tue Oct 30 09:53:42 GMT 2012
..
..cat ....cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory
.cat: .//Escherichia_coli_W_uid162101: No such file or directory
cat: .//Escherichia_coli_K_12_substr__W3110_uid161931: No such file or directory

Getting weird errors, not sure what’s happening here.

Try rewriting all the headers to just the accession code.

find ../refs/ecoli/Bacteria/ -wholename "*Escherichia_coli*.fasta" -or -wholename "*Shigella*.fasta" | while read i ; do python rewriteheaders.py "$i" refs/`basename "$i"` ; done

rewriteheaders.py:

from Bio import SeqIO
import sys

fh = open(sys.argv[2], "w")

for rec in SeqIO.parse(open(sys.argv[1]), "fasta"):
    	rec.id = rec.id.split("|")[3]
        rec.description = ''
    	SeqIO.write([rec], fh, "fasta")

Seems to work for 20 genomes now.

Alignment file to pan-genome

Create pan-genome from MAF file. Try Biopython MAF branch (http://biopython.org/wiki/Multiple_Alignment_Format). Read each alignment and take the non-gapped characters. Perhaps need to remove any Ns or very small blocks. What are the size of the blocks and number of species per block? Plot.

Needed to modify Biopython to get it to read Mugsy files, sent diff to the author::

diff --git a/Bio/AlignIO/MafIO.py b/Bio/AlignIO/MafIO.py
index 6eda0ca..4bb1407 100644
--- a/Bio/AlignIO/MafIO.py
+++ b/Bio/AlignIO/MafIO.py
@@ -178,7 +178,7 @@ def MafIterator(handle, seq_count = None, alphabet = single_letter_alphabet):

    	     annotations = dict([x.split("=") for x in line.strip().split()[1:]])

-            if len([x for x in annotations.keys() if x not in ("score", "pass")]) > 0:
+            if len([x for x in annotations.keys() if x not in ("score", "pass", "label", "mult")]) > 0:
                   raise ValueError("Error parsing alignment - invalid key in 'a' line")
elif line.startswith("#"):
	# ignore comments

Script: pangenome.py - extracts sizes and lengths of alignments from MAF file

library(ggplot2)
ggplot(sizes, aes(x=V1, y=V2)) + geom_point()
read.table("sizes.txt", header=FALSE)

How to deal with alignment blocks of different lengths for each species? Could take longest, but there may be important differences between blocks (e.g. phage insertion) which would then not be captured. Could split blocks at long alignment gaps into sub-blocks. Or - easy - split the blocks into subblocks of ?1000 base pairs (overlapping or not?). Then if 90% of the block is mapped, consider it present, otherwise absent. Blocks would be tied together to help with layout.

Test with 3 genomes and TY2482

Test with 3 genome seqences. Just extracted the sequence in each LCB alignment with the fewest gapped characters to create a pan-genome. Mapping TY2482 reads: 85% mapped, quite promising. Test again with 20 genomes.

Test with 20 genomes

Pan-genome about 7.7Mb, 2301 LCBs.

python pangenome.py ecoli.maf > test_pangenome.fasta
bwa index test_pangenome.fasta
bwa bwasw -t8 test_pangenome.fasta ~nick/sbtm12/data/TY2482_30x_1.fastq.gz ~nick/sbtm12/data/TY2482_30x_2.fastq.gz > test_ty2482.sam
samtools view -bS test_ty2482.sam > test_ty2482.bam
samtools sort test_ty2482.bam test_ty2482.sorted
samtools index test_ty2482.sorted.bam

samtools flagstat test_ty2482.sorted.bam
3214331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
2968019 + 0 mapped (92.34%:-nan%)
2968019 + 0 paired in sequencing
1483936 + 0 read1
1484083 + 0 read2
2552550 + 0 properly paired (86.00%:-nan%)
2905151 + 0 with itself and mate mapped
62868 + 0 singletons (2.12%:-nan%)
283060 + 0 with mate mapped to a different chr
215455 + 0 with mate mapped to a different chr (mapQ>=5)

92% not bad, although perhaps not as high as it should be. Compare with reference genome.

Run all 65 genomes overnight …

All the cool kids are on arXiv and Haldane's Sieve .. why you should be too

Something very exciting has happened in recent weeks on arXiv, the preprint server which many biologists believe is the reserve of angry physicists, beardy mathematicians and unwashed computer scientists  (joke!!!).

Not any longer. I first felt a disturbance in the force in September when a few high-profile human genomicists started making pledges to send all their manuscripts to arXiv first, including the angriest man in biology, Michael Eisen. I'm pretty sure genomics wunder-kinds Daniel MacArthur and Joe Pickrell are also planning on sending their manuscripts there first. The venerable Ewan Birney is also thinking of getting in on the action, tweeting back in August:

 "Ah. Bugger. Scooped by George Church on arbitrary DNA storage. Our paper is in review <sigh>. (wish we had posted on arXive now"

Things are happening!

Those working in human population genetics and paleogenomics are already posting fascinating, high-impact manuscripts, see for example "The Date of Interbreeding between Neandertals and Modern Humans"  from a team including  Svante Pääbo. Joe Pickrell and crew posted a detailed study on "The Genetic Prehistory of Southern Africa", focusing on groups which speak using click-consonants.

But as interesting and inspiring as these papers are, I am more interested in bioinformatics, microbial genomics, evolution and ecology, so these papers don't really impact my day job. But recently things have got even more interesting .. by which I mean microbial. Witness:

Posted 15th October 2012: Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences (Haldane's Sieve, arXiv). This paper describes a novel method of using shotgun sequencing of long 16S amplicons  to permit species-level assignments.

Posted 28th September 2012: Horizontal gene transfer may explain variation in θs (Haldane's Sieve) - a paper from Lenski no less, which gives a possible explanation of the intriguing findings of potential mutational "hotspots" in the E. coli genome published in Nature by Inigo Martincorena and Nicholas Luscombe. This paper suggested an "evolutionary risk management strategy" which challenges our fundamental understanding of genetic mutations being acquired randomly and subsequently selected for (demonstrated beautifully by Luria and Delbruck in 1943). Lenski, using data from his long-term E. coli evolution experiment suggests that in fact undetected recombination is instead the likely cause for these mutational hotspots.

Posted 13 Oct 2012: A 454 survey of the community composition and core microbiome of the common bed bug, Cimex lectularius, reveals significant microbial community structure across an urban landscape (http://arxiv.org/abs/1210.3707). Notable for being one of the first microbial ecology studies published in arXiv and obviously bed-bugs are kind of cool/gross .

Posted  19 Sep 2012: Maximum Likelihood Estimation of Frequencies of Known Haplotypes from Pooled Sequence Data (Haldane's Sieve)- a really interesting bit of software which may be useful for haplotype reconstruction in metagenomics and pooled sequencing experiments. I plan to give this a whirl and feedback to the authors my findings.

Posted 1 Oct 2012: Best Practices for Scientific Computing  (Haldane's Sieve) - not specifically microbial or biological but a useful treatise on how to write better code which would be useful for those writing bioinformatics software and pipelines, or even just doing analysis.

So, a little taster there.

Make no mistake - these are all quality manuscripts, they aren't being dumped there because they couldn't get past the PLoS ONE reviewers, or something equally banal.

Why are they there? My interpretation is that these authors get it and are posting to arXiv to take advantage of the particular benefits of using a preprint server. Firstly, the immediacy of getting your work out there - simply submit the PDF and it's available to everyone. Your manuscript then gets a permanent home with a citable DOI. Submitting to arXiv may help with establishing priority.

For me, even more useful than these things are the benefits of publishing to a self-selected audience who are genuinely interested in this subject, and actively wish to read and critique such papers out of professional curiosity, not just because they are lucky/unlucky enough to be selected as peer reviewers. On arXiv, the "vibe" seems much different to that of the now-closed Nature Precedings, which sometimes honestly did feel like a dumping ground for unloved or hurried manuscripts.

A potential worry for these authors is that although they have deposited in arXiv, the community as a whole may not be looking there-- arXiv is not archived by PubMed-- and so they may not be cited by others routinely because they weren't seen. Hence this blog post, a small attempt to draw attention to this exciting development!

So I've talked a lot about arXiv - where does Haldane's Sieve come in? This is simply a blog site run by Graham Coop, Bryan Howie and Joe Pickrell. It is important because arXiv provide no facilities for permitting comments on manuscripts, preferring that individual communities figure out the best way to discuss articles (and sensibly recognising this may not be a single place, something that even the open-access publishers can't really understand).

In maths and physics this is usually done on listservs, but in genomics and biology I guess we are more comfortable with the blog format for discussion hence the choice of WordPress. Haldane's Sieve finds new postings on arXiv, mainly in the field of population genetics, and then posts summary articles for you to comment on. It may be in the future we need a similar site for microbial genomics and ecology, but for now it's not so busy that this nascent community needs splitting up. Another place to find links is Twitter, e.g. by following me (shameless link).

It seems to be working; the discussion of Lenski's paper has already generated a vigorous response from Inigo Martincorena, the likes of which you are unlikely to see in a published journal, and all the better for it's frankness and energy-- in my opinion.

So, in summary, you should add Haldane's Sieve and the arXiv qBio category (http://arxiv.org/archive/q-bio) to your feed reader if you want to spot exciting new articles and comment on them, and why not think about sending your next manuscript to arXiv first? (No, it doesn't prevent you publishing in peer-reviewed journals)