Generating MLST profiles from short-read data

There are now several available options if you want to call MLST profiles from whole-genome data.

[caption id="attachment_1366" align="alignright" width="300" caption="Result from the DTU MLST web server"][/caption]

DTU MLST Server

The web server at the Center for Genomic Epidemiology at the Danish Technical University is probably the easiest option, with the advantage that it will accept both raw read files and assemblies. It worked well when I tried it, however it was quite slow to return results and if you are uploading large read datasets it will take some time, particularly if you are analysing a large number of samples. It also does not have all of the MLST database listed (I wanted to use C. albicans).

BIGSdb

BIGSdb is a powerful and flexible web server software that can be installed on your local PC or server. It offers the ability to call MLST profiles from assembled genome data, as well as setting up your own typing schemes based on other epidemiologically informative marker genes. But non-bioinformaticians may find it a little tricky to set up.

Update: There is also a hosted version of BIGSdb which lets you cut-and-paste your de novo assembly into the sequence query form and get profiles out, available for a certain subset of the MLST databases (more available on request to Keith Jolley).

SRST

SRST comes from Kat Holt's group in Melbourne. It runs on your local machine and is notable because it calls profiles from short-read data without prior de novo assembly. It gives a confidence score to assignments. As it has some dependencies (BWA, samtools, BLAST) and runs as a Python script it is probably best run on a Linux machine or a Mac.

I found it works quite well on the Illumina data I tried, however there are a few tips for getting it running that are probably worth documenting for other users.

  • The alleles files should be named gene.fas and geneshould be identical to the FASTA header lines in the file, as well as the column names in the STs file.
  • The alleles in the alleles file should be named gene-N where N is the number of the allele. Note if you have a different separator than a hyphen you can specify this with the --name-sep argument, but having no separator is not allowed (as I think is the case with the Cork E. colidatabase).
  • You need an older version of samtools to run this properly, I used samtools-0.1.12a. Newer versions don't work.

Roll your own (suggested by Anthony Underwood, HPA)

Of course what many people do is first perform a de novo assembly, perhaps with Velvet, and then BLAST the contigs against the MLST allele database. You can then inspect the results manually, or write a little script to collect the results into a profile. If you have one you'd like to share, please post the link in the comments below. Here's my Python script for what it's worth ...

Seqbench: A useful meta-database of sequence reads from multiple platforms

Little and often, little and often. This is my new blogging mantra.

A little project that I mentioned at UKNGS2012 is one Lex Nederbragt and I have been kicking around for a little while called Seqbench, but I've only just got time to publish about it on the blog.

It's quite a simple idea: a meta-database containing information and links to sequencing datasets with known reference sequences. It's not a sequence archive like NCBI SRA or ENA. It simply contains what we consider to be the most useful meta-data for a sequencing run, as well as direct links to the FASTQ or SFF files, where available (often via SRA/ENA).

To start with we have focused on E. coli which is probably the most sequenced bacterial species (assuming you don't count mitochondria). The two strains we have focused on are K-12 MG1655 and the O104:H4 STEC from last year's Germany outbreak. So far we have collected almost 150 runs from a huge diversity of instruments including Illumina GA2, HiSeq, MiSeq, 454 GS Junior, 454 GS FLX, 454 GS FLX+, Ion PGM (314, 316, 318 chips) and PacBio. You've also got a variety of read lengths from 100 bases up to 15kb, with insert libraries between 180 bases and 8000 bases!

We couldn't just use SRA because a) many of the datasets are not in there b) the metadata is often incomplete and inconsistently encoded. Just to get this far has been a fair amount of work, trawling the short read archives, the literature and manufacturer's websites for datasets. It is not yet complete.

The data is available in a public Google Fusion Table, which permits easy export to CSV.

[iframe src="https://www.google.com/fusiontables/embedviz?viz=GVIZ&t=TABLE&containerId=gviz_canvas&q=select+col0%2C+col1%2C+col2%2C+col3%2C+col4%2C+col5%2C+col6%2C+col7%2C+col8%2C+col9%2C+col10%2C+col11%2C+col12%2C+col13%2C+col14%2C+col15%2C+col16%2C+col17%2C+col18%2C+col19+from+1-RvsucyVWJBXtvszrPKMzzA7laSwUfFxMDuzHu8" width=800 height=300 scrolling=yes]

You can even access this via a SQL interface using the Google Fusion Table API, e.g. with a command-line interface like Pfusion.

The dataset is under active curation by Lex and myself. When it is stable I plan to deposit it with a service that assigns a DOI to it, perhaps Figshare or GigaDB so it can be cited easily.

What's the point of such a dataset? Well, we find it extremely helpful for training purposes. For example, Lex and I used it to give a course on de novo assembly. We got the students to assemble different datasets and combinations of datasets and compare their results live, putting their results into a live edited Google Doc (very cool).

I used it again to help give a course on de novo assembly, alignment and SNP calling at SBTM12 (more on this in a forthcoming post, course booklet here).

I also think it would be an awesome resource if you were building bioinformatics software or pipelines, particularly assemblers, aligners or variant callers as the strains have "known truth" reference sequences (although they may have an error or two). You could also use it to do platform comparisons.

No doubt you could think of some other things to use it for as well!

Comments, thoughts on the dataset and its usability welcome below please.

Some thoughts on today's Ion World announcements

A few significant announcements from Ion World today (sourced from the press release and #ionworld Tweets) which I've summarised here:

Proton III and Avalanche

The Ion Proton is now shipping and Life Tech plan to ship 100 instruments to customers in September.

The initial chip, Proton I ("PI") will do 60-80m reads and "up to 10Gb" of output from its 110 million sensors. The read length is "between 100 and 200 bases". No mention of 400 base kit for Proton at this time. Proton II (PII) we know has 650 million sensors. The big news is the announcement of a third chip, predictably named Proton III. This will have 1.2bn sensors and could theoretically generate 256Gb of data. It seems this will be enabled by a new emulsion PCR-free technology they are calling Avalanche. Is this related to the long-talked about Wildfire protocol for SOLiD?

The PIII will require Avalanche and so this may suggest that the chip will not have beads and wells, but instead move to some kind of solid-surface technology as used in Illumina sequencing. In turn this may suggest that they have run out of real estate at the PII chip density, and so require the spaces between wells to get further throughput.  But this is pure speculation on my part. Available "within 18 months".

Yes, Chef!

One of the hassles with PGM sequencing right now is that there is a good deal of hands-on to go from a genomic DNA sample to a sequence-ready loaded chip, even with the OneTouch. So it is great news that the Ion Chef (pictured) has been announced which claims to go from library to a loaded chip with only minutes of hands-on time. It presumably replaces the OneTouch (OneTouch 2 for Proton) and is scheduled for some time in first half of 2013.  About the pun-tastic name; awful. Update 14/09: According to Dale Yuzuki's blog, the Ion Chef will be available mid-2013 and will cost about twice the OneTouch, so around $40,000 I think.

Platform Accuracy

Ion Torrent Suite 3.0 is supposed to have dramatic improvements in base-calling. Ion Torrent on Twitter quote Shawn Levy at Hudson Alpha saying: "Concerns about insertion deletion errors are now largely solved". Chad Nusbaum is quoted as saying: "Indel mismatches significantly improved in the new bioinformatics software Torrent suite 3.0" (tweet). This is great news. Certainly the recent E. coli 300bp dataset put out on the Ion Torrent website look a great deal better than that which we got in summer 2011.

On the Ion Proton apparently "consensus accuracy is comparable to that of the Ion PGM™ sequencer" which is a little vague. Apparently the Broad Baylor (thanks NJL_Broad for the correction) have performed 274 runs on the Proton (tweet) which is amazing, but a shame there hasn't been any data released.

How long?!

There seems to be some confusion about how long it takes to sequence on the Proton. Most of the Twitter PR says two hours, but the press release states "two to four hours". Certainly the PGM run-time has been variable depending on the chip type, leading to the development of a run-time calculator.

Well, in summary, this is a pretty exciting set of announcements from Life Tech focusing on some of the platform issues (accuracy and hands-on time). The Proton with Proton II chip and Ion Chef, if it all works and delivers high quality data is a serious prospect. But of course, the proof is in the pudding!

It would be great if some early access users would post some data.

Current Ion Roadmap

There seems to be a little confusion as to what is available now. Here is the roadmap I figured out based on the announcements today:

September 2012 - Ion Proton ships to first customers

end 2012 - 400 base kit available for PGM

March 2013 - PII chip available (assuming 6 months from September 2012)

Mid-2013 - Ion Chef launches (for PGM and Proton)

March 2014 - Avalanche and PIII chip available (assuming 18 months from September 2012)

 

 

Sequencing low diversity libraries on Illumina MiSeq

After its launch in 2005, the 454 rapidly became the go-to technology if you wanted to sample diversity in amplicon libraries, whether a cancer panel, a viral quasispecies or microbial community profiling. It is not difficult to see why. Compared to Sanger sequencing the 454 offered massive throughput, being able to produce over a million reads per run at the relatively modest price of $10,000. This was an order of magnitude less than Sanger sequencing. And crucially, combining the instrument's high-throughput with barcode multiplexing permitted large numbers of samples to be interrogated on a single run at high coverage depth.

In microbiology and ecology, deep sequencing of 16S amplicon libraries using 454 is now the dominant method for phylogenetic profiling of microbes. Of the 2,210 publications listed on the 454.com website, 839 are in the category “Metagenomics and Microbial Diversity”. The “rare biosphere” in our bodies in health and disease was revealed for the first time. Environmental ecologists used the technology to interrogate hugely diverse environmental niches. Hundreds of new OTUs, often representing hitherto uncultured microbes were revealed for the first time.

Move over 454

Sadly, the pace of development of the 454 platform has stagnated in recent years following the Titanium upgrade in 2008. The long-promised upgrade to GS FLX+ “1kb reads” was late and under-delivered with reads more like 700-800 bases, and some users have reported dissatisfaction with the upgrade. Disappointingly the long read protocol is not supported when running unidirectional Lib-A sequencing, dramatically limiting its potential market. Nor is it available on the benchtop 454 GS Junior, although this may change in future.

But most critical is the apparent blind spot of Roche management to the rapidly dropping costs of sequencing on competitor platforms. The 454 has simply priced itself out of the market by being one to two orders of magnitude more expensive when costed per megabase compared to the Illumina and Life Technologies platforms.

New Platforms for Amplicon Sequencing

So for microbiologists wishing to do 16S sequencing, whether they are driven by cost-cutting, or by a desire to sequence more samples more deeply, it is now time to look around at alternatives. The MiSeq and the PGM are both promising platforms for 16S analysis given their competitive price points, and increasingly long reads (MiSeq 2x150bp, PGM 200bp - going to 2x250bp and 400bp respectively by the end of the year).

Sequencing low diversity libraries on Illumina MiSeq

We are moving to the Illumina MiSeq locally for 16S sequencing. For about £750 we generate over 5 million reads per run. By using paired-end sequencing at 150 bases we can design experiments which generate amplicons a little less than 300 bases and overlap them to generate long pseudo-reads. The error model is favourable compared to 454 as it does not suffer from frequent indel errors, meaning there is less need for expensive denoising steps such as PyroNoise.

However, there is a fly in the ointment. Amplicon sequencing on the Illumina platform has traditionally been problematic when sequencing so-called "low diversity" libraries such as 16S, resulting in low yields and lower per-base quality scores compared to sequencing more random libraries, e.g. from genomic DNA.

The good folks of Seqanswers have discussed this at length, and various work-arounds have been suggested. One commonly used approach is to spike in a genomic, higher-diversity sample, e.g. PhiX. The more PhiX spiked in, the better the results, but at the expense of the number of amplicon sequences generated. A second option is to add a sequence of N bases upstream of the 16S primer, resulting in the generation of random sequences. This however reduces the effective read length.

Solving the problem

We have been very fortunate in the past few weeks to welcome Josh Quick to our lab. He previously worked as an integration engineer at Illumina but has now decided to hone his skills as a bioinformatician. There's not much he doesn't know about Illumina sequencing, and he quickly introduced me to some tips for improving amplicon sequencing performance that were so impressive I asked him to share them here.

Over to you, Josh ...

There are 3 main areas in which low-diversity samples can cause you problems on MiSeq:

1) Focusing (every cycle) - the MiSeq focuses on the T channel with a fall back to the C channel, in practice as long as all the signal is not in the G channel you will be fine.  All other issues aside a very small PhiX spike in (~5%) is enough to prevent any focusing issues regardless of the composition of your library.

2) Template building (cycles 1 to 4) and registration (every cycle) - RTA uses images from the first 4 cycles to detect the positions of all the clusters.  You need to have some signal present in each channels for RTA to do template generation and registration properly.  Again a small PhiX spike in (~5%) is usually enough to prevent problems here provided density is <=700k.

3) Phasing/matrix estimation (cycles 1 - 12) - RTA estimates the average colour matrix over the first 4 cycles and the phasing over the first 12 cycles.  Low diversity samples can cause problems with both as the intensity is not evenly distributed across all channels as it is with genomic libraries.  As these are calculated in order to perform corrections a bad estimate here can cause your quality to start high then rapidly fall away, in these cases you might need a large PhiX spike in (~50%) to solve the problem.

Control lanes - on the GA/HiSeq 1) and 2) were still considerations (although each instrument focuses differently) however the use of a PhiX control lane eliminated problem 3).  On the MiSeq, having only a single lane means a control lane isn’t possible but there is a method for using ‘control’ conditions on MiSeq by modifying the RTA configuration file.

In my experience the most likely thing to go wrong is the phasing estimator, it will give a spuriously high phasing or prephasing number of >1% which means your quality starts off good then rapidly falls away.  However you can use a value based on a previous PhiX run, for example ours would be 0.0015/0.003.

The way to use ‘control’ matrix/phasing on the MiSeq:

(DISCLAIMER - this is not a configuration supported by Illumina so use it at your own risk)

Our MiSeq is running:

  • MiSeq Control Software 1.2.3
  • RTA 1.14.23

Locate your RTA configuration, ours is at:

C:\Illumina\RTA\Configs\MiSeq.Configuration.xml

Locate your control phasing and matrix files (previous PhiX run is ideal):

D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\BaseCalls\(Phasing|Matrix)\s_1(phasing|matrix).txt

Use a text editor to put the matrix and phasing values from these files into the MiSeq.Configuration.xml below the other options like this:

<HardCodedPhasing>
  <float>0.0015</float>
</HardCodedPhasing>
<HardCodedPrePhasing>
  <float>0.003</float>
</HardCodedPrePhasing>
<HardCodedColorMatrix>
  <ArrayOfFloat>
    <float>0.9339278</float>
    <float>0.07252103</float>
    <float>0</float>
    <float>0</float>
    <float>1.458246</float>
    <float>1.399187</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0</float>
    <float>0.8679092</float>
    <float>0.03415901</float>
    <float>0</float>
    <float>0</float>
    <float>0.5764247</float>
    <float>0.988043</float>
  </ArrayOfFloat>
</HardCodedColorMatrix>

You need to have one float/ArrayOfFloat per read so the above would set the phasing, prephasing and matrix for a single read run, and below would set just the phasing for a dual index paired end run with four reads:

<HardCodedPhasing>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
  <float>0.0015</float>
</HardCodedPhasing>

When the run starts check the RTA configuration file in your run folder to make sure it accepted the settings:

D:\Illumina\MiSeqTemp\RunFolder\Data\Intensities\RTAConfiguration.xml

This in most cases will enable you to use a significantly smaller amount of spiked in PhiX, you will still need 5% minimum to prevent problems arising from 1) and 2) and do not run at high density for amplicon work - 700k is the upper limit for difficult low diversity samples.  It is also possible to save the images for re-running RTA offline, this enables you to try different settings to find what works best.  The MiSeq.Configuration.xml setting for this is:

<CopyImages>true</CopyImages>

Good luck!

Update 6th September 2012: Some of the example values in the original post were wrong and have been corrected. However these were just illustrative, you should use the values from a test run on your local machine for this approach to be useful.

Update 31st October 2012: In the latest release of RTA (1.16) you no longer need to modify your RTAConfiguration.xml, instead save a copy of your control phasing/matrix files described above in the root of the RTA directory as phasing.txt and matrix.txt. RTA will fall back to the values in these files if it detects a low diversity sample.

UK Next Gen Sequencing Meeting 2012 - Loads of good talks


Shameless plug alert, I promise I'll get back to more in-depth blogging sometime soon!

If you haven't already, please register for the UK NGS Meeting 2012 held in Nottingham, UK on August 28-30. This is the third incarnation of this relaxed meeting which has a diverse cross-section of NGS talks. There are particularly strong talks in the Pathogens strand this year, which I am helping organise. A few talks I am particularly looking forward to with a microbial theme include:

Dave Rasko, University of Maryland; "Application of whole genome sequencing to microbial forensics and diagnostic development".

Jennifer Gardy, British Columbia Centre for Disease Control, "The field guide to pathogens: using next generation sequencing to improve public health microbiology".

Zamin Iqbal, University of Oxford, "High-­‐throughput population genomics using de novo variant assembly: examples from pathogen and human genomics"

The meeting will also have talks from Chris Quince (of PyroNoise/AmpliconNoise fame), Mike Quail (WTSI), Matt Berriman (WTSI), Ed Feil (Bath) .. and even myself! :)

Should be a fun couple of days, and of course this meeting is renowned for spilling out into Nottingham's city centre which features a number of excellent hostelries, including "England's oldest pub", Ye Olde Trip to Jerusalem. Neil Hall, the CEO of SHTseq, has said the first round is on him!

Twitter hashtag will is #UKNGS2012.

Hope you can join us!