First look at Ion Torrent data: De novo assembly

Well we've had our Ion Torrent for nearly 6 weeks now, but we haven't yet run it. A combination of factors have impeded our progress. After waiting for installation there have been various changes to the protocols and reagents, plus we needed to buy some more equipment to make libraries (a Bioruptor, for shearing). Our training is now scheduled for next week. We're not complaining too much, we ideally need the 316 chips which produce >100Mb to start doing the work we have pencilled in for it (genomic epidemiology of Acinetobacter).

However, I couldn't help but be intruiged when I saw that Ion Torrent posted an E. coli K-12 DH10B dataset (registration required) on their Torrent Dev community site. This is from a single run of a 314 chip and consists of 49Mb of raw sequence - way above the minimum spec (10Mb).

An accompanying Ion Torrent application note is a useful initial guide to these reads. And in fact whilst I was preparing this post Keith Robison recently posted a more detailed analysis of the quality scores.

However, I am interested in a particular application: de novo assembly, where the genome sequence is predicted without a reference sequence. This is crucial for novel gene discovery and detection of large-scale insertions and deletions, both important for the kind of bacterial genomic epidemiology work we want to use the instrument for.

Read Analysis

Firstly, as is recommended when assembling de novo, I assessed the read data for quality.

The Ion Torrent server makes the reads available in both FASTQ and SFF format, which makes life easy for feeding to various pipelines.

I ran it through the excellent PRINSEQ to get some quality reports (I could have equally used FastQC).

There are a total of 522,099 sequences and a total of 49,040,224 bases, giving a mean read length of 93 bases.

Figure 1. Read length distribution.

Figures 2a and 2b. Base qualities

Base qualities start around Q30 and then fall towards the 3' end of the read, in common with both 454 and Illumina. The tails are reportedly quite low quality (Q10 = 1/10 chance of base being wrong). But it ain't Pac Bio! :)

Figure 3. GC distribution, looks pretty good and consistent with E. coli K-12's ~50% GC.

So the initial reports look fine. I might be inclined to trim some of the low quality sequence but for now I have left all the sequences in.

De novo Assembly

Aligning these reads to a reference is fairly trivial, and Ion Torrent supply some software called TMAP, written by Nils Homer (author of BFAST) to achieve this. Early reports on the Ion Torrent community forum are that it is both fast and accurate.

However Ion Torrent do not have an in-house assembler (yet), so it is an open question which software to use, and how well the process works.

We know E. coli K-12 DH10B has a 5Mb genome, so 49Mb of data means should give approximately 10x coverage, assuming all the reads are incorporated into the assembly. This is a little on the low side - if it was 454 data I'd prefer 15x or even 20x - but it is still sufficient to make it worthwhile proceeding. Many of the original Sanger-sequenced genomes made do with less than this.

Ion Torrent have partnered with CLC Bio and DNAStar as bioinformatics partners and so CLC Genomics Workbench and SeqMan NGen already have Ion Torrent support built-in, so it makes sense to try them.

We expect Ion Torrent reads to have similar homopolymer and carry-forward errors as the 454, so it makes sense to try assemblers that work well on 454 data. So I thought would try Newbler (Roche's 454 assembler) and MIRA3 (a good option for 454 sequences). I also chucked-in Ray because it has native support for SFF files. CAP3 is an old stalwart assembler. And for giggles I thought I'd try a short-read assembler, in this case Velvet without any real expectation that it would work correctly (I used k=31).

Assembler Version Method Open source License
CAP3 10/15/07 Overlap-layout-consensus Yes Free to non-profit
CLC Genomics Workbench 4.6.1 Hybrid No Commercial
MIRA 3.2.1 Overlap-layout-consensus Yes Free
Newbler 2.5.3 Overlap-layout-consensus No Free for academic use
SeqMan NGen 3.0.4 (10) Overlap-layout-consensus No Commercial
Ray 1.3.0 OpenAssembler (de Bruijn) Yes Free
Velvet 1.1.03 de Bruijn Yes Free

Table 1. Assemblers used in comparison (some data from Kumar, Blaxter)

I was gratified that getting all these packages installed was both quick and relatively painless. For the desktop programs I ran the analysis on a Samsung RF510 (i5 2.66Ghz, 8Gb RAM) running 64-bit Win7 and for the command-line programs I used a Gentoo 64-bit Linux box with 8 x 3.0Ghz Xeon processors and 32Gb of RAM. In each case I used the default parameters except with SeqMan NGen which I told to expect 10x coverage for repeat resolution. For Newbler, MIRA and Ray I gave it input as if it was 454 data (from SFF file).

Assembler Time taken (hh:mm:ss)
CLC Genomics Workbench 00:01:33 (desktop)
SeqMan NGen 00:07:50 (desktop)
Velvet ~00:15:00 (server)
Ray 00:16:03 (server)
MIRA 00:21:00 (server)
CAP3 ~01:00:00 (server)
Newbler 4 days (server)

Table 2. Run times for assemblers

I was shocked by just how quick the commercial offerings were - particularly CLC - despite running on a laptop less powerful than the server. CLC beat its advertised speed of 2 minutes, quite impressive.

An honourable mention for Newbler which took an incredible 4 days to run. I wasn't sure it was going to finish in fact. It's a little unfair as of course Newbler is not designed for Ion Torrent data and is pretty quick when you feed it 454 files for this kind of sized assembly. I assume the short reads triggered some kind of weird brute-force alignment mode. However, it did complete and produce results, so well done!

</col> </col> </col>
Number of contigs Min Max Total
CLC 899 200 34562 4473313
Newbler 1041 100 42786 4450039
MIRA 3.2.1 2766 80 17257 4527768
Ray 3222 124 7703 4373659
Seqman Ngen 3516 78 11547 4569381
CAP3 7593 42 8743 4857173

Table 2. Contig stats from assembly ordered by number of contigs (best first)

Velvet produced 1.3m contigs, so I had to exclude it from the test.

Quite a range of values there. It is worth noting that adding up the lengths of all the contigs produces values much less than 5mb. This is expected because perfect or near-perfect repeats longer than the read length often end up in a single "consensus contig". But there is quite a size range there. Note that comparing numbers of contigs is not strictly like-for-like because each assembler uses a different minimum length cut-off.

</col> </col> </col> </col> </col>
Avg N50 N75 N90
CLC 4975 8629 4560 2455
Newbler 5507 8207 4615 2575
MIRA 1636 2900 1613 797
Seqman Ngen 1299 2016 1160 619
Ray 1357 1884 1164 697
CAP3 639 973 559 310

Table 3. Contig stats from assembly ordered by N50 (best first)

N50 is a much better guide to comparing "contiguity", and CLC Genomics Workbench and Newbler give the highest values here. MIRA, Ray and Seqman Ngen are in the middle of the pack. CAP3 didn't do too well at all which is not a great surprise as it doesn't use the quality values, plus its default parameters may not be suitable for these reads.

Before we go any further - this isn't intended as an assembly competition and I am not recommending or dissing any of these assemblers. This is just a very basic experiment to give me a feel for how well de novo assembly performs on these data. But I thought it interesting enough to share with you guys.

So - the results themselves. They aren't great - if you compare them with a 454 Titanium run or an Illumina paired-end run - but they are pretty good if you consider this is the first dataset available from the Ion Torrent, running with the lowest specification chip. In pure sequencing costs, it's a $500 bacterial whole-genome and the sequencing itself only takes 2 hours. So it's quite exciting.

And we must remember much of the software is not yet properly optimised for these data.

The assemblies are certainly good enough to predict and detect most of the genes in K-12. What I have not assessed yet is assembly quality, so the assemblers with the longest N50s may or may not be the best choice if you need accuracy (and you probably do).

I intend to take a close look at the extent of misassemblies, erroneous consensus base calls and issues with homopolymeric tracts in a future blog post.

So for now, the take home messages are:

Your comments are always appreciated!  And do drop me a line if you are at ASM this year as I will be presenting on Sunday's "Genome Epidemiology" session.