First look at Ion Torrent data: De novo assembly

10 May 2011

Well we've had our Ion Torrent for nearly 6 weeks now, but we haven't yet run it. A combination of factors have impeded our progress. After waiting for installation there have been various changes to the protocols and reagents, plus we needed to buy some more equipment to make libraries (a Bioruptor, for shearing). Our training is now scheduled for next week. We're not complaining too much, we ideally need the 316 chips which produce >100Mb to start doing the work we have pencilled in for it (genomic epidemiology of Acinetobacter).

However, I couldn't help but be intruiged when I saw that Ion Torrent posted an E. coli K-12 DH10B dataset (registration required) on their Torrent Dev community site. This is from a single run of a 314 chip and consists of 49Mb of raw sequence - way above the minimum spec (10Mb).

An accompanying Ion Torrent application note is a useful initial guide to these reads. And in fact whilst I was preparing this post Keith Robison recently posted a more detailed analysis of the quality scores.

However, I am interested in a particular application: de novo assembly, where the genome sequence is predicted without a reference sequence. This is crucial for novel gene discovery and detection of large-scale insertions and deletions, both important for the kind of bacterial genomic epidemiology work we want to use the instrument for.

Read Analysis

Firstly, as is recommended when assembling de novo, I assessed the read data for quality.

The Ion Torrent server makes the reads available in both FASTQ and SFF format, which makes life easy for feeding to various pipelines.

I ran it through the excellent PRINSEQ to get some quality reports (I could have equally used FastQC).

There are a total of 522,099 sequences and a total of 49,040,224 bases, giving a mean read length of 93 bases.

Figure 1. Read length distribution.

Figures 2a and 2b. Base qualities

Base qualities start around Q30 and then fall towards the 3' end of the read, in common with both 454 and Illumina. The tails are reportedly quite low quality (Q10 = 1/10 chance of base being wrong). But it ain't Pac Bio! :)

Figure 3. GC distribution, looks pretty good and consistent with E. coli K-12's ~50% GC.

So the initial reports look fine. I might be inclined to trim some of the low quality sequence but for now I have left all the sequences in.

De novo Assembly

Aligning these reads to a reference is fairly trivial, and Ion Torrent supply some software called TMAP, written by Nils Homer (author of BFAST) to achieve this. Early reports on the Ion Torrent community forum are that it is both fast and accurate.

However Ion Torrent do not have an in-house assembler (yet), so it is an open question which software to use, and how well the process works.

We know E. coli K-12 DH10B has a 5Mb genome, so 49Mb of data means should give approximately 10x coverage, assuming all the reads are incorporated into the assembly. This is a little on the low side - if it was 454 data I'd prefer 15x or even 20x - but it is still sufficient to make it worthwhile proceeding. Many of the original Sanger-sequenced genomes made do with less than this.

Ion Torrent have partnered with CLC Bio and DNAStar as bioinformatics partners and so CLC Genomics Workbench and SeqMan NGen already have Ion Torrent support built-in, so it makes sense to try them.

We expect Ion Torrent reads to have similar homopolymer and carry-forward errors as the 454, so it makes sense to try assemblers that work well on 454 data. So I thought would try Newbler (Roche's 454 assembler) and MIRA3 (a good option for 454 sequences). I also chucked-in Ray because it has native support for SFF files. CAP3 is an old stalwart assembler. And for giggles I thought I'd try a short-read assembler, in this case Velvet without any real expectation that it would work correctly (I used k=31).

Assembler	Version	Method	Open source	License
CAP3	10/15/07	Overlap-layout-consensus	Yes	Free to non-profit
CLC Genomics Workbench	4.6.1	Hybrid	No	Commercial
MIRA	3.2.1	Overlap-layout-consensus	Yes	Free
Newbler	2.5.3	Overlap-layout-consensus	No	Free for academic use
SeqMan NGen	3.0.4 (10)	Overlap-layout-consensus	No	Commercial
Ray	1.3.0	OpenAssembler (de Bruijn)	Yes	Free
Velvet	1.1.03	de Bruijn	Yes	Free

Table 1. Assemblers used in comparison (some data from Kumar, Blaxter)

I was gratified that getting all these packages installed was both quick and relatively painless. For the desktop programs I ran the analysis on a Samsung RF510 (i5 2.66Ghz, 8Gb RAM) running 64-bit Win7 and for the command-line programs I used a Gentoo 64-bit Linux box with 8 x 3.0Ghz Xeon processors and 32Gb of RAM. In each case I used the default parameters except with SeqMan NGen which I told to expect 10x coverage for repeat resolution. For Newbler, MIRA and Ray I gave it input as if it was 454 data (from SFF file).

Assembler	Time taken (hh:mm:ss)
CLC Genomics Workbench	00:01:33 (desktop)
SeqMan NGen	00:07:50 (desktop)
Velvet	~00:15:00 (server)
Ray	00:16:03 (server)
MIRA	00:21:00 (server)
CAP3	~01:00:00 (server)
Newbler	4 days (server)

Table 2. Run times for assemblers

I was shocked by just how quick the commercial offerings were - particularly CLC - despite running on a laptop less powerful than the server. CLC beat its advertised speed of 2 minutes, quite impressive.

An honourable mention for Newbler which took an incredible 4 days to run. I wasn't sure it was going to finish in fact. It's a little unfair as of course Newbler is not designed for Ion Torrent data and is pretty quick when you feed it 454 files for this kind of sized assembly. I assume the short reads triggered some kind of weird brute-force alignment mode. However, it did complete and produce results, so well done!

</col> </col> </col>

	Number of contigs	Min	Max	Total
CLC	899	200	34562	4473313
Newbler	1041	100	42786	4450039
MIRA 3.2.1	2766	80	17257	4527768
Ray	3222	124	7703	4373659
Seqman Ngen	3516	78	11547	4569381
CAP3	7593	42	8743	4857173

Table 2. Contig stats from assembly ordered by number of contigs (best first)

Velvet produced 1.3m contigs, so I had to exclude it from the test.

Quite a range of values there. It is worth noting that adding up the lengths of all the contigs produces values much less than 5mb. This is expected because perfect or near-perfect repeats longer than the read length often end up in a single "consensus contig". But there is quite a size range there. Note that comparing numbers of contigs is not strictly like-for-like because each assembler uses a different minimum length cut-off.

</col> </col> </col> </col> </col>

	Avg	N50	N75	N90
CLC	4975	8629	4560	2455
Newbler	5507	8207	4615	2575
MIRA	1636	2900	1613	797
Seqman Ngen	1299	2016	1160	619
Ray	1357	1884	1164	697
CAP3	639	973	559	310

Table 3. Contig stats from assembly ordered by N50 (best first)

N50 is a much better guide to comparing "contiguity", and CLC Genomics Workbench and Newbler give the highest values here. MIRA, Ray and Seqman Ngen are in the middle of the pack. CAP3 didn't do too well at all which is not a great surprise as it doesn't use the quality values, plus its default parameters may not be suitable for these reads.

Before we go any further - this isn't intended as an assembly competition and I am not recommending or dissing any of these assemblers. This is just a very basic experiment to give me a feel for how well de novo assembly performs on these data. But I thought it interesting enough to share with you guys.

So - the results themselves. They aren't great - if you compare them with a 454 Titanium run or an Illumina paired-end run - but they are pretty good if you consider this is the first dataset available from the Ion Torrent, running with the lowest specification chip. In pure sequencing costs, it's a $500 bacterial whole-genome and the sequencing itself only takes 2 hours. So it's quite exciting.

And we must remember much of the software is not yet properly optimised for these data.

The assemblies are certainly good enough to predict and detect most of the genes in K-12. What I have not assessed yet is assembly quality, so the assemblers with the longest N50s may or may not be the best choice if you need accuracy (and you probably do).

I intend to take a close look at the extent of misassemblies, erroneous consensus base calls and issues with homopolymeric tracts in a future blog post.

So for now, the take home messages are:

bacterial de novo assembly is possible with Ion Torrent data, even with data from a single 314 chip
CLC Genomics Workbench is the fastest assembler and produces some of the longest contigs with default settings (but costs $5000 for a license)
Newbler also produces long contigs but takes days to run for some reason
MIRA and Ray are both promising open-source options

Your comments are always appreciated! And do drop me a line if you are at ASM this year as I will be presenting on Sunday's "Genome Epidemiology" session.

Loman Labs

First look at Ion Torrent data: De novo assembly

Read Analysis

De novo Assembly

Related Posts

Balti and Bioinformatics: 14th November 2019 14 Oct 2019

Food in Birmingham 11 Aug 2019

How to generate consensus sequences using nanopolish 21 Dec 2018