Nanopore R9 rapid run data release

30 Jul 2016

R9 data

A long promised addition to the nanopore sequencing repertoire is the rapid sequencing kit. This kit significantly reduces the effort required to make a sequencing library - down from 2-3 hours to a few minutes. We’ve actually played with this kit several times before, once very early on in the MAP (I think using R7 chemistry as long ago as July 2014). More recently, Matt Loose and I tried it out in a hotel room before a famous genomics conference in February of this year. We can both vouch for how easy it is to use - no specialist equipment is required other than pipettes and a source of heat to neutralise the transposase after a short incubation at room temperature. The recommended starting DNA input is 500ng. In our hotel room we used a freshly brewed cup of coffee which provided the required 70 degrees.

The #agbt16 secret sequencing lounge is open ;) pic.twitter.com/iIYqNI5MUD
— Nick Loman (@pathogenomenick) February 10, 2016

However, until recently this kit was really mainly a curiosity rather than a serious proposition because it only produces so-called “1D” data. To remind you, 1D data is when only the template strand of the double-stranded molecule is read. With the 1D kit because there is no hairpin ligation the complement strand does not pass through the pore.

And for R7.3 data this was a significant drawback: sequence accuracy on the template strand is in the low 70s, accuracy-wise, which makes basic tasks like de novo assembly and variant calling computationally very difficult (although probably not impossible, and assemblers like Canu can cope, with a bit of tweaking). It also makes polishing extremely slow.

The release a few months back of the R9 chemistry has changed the game – it’s a game-changer! – and suddenly made 1D reads very usable. This is ascribed to the more discriminatory read head of the CsgG pore employed, where fewer nucleotides in the pore abrogate the flow of ions across the membrane. The spread of electrical current levels is about twice as wide as seen in R7. However it is hard to know exactly how much of the improved accuracy is caused by the pore as this coincided with the introduction of a new style of basecaller that employs ‘deep learning’ (technically a recurrent neural network) rather than the Hidden Markov Model of before. A third change is the introduction of ‘fast mode’, currently running at 250 bases / second, or four times the translocation speed employed with the R7 chemistry. Because all these changes were introduced at once, it is hard to know the relative contribution of each. However, our early access experiences with R7.3 demonstrated that ‘fast mode’ did not seem to have a significant detrimental effect on quality. In fact, the theory is it may improve handling of long homopolymeric tracts by introducing more signal into the ‘dwell’ times.

Other changes: Notably, the sequencing files now record raw current sample data (at 5kHz) by default, and the previous process of linearising the signal into ‘events’ is now performed by the cloud base caller Metrichor rather than MinKNOW on the laptop. Excitingly there are now three local basecallers available - one is built into MinKNOW 1.0.0 (the next release). There is also a separate download called nanonet (available to MAPpers). We tried out nanonet during the ZiBRA bus trip and it worked well, albeit it could not quite keep up with data generation on a standard laptop. Jared Simpson and Matei David also have an open source basecaller called nanocall.

We’ve done two runs of this protocol. The first was on a flowcell that was delivered, erroneously frozen for 36 hours at -10 degrees in our Stores, and then left at room temperature for a week or so (we’d assumed it was completely knackered). We thought we’d just try it out for fun and to our surprise it actually generated a decent yield of data, around 600mb. Data here is from a second flowcell that was correctly stored at fridge temperature.

The final new thing here is that this is a SpotON flowcell; which means the total volume loaded onto the flowcell is halved, and you in fact ‘drip, drip’ the library straight onto the flowcell surface via a small hole that is protected by a plastic clip. What difference this makes to performance is currently unknown:

Freaky new minion loading technique straight on to the flowcell - pic.twitter.com/imEIBrvNrO
— Nick Loman (@pathogenomenick) July 28, 2016

The results from the better flowcell are presented here with links to data at the bottom:

E. coli stats

stats

Type	Total Reads	Base Pairs	Mean	Median	Min	Max	N25	N50	N75
pass:template	164472	1.48Gb	9009	5944	117	131969	25244	14891	8074
fail:template	74465	467Mb	6271	3544	5	328471	21903	12033	6047

This is the highest yielding flowcell we’ve ever had, with just shy of 2Gb of base called sequence, and 1.48Gb in the pass bin. Over 99% of the reads map to the reference, meaning the goodput is equivalent to the output.

Read length

The transpososome method gives a very different size distribution to the Gaussian distribution expected with the traditional Covaris G-tube fragmentation. There are more shorter reads, but the N50 is improved to nearly 15kb (from around 8kb). The maximum length read in this dataset is 131kb and aligns completely to the reference genome at 85% identity.

Read length (greater than 50kb)

Zooming into this plot it is obvious there are plenty of super long reads - 953 of the passing reads are greater than 50kb comprising 57.5Mb of sequence.

Read length (greater than 50kb)

Gratifyingly the data gives a single contig assembly with miniasm and Canu without any custom parameterisation. We’ll pass it over to Jared to see what kind of consensus accuracy he can get out of nanopolish which now has alpha support for R9 data.

Accuracy

The 1D accuracy is a quantum leap from previous pores, with mean read accuracy at 83%.

We’ll do more analysis on this dataset and hope to write it up as a manuscript in future, but are releasing the dataset for the community to play with.

E. coli 2D kit data

We’ve also previously generated 2D data and this is available below.

Stats

668Mb of passing 2D data (template+complement) results in 244mb of 2D data.

`pass` stats

Type	Total Reads	Base Pairs	Mean	Median	Min	Max	N25	N50	N75
template	50277	328543190	6534.66	6448	9	78622	11688	9063	6665
complement	50277	340285012	6768.2	6427	5	144661	12555	9280	6732
twodirections	31858	244275647	7667.64	7603	99	64218	11754	9244	7135

ipython notebook

I have posted up the IPython notebook detailing the commands to reproduce this analysis.

Credits

Josh Quick did the laboratory work and sequencing. We are grateful to John Tyson for supplying his tuning scripts for the 1D R9 run.

Conflict of interests

I have received an honorarium to speak at an Oxford Nanopore meeting, and travel and accommodation to attend London Calling 2015 and 2016. I have ongoing research collaborations with ONT although I am not financially compensated for this and hold no stocks, shares or options. ONT have supplied free-of-charge reagents as part of the MinION Access Programme and also generously supported our infectious disease surveillance projects with reagents.

Loman Labs

Nanopore R9 rapid run data release

R9 data

E. coli stats

stats

Links

E. coli 2D kit data

Stats

`pass` stats

Links

ipython notebook

Credits

Conflict of interests

Loman Labs

Nanopore R9 rapid run data release

R9 data

E. coli stats

stats

Links

E. coli 2D kit data

Stats

pass stats

Links

ipython notebook

Credits

Conflict of interests

Related Posts

Balti and Bioinformatics: 14th November 2019 14 Oct 2019

Food in Birmingham 11 Aug 2019

How to generate consensus sequences using nanopolish 21 Dec 2018

`pass` stats