VIB NGS Meeting 2015

Whilst following the #PAGXXIII tweets I really noticed that livetweeting - whilst offering intriguing nuggets of information - are really hard to interpret without context. At best this is annoying, at worst it can result in countless clarifications and circular discussions.

I suggested a potentially different way of doing livetweeting which is to use a Google Doc or Etherpad to scribble notes, and then to tweet links to it with salient quotes.

Anyway, I thought I would give it a try for the conference I am attending and speaking at - the VIB NGS 2015 conference in Leuven, Belgium.

First off, it is trivially easy to set up an Etherpad (this one hosted by the folk at mozilla.org and embed it in this permalink. That way when the conference is over I will simply cut-and-paste the text and turn it into a permanent record, but for now my updates are live and subject to change. And if you are at the conference please feel free to add your own notes or clarifications, your text will come up in a different colour.

REVOLUTIONIZING NEXT-GENERATION SEQUENCING: TOOLS AND TECHNOLOGIES
VIB, Leuven, Belgium 15-16 January 2015
http://www.vibconferences.be/event/revolutionizing-next-generation-sequencing-tools-and-technologies

David Jaffe, Broad Institute, Personal Genome Assembly 

Traditional resequencing can miss loci unique to an individual. Define a “personal genome assembly” for $10k/genome. Process: 0.5 microgram DNA input, TruSeq PCR-free, one rapid run flow cell on HiSeq 2500. Output: 60x coverage. Variants detected with DISCOVAR (Nature Genetics). DISCOVAR contig N50 ~126kb (several times better than traditional de novo assemblers). Validation dataset produced using 100 Fosmids (4Mb) to finished quality (sequenced by Illumina and PacBio). Thread haploid sequence through assembly graph. If a path is found then assembly is good. Noticed errors rare: ~50kb between them, usually from long homopolymers.

Demonstrating visualising assembly graphs to find variation. For e.g nebulin - 10kb x 3 copy with evidence of variation between paralogous copies. Can’t call any SNPs with traditional alignment methods. Looking at graph complexity within tumours. Powerful method for detecting mutations that would be missed previously, although haplotyping still difficult. Plan to use this method to look at various experiments: trio studies, cancer, difference between different tissue types, difference between cells and cell lines, ideally do world reference population graph.

Thinking about future methods to improve contiguity: mate-pairs, genome maps (BioNano genomics: read positions of 7-mer nick sites across several hundred kb molecules). Aiming for $20K reference quality human genomes with HiSeq data plus one of these new datasets.

A few thoughts: Why not just use PacBio or Nanopore for long reads? Could DISCOVAR be used for bacterial or metagenomes? Is the graph visualiser easy to play with as the visualisations shown are nice. It’s curious to talk about $10k and $20k human genomes from the Broad when we are fed a diet of the $1000 genome from the HiSeq X Ten, but apples and oranges comparison.

Further reading:

Comprehensive variation discovery in single human genomes
http://www.nature.com/ng/journal/v46/n12/full/ng.3121.html

DISCOVAR online demo
http://broad.io/disco-demo

DISCOVAR blog
http://www.broadinstitute.org/software/discovar/blog/

Max Van Min, Cergentis

Targeted Locus Amplification – one primer pair (2x20bp) required to enrich a region. Uses physical proximity as basis for selection. Cross link DNA. Loci in genes are in close physical proximity compared with the rest of the genome. Inverse primers to generate 2kb amplicons (not sure why 2kb because >2kb can be picked out). Can use paired-end sequencing for haplotyping (or long read sequencing as compatible with any NGS workflow). Demonstrate ability to pick up inter chromosomal gene fusions e.g. in cancers. One primer per direction (?). Can multiplex lots of sequences. The more you sequence the more the coverage of the genome you will get. Coverage ~50-60kb per primer pair. Total time to do protocol about 2 days. Whole-cell product available to buy. DNA & FFPE protocol in testing. Input requirement 10 micrograms, not sure how much you get out for sequencing.

A few thoughts: This seems potentially really useful for a number of microbial applications, e.g. pulling down plasmids or other regions of interest from a very mixed sample. Judicious use of multiplexed primers might allow for nearly whole-genome recovery.

Further reading:

Cergentis website
http://www.cergentis.com/tla-technology

Targeted sequencing by proximity ligation for comprehensive variant detection and local haplotypinglhttp://www.nature.com/nbt/journal/v32/n10/full/nbt.2959.html


Evan Eichler, University of Washigton

Human genome has many duplications that are absent from the reference sequence.  Duplications are often unique to individuals. Segmental duplications are often missing or misassembled, particularly in those from short-read assemblers (e.g. YH). Structural variation largely measured indirectly with short reads (e.g. by read depth analysis, pair analysis, split read analysis). 

Step in long read sequencing on PacBio. Generate 45X sequence coverage of CHM1 (hydatidiform mole) - haploid genome. P5/C3 ~15% error. Did read-based detection of structural variants (Chiasson 2014 Nature). Closed 40 gaps and 50 extensions adding 1.1Mbp. Resulted in 20 additional exons in 12 gene models. 92% of insertions and 60% deletions are novel c.f. 1000 genome project. 

Aout 0.4% of human euchromatin still can’t be assembled with PacBIo shotgun WGS. Use 32 BACs for PacBio sequencing and assembly which added 416kbp of missing reference seq, also eliminated 856kb of sequence. Falcon Assembly and MHAP Assembly N50 is ~5 Mbp. Cost is $50-80,000 for PacBio human reference genome. 

Shotgun sequence assembly and recent segmental duplications within the human genome
http://www.nature.com/nature/journal/v431/n7011/abs/nature03062.html

Genome structural variation discovery and genotyping
http://www.nature.com/nrg/journal/v12/n5/full/nrg2958.html

Resolving the complexity of the human genome using single-molecule sequencing
http://www.nature.com/nature/journal/vaop/ncurrent/full/nature13907.html
http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation/


Paul Schaffer, Roche Diagnostics

I was curious to see whether Roche would finally announce any products after a recent buying spree of interesting genomics companies, coupled with a large marketing agreement with Pacific Biosciences for clinical diagnostic assays. Roche now own bina, ABvitro, Stratos, Genia, KAPA, Foundation Medicine etc. Sadly in a bid to “underpromise and overdeliver” Paul offered “no details of timescales, specifications or features for new platforms”. A leading question from me about the possibility of a benchtop PacBio did not reveal any new information. I do wonder the marketing strategy of giving such talks with no actual technical information in, especially given Roche’s still raw humiliation with their handling of the once-great 454 platform. Also notable for Americanisms such as: “extremely laser focused” and “from soup to nuts” (http://en.wikipedia.org/wiki/Soup_to_nuts).


Mark Akeson, UC Santa Cruz Genomcs Institute

Mark Akeson also identifies as a MinION ‘fanboy’! Good on him. Starts with an introduction to nanopore sequencing. David Deamer and Branton are credited with initial idea. Deamer used to challenge physicists who claimed nanopore wouldn’t work. They said “Impossible? No! It’s just too hard!”. 1) Got to get DNA through a 2nm hole. 2) Need 5 angstrom control of nucleotides. 3) Requires an exquisitely sensitive sensor. 

It works because there are 10^20 particles per mm-2 s-1 – like a lightning bolt. Amplify 10^6-10^7 ions per nucleotide (not sure I follow exactly here).

Several problems to solve:
1) Capture and translocate DNA: Kasianowicz & Deamer proved that DNA would translocate through pore, by assessing movement between two compartments. 
2) Need the right size pore so only a few bases in contact with pore otherwise signal cannot be deconvoluted: Bayley figured out way with alpha-haemolysin and Gundlach with MspA. 
3) Rate control. Initially using Klenow fragment but still too fast, then Akeson discovered Phi29 polymerase keeps ‘dwell time’ of 12.5sec.

Partnership between Akeson and Gundlach labs to test Phi29 DNA pol & mspA. Sequence CAT motif in the ‘CAT’ experiment.

Characterization of individual polynucleotide molecules using a membrane channel
http://www.pnas.org/content/93/24/13770.full

Enhanced translocation of single DNA molecules through α-hemolysin nanopores by manipulation of internal charge
http://www.pnas.org/content/105/50/19720

Characterization of individual polynucleotide molecules using a membrane channel
http://www.pnas.org/content/107/37/16060.full

Automated Forward and Reverse Ratcheting of DNA in a Nanopore at Five Angstrom Precision
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3408072/

Now onto Oxford Nanopore history. AGBT 2012 Quotes: “Without data, how do we know this is not cold fusion?”. That gem from Jonathan Rothberg (where is he now?). MinION weighs 90 grams. Functions as a multiplexed axopatch – each amplifier can assay one of 4 wells at a time (so actually 2000 pores per MinION but only 500 addressed simultaneously).

Another challenge: typical bilayers rupture at low voltage and notoriously hard to reproduce. Relates to Intel’s problem manufacturing chips in the early 70s. When some days they just couldn’t make chips that worked they eventually tracked down to chemicals from crop dusters dusting the apricot harvest in Silicon Valley.

Solved problem of membrane instability via milling complex ‘wells’ using photolithography and a triblock polymer to form film. Resilient to shipping via FedEx.

Another achievement, making 2D reads. Hairpin and motor permits two direction reads – high-quality reads. 

Shows a single 48kb long read which is 87% identity and 90% coverage of phage reference genome – reads this long are rare, more usually around 10kb 

(http://figshare.com/articles/UCSC_Full_Length_Lambda_2D_Read/1209636 (Thanks Lex!)

You might be confused reading the literature about accuracy due to other publications, due to frequent changes in chemistry. Akeson’s results: 66% accuracy R6, 70% R7, 85% R7.3. Now focusing on R7.3 - they get 184-450Mb of bases per run, 17-55Mb of full 2D base reads (i.e. ~10% high quality 2D reads).

Showed that different aligners show different performance indel/mismatch wise (e.g. BLASR – high indel rate, LAST high mismatch rate). Devised EM algorithm to tune alignments and unify results. 99% of 2D reads map to reference.

Multimers of particular nucleotide are hard to sequence. Confusion amtrix shos that some bases more commonly missed than others. A <-> T uncommon G <-> C more common

Very long reads (36-42Kb) covered unassemblable highly repetitive gap of X-chromosome region. Short reads (10kb) suggest 8 CT47 gene copies.

Now planning on modelling modified bases. 5 known base modifications, sometimes differ by just 2 hydrogen atoms. Evidence that discrimination between modified and unmodified bases is possible.

Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3839712/

Working on protein sequencing – 700 amino acid protein S2-GT through a nanopore. Pull protein through and see features (not single amino acids)

Unfoldase-mediated protein translocation through an α-hemolysin nanopore
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772521/

Q: What limits read length? Entropy: DNA is a ball at that length and so hard to capture it into pore.
Q: Is there a lower limit on read length? Can do 150bp.
Q: What do we do with the non-2D reads? You can still use them but work ongoing to increase relative yield of them so you don’t need to.



Balti and Bioinformatics On Air: 21st January 2015

The plan this year for the triumphant Balti and Bioinformatics series is to alternate between virtual, “on-air” meetings (where sadly you will need to provide your own balti curry) and real life ones which will be mainly held in Birmingham, but may be in other places in England or Wales. Ideally I would plan to run 6 meetings a year.

So … to kick us off:

Balti and Bioinformatics On-Air

This meeting’s theme is open data and reproducible bioinformatics.

Wednesday 21st January, 4pm GMT (=11am EST, =8am PST, 00:00 China)

20 minute talks each (interactive Q&A through Google Hangouts enabled)

Draft schedule:

+0m C. Titus Brown, UC Davis: Self-interest: can it be a strategy for convincing scientists to share pre-publication data in a useful way?

+30m Scott Edmunds, GigaScience: New models for open data publishing

+50m Jane Landolin, Pacific Biosciences: Open Pacific Biosciences data for model organisms

+70m Michael Barton, JGI: nucleotid.es for de novo assembly benchmarking and Docker

+90m Nick Loman, U. Birmingham: Nanopore data updates and the “poreathon”.

+100m Dave Lunt, University of Hull: ReproPhylo - Reproducible phylogenetics.

+110m Discussion (are we on the right track? Challenges? Containers and VMs - beneficial or the wrong direction?)

What to do with a problem like .. Omicsmaps?

The start of a New Year always leaves me wanting to get organise, to fix broken things, to perhaps ditch some projects that have started to smell – well, a bit whiffy, and to generally get control of my life better. Whilst I have realised some time ago that new year’s resolutions are a self-defeating waste of time, I do like to channel this positive energy while its there.

Which brings me to http://omicsmaps.com

This project, which aimed to document the deployment of next-generation sequencers out there in the wide world was started by my chum James Hadfield over at the Cambridge Cancer Centre back in the mists of time. I cannot actually remember when it was but it was in the days when the delivery of a new 454 GS FLX was still a cause for celebration. I got involved by offering to build a ‘proper’ website, and have ended up hosting and developing it - on and off - ever since.

But now the project is tiring me out. Not really the actual doing of anything - more the weight of it hanging over me - the acknowledgement that I haven’t done any work on it for a while – I haven’t even done any moderating of new entries for many months (I had to introduce moderation because the site was attracting armies of spambots, posting fake entries). And to add insult to injury, the site had actually been down over the Xmas break due to Amazon rebooting the server automatically.

So I need to do something about it urgently. Firstly, the site is costing me (indirectly) $100/month due to an utterly unmanaged Amazon EC2 account that I can’t seem to get control of. The main server I use just serves omicsmaps.com and a couple of WordPress blogs. Nowadays you can get WordPress hosting for about $1/month, so there’s an immediate incentive there.

I would just close the site down, but a quick Twitter poll indicates there’s still some love for it, and I do receive a slow but steady stream of interest in it. I am fairly sure people use it for finding local sequencing providers, and particularly to find the more esoteric sequencing providers.

I know for example that Dawn Field has featured it (complete with picture!) in her new book BioCode and I would hate for people to look it up and find it isn’t there any more.

It does provide a kind of historical document, and also through the accumulated statistics shows the trends in sequencing instruments.

But I also think it needs a good lick of paint, and some new sequencers (e.g. drop the 454 now, add Oxford Nanopore and the HiSeq X Ten).

I would love it if someone could take on some development of the site (if you want get in touch). But in the meantime, I think I will do the following:

  • Move the site to Heroku (this requires updating the code to the latest version of Django and porting the site to use Postgres instead of SQLite but this should be relatively minor)

  • As Heroku uses Git, also take the opportunity to open up the code on Github.

  • Add some basic Akismet anti-spam protection and then open it up as unmoderated, in order to reduce the admin burden.

Then, after doing that I am very happy to receive and incorporate any pull requests that might improve the site, with the advantage these can be automagically pushed to Heroku.

Examples of things I would like if someone did:

  • Move to Google Maps API v3

  • Add support for other technologies than sequencers, e.g. proteomics and metabolomics and imaging platforms, also perhaps for services e.g. ‘single cell genomics’

  • Remove the outer frame and make it all snazzy looking with one big map, with panels on top

Sounds like a plan?!

Update 3rd Jan 2015:

Well, on investigation it seems that most of my $100/month was going on Amazon storage and a load balancer, and that server costs are just $35/month. So it may not be worth the effort to move to Heroku.

But I’ve pushed the source and database to https://github.com/nickloman/omicsmaps now.

I’ve also disabled the moderation queue (post-Akismet check) so updates are processed automatically.

If you want to get admin rights on the site and help keep it up to date, please do get in touch.

Correction: A reference bacterial genome dataset generated on the MinION™ portable single molecule nanopore sequencer

Recently we noticed that we could not reproduce Figure 3 while analysing this data as part of a new experiment (1). This was due to a modification to the script used for extracting sequence alignment profiles from the dataset (2). On further investigation we found that an error in this script had been reported and a fix supplied by an anonymous online contributor on October 27th 2014 via Github (3). The error prevented mismatches from being properly counted in the calculation of read accuracy (insertions and deletions were). We therefore present updated versions of Figures 3 and 4 generated by the corrected script. We are grateful to the anonymous contributor who noticed this error. The manuscript text and tables are unaffected.

We would also like to correct the spelling of Minh Duc Cao in the Acknowledgements section.

We apologise for any inconvenience caused by this error.

1. Quick, J., Quinlan, A.R., Loman, N.J.: A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer. GigaScience 3, 22 (2014)
2. Quinlan, A.R.: https://github.com/arq5x/nanopore-scripts/blob/master/count-errors.py. Github (2014)
3. Anon.: https://github.com/arq5x/nanopore-scripts/pull/1. Github (2014)

The formal correction including the updated figures submitted to GigaScience is available to view as a preprint.

Best methods for taxonomic assignment from shotgun species

This is turning into a frequently-asked-question on Twitter, so here is my $0.02 for the best methods to try:

  • Metaphlan2 – relies on mapping to taxon-defining genes. Works very well for well-characterised environments and species, e.g. human gut and pathogens. Search database is small so it is quite rapid. High specificity, potentially low sensitivity in poorly-characterised environments, e.g. soil. https://bitbucket.org/biobakery/metaphlan2

  • Kraken – http://ccb.jhu.edu/software/kraken/, even faster than Metaphlan2. Lowest common ancestor (LCA) approach combined with k-mer matching, hence being rapid. As it is k-mer based may suffer from lower specificity, particularly for reference genomes containing erroneous k-mers, and therefore recommend filtering the results. Also see http://www.onecodex.com for a web-based, ultra-fast version.

  • DIAMOND combined with MEGAN for LCA – DIAMOND is Daniel Huson’s fast replacement for BLASTX. Other alternatives include RAPSearch2. Searching against Genbank non-redundant proteins (nr) is probably the highest sensitivity method for metagenomics assignments but it needs to be combined with an LCA approach to give appropriate specificity (if you don’t believe me, try taking E. coli and shredding it into 100-mers and BLASTXing it and see how many taxa you retrieve). http://ab.inf.uni-tuebingen.de/software/diamond/ http://ab.inf.uni-tuebingen.de/software/megan5/

  • If you are boldly exploring strange new worlds to seek out new life, a phylogenetic approach may be more suitable. These tend to be slow, and rely on the presence of one or more marker genes in your sample, but have the advantage of giving a feel for the phylogenetic position of species in your sample. If you are doing this kind of thing I recommend trying Phylosift http://phylosift.wordpress.com/ or MOCAT http://vm-lux.embl.de/~kultima/MOCAT/