EHEC Genome Assembly

Keep track of the genomic analysis of the EHEC strains on our Github Wiki.

BGI have released 5 runs of Ion Torrent data for the German EHEC/VTEC outbreak strain. I hope it is released with no specific restrictions on use for the benefit of the entire community, but the site doesn't make that entirely clear. Thanks to the BGI for putting it up!

Shall we crowd source some analysis? This comes at a very timely moment as I am currently help organise the Applied Bioinformatics & Public Health conference in Hinxton (#ABPH11), where we are discussing the use of whole-genome sequencing in epidemiology. The problem is I don't have much time to dig into the data.

But I've put a first-pass de novo assembly up using MIRA (3.2.1.17_dev) here. 3,057 contigs, total bases: 5,491,032, N50 3,675. If you want the alignment files etc. get the big file here (282Mb).

Parameters are: mira --job=denovo,genome,accurate,iontor -GE:not=1

Update 3/6/11 09:15 GMT+1

Marina Manrique has run the assembly through their BG7 bacterial genome annotation pipeline, results are here.

Torsten Seemann and Simon Gladman from the Victorian Bioinformatics Consortium have sent me the results of their in-house annotation pipeline. Results are available: contigs reordered according to E. coli EAEC 55989 and TWEC.

NCBI have also posted a preliminary assembly (of a different isolate - LB226692) - although it is not a true de novo assembly. The approach is a bit different. "Reads were mapped with TMAP against the publicly available E. coli 55989 chromosome (CU928145.2) and the derived consensus was split into contigs at zero-coverage regions. These contigs were used as a 'backbone' for mapping of reads, followed by de novo assembly of unmapped reads with the MIRA assembler (v 3.2.1). A small number of de novo and consensus contigs were merged using CAP3."

Update 3/6/11 16:50 GMT+1

There are two O104 isolates sequenced from this outbreak now. This first - named TY2482 - was done by BGI in collaboration with University Medical Centre Hamburg-Eppendorf and the second was done by Life Tech in-house in collaboration with University of Muenster - this is called LB226692. So opportunities for comparison exist now.

In summary: TY2482 assembly (BGI reads, my assembly), LB226692 assembly (Life Tech reads, assembly).

Mike the Mad Biologist has looked at the TY2482 assembly and concludes it is ST678 (or closely related) which agrees with the original molecular typing release from the Robert Koch Institute.

I've heard from another group they are planning on sequencing another isolate. I am going to try and find a place where the latest information can be collated to aid in further crowd-sourcing analysis.

Update 3/6/11 19:50 GMT+1

BGI just released two more 314 chips worth of data and their own assembly of TY2482. I don't have any details on program used or parameters just yet but I've enquired.

Who will take on the challenge of building a whole-genome phylogeny?

Update 4/6/11 16:15 GMT+1

A few notable updates.

Kat Holt has picked up the gauntlet of doing some whole-genome SNP comparisons of the strains. Results here.

David Studholme has looked for strain-specific genes in TY2482 and found some, including a class A beta-lacatamase.

BGI have published some more analysis of the genomes and have suggested people use their assembly for further comparison. However I still don't have any details on how that assembly was done (I have asked), which seems important.

Some more useful discussion from Phylogeo about the novelty of this strain. I think the consensus is now that this strain has been seen and subsequently typed in the past (hence ST678 - not a new sequence type), but before now we did not have a genome sequence for this particular strain. More discussion over at Aetiology.

Marina Manrique has set up a Github repository and Wiki for this EHEC crowd-sourcing project. I am going to have a play around with this and hopefully we can start keeping all our crowd-sourced data here in a logical format.

Some RAST annotations are available, see the comments thread.

Update 6/6/11 10:54 GMT+1

Keep track of the genomic analysis of the EHEC strains on our Github Wiki.