Seqbench: A useful meta-database of sequence reads from multiple platforms

Little and often, little and often. This is my new blogging mantra.

A little project that I mentioned at UKNGS2012 is one Lex Nederbragt and I have been kicking around for a little while called Seqbench, but I've only just got time to publish about it on the blog.

It's quite a simple idea: a meta-database containing information and links to sequencing datasets with known reference sequences. It's not a sequence archive like NCBI SRA or ENA. It simply contains what we consider to be the most useful meta-data for a sequencing run, as well as direct links to the FASTQ or SFF files, where available (often via SRA/ENA).

To start with we have focused on E. coli which is probably the most sequenced bacterial species (assuming you don't count mitochondria). The two strains we have focused on are K-12 MG1655 and the O104:H4 STEC from last year's Germany outbreak. So far we have collected almost 150 runs from a huge diversity of instruments including Illumina GA2, HiSeq, MiSeq, 454 GS Junior, 454 GS FLX, 454 GS FLX+, Ion PGM (314, 316, 318 chips) and PacBio. You've also got a variety of read lengths from 100 bases up to 15kb, with insert libraries between 180 bases and 8000 bases!

We couldn't just use SRA because a) many of the datasets are not in there b) the metadata is often incomplete and inconsistently encoded. Just to get this far has been a fair amount of work, trawling the short read archives, the literature and manufacturer's websites for datasets. It is not yet complete.

The data is available in a public Google Fusion Table, which permits easy export to CSV.

[iframe src="https://www.google.com/fusiontables/embedviz?viz=GVIZ&t=TABLE&containerId=gviz_canvas&q=select+col0%2C+col1%2C+col2%2C+col3%2C+col4%2C+col5%2C+col6%2C+col7%2C+col8%2C+col9%2C+col10%2C+col11%2C+col12%2C+col13%2C+col14%2C+col15%2C+col16%2C+col17%2C+col18%2C+col19+from+1-RvsucyVWJBXtvszrPKMzzA7laSwUfFxMDuzHu8" width=800 height=300 scrolling=yes]

You can even access this via a SQL interface using the Google Fusion Table API, e.g. with a command-line interface like Pfusion.

The dataset is under active curation by Lex and myself. When it is stable I plan to deposit it with a service that assigns a DOI to it, perhaps Figshare or GigaDB so it can be cited easily.

What's the point of such a dataset? Well, we find it extremely helpful for training purposes. For example, Lex and I used it to give a course on de novo assembly. We got the students to assemble different datasets and combinations of datasets and compare their results live, putting their results into a live edited Google Doc (very cool).

I used it again to help give a course on de novo assembly, alignment and SNP calling at SBTM12 (more on this in a forthcoming post, course booklet here).

I also think it would be an awesome resource if you were building bioinformatics software or pipelines, particularly assemblers, aligners or variant callers as the strains have "known truth" reference sequences (although they may have an error or two). You could also use it to do platform comparisons.

No doubt you could think of some other things to use it for as well!

Comments, thoughts on the dataset and its usability welcome below please.