Introducing Hyde

Hyde is a brazen two-column Jekyll theme that pairs a prominent sidebar with uncomplicated content. It’s based on Poole, the Jekyll butler.

Built on Poole

Poole is the Jekyll Butler, serving as an upstanding and effective foundation for Jekyll themes by @mdo. Poole, and every theme built on it (like Hyde here) includes the following:

  • Complete Jekyll setup included (layouts, config, 404, RSS feed, posts, and example page)
  • Mobile friendly design and development
  • Easily scalable text and component sizing with rem units in the CSS
  • Support for a wide gamut of HTML elements
  • Related posts (time-based, because Jekyll) below each post
  • Syntax highlighting, courtesy Pygments (the Python-based code snippet highlighter)

Hyde features

In addition to the features of Poole, Hyde adds the following:

  • Sidebar includes support for textual modules and a dynamically generated navigation with active link support
  • Two orientations for content and sidebar, default (left sidebar) and reverse (right sidebar), available via <body> classes
  • Eight optional color schemes, available via <body> classes

Head to the readme to learn more.

Browser support

Hyde is by preference a forward-thinking project. In addition to the latest versions of Chrome, Safari (mobile and desktop), and Firefox, it is only compatible with Internet Explorer 9 and above.

Download

Hyde is developed on and hosted with GitHub. Head to the GitHub repository for downloads, bug reports, and features requests.

Thanks!

The biggest genome sequencing projects: the uber-list!

I am just writing a short presentation for a meeting in Hinxton. I wanted to demonstrate the profound effect that whole-genome sequencing is having on the study of biology, and the size and scope of recent studies.

So I thought it would be fun to catalogue the largest - in terms of samples - genome projects that have been published so far.

A few things are notable here. As expected, many of the biggest studies in terms of numbers are bacterial, enabled partly due to their smaller genome size.

Update: My attention has just been drawn to a study of 2,007 C. elegans genomes!

I found it interesting that all the bacterial studies listed herald from the UK, we are clearly blazing a trail in this field of study!

A PhD for sequencing a gene? A single genome? A hundred genomes? How about a thousand genomes? A million?

</tbody> </table>

So, what's coming up that could potentially knock these studies off their perch?

Did I miss a study? Please drop a comment below.

Rules for inclusion:

  • whole-genome sequencing >10X average per sample (no exome, target capture)
  • at least one library per sample (e.g. no pooled species, quasispecies)
  • not a meta-analysis, fresh data for the paper

Thanks to: Casey Bergman, Scott Edmunds, Prashant, Liz Batty, Craig Duffy, Cui Yujun, Lex Nederbragt for suggestions!

Update 10-02-2014: Added Chewapreecha et al, Casali et al, now occupying positions 1 and 4 respectively in the uber-list!
Update 15-04-2014: Added Nasser et al, new position 1!
Update 29-05-2014: Added 3,000 rice genome project, new position 3!

Name Number Reference
S. pyogenes 3,615 Nasser et al. 2014</td.
</tr>
S. pneumoniae 3,085 Chewapreecha et al. 2014
Rice (Oryza sativa) 3,000 The 3,000 rice genomes project
C. elegans 2,007 Thompson et al. 2013
Clostridium difficile 1,250 Eyre et al. 2013
The thousand genome project 1092 human genomes 1000 Genome Project Consortium, 2013
Mycobacterium tuberculosis 1,000 Casali et al. 2014
Plasmodium falciparum 825 Miotto et al, 2013
Streptococcus pneumoniae 616 Croucher et al. 2013
Mycobacterium tuberculosis 390 Walker et al. 2013
Salmonella in cattle and humans 373 Mather et al. 2013
Shigella sonnei 263 Holt et al. 2013
Mycobacterium tuberculosis 259 Comas et al. 2013
Streptococcus pneumoniae 240 Croucher at al. 2011
Methicillin-resistant Staphylococcus aereus 193 Holden et al. 2013
Campylobacter jejuni 192 Sheppard et al. 2013
Mycobacterium abscessus in CF 170 Bryant et al. 2013

Beatles and Bioinformatics: Our best meeting yet

Wow, so last Wednesday we held the fourth instalment of our Balti and Bioinformatics series which was a brilliant success, attracting over 100 participants. The idea of this meeting is to bring those developing cutting-edge bioinformatics methods together with those who actually use them. Thanks to the generous sponsorship of the Centre for Genomic Research at the University of Liverpool and the Medical Research Council and the BBSRC we were able to change up a gear, inviting two incredible international speakers: Sébastien Boisvert and Daniel Huson. We were also able to afford a proper lunch for everyone - of course this was the traditional Liverpool dish 'scouse'.

One thing that gave the meeting a little extra edge was that we did a live 'webcast' for the very first time through YouTube's live events system. Apart from a few issues with the sound right at the beginning, this was a great success and the YouTube statistics told me that 461 playbacks were made during the broadcast. There was also a flurry of Twitter activity on the #BeatlesAndBioinformatics hashtag (see the Storify by Surya Saha here) and we even managed to take a question over Twitter.

The great thing about the YouTube live events is that it also saves a copy, and so we are able to record the event for posterity. I've had a few people ask about how to set up such a webcast themselves, we will try and write a short guide for the blog at some point.

A great meeting, I am incredibly grateful to our speakers: Séb, Daniel, Chris Quince, Susannah Salter, Sujai Kumar, Mike Cox, Rebecca Gladstone and Chris Hayman. I am also massively grateful to the team at Liverpool for helping to organise; Neil Hall, Christiane Hertz-Fowler and especially Lesley Parsons. Also thanks to Christina Bronowski and Ian Goodhead and the Free State Kitchen for help with the evening catering. And finally to Barbara Myers, Paul Loman and Josh Quick for organising the live video webcast.

Finally, we ended up in the Cavern Pub where we were entertained by the guitar antics of The Amazing Kappa.

Check out the webcast!

http://www.youtube.com/watch?v=tSul_qDwvN4

You can even jump directly to a talk, thanks to the tags that Sebastien Boisvert has put in.

13.00 – KEYNOTE: Sebastien Boisvert, Université Laval, Québec, Canada – “Ray and Ray Cloud Browser for Metagenomics” 5:46

13.50 – Chris Stewart, University of Northumbria at Newcastle – “Development of the Gut Microbiome in Preterm Infants at Risk of Necrotising Enterocolitis and Sepsis” 59:48

14.10 – Chris Quince, University of Glasgow – “CONCOCT: Clustering cONtigs on COverage and ComposiTion” 1:15:32

14.30 – Susannah Salter, Wellcome Trust Sanger Institute – “What’s lurking in your kits?”1:39:00

15.10 – KEYNOTE: Daniel Huson – Center for Bioinformatics, University of Tübingen – “Identifying Organisms from a Stream of DNA Sequences” 2:32:30

16.00 – Sujai Kumar, University of Oxford – “Blobology: exploring raw (meta)genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots”3:10:50

16.20 – Mike Cox, Imperial College – “Copy number correction in 16S analysis” 3:33:16

16.40 – Rebecca Gladstone, University of Southampton / Wellcome Trust Sanger Institute – “Managing hundreds and thousands of bacterial genome sequences” 3:48:30

Course Advert: NERC Workshop on Population and Metagenomics Analysis

I help out with a fair few workshops on genomic analysis, but most are limited by being restricted to just a day or two, hardly enough time to cover more than the basics. So the 10 day long NERC Population and Metagenomics Analysis course, organised by the fantastic Konrad Paszkiewicz from the University of Exeter is a quite amazing opportunity. The teaching will be from some of the best people in the field (and me!) and of the highest quality. Incredibly, if you are a UK-funded researcher the costs of the course are entirely waived, with accommodation and meals also included! Spaces are limited but I would definitely urge you to sign-up. Full details below,  and please head over to the course workshop page to register. Hurry!

Workshop Overview

A ten-day workshop taking place between 25 February - 6 March 2014 providing detailed hands-on training for population and meta-genomics analysis for researchers with little or no background in mathematics or computing.

Venue: Dartington Hall, Totnes, Devon (nearest train station - Totnes)

Times: 25 February - 6th March 2014.

Arrival evening of Tuesday 25 February 2014. Departure morning of 6th March 2014. The course itself will take place 9am-12pm, 2pm-5pm and on some evenings 7pm-10pm everyday 26 February-5th March. Students are expected to attend the entire course.

Contact: research-events@exeter.ac.uk

Registration

The course itself is free of charged and is funded by a Professional Postgraduate Development Award from NERC.

A total of 30 funded places are available which cover the costs of accommodation and food, but not the cost of transportation to/from the venue.

An additional 10 places are available for participants from industry. The cost of accommodation and meals will need to be covered by the participants.

You should register your interest by 31 December 2013. Participants will be informed by 10th January 2014 as to whether they have been selected. Please note that preference will be given to researchers funded by NERC.

Accomodation and Transport:

For UK-based academic researchers:

The course is free of charge for up to 30 academic researchers working at recognised UK HEIs and research institutes. Accommodation at Dartington Hall is included and includes breakfast, lunch and dinner. Transportation to/from Dartington hall is NOT included.

For all other participants:

Whilst course fees will be waived, the cost of accommodation and meals will NOT be included. If selected, you will need to book accommodation with Dartington Hall separately. A special rate of £102 per night plus VAT has been negotiated (this includes breakfast, lunch and dinner).

Requirements:


Selected participants must bring their own laptops to the course. This needs to be a modern laptop (Windows, Mac OSX or Linux) with a full-size screen and wireless (please do not bring netbooks etc). Ethernet sockets may not be available during the course so please plan accordingly.

 

Draft Programme

An outline of the short course is given below with lead instructors in brackets. Please note that this is subject to change.

Tuesday 25 February 2014:

Arrival at Dartington Hall

Evening welcome buffet

Wednesday 26 February 2014:

Breakfast

Morning session 9am-12pm:

Introduction to the course

Hands on-workshop: Introduction to Amazon EC2 cloud (Konrad Paszkiewicz)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Thursday 27 February 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: Introduction to Linux (Konrad Paszkiewicz and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Lecture: Introduction to genomics and bioinformatics (David Studholme)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Short read genomics (Konrad Paszkiewicz and David Studholme)

Friday 28 February 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Introduction to RAD-seq (William Cresko and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Saturday 1 March 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: RAD-seq (William Cresko and Julian Catchen)

Lunch

Afternoon session 2pm-5pm:

Participant presentations

Dinner

Evening session 7pm-10pm:

 

Sunday 2 March 2014:

Free day with organised activities

Monday 3 March 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Marker-based metagenomics introduction (Jose Clemente)

Lecture: Statistical challenges in metagenomics analysis (Chris Quince)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Introduction to QIIME (Jose Clemente, Daniel McDonald)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Introduction to QIIME (Jose Clemente, Daniel McDonald)

 

Tuesday 4 March 2014:

Breakfast

Morning session 9am-12pm:

Lecture: Whole genome metagenomics introduction (Nick Loman)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Whole genome metagenomics (Nick Loman, Chris Quince)

Dinner

Evening session 7pm-10pm:

Hands-on workshop: Whole genome metagenomics (Nick Loman, Chris Quince)

Wednesday 5 March 2014:

Breakfast

Morning session 9am-12pm:

Hands-on workshop: Free session (bring your own data)

Lunch

Afternoon session 2pm-5pm:

Hands-on workshop: Free session (bring your own data)

Dinner

Evening session 7pm-10pm:

End of workshop party

Thursday 6 March 2014:

Breakfast

Departure

Full event information

Population and metagenomics analysis are fields which have developed rapidly over the recent years and have opened up new methodologies to researchers in ecology, systematics, evolutionary development and ecotoxicology. However, the software which has been developed to analyse these types of data are typically non-graphical and complex to master for researchers in biological sciences who have not been specifically trained in bioinformatics. In this short course the Amazon EC2 cloud will be used for training using laptops.

One of the most significant recent developments in population genomics is Restriction-site Associated DNA sequencing (RAD-seq)1. This technique uses high-throughput sequencing to simultaneously sequence and genotype organisms at tens of thousands of loci. The number of markers generated makes analysis much more sensitive than traditional microsatellite-based approaches, enabling resolution between very closely related individuals who belong to the same microsatellite type. It also requires comparatively less development and optimization time since the number of markers is proportional to the number of fragments digested. It can be applied even in the absence of a reference genome and can assist with genome assembly as well as provide functional information1,2.

Metagenomics involves the study of communities of microbial organisms in particular environments. The combination of uncultured-based techniques and high-throughput sequencing technology has made possible a comprehensive characterization of whole communities for a fraction of the cost. This enables studies of particular environmental niches over time or changing conditions at different resolution levels3,8.

We have arranged for leading population genomic and metagenomic experts in the US and the UK to serve as instructors for this short-course. The US instructors below have either developed the molecular methods, the theory behind the analysis, or have actively developed the relevant software to perform the analysis.

 

Workshop instructors

  1. Professor William Cresko is a Principle Investigator and Director of the Institute of Ecology and Evolution at the University of Oregon. He is a pioneer of the RAD-seq technique, and has used the approach extensively to perform genetic mapping of stickleback fish phenotypic variation1 as well as the evolutionary genomics of pipefishes and seahorses (http://creskolab.uoregon.edu/).
  2. Dr Julian Catchen is a Postdoctoral Research Fellow at the University of Oregon Institute of Ecology and Evolution. He is the author of Stacks – the most popular software package designed to process and analyse RAD-seq data.
  3. Prof. Jose Clemente is based at the Icahn School of Medicine at Mount Sinai, New York. He is a contributing author to the QIIME (Quantitative Insights Into Microbial Ecology)4 software package which is one of the most popular software tools for performing metagenomic analysis. His lab at Mount Sinai is particularly focused on characterizing the mechanisms of action of the microbiome in IBD.
  4. Dr Nick Loman is an MRC Special Training Research Fellow currently working at the University of Birmingham. His research program focuses on the genomic and metagenomic analysis of microbial sequence data in a clinical context.
  5. Daniel McDonald is a graduate student in the Interdisciplinary Quantitative Biology program in the BioFrontiers Institute at the University of Colorado, and a part of Prof. Rob Knight's lab, a recognized leader in microbiome research. Daniel is a contributing author of QIIME and a core software developer on the project.
  6. Dr David Studholme is a Senior Lecturer in Bioinformatics at the University of Exeter. His research interests encompass applications of genomics, transcriptomics and metagenomics to plant-pathogen interactions. His recent projects have focussed on tree-pathogens Chalara fraxinea and Phytophthora ramorum as well as bacterial pathogens of banana, enset, tomato and other crops.
  7. Dr Konrad Paszkiewicz is the Director of the Wellcome Trust Biomedical Informatics hub. He is responsible for the provision of training for PhD students and researchers as well as bioinformatics facilities and capabilities within the University of Exeter.
  8. Prof. Peter Kille is the director of Bio-Initiatives at the University of Cardiff. His primary research expertise lies in the application of molecular techniques such as proteomics and genomics to eco-toxicology. His research interests encompass the effect on biological systems of the release of heavy metals into the environment.
  9. Dr Christopher Quince is a Reader at the University of Glasgow. He leads the Computational Microbial genomics group which focuses on the development of novel algorithms to aid the analysis of microbial community structures. The group also develop engineering systems using microbial communities including microbial fuel cells and filtration systems. He is the author of PyroNoise and AmpliconNoise which are integral to the analysis of many high-throughput metagenomic datasets.

All of the instructors have extensive experience teaching in a short-course/workshop environment. These include the National Evolutionary Synthesis Centre Workshops (NESCent) in Next Generation Sequencing 2011 and 2012 () held at Duke University, North Carolina, USA and the Evomics workshop held every January in Český Krumlov, Czech Republic. The QIIME group hold regular workshops in the US and worldwide. Much of the teaching material for the proposed short course has already been produced, delivered and tested to a large number of audiences. Amazon cloud images for each section of the short course have already been produced by the instructors and have been extensively tested in previous workshops.

The Amazon and Linux training will be based on a modified version of the ‘Unix & Perl for Biologists’7 course adapted for use on the cloud with an extensive EC2 tutorial. Pre-existing in-house workshop materials will be used to teach the basics of remapping, assembly and variant calling. The Stacks software suite will be used in conjunction with R to teach RAD-seq analysis. We will use QIIME4, MEGAN5 and MetaPhlAn6 software packages to teach various aspects of marker based and shotgun metagenomics.

 

References:

  1. Hohenlohe PA, Bassham S, Etter PD, Stiffler, N. Cresko, W.A. Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags. PLoS Genetics 2010;6.
  2. J. Catchen, P. Hohenlohe, S. Bassham, A. Amores, and W. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology. 2013.
  3. J Rousk, E Bååth, PC Brookes, CL Lauber, C Lozupone, JG Caporaso, R Knight Soil bacterial and fungal communities across a pH gradient in an arable soilThe ISME journal 4 (10), 1340-1351
  4. J. Caporaso et al. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335-336 (2010)
  5. Huson, DH, Mitra, S, Weber, N, Ruscheweyh, H, and Schuster, SC (2011). Integrative analysis of environmental sequences using MEGAN4. Genome Research, 21:1552-1560
  6. Segata, N et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9, 811-814 (2012).
  7. Bates, S. et al. Global biogeography of highly diverse protisan communities in soil. ISME J 7: 652-659; Dec. 2012. http://korflab.ucdavis.edu/Unix_and_Perl/

Diagnosing problems with phasing and pre-phasing on Illumina platforms

 

If you do Illumina sequencing you probably hear the words 'phasing' and 'pre-phasing' pretty regularly, but what does it mean exactly and why is it important? Well, with MiSeq read lengths now at 300 and HiSeq high-throughput mode soon to be 125, keeping phasing and pre-phasing levels under control will become increasingly important. Nothing can bring down a long read like high phasing or pre-phasing, and throwing higher densities into the mix only makes the problem worse. Here is a quick guide to troubleshooting high phasing and pre-phasing issues.

What is phasing?

In sequencing-by-synthesis chemistry like Illumina (sorry, Solexa!) phasing is the rate at which single molecules within a cluster loose sync with each other. Phasing is falling behind, pre-phasing is going ahead and together they describe how well the chemistry is performing.

Low numbers are better! Values of 0.10/0.10 mean 0.10% of the molecules in your cluster are both falling behind AND 0.10% are running ahead at EACH cycle. In other words 0.20% of the true signal is lost each cycle and will therefore contribute to noise. Another example, 0.20/0.20 means that 0.4% of the true signal is lost per cycle, so after 250 cycles (without correction) your noise would be equal to your signal.

The reason it is calculated is so RTA can apply the correct level of phasing correction, which is why you can sequence for 250 bases without making random basecalls! This works by artificially pushing signal in or out of each channel based on basecalls before or after it and is an essential process in the Illumina basecaller.

Historically, the phasing and pre-phasing were estimated over the first 12 cycles of each read and then applied to all subsequent cycles. However with MCS 2.4 on the MiSeq we see a new algorithm called empirical phasing correction which optimises the phasing correction at every cycle by trying a range of corrections and selecting the one which results in the highest chastity (signal purity). This has major benefits as it means that the correction no longer assumes a linear phasing correction for the whole read, and does not rely on an accurate estimate over the first 12 cycles (better for low diversity samples). The only cost of this computational which is why it is not yet available for HiSeq. The new algorithm stores a new text file in the phasing folder:

D:\Illumina\MiSeqAnalysis31118_M00875_0072_000000000-A6B08\Data\Intensities\BaseCalls\Phasing\EmpiricalPhasingCorrection_1_1_1101.txt

Plotting this can help diagnose problems, shown below is a good run and a bad run - can you tell which is which?! In the bottom run the pre-phasing was so bad it actually reached the maximum allowable pre-phasing correction of 0.6. As this are cumulative values the actual phasing per cycle is the gradient of the line (approximately 0.1% in the good run).

emp_phasing1

emp_phasing2

How to recognise high phasing/pre-phasing

It's hard to say what phasing values you 'should have' because it depends on many variables so how do you recognise if you have a problem? Here are a few questions you might ask yourself:

  • Were the phasing/pre-phasing values higher than usual?
  • Do the quality scores look low?
  • Have you run this sample before without issue?
  • Did the instrument complete without error?
  • Do the thumbnail images look normal?
  • Do the intensity and %base plots look normal?
  • Is there an excessive phasing/pre-phasing gradient down the lane visible on the heatmap?

If the answer to most of these questions is 'Yes!' then you may suspect a phasing/pre-phasing problem. So how do know which it is? A simplified explanation is that phasing is caused by enzyme kinetics while pre-phasing is caused by either inadequate flushing of the flowcell or inadvertant reagent mixing. Here is a representative (but by no means exhaustive list):

Cause of phasing Comment
High GC content Extreme GC should result in quite high phasing, this is normal
Bad lot number Reagents were manufactured incorrectly
Peltier calibrated low Even one or two degrees can effect the enzyme kinetics
Chiller calibrated high Chiller temperature should not exceed 6°C
Fluidics problem Reagents were under-delivered
Shipping problem Reagents should not thaw until use, double Mylar wrapping should be unbroken
Improper storage Reagents should be stored at -20°C
Improper handling Reagents should be thawed in lukewarm water and used immediately

 

Cause of pre-phasing Comment
Fluidics problem Worn valve, PR2 was under-delivered
Bad lot number Reagents were manufactured incorrectly
Common line or manifold Common cause of pre-phasing problems
Instrument not washed Wash instrument with 0.5% TWEEN in DI water immediately following run
Shipping problem As above
Improper storage As above
Improper handling As above

One last point, if you are running amplicons or other low diversity sample in which the phasing estimation is inaccurate the PhiX error rate can sometimes be useful for diagnosing problems.

We are very lucky to have a fantastic FAS, Helen who keeps our instruments running very smoothly but if we do have a problem we tend to send her all the information we can on the problem to save time. Hopefully this will help you do the same.