BIOM25: 16S Practical

BIOM25: 16S Practical

In this practical we will analyse datasets from several studies, some very important, others perhaps just a little silly.

At first, we will go through a dataset together, this is from a pioneering paper:

  • The Human Microbiome in Space and Time.

After that, in groups, we will analyse one of three different datasets:

  • CSI: Microbiome. Can you determine who has been using a keyboard from the microbiome that is left behind? Do keyboards have a core microbiome??
  • The microbiome of restroom surfaces (toilets!)
  • Development of the infant gut microbiome.

Please watch this video for a useful demonstration of how principal component analysis works: https://www.youtube.com/watch?v=BfTMmoDFXyE

General questions

Q: What is the difference between alpha- and beta-diversity?

Human microbiome in space and time

Paper: http://www.microbesng.uk/filedist/16stutorial/spacetime/nihms245011.pdf

Supplementary material: http://www.sciencemag.org/cgi/content/full/1177486/DC1

Let’s have a look at the results.

Fields of importance: HOST_INDIVIDUAL, SEX, BODY_HABITAT, BODY_SITE, COMMON_NAME

Results:

Alpha diversity: http://www.microbesng.uk/filedist/16stutorial/spacetime/core/arare_max500/alpha_rarefaction_plots/rarefaction_plots.html

Bar plots by sample site: http://www.microbesng.uk/filedist/16stutorial/spacetime/core/taxa_plots_COMMON_SAMPLE_SITE/taxa_summary_plots/bar_charts.html

PCoA analysis: http://www.microbesng.uk/filedist/16stutorial/spacetime/core/bdiv_even500/unweighted_unifrac_emperor_pcoa_plot/index.html

Q: Is there evidence of natural clusters being formed?

Q: Do samples cluster by individual? If not, how do they cluster?

Q: What are the most dominant taxa in stool, skin, urine? Look at different taxonomic levels down to genus.

Q: Are these sites similar or different? What are the major differences in taxonomic profile between these three sites?

##CSI: Microbiome

Original paper: http://www.microbesng.uk/filedist/16stutorial/keyboard/keyboard_paper.pdf

Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:

http://www.arb-silva.de/search/testprime

Results: http://www.microbesng.uk/filedist/16stutorial/keyboard/core2/

Important metadata fields for this project:

  • Description_duplicate - the key from any keyboard
  • HOST_SUBJECT_ID - the person each keyboard belongs to

Hint: M1, M2 and M9 are the three participants referred to in the paper.

Q: What are the most abundant taxa?

Q: Check the PCA plots, do samples cluster by key, or by subject (hint: HOST_SUBJECT_ID, )

Q: Go back to the taxa barplots, can you figure out which taxa are driving the variation producing grouping?

Q: Which of these taxa are part of the normal skin microbiome? Are any out of plcae? Where might they come from?

Q: Do you think this technique will really be usable for forensics? What are the challenges? What other techniques might work better for studying the microbiome?

Q: Now, read the paper in more detail and prepare a short summary to present the context for the study, the methods employed and the results found.

##Restroom surfaces

Paper: http://www.microbesng.uk/filedist/16stutorial/restrooms/pone.0028132.pdf

Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:

Now, look at the output of QIIME:

Results: http://www.microbesng.uk/filedist/16stutorial/restrooms/core/

Fields of importance: Floor, Level, SURFACE, BUILDING

Q: What surfaces have the greatest amount of diversity? Is this expected?

Q: What do the profiles of stool, etc. look like?

Q: Are there any natural looking clusters in the data?

Q: Which sources of samples are most similar to others?

Q: Is there any clustering between different floors of the building?

Q: Compare the weighted vs unweighted Unifrac results, do the clusters look more natural in one or the toher?

Q: Which surfaces have the most diversity? Least?

Q: Now, read the paper in more detail and prepare a short summary to present to the whole group. Consider: the context for the study, the methods that were employed and the results found. What did you think? What are the limitations of the study?

Infant gut metagenome

Paper: http://www.microbesng.uk/filedist/16stutorial/infant_time_series/PNAS-2011-Koenig-4578-85.pdf

Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:

Now, look at the output of QIIME:

Results: http://www.microbesng.uk/filedist/16stutorial/infant_time_series/core/

Fields of importance:

  • SampleID - age in days of infant
  • SOLIDFOOD
  • FORMULA
  • COWMILK
  • BREASTMILK
  • COLORDESCRIPTION
  • HOST_SUBJECT_ID

Q: Is there any evidence of a gradient? (Key: use SampleID and turn gradient colours on)

Q: How do the taxa change over time?

Q: Which infant samples do the maternal stool most look like?

Q: Is the colour of stools associated with their bacterial diversity?

Q: Now, read the paper in more detail and prepare a short summary to present to the whole group. Consider: the context for the study, the methods that were employed and the results found. What did you think? What are the limitations of the study?

##Instructor notes on building this tutorial

  • Download from QIIME db site or the BEAST
  • Get greengenes tree file
  • core_diversity_analyses.py -i study_1335_closed_reference_otu_table.biom -o core -m study_1335_mapping_file.txt -e 1000 -t ../gg.tree -c “GENDER,FLOOR,BUILDING,SURFACE”
  • core_diversity_analyses.py -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”
  • core_diversity_analyses.py -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”

First SQK MAP 006 experiment

We’ve just finally found the time to break open the new SQK-MAP-006 kits from Oxford Nanopore. These kits are notable because they introduce the first really major changes to the chemistry for some time.

  • First up, the speed has been doubled from ~30 bp/s to ~75 bp/s. The assumption is this will increase yields, but it will be interesting to see what - if any - effect it has on quality profile. The worry would be that increased speeds would increase the chance of missing events (transitions between signal levels), which would manifest as deletions after basecalling.
  • Secondly, the previous hairpin-motor complex (which enabled 2D reads and also stalled the complement strand) has been jettisoned to return to a simpler setup. As I understand it, the hairpin remains (and is now biotinylated and pulled down by beads to ensure very high 2D yields) but the second motor has gone. The new motor I assume is clever enough to be able to stall both the template and complement strand. It will be interesting to compare translocation times of the two strands (in SQK-MAP-005 the complement strand went through the pore more slowly, as it was retarded by two enzymes).

The new chemistry is accompanied with a new Metrichor basecaller workflow specific to SQK-MAP-006.

A notable change, looking at the returned FAST5 files, is that the model is now considering signal levels from each of the 4^6 possible combinations of 6-mers when doing basecalling. Before 5-mers were used. Does this mean that the ionic flux through the nanopore is in fact affected by 6 or more bases, rather than the 5 that we initially assumed? Or was 5 simply chosen to simplify the analysis. If the latter - and this seems likely - this may help with basecalling accuracy and it will be interesting to see if it resolves any previously difficult to sequence motifs (we looked at such under represented sequences in our recent paper here in the context of 5-mers: http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html)

It does not seem to be supported to call older, pre-SQK-MAP-006 data with the new 6-mer model basecaller.

So far we have done four SQK-MAP-006 runs. Two were generated with natural DNA, and two were generated with the low-input library that includes a PCR step.

Each of the files below are archives of the runs following base calling with Metrichor. We also provide a subset of one of the runs in ‘raw’ format which has the individual signal measurements (i.e. before event detection is carried out).

The raw files are available via the ENA FTP site

Run Basecalled data 2D pass FASTA
MAP-006-1 Basecalled Pass FASTA
MAP-006-2 Basecalled and raw Pass FASTA
MAP-006-PCR-1 Basecalled Pass FASTA
MAP006-PCR-2 Basecalled and raw Pass FASTA

Head over to Jared Simpson’s blog to see some early results of using these data for assembly polishing.

Enjoy!

As always, thanks to Josh Quick for his masterful library preparation technique.

Calling haploid consensus sequence

For some reason, calling a haploid consensus sequence from a VCF seems harder than it needs to be.

I’ve experimented with samtools mpileup and bcftools call/consensus with much frustration and little success, as it always wants to call heterozygous positions which I don’t want.

In the end the easiest way to do this I have found is to use freebayes.

freebayes -f ref.fa -p 1 aln.sorted.bam > vcffile

And then use vcf2fasta from vcflib to call a consensus

vcf2fasta -f ref.fa -P 1 vcffile

This will spit out a file with the consensus sequence.

Of course, given that the VCF format is not really a format, trying to use vcf2fasta on VCFs produced by other tools than FreeBayes (VarScan, in my case) didn’t work for me.

Real time genomic surveillance of Ebola outbreak 2014-2015

The current Ebola outbreak in West Africa is the largest ever recorded, with over 26,500 cases reported resulting in an estimated 11,000 deaths. Yet genomic surveillance of this outbreak has been patchy, hampered by understandable but vexing logistical, social, political and technical obstacles in securing and transporting samples for processing.

We wanted to help address the gaps in our knowledge of viral evolution and to generate data for epidemiological use. So, in April, Josh Quick from my group went to Conakry, Guinea to establish proof-of-principle for portable nanopore sequencing. This was the most practical way we could rapidly establish a local sequencing lab in order to generate real-time information.

His travels have been documented in several recent news articles. For background I would recommend reading Erika Hayden’s report over at Nature News, the BMC On Biology blog and this recent GenomeWeb article (registration free for academic subscribers).

In the two weeks he was there, he sequenced 14 genomes when based at Donka Hospital in Conakry. However, the surveillance sequencing has continued, thanks to the hard work of Sophie Duraffour in Coyah under the auspcies of the European Mobile Laboratory project. Sophie has been working around the clock in the laboratory generating the real-time genome data, uploading it to Birmingham for analysis and then distributing it to WHO central coordination. We have had early feedback that the data has been extremely useful for the epidemiologists on the ground.

As is often the case in outbreaks, genomic data production and sharing has been patchy and uncoordinated. However, a new exciting deveopment is under way to try and address this. Andrew Rambaut, author of essential phylogenetics software such as BEAST and FigTree and viral genome maven, has taken on a kind of unofficial role of coordinating genome sequence data, which is distributed through his website and forum Virological.org.

His personal database of Ebola genomes sits at nearly 1000 sequences and he has been privately sharing some wonderful integrated phylogenetic analyses covering the entire Ebola outbreak. However, until recently the sharing has been limited by access to public data. At a recent conference at the Institut Pasteur, I met him and his colleague Richard Nehrer and discussed ways to improve sharing. With Trevor Bedford, Richard are the developers of the nextflu website, which aims to track real-time evolution of flu.

I said that we needed this for Ebola, and of course they had already thought of this and had started building something. I said that we would contribute our nanopore sequencing dataset to this project in real-time, and those with large datasets to compare also contributed theirs.

So it is a real thrill to see the website up and running now and available to use at ebola.nextflu.org. On this website you can explore Ebola evolution during this outbreak, using controls to scroll through time, and restricting analysis to particular locations or laboratories. You can also zoom into particular clades, and see frequency distributions of specific mutations.

One thing that was particularly notable with the data integration is that our surveillance data from Guinea, when compared with Ian Goodfellow’s recently produced surveillance data from Sierra Leone is that the two extant Guinean lineages overlap with cases from close to the Guinean border in Sierra Leone. This makes sense, and suggests that cross-country transmission may be frequently occurring.

We will be updating this website with new sequences generated by the EMLab until the end of the outbreak. We have decided that we will leave a one week delay before releasing it for WHO central coordination to see the data, and the data is limited to prefecture level information without more specific locations.

ECCMID 2015

I am at the incredibly impressive and huge ECCMID meeting in Copenhagen.

I’ve given a talk already on “So I have sequenced my organism .. what do I do now?” (organisers title!). It is viewable here:

http://www.eccmidlive.org/resources/so-i-have-sequenced-my-organism-what-do-i-do-now

Tomorrow I am doing a “Meet-the-Expert” session about what tools to use for bacterial genome analysis, feel free to look at my slides in advance and ask some questions (even if you aren’t at ECCMID!)

http://eccmidlive.org/resources/expert--33?slide_deck=1

One thing that is noticeable about this conference is how incredibly high-tech the conference website is. Talks are posted in near-real-time after they are given.

Here are some you should definitely check out!

Matt Holden, Whole genome sequencing for microepidemiological investigations: can person-to-person transmission be identified? http://www.eccmidlive.org/resources/whole-genome-sequencing-for-microepidemiological-investigations-can-person-to-person-transmission-be-identified

Ed Feil, Whole genome sequencing and public health: how can high-risk clones be identified and what can be learned for prevention and control? http://www.eccmidlive.org/resources/whole-genome-sequencing-and-public-health-how-can-high-risk-clones-be-identified-and-what-can-be-learned-for-prevention-and-control

Frank Aarestrup, Bacterial genome sequencing for outbreak detection http://www.eccmidlive.org/resources/bacterial-genome-sequencing-for-outbreak-detection

Diversity of P. aeruginosa in CF airways http://t.co/9yo1iDEHdR

The Gut and Lung Microbiota http://t.co/5v6bP20Iny

Implications of microbiome alterations due to pro/antibiotics http://t.co/JJfl6HMiz3

Culturomics http://t.co/kDlE7ltc3u

Benefits of microbiome manipulation in reducing resistance http://eccmidlive.org/resources/benefits-of-microbiome-manipulation-in-reducing-resistance

There’s lots more at http://www.eccmidlive.org