BIOM25: 16S Practical

16S Lecture Slides

http://nickloman.github.io/static/BIOM25 16S Lecture.pptx

16S Practical

In this practical we will analyse datasets from several studies, some very important, others perhaps just a little silly.

The datasets we have are:

  • CSI: Microbiome. Can you determine who has been using a keyboard from the microbiome that is left behind? Do keyboards have a core microbiome??
  • The microbiome of restroom surfaces (toilets!)
  • The Human Microbiome in Space and Time.
  • Development of the infant gut microbiome.

General questions

Q: What is the difference between alpha- and beta-diversity?

##CSI: Microbiome

Original paper: http://pathogenomics.bham.ac.uk/filedist/16stutorial/keyboard/keyboard_paper.pdf

Results: http://pathogenomics.bham.ac.uk/filedist/16stutorial/keyboard/core2/

Important metadata fields for this project:

  • Description_duplicate - the key from any keyboard
  • HOST_SUBJECT_ID - the person each keyboard belongs to

Hint: M1, M2 and M9 are the three participants referred to in the paper.

Q: What are the most abundant taxa?

Q: Check the PCA plots, do samples cluster by key, or by subject (hint: HOST_SUBJECT_ID, )

Q: Go back to the taxa barplots, can you figure out which taxa are driving the variation producing grouping?

Q: Which of these taxa are part of the normal skin microbiome? Are any out of plcae? Where might they come from?

Q: Do you think this technique will really be usable for forensics? What are the challenges? What other techniques might work better for studying the microbiome?

##Restroom surfaces

Paper: http://pathogenomics.bham.ac.uk/filedist/16stutorial/restrooms/pone.0028132.pdf

Results: http://pathogenomics.bham.ac.uk/filedist/16stutorial/restrooms/core/

Fields of importance: Floor, Level, SURFACE, BUILDING

Q: What surfaces have the greatest amount of diversity? Is this expected?

Q: What do the profiles of stool, etc. look like?

Q: Are there any natural looking clusters in the data?

Q: Which sources of samples are most similar to others?

Q: Is there any clustering between different floors of the building?

Q: Compare the weighted vs unweighted Unifrac results, do the clusters look more natural in one or the toher?

Q: Which surfaces have the most diversity? Least?

Human microbiome in space and time

Paper: http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/nihms245011.pdf

Fields of importance: HOST_INDIVIDUAL, SEX, Description_duplicate, COMMON_NAME

Results:

Alpha diversity: http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/core/arare_max500/alpha_rarefaction_plots/rarefaction_plots.html

Bar plots: http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/core/taxa_plots/taxa_summary_plots/bar_charts.html

Bar plots by sample site: http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/core/taxa_plots_COMMON_SAMPLE_SITE/taxa_summary_plots/bar_charts.html

PCoA analysis: http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/core/bdiv_even500/unweighted_unifrac_emperor_pcoa_plot/index.html

http://pathogenomics.bham.ac.uk/filedist/16stutorial/spacetime/core/

Q: Is there evidence of natural clusters?

Q: Do samples cluster by individual?

Q: What are the most dominant taxa in stool, skin, urine? How do they differ?

Infant gut metagenome

Paper: http://pathogenomics.bham.ac.uk/filedist/16stutorial/infant_time_series/PNAS-2011-Koenig-4578-85.pdf

Results: http://pathogenomics.bham.ac.uk/filedist/16stutorial/infant_time_series/core/

Fields of importance:

  • SampleID - age in days of infant
  • SOLIDFOOD
  • FORMULA
  • COWMILK
  • BREASTMILK
  • COLORDESCRIPTION
  • HOST_SUBJECT_ID

Q: Is there any evidence of a gradient? (Key: use SampleID and turn gradient colours on)

Q: How do the taxa change over time?

Q: Which infant samples do the maternal stool most look like?

Q: How does diversity change over time?

##Instructor notes on building this tutorial

  • Download from QIIME db site or the BEAST
  • Get greengenes tree file
  • core_diversity_analyses.py -i study_1335_closed_reference_otu_table.biom -o core -m study_1335_mapping_file.txt -e 1000 -t ../gg.tree -c “GENDER,FLOOR,BUILDING,SURFACE”
  • core_diversity_analyses.py -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”
  • core_diversity_analyses.py -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”

NERC Exeter: Metagenomics three hour practical

##Whole-genome shotgun metagenomics learning objectives

  • Know how to subsample reads from a FASTQ file without losing paired-information (20 minutes)
  • Know how to perform basic taxonomic assignment using a marker-gene method: MetaPhlan (10 minutes)
  • Assess the impact sampling has on discovery of taxa (20 minutes)
  • Know how to use MEGAN to visualise taxonomic information (30 minutes)
  • Understand how the Lowest Common Ancestor parameters affect sensitivity and specificity of taxonomic assignments (20 minutes)
  • Know how to use MEGAN to visualise functional information (30 minutes)
  • Understand how to compare multiple samples in MEGAN (30 minutes)

###Optional, more advanced objectives (if you finish, or when back at home)

  • Know how to generate a heatmap of taxa from Metaphlan results (20 minutes)
  • Assemble a metagenome with Velvet (30 minutes)
  • Know how to perform basic taxonomic and functional assignment using a sequence similarity method: BLAST/RapSearch2 (10 minutes)

#Part I: Multivariate statistics tutorial

Login to Amazon cloud and start a new Terminal

Download the tutorial archive onto your instance:

wget http://nickloman.github.io/static/MultivariateStats.tar.gz

Go into Tutorials, untar the archive and cd into it:

tar xvfz MultivariateStats.tar.gz

cd MultivariateStats

Download the tutorial slides from http://nickloman.github.io/static/MultivariateStatsTutorial.pdf and follow the steps

Part II: Taxonomic assignments from metagenomics data

##Datasets for this practical

For this practical we will use some real clinical metagenomics data from the E. coli outbreak in 2011, published in this JAMA article: http://jama.jamanetwork.com/article.aspx?articleid=1677374

List the contents of the the ~/shotgun_metagenomics/ directory (remember ~ is just a short-cut for your home directory, e.g. /home/ubuntu, so this is the same as /home\/ubuntu/shotgun_metagenomics).

Q1. What format are the reads, and what can you deduce about the sequencing from them?

Q2. How many reads are in the biggest and the smallest files?

Initially, perform your work on the 2638-N12-STEC dataset.

Then, when you have run through the tutorial once, do it again but choose a dataset at random (or one you think might be particularly interesting!).

##Subsample reads with seqtk

seqtk is a useful ‘Swiss army knife’ for basic sequence manipulation.

Test it works:

$ seqtk

Usage:   seqtk <command> <arguments>

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
     	 sample    subsample sequences
    	 subseq    extract subsequences from FASTA/Q
     	 trimfq    trim FASTQ using the Phred algorithm
     	 hety      regional heterozygosity
     	 mutfa     point mutate FASTA at specified positions
     	 mergefa   merge two FASTA/Q files
     	 randbase  choose a random base from hets
     	 cutN      cut sequence at long N
     	 listhet   extract the position of each het

####Subsampling reads

This command shows how to subsample 100,000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing). Seqtk is smart enough to understand whether a file is zipped or not, so you don’t need to unzip them first, but it will always output uncompressed files.

For two hypothetical files, pair1.fastq.gz and pair2.fastq.gz:

seqtk sample -s100 pair1.fastq.gz 100000 > <MYREADS>_1.fastq
seqtk sample -s100 pair2.fastq.gz 100000 > <MYREADS>_2.fastq

Replace <MYREADS> with an informative name, so if you are dealing with sample 2638 and have subsampled 10000 reads then perhaps 2638_10000_1.fastq would be sensible.

You can read more about seqtk at C. Titus Brown’s tutorial page here: http://ged.msu.edu/angus/tutorials-2013/seqtk_tools.html

Taxonomic assignments with Metaphlan

The Metaphlan home page is here: http://huttenhower.sph.harvard.edu/metaphlan. Metaphlan is already installed on the Amazon Machine Image.

###Running Metaphlan

Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.

Metaphlan doesn’t need paired-end information, so you can just join your two files together to make use of all the data:

cat <MYREADS>_1.fastq <MYREADS>_2.fastq > <MYREADS>.fastq

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8

The output will be sent to stdout, so you need to redirect it to a file, remember you can do this:

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8 > <MYREADS>_results.txt

Hint: Each time you run Metaphlan it will produce an output file named according to the input file. Metaphlan will not run if that file already exists, so ensure you use a different input file each time, or simply delete the file it produces first: it is is named <MYREADS>.bowtie2out.txt.

Run metaphlan for subsamplings of 1000, 10000, 100000 and 1000000 reads.

##Questions

Q3. How long did each file take to run? (hint prepend time to your command)

Examine the output of metaphlan (i.e. <MYREADS>_results.txt).

Q4. What are the relative proportions of the most abundant phyla? What about species?

Update the Google Docs spreadsheet here with your results: http://tinyurl.com/evomics-metaphlan

Q5. How many different genera were detected at each sampling level?

##Advanced usage There are other useful values of -t you can use other than rel_ab, the default:

  • rel_ab: profiling a metagenomes in terms of relative abundances
  • reads_map: mapping from reads to clades (only reads hitting a marker)
  • clade_profiles: normalized marker counts for clades with at least a non-null marker
  • marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if –nreads is specified)
  • marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with –pres_th

Now, go back to the beginning, choosing a new, uncharacterised sample. Add the results to the Google Docs file.

If you finish this before the recap, then try yet another sample and then go on to the heatmap generation exercise under the Advanced section.

Visualizing results

MEGAN

MEGAN is both a taxonomic analysis software, and a visualization tool. It is very helpful when ‘mining’ your dataset.

MEGAN is quite memory hungry and should be set-up to use as much memory as you can afford. We have done this for you already, but instructions for doing this are in the Advanced section.

MEGAN takes the results of read alignments as input. This alignment information typically comes from BLAST-format files, but can also come from SAM files produced by short-read aligners such as Bowtie2 and BWA, or the aln format from RAPSearch2.

Because generating the BLAST format files takes so long, we have precomputed them for you and generated MEGAN5-compatible files, but see at the bottom of the exercise for details on how to generate your own files for your data:

Start by downloading the results file for a random subsampling of 250,000 reads from the 2638-H dataset:

Load this file into MEGAN (MEGAN is available under the “Other” menu on the Amazon Machine Image desktop).

For more information on using the various functionality in MEGAN, see the user manual: http://ab.inf.uni-tuebingen.de/data/software/megan5/download/manual.pdf

Please see this videocast for a recap of the MEGAN functionality:

http://www.youtube.com/watch?v=R8dpD_lj6Ts&feature=em-upload_owner

Have a look at the assignments.

Q6. How do they compare to your Metaphlan results? Are more taxa predicted or fewer?

Q7. Which taxon has the most assignments made to it?

Q8. What taxonomic level does this taxon belong to?

Q9. Why are so many reads being assigned at this level? Is it reasonable?

Hint: Inspect the read alignments by using right-mouse click (secondary click on Mac) on the nodes and chosing “Inspect reads”.

Q10. Are there species in there that are unexpected?

Q11. What do the alignments look like? Are they good quality? Are they full-length?

Can you change the LCA parameters to make the results more specific?

Try it.

Does that make the results “better” or “worse” ?

Experiment more.

How about now?

Q12. Are there remaining species that don’t make sense?

Q13. Inspect some taxa. Are there any you feel confident calling as present? Are there any you don’t feel confident about? Why?

Try looking at the functional mode by clicking on the KEGG icon. Explore the different pathways and their presence/absence.

Q14. Does this sample have the Shiga-toxin genes in it? (hint: look under Human Diseases category) ?

Now, download some more files, you can choose one, or several from this list! If you relate the file names to the original paper we published you could even put together a hypothesis to test (note the diagnosis is in the file name).

Try out the MEGAN comparison mode by opening a few files.

Q15. Which sample (of the ones you picked) has the most Escherichia ?

If you get this far before the recap (or in your spare time), try assembling one of the files with Velvet or one of the other assemblers you use last week.

##Advanced

Metaphlan

Heatmaps of multiple samples

Metaphlan comes with some plotting scripts which permit the generation of heatmaps.

You need to merge the results you generated from Metaphlan each sample:

~/software/metaphlan/utils/merge_metaphlan_tables.py <table1> <table2> <table3> <tableN..> > merged_table.txt

Top 10 genera:

~/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.py --in merged_table.txt --out test.png --tax_lev g --top 10

Open the test.png file after you have created it.

A more complex example (from the Metaphlan tutorial https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial)

/home/ubuntu/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.pymetaphlan_hclust_heatmap.py -c bbcry --top 25 \
   --minv 0.1 -s log
   --in merged_abundance_table.txt
   --out output_images/abundance_heatmap.png

Here the settings mean:

  • abundances for species only (default –tax_lev s)
  • in logarithmic scale (-s log)
  • reporting only the 25 most abundant clades (–top 25),
  • according to the 90th percentile of the abundances in each clade (default –perc 90)
  • with custom color map (-c bbcry).

In this example, the clustering is performed with “average” linkage (default -m average), using “Bray-Curtis” distance for clades (default -d braycurtis) and “correlation” for samples (default -f correlation).

Importing Metaphlan data into QIIME

QIIME is capable of analysing any tabulated data, not just 16S amplicon OTUs or taxa. It can therefore be adopted for analysis of whole-genome shotgun data, for example taxonomic data from MetaPHlan or functional pathway data from HUMANn/KEGG.

Once you have merged Metaphlan tables, it is quite straight-forward to analyse this data in QIIME:

/home/ubuntu/software/metaphlan/utils/merge_metaphlan_tables.py *.txt >real_data.txt
biom convert -i real_data.txt -o real_data.biom --table-type "taxon table"

Generate beta-diversity tables:

beta_diversity.py -i real_data.biom -m bray_curtis -o betadiv

Generate PCoA output:

principal_coordinates.py -i betadiv/bray_curtis_real_data.txt -o betadiv/bray_curtis_real_data.pc

Generate PCoA plots with EMPeror:

make_emperor.py -i betadiv/bray_curtis_real_data.pc -m qiime_metadata.tsv -o pca	

High-throughput functional assignments with KEGG

These are the steps to reproduce the functional assignments at home, because it takes some time you may not want to do this during the workshop (although you could do it during free/open study).

wget --http-user kegg --http-passwd ggek http://huttenhower.sph.harvard.edu/kegg/kegg.reduced.fasta.tar.bz2
tar xvfj kegg.reduced.fasta.tar.bz2

Install RAPSearch2 and add the binaries to your path.

prerapsearch -d kegg.reduced.fasta -n kegg.reduced.fasta.rap
rapsearch -q ../taxchris/"$tag".fastq -d kegg.reduced.fasta.rap -o "$tag" -z 16 -v 20 -b 1 -t q -a t

Analysis with HUMANn

Installing HUMANn

wget https://bitbucket.org/biobakery/humann/downloads/humann-v0.99.tar.gz
tar xvfz humann-v.0.99.tar.gz
sudo apt-get install scons

Configuring HUMANn

By default, HUMANn looks for BLAST files which end in ‘.txt’ (or gzipped). So you need to copy the files ending in .m8 from RAPSearch2 into the input/ directory from the HUMANn installation and rename them to .txt.

Running HUMANn

scons

Analysing HUMANn output

The relevant files are the pathway files which start with 04 in the output directory.

Creating BLAST files for MEGAN

If you are interested in creating BLAST files for MEGAN so you can run it yourself, please see the reference material on my blog at:

http://pathogenomics.bham.ac.uk/blog/2013/08/methods-for-taxonomic-assignment-of-shotgun-whole-genome-metagenomics-reads/

####MEGAN Memory Requirement

By default, MEGAN uses 2Gb of RAM on 64-bit Linux. This isn’t quite enough sometimes, we will make it use 6Gb instead (you could use more, but never more than the physical memory available on the machine). To do this you need to edit the /opt/megan/MEGAN.vmoptions file and change the content from:

-Xmx2000m

to

-Xmx6000m

This needs to be done as sudo. If you are lazy, just run:

sudo bash -c 'echo "-Xmx6000m" > /opt/megan/MEGAN.vmoptions'

Check it is correct by cating the file:

cat /opt/megan/MEGAN.vmoptions

An outsiders guide to bacterial genome sequencing on the Pacific Biosciences RS

It had to happen eventually. My Twitter feed in recent times had become unbearable with the insufferably smug PacBio mafia (that's you Keith, Lex, Adam and David) crowing about their PacBio completed bacterial genomes. So, if you can't beat 'em, join 'em. Right now we have a couple of bacterial genomic epidemiology projects that would benefit from a complete reference genome. In these cases our chosen reference genomes are quite different in terms of gene content, gene organisation and have divergent nucleotide identities to the extent where we are worried about how much of the genome is truly mappable.  And in terms of presenting the work for publication, there is a certain aesthetic appeal to having a complete genome.

And so, after several false starts relating to getting the strains selected and enough pure DNA isolated, we finally sent some DNA off to Lex Nederbragt at the end of last year for Pacific Biosciences RS (PacBio) sequencing!

This week I received the data back, and I thought it would be interesting to document a few things I have learnt about PacBio sequencing during this intiguing process.

It’s all about the library, stupid

The early PacBio publications in 2011 showing the results of de novo assembly of PacBio data weren’t great, giving N50 results not materially much better than 454 sequencing. This was despite the vastly longer read length achieved by the instrument, even then mean read lengths of 2kb were achievable. Since then, incremental improvements to all aspects of the sequencing workflow have resulted in dramatic improvements to assembly performance, such that single contig bacterial genome assemblies can be expected routinely. This is probably best illustrated by the HGAP paper published last year where single contig assemblies for three different bacterial species including E. coli were demonstrated. HGAP is PacBio's assembly pipeline.

The main improvements have been:

You need a lot of DNA (like, a lot)

There is a trade-off between the amount of input DNA and which library prep you can do. 5 micrograms for 10kb libraries, ideally 10 micrograms for 20kb libraries. Not always a trivial amount to get your hands on, even for fast-growing bacteria. This is one of the things that would limit our use of PacBio for metagenomics pathogen discovery right now, because this amount of microbial DNA from a culture-free sample is basically impossible to get.

However, in fact we managed to get a library made from 2.6ug of DNA but in this case the BluePippen size cut-off had to be dropped to 4kb (from 7kb).

Input DNA Library prep BluePippen cut-off Number of reads Average read length MBases
2.6ug 10kb 4kb 36 696 5529 202.9
70 123 5125 359.4
>5ug 10kb 7kb 49 970 6898 334.7
58 755 6597 387.6
>10ug 20kb 7kb 42 431 6829 289.9
59 156 7093 419.6

Table 1. PacBio library construction parameters and accompanying output statistics (per SMRTcell, 180 minute movie)

(An aside, I wonder if Oxford Nanopore’s reported shorter than expected read length from David Jaffe’s AGBT talk may just be a question of feeding the right length fragments into the workflow. Short fragments in equals short reads out).

Choose the right polymerase

For bacterial de novo assemblies, all the experts I spoke to recommended the P4-C2 enzyme. This polymerase doesn’t generate the very longest reads, but is recommended for de novo assembly because the newer P5-C5 has systematic error issues with homopolymers (as presented by Keith Robison at AGBT). P5-C3 is therefore recommended for scaffolding assemblies, or could be used in conjunction with P4-C2.

Longer reads may mean reduced loading efficiency

You want long fragments, but we were warned by several centres that 20kb libraries load less efficiently than 10kb libraries, meaning throughput is reduced. It was suggested we would need 3 SMRTcells to get 100x coverage for an E. coli sized genome of 5mb. However in our case, it didn't really seem that was the case for us (Table 1).

Shop around for best prices

As you almost certainly can’t afford your own PacBio, and even if you could your floor wouldn’t support its weight, you will probably be using an external provider like I did. Prices vary, but the prices I had per SMRTcell were around £350 and quotes for 10kb libraries around £400, with 20kb libraries being more expensive. In the end I went with Lex Nederbragt and the Oslo CEES - not the very cheapest but I know and trust Lex not to screw up my project and to communicate well, an important consideration (see Mick Watson's guide to choosing a sequencing provider). In the UK, the University of Liverpool CGR have just acquired a PacBio and also would be worth a try. TGAC also provide PacBio sequencing. In the US, Duke provide a useful online quote generator and the prices seem keen.

What language you speaking?

It’s both refreshing and a bit unnerving to be doing something so familiar as bacterial assembly, but having to wrap your head around a bunch of new nomenclature. Specifically, the terms you need to understand are the following:

  • Polymerase reads: these are basically just ‘raw reads'
  • Subreads: aka 'reads of insert'. This is, I think, the sequence between the adaptors, factoring in that the PacBio has a hairpin permitting reading of a fragment and its reverse strand. This term also relates to the becoming obsoleted circular consensus sequencing mode. Lex has a description here: (http://flxlexblog.wordpress.com/2013/06/19/longing-for-the-longest-reads-pacbio-and-bluepippin/)
  • Seeds: in the HGAP assembly process, these are the long reads which will be corrected by shorter reads
  • Pre-assembled reads: a bit confusing, these are the seeds which have been corrected, they are only assembled in the sense that they are consensus sequence from alignment of short reads to long reads and that PacBio uses an acyclic graph to generate the consensus
  • Draft assembly: the results of the Celera assembler, before polishing with the read set

The key parameter for HGAP assembly is the Mean Seed Cutoff

[caption id="attachment_2014" align="alignnone" width="1024"]The seed length cutoff is the set of longest reads which give >30x coverage The seed length cutoff is the set of longest reads which give >30x coverage[/caption]

This parameter is critical and defines how many of the longer, corrected reads go into the draft Celera assembly process. The default is to try and get 30x coverage from the longest reads in the dataset. This is calculated from the genome size you specify, which ideally you would know in advance. If this drops below 6000 then 6000 will be used instead. You can also specify the mean seed cutoff manually. According to Jason Chin the trade-off here is simply the time taken to correct the reads, versus the coverage going in the assembly. I am not clear if there is also any quality trade-off. Tuning this value did seem to make important differences to the assembly (a lower cut-off gave better results). The HGAP2 (for it is this version you want) tutorial is helpful on tuneable parameters for assembly.

SMRTportal is cool, but flakey

I used Amazon m2.2xlarge (32Gb RAM) instances with the latest SMRTportal 2.1.0 AMI. About half the assembly jobs I started failed, with different errors, despite doing the same thing each time. Some times it worked with the same settings. I am not sure why this should be, maybe my VM was dodgy.

HGAP is slooooooow

Being used to smashing out a Velvet assembly in a matter of minutes, the glacial speed of the HGAP pipeline is a bit of a shock. On the Amazon AMI instance assemblies were taking well over 24 hours. According to Keith Robison on Twitter this is because much of the pipeline is single-threaded, with multi-threading only occurring on a per-contig basis. So if you are expecting a single contig you are bottlenecked onto a single processor. We therefore chose the m2.2xlarge instance type because the high-memory instances have the fastest serial performance of the available instance types. Actually this is important in a clinical context. Gene Myers (yes, THAT Gene Myers) presented at AGBT 2014 to say that he had a new assembler which can do an E. coli sized genome in 30 minutes, can't come soon enough as far as I'm concerned.

Single contig assemblies are cool

[caption id="attachment_2008" align="alignnone" width="874"]Screen Shot 2014-02-26 at 21.00.01 A very long contig, yesterday[/caption]

Well, my first few attempts have given me two contigs, but that is cool enough. And it is pretty damn cool. If money was no object (and locally we are looking at a 20:1 cost ratio for PacBio sequencing over Illumina) then I would get them every time. As it is, for now, we will probably confine our use to when we really need to generate a quality reference sequence to map Illumina reads against, for example when investigating an outbreak without a good reference. Open pan-genome species like Pseudomonas and E. coli are good potential applications for this technology, where you have a reasonable expectation of large scale genome differences between unrelated genomes. Our Pseudomonas genomes went from 1000 contigs to 2 contigs, which does make a huge difference to alignments. As far as I can see it is pointless to use PacBio for monomorphic organisms, unless you are interested in the methylation patterns. Keith Robison wrote recently and eloquently predicting the demise of draft genome sequencing, but whilst the price differential remains I think this is premature.

Polished assemblies still need fixing

Inspecting the alignment of Illumina reads back to the polished assemblies reveals errors remain, these are typically single-base insertions relative to the reference which need further polishing (Torsten Seeman's Nesoni would be a good choice for this)

The Norwegian Sequencing Centre rocks

I’m very grateful for Lex Nederbragt and Ave Tooming-Klunderud and the rest of the staff of the Norwegian Sequencing Centre in Oslo for their help with our projects, they have been very helpful and I recommend them highly. Send them your samples and first born!

Also many thanks to those on Twitter who have answered my stupid questions about PacBio particularly Keith Robison, Jason Chin, Adam Philippy, Torsten Seemann.

When I have more time I will dig into the assemblies produced and look a bit more about what they mean for both the bioinformatics analysis and biology.

BIOM25: MEGAN analysis of metagenomes

##Whole-genome shotgun metagenomics learning objectives

  • Know how to use MEGAN to visualise taxonomic information (30 minutes)
  • Understand how the Least Common Ancestor parameters affect sensitivity and specificity of taxonomic assignments (20 minutes)
  • Know how to use MEGAN to visualise functional information (30 minutes)
  • Understand how to compare multiple samples in MEGAN (30 minutes)

##Datasets for this practical

For this practical we will use some real clinical metagenomics data from the E. coli outbreak in 2011, published in this JAMA article: http://jama.jamanetwork.com/article.aspx?articleid=1677374

MEGAN

MEGAN is both a taxonomic analysis software, and a visualization tool. It is very helpful when ‘mining’ your dataset.

MEGAN takes the results of read alignments as input. This alignment information typically comes from BLAST-format files, but can also come from SAM files produced by short-read aligners such as Bowtie2 and BWA, or the aln format from RAPSearch2.

Because generating the BLAST format files takes so long, we have precomputed them for you and generated MEGAN5-compatible files.

Start by downloading the results file for a random subsampling of 250,000 reads from the 2638-H dataset:

Installing MEGAN

Download MEGAN from this link:

Also download the MEGAN license file (right-click on link to Save link as…):

Install MEGAN into your Z: drive.

When asked, choose 3,000 megabytes for MEGAN max memory usage.

Run MEGAN.

Tell MEGAN the location of the MEGAN academic license file.

Load the file you downloaded into MEGAN (MEGAN is available under the “Other” menu on the Amazon Machine Image desktop).

For more information on using the various functionality in MEGAN, see the user manual: http://ab.inf.uni-tuebingen.de/data/software/megan5/download/manual.pdf

Please see this videocast for a recap of the MEGAN functionality:

http://www.youtube.com/watch?v=R8dpD_lj6Ts&feature=em-upload_owner

Have a look at the assignments.

Q1. Which taxon has the most assignments made to it?

Q2. What taxonomic level does this taxon belong to? (Hint: http://en.wikipedia.org/wiki/Taxonomic_rank)

Q3. Why are so many reads being assigned at this level? Is it reasonable?

Hint: Inspect the read alignments by using right-mouse click (secondary click on Mac) on the nodes and chosing “Inspect reads”.

Q4. Are there species in there that are thought to be pathogens? Any that are unexpected?

Can you change the Last Common Ancestor parameters to make the results more specific?

Try it.

Does that make the results “better” or “worse” ?

Experiment more. How about now?

Q5. Look in more detail about the alignments for some taxa. Are there any you feel confident calling as present? Are there any you don’t feel confident about? Why?

Functional analysis

Try looking at the functional mode by clicking on the KEGG icon. Explore the different pathways and their presence/absence.

Q6. Which metabolic pathways does this sample definitely contain?

Q7. Are all the genes present in essential metabolic pathways, i.e. glycolysis? Is this expected? Why?

Q8. Does this sample contain Shiga-toxin genes in it? (hint: look under Human Diseases category) ? Is it missing expected virulence factors?

Comparative analysis

Now, download another file:

Q9. How does this sample differ from the first in terms of composition?

Now, choose one of the following samples:

Q10. What are the most abundant taxa?

Q11. Is the Shiga toxin present in this sample?

Q12. Is there anything notable about this sample?

If you have time, try another sample!

Once several windows are open, you can try out the MEGAN comparison mode.

Q13. Which sample (of the ones you picked) has the most Escherichia present?

Evomics 2014: Metagenomics three hour practical

##Whole-genome shotgun metagenomics learning objectives

  • Know how to subsample reads from a FASTQ file without losing paired-information (20 minutes)
  • Know how to perform basic taxonomic assignment using a marker-gene method: MetaPhlan (10 minutes)
  • Assess the impact sampling has on discovery of taxa (20 minutes)
  • Know how to use MEGAN to visualise taxonomic information (30 minutes)
  • Understand how the Least Common Ancestor parameters affect sensitivity and specificity of taxonomic assignments (20 minutes)
  • Know how to use MEGAN to visualise functional information (30 minutes)
  • Understand how to compare multiple samples in MEGAN (30 minutes)

###Optional, more advanced objectives (if you finish, or when back at home)

  • Know how to generate a heatmap of taxa from Metaphlan results (20 minutes)
  • Assemble a metagenome with Velvet (30 minutes)
  • Know how to perform basic taxonomic and functional assignment using a sequence similarity method: BLAST/RapSearch2 (10 minutes)

##Datasets for this practical

For this practical we will use some real clinical metagenomics data from the E. coli outbreak in 2011, published in this JAMA article: http://jama.jamanetwork.com/article.aspx?articleid=1677374

List the contents of the the ~/shotgun_metagenomics/ directory (remember ~ is just a short-cut for your home directory, e.g. /home/ubuntu, so this is the same as /home\/ubuntu/shotgun_metagenomics).

Q1. What format are the reads, and what can you deduce about the sequencing from them?

Q2. How many reads are in the biggest and the smallest files?

Initially, perform your work on the 2638-N12-STEC dataset.

Then, when you have run through the tutorial once, do it again but choose a dataset at random (or one you think might be particularly interesting!).

##Subsample reads with seqtk

seqtk is a useful ‘Swiss army knife’ for basic sequence manipulation.

Test it works:

$ seqtk

Usage:   seqtk <command> <arguments>

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
     	 sample    subsample sequences
    	 subseq    extract subsequences from FASTA/Q
     	 trimfq    trim FASTQ using the Phred algorithm
     	 hety      regional heterozygosity
     	 mutfa     point mutate FASTA at specified positions
     	 mergefa   merge two FASTA/Q files
     	 randbase  choose a random base from hets
     	 cutN      cut sequence at long N
     	 listhet   extract the position of each het

####Subsampling reads

This command shows how to subsample 100,000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing). Seqtk is smart enough to understand whether a file is zipped or not, so you don’t need to unzip them first, but it will always output uncompressed files.

seqtk sample -s100 pair1.fastq.gz 100000 > <MYREADS>_1.fastq
seqtk sample -s100 pair2.fastq.gz 100000 > <MYREADS>_2.fastq

Replace <MYREADS> with an informative name, so if you are dealing with sample 2638 and have subsampled 10000 reads then perhaps 2638_10000_1.fastq would be sensible.

You can read more about seqtk at C. Titus Brown’s tutorial page here: http://ged.msu.edu/angus/tutorials-2013/seqtk_tools.html

Taxonomic assignments with Metaphlan

The Metaphlan home page is here: http://huttenhower.sph.harvard.edu/metaphlan. Metaphlan is already installed on the Amazon Machine Image.

###Running Metaphlan

Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.

Metaphlan doesn’t need paired-end information, so you can just join your two files together to make use of all the data:

cat <MYREADS>_1.fastq <MYREADS>_2.fastq > <MYREADS>.fastq

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8

The output will be sent to stdout, so you need to redirect it to a file, remember you can do this:

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8 > <MYREADS>_results.txt

Hint: Each time you run Metaphlan it will produce an output file named according to the input file. Metaphlan will not run if that file already exists, so ensure you use a different input file each time, or simply delete the file it produces first: it is is named <MYREADS>.bowtie2out.txt.

Run metaphlan for subsamplings of 1000, 10000, 100000 and 1000000 reads.

##Questions

Q3. How long did each file take to run? (hint prepend time to your command)

Examine the output of metaphlan (i.e. <MYREADS>_results.txt).

Q4. What are the relative proportions of the most abundant phyla? What about species?

Update the Google Docs spreadsheet here with your results: http://tinyurl.com/evomics-metaphlan

Q5. How many different genera were detected at each sampling level?

##Advanced usage There are other useful values of -t you can use other than rel_ab, the default:

  • rel_ab: profiling a metagenomes in terms of relative abundances
  • reads_map: mapping from reads to clades (only reads hitting a marker)
  • clade_profiles: normalized marker counts for clades with at least a non-null marker
  • marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if –nreads is specified)
  • marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with –pres_th

Now, go back to the beginning, choosing a new, uncharacterised sample. Add the results to the Google Docs file.

If you finish this before the recap, then try yet another sample and then go on to the heatmap generation exercise under the Advanced section.

Visualizing results

MEGAN

MEGAN is both a taxonomic analysis software, and a visualization tool. It is very helpful when ‘mining’ your dataset.

####MEGAN Memory Requirement

By default, MEGAN uses 2Gb of RAM on 64-bit Linux. This isn’t quite enough sometimes, we will make it use 6Gb instead (you could use more, but never more than the physical memory available on the machine). To do this you need to edit the /opt/megan/MEGAN.vmoptions file and change the content from:

-Xmx2000m

to

-Xmx6000m

This needs to be done as sudo. If you are lazy, just run:

sudo bash -c 'echo "-Xmx6000m" > /opt/megan/MEGAN.vmoptions'

Check it is correct by cating the file:

cat /opt/megan/MEGAN.vmoptions

MEGAN takes the results of read alignments as input. This alignment information typically comes from BLAST-format files, but can also come from SAM files produced by short-read aligners such as Bowtie2 and BWA, or the aln format from RAPSearch2.

Because generating the BLAST format files takes so long, we have precomputed them for you and generated MEGAN5-compatible files, but see at the bottom of the exercise for details on how to generate your own files for your data:

Start by downloading the results file for a random subsampling of 250,000 reads from the 2638-H dataset:

Load this file into MEGAN (MEGAN is available under the “Other” menu on the Amazon Machine Image desktop).

For more information on using the various functionality in MEGAN, see the user manual: http://ab.inf.uni-tuebingen.de/data/software/megan5/download/manual.pdf

Please see this videocast for a recap of the MEGAN functionality:

http://www.youtube.com/watch?v=R8dpD_lj6Ts&feature=em-upload_owner

Have a look at the assignments.

Q6. How do they compare to your Metaphlan results? Are more taxa predicted or fewer?

Q7. Which taxon has the most assignments made to it?

Q8. What taxonomic level does this taxon belong to?

Q9. Why are so many reads being assigned at this level? Is it reasonable?

Hint: Inspect the read alignments by using right-mouse click (secondary click on Mac) on the nodes and chosing “Inspect reads”.

Q10. Are there species in there that are unexpected?

Q11. What do the alignments look like? Are they good quality? Are they full-length?

Can you change the LCA parameters to make the results more specific?

Try it.

Does that make the results “better” or “worse” ?

Experiment more.

How about now?

Q12. Are there remaining species that don’t make sense?

Q13. Inspect some taxa. Are there any you feel confident calling as present? Are there any you don’t feel confident about? Why?

Try looking at the functional mode by clicking on the KEGG icon. Explore the different pathways and their presence/absence.

Q14. Does this sample have the Shiga-toxin genes in it? (hint: look under Human Diseases category) ?

Now, download some more files, you can choose one, or several from this list! If you relate the file names to the original paper we published you could even put together a hypothesis to test (note the diagnosis is in the file name).

Try out the MEGAN comparison mode by opening a few files.

Q15. Which sample (of the ones you picked) has the most Escherichia ?

If you get this far before the recap (or in your spare time), try assembling one of the files with Velvet or one of the other assemblers you use last week.

##Advanced

Metaphlan

Heatmaps of multiple samples

Metaphlan comes with some plotting scripts which permit the generation of heatmaps.

You need to merge the results you generated from Metaphlan each sample:

~/software/metaphlan/utils/merge_metaphlan_tables.py <table1> <table2> <table3> <tableN..> > merged_table.txt

Top 10 genera:

~/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.py --in merged_table.txt --out test.png --tax_lev g --top 10

Open the test.png file after you have created it.

A more complex example (from the Metaphlan tutorial https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial)

/home/ubuntu/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.pymetaphlan_hclust_heatmap.py -c bbcry --top 25 \
   --minv 0.1 -s log
   --in merged_abundance_table.txt
   --out output_images/abundance_heatmap.png

Here the settings mean:

  • abundances for species only (default –tax_lev s)
  • in logarithmic scale (-s log)
  • reporting only the 25 most abundant clades (–top 25),
  • according to the 90th percentile of the abundances in each clade (default –perc 90)
  • with custom color map (-c bbcry).

In this example, the clustering is performed with “average” linkage (default -m average), using “Bray-Curtis” distance for clades (default -d braycurtis) and “correlation” for samples (default -f correlation).

Creating BLAST files for MEGAN

If you are interested in creating BLAST files for MEGAN so you can run it yourself, please see the reference material on my blog at:

http://pathogenomics.bham.ac.uk/blog/2013/08/methods-for-taxonomic-assignment-of-shotgun-whole-genome-metagenomics-reads/