Evomics 2014: Metagenomics three hour practical

##Whole-genome shotgun metagenomics learning objectives

###Optional, more advanced objectives (if you finish, or when back at home)

##Datasets for this practical

For this practical we will use some real clinical metagenomics data from the E. coli outbreak in 2011, published in this JAMA article: http://jama.jamanetwork.com/article.aspx?articleid=1677374

List the contents of the the ~/shotgun_metagenomics/ directory (remember ~ is just a short-cut for your home directory, e.g. /home/ubuntu, so this is the same as /home\/ubuntu/shotgun_metagenomics).

Q1. What format are the reads, and what can you deduce about the sequencing from them?

Q2. How many reads are in the biggest and the smallest files?

Initially, perform your work on the 2638-N12-STEC dataset.

Then, when you have run through the tutorial once, do it again but choose a dataset at random (or one you think might be particularly interesting!).

##Subsample reads with seqtk

seqtk is a useful ‘Swiss army knife’ for basic sequence manipulation.

Test it works:

$ seqtk

Usage:   seqtk <command> <arguments>

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
     	 sample    subsample sequences
    	 subseq    extract subsequences from FASTA/Q
     	 trimfq    trim FASTQ using the Phred algorithm
     	 hety      regional heterozygosity
     	 mutfa     point mutate FASTA at specified positions
     	 mergefa   merge two FASTA/Q files
     	 randbase  choose a random base from hets
     	 cutN      cut sequence at long N
     	 listhet   extract the position of each het

####Subsampling reads

This command shows how to subsample 100,000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing). Seqtk is smart enough to understand whether a file is zipped or not, so you don’t need to unzip them first, but it will always output uncompressed files.

seqtk sample -s100 pair1.fastq.gz 100000 > <MYREADS>_1.fastq
seqtk sample -s100 pair2.fastq.gz 100000 > <MYREADS>_2.fastq

Replace <MYREADS> with an informative name, so if you are dealing with sample 2638 and have subsampled 10000 reads then perhaps 2638_10000_1.fastq would be sensible.

You can read more about seqtk at C. Titus Brown’s tutorial page here: http://ged.msu.edu/angus/tutorials-2013/seqtk_tools.html

Taxonomic assignments with Metaphlan

The Metaphlan home page is here: http://huttenhower.sph.harvard.edu/metaphlan. Metaphlan is already installed on the Amazon Machine Image.

###Running Metaphlan

Metaphlan can use either BLAST or Bowtie2 for assignments. Bowtie2 is significantly faster, so we will use that. It should be already on your path.

Metaphlan doesn’t need paired-end information, so you can just join your two files together to make use of all the data:

cat <MYREADS>_1.fastq <MYREADS>_2.fastq > <MYREADS>.fastq

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8

The output will be sent to stdout, so you need to redirect it to a file, remember you can do this:

metaphlan.py <MYREADS>.fastq --bowtie2db ~/software/metaphlan/bowtie2db/mpa --bt2_ps sensitive-local --nproc 8 > <MYREADS>_results.txt

Hint: Each time you run Metaphlan it will produce an output file named according to the input file. Metaphlan will not run if that file already exists, so ensure you use a different input file each time, or simply delete the file it produces first: it is is named <MYREADS>.bowtie2out.txt.

Run metaphlan for subsamplings of 1000, 10000, 100000 and 1000000 reads.


Q3. How long did each file take to run? (hint prepend time to your command)

Examine the output of metaphlan (i.e. <MYREADS>_results.txt).

Q4. What are the relative proportions of the most abundant phyla? What about species?

Update the Google Docs spreadsheet here with your results: http://tinyurl.com/evomics-metaphlan

Q5. How many different genera were detected at each sampling level?

##Advanced usage There are other useful values of -t you can use other than rel_ab, the default:

Now, go back to the beginning, choosing a new, uncharacterised sample. Add the results to the Google Docs file.

If you finish this before the recap, then try yet another sample and then go on to the heatmap generation exercise under the Advanced section.

Visualizing results


MEGAN is both a taxonomic analysis software, and a visualization tool. It is very helpful when ‘mining’ your dataset.

####MEGAN Memory Requirement

By default, MEGAN uses 2Gb of RAM on 64-bit Linux. This isn’t quite enough sometimes, we will make it use 6Gb instead (you could use more, but never more than the physical memory available on the machine). To do this you need to edit the /opt/megan/MEGAN.vmoptions file and change the content from:




This needs to be done as sudo. If you are lazy, just run:

sudo bash -c 'echo "-Xmx6000m" > /opt/megan/MEGAN.vmoptions'

Check it is correct by cating the file:

cat /opt/megan/MEGAN.vmoptions

MEGAN takes the results of read alignments as input. This alignment information typically comes from BLAST-format files, but can also come from SAM files produced by short-read aligners such as Bowtie2 and BWA, or the aln format from RAPSearch2.

Because generating the BLAST format files takes so long, we have precomputed them for you and generated MEGAN5-compatible files, but see at the bottom of the exercise for details on how to generate your own files for your data:

Start by downloading the results file for a random subsampling of 250,000 reads from the 2638-H dataset:

Load this file into MEGAN (MEGAN is available under the “Other” menu on the Amazon Machine Image desktop).

For more information on using the various functionality in MEGAN, see the user manual: http://ab.inf.uni-tuebingen.de/data/software/megan5/download/manual.pdf

Please see this videocast for a recap of the MEGAN functionality:


Have a look at the assignments.

Q6. How do they compare to your Metaphlan results? Are more taxa predicted or fewer?

Q7. Which taxon has the most assignments made to it?

Q8. What taxonomic level does this taxon belong to?

Q9. Why are so many reads being assigned at this level? Is it reasonable?

Hint: Inspect the read alignments by using right-mouse click (secondary click on Mac) on the nodes and chosing “Inspect reads”.

Q10. Are there species in there that are unexpected?

Q11. What do the alignments look like? Are they good quality? Are they full-length?

Can you change the LCA parameters to make the results more specific?

Try it.

Does that make the results “better” or “worse” ?

Experiment more.

How about now?

Q12. Are there remaining species that don’t make sense?

Q13. Inspect some taxa. Are there any you feel confident calling as present? Are there any you don’t feel confident about? Why?

Try looking at the functional mode by clicking on the KEGG icon. Explore the different pathways and their presence/absence.

Q14. Does this sample have the Shiga-toxin genes in it? (hint: look under Human Diseases category) ?

Now, download some more files, you can choose one, or several from this list! If you relate the file names to the original paper we published you could even put together a hypothesis to test (note the diagnosis is in the file name).

Try out the MEGAN comparison mode by opening a few files.

Q15. Which sample (of the ones you picked) has the most Escherichia ?

If you get this far before the recap (or in your spare time), try assembling one of the files with Velvet or one of the other assemblers you use last week.



Heatmaps of multiple samples

Metaphlan comes with some plotting scripts which permit the generation of heatmaps.

You need to merge the results you generated from Metaphlan each sample:

~/software/metaphlan/utils/merge_metaphlan_tables.py <table1> <table2> <table3> <tableN..> > merged_table.txt

Top 10 genera:

~/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.py --in merged_table.txt --out test.png --tax_lev g --top 10

Open the test.png file after you have created it.

A more complex example (from the Metaphlan tutorial https://bitbucket.org/nsegata/metaphlan/wiki/MetaPhlAn_Pipelines_Tutorial)

/home/ubuntu/software/metaphlan/plotting_scripts/metaphlan_hclust_heatmap.pymetaphlan_hclust_heatmap.py -c bbcry --top 25 \
   --minv 0.1 -s log
   --in merged_abundance_table.txt
   --out output_images/abundance_heatmap.png

Here the settings mean:

In this example, the clustering is performed with “average” linkage (default -m average), using “Bray-Curtis” distance for clades (default -d braycurtis) and “correlation” for samples (default -f correlation).

Creating BLAST files for MEGAN

If you are interested in creating BLAST files for MEGAN so you can run it yourself, please see the reference material on my blog at: