Bacterial genomics course

Practicals hand-out

##Conventions in this document

This is normal text

This is text describing a unix command, e.g. grep

This is a command you need to enter on the command line

This command has one word in bold that you need to change

For example, this might be the name of the file that will contain the input for a command

Below is a table where you can fill in the answers to the question(s)

Question Answer
What’s your name?  

Software/services used in this practical

  • FastQC: a read quality analysis software
  • seqtk: a read quality trimmer
  • Velvet: a de novo assembler
  • xBASE Annotation: an online service for annotation
  • Artemis Comparison Tool

##Setting up the environment for the first time

First, copy the files from the USB stick to your Desktop.

Drag and drop the entire folder entitled NGS into your home directory.

We’ll be running some of the practical on the Linux cluster. After logging-in, change directory to the NGS folder:

cd NGS

And then set up the paths to executables:

source env.sh

This step needs to be repeated any time you open a new terminal window. This makes it so commands can be executed without putting the full path in each time.

Practical 1 - Understanding reads, preparing and QC of sequence data

Learning points:

  • Understand read file formats
  • How to judge a QC report
  • How to filter/trim Illumina reads

Let’s have a look at some reads. We have some E. coli sequence data from the German outbreak in 2011, sequenced on the MiSeq instrument.

less data/Sample280.fastq
Question Your answer
What is the read length?  
What specific instrument did this come from?  
Is this sequence data paired-end?  

(Hint: read up on http://en.wikipedia.org/wiki/FASTQ_format)

We will be using a program called FastQC to assess the reads further. The program is available with a graphical user interface, or as a command-line only version. We will use the graphical one. It takes a single fastq file (the file can be compressed with gzip) as input.

For the first part of the practical we will use the file data/Sample280.fastq (from now on, all paths will be relative to the NGS directory which should be your current directory. To see what directory you are in you can type the command pwd).

FastQC has a GUI interface, but you can also run it as a script:

fastqc data/Sample280.fastq

The program will tell you how far it has come, and should finish in a minute or two. Check that it finished without error messages.

The results will be put into a ZIP file called data/Sample280_fastqc.zip

Transfer this file to your Windows PC (if you have problems I have created a direct link at http://static.xbase.ac.uk/files/results/nick/Sample280_fastqc.zip

Study the results. The plot called “Per base sequence quality” shows an overview of the range of quality scores across all based at each position in the fastq file. The y-axis shows quality scores and the x-axis shows the read position. For each read position, a boxplot is used to show the distribution of quality scores for all reads. The yellow boxes represent quality scores within the inter-quartile range (25% - 75%). The upper and lower whiskers represent 10% and 90% point. The central red line shows the median of the quality values and the blue line shows the mean of the quality values.

A rule of thumb is that a quality score of 30 indicates a 1 in 1000 probability of error (99.9% accuracy) and a quality score of 20 indicates a 1 in 100 probability of error (see the Wikipedia page on the fastq format at http://en.wikipedia.org/wiki/Fastq). Therefore, the higher the score the better the base call. You will see from the plots that the quality of the base calling deteriorates along the read (as is usually the case with Illumina sequencing).

As a rule of thumb, a minimum requirement for Per Base Sequence Quality is that the first 50 bases should have a median and mean quality score over 20, otherwise there may have been a problem with the sequencing run.

Now, answer these questions:

Question Your answer  
How many reads were there in total in the sequence file?    
How many bases were there in total in the file?  
Which part(s) of the reads would you say are of low quality?    
Does this run meet the minimum requirement?    
What, expressed as a p-value is the mean accuracy of the last base in the sequence?    

Let’s trim these reads using seqtk

seqtk trimfq data/Sample280.fastq > my_trimmed_file.fastq

Load your newly-trimmed file into FastQC. How does the quality plot differ?

http://static.xbase.ac.uk/files/results/nick/Sample280_trimmed_fastqc.zip

Question Your answer
How many reads were there in total in the sequence file ?  
How many bases were there in total in the file?  
What, expressed as a p-value is the mean accuracy of the last base in the sequence?  

Practical part 2: de novo Assembly

###Assembling short-reads with Velvet

We will use a de Bruijn graph assembler called Velvet to assemble Illumina reads.

We will assemble the E. coli O104:H4 280 strain from a traveller who returned to the UK from Germany, where they acquired the infection from eating salad. The strain was sequenced on an Illumina MiSeq, it is the same reads we analysed earlier. From the sequence data we want to characterise the multi-locus sequence type and gain clues as to the pathogen’s biology.

Sequencing centres may supply paired-end reads in one of two formats:

  • Separate (each pair in its own file)
  • Interleaved (one pair follows its mate in the file)

For simplicity we have supplied interleaved files. If you have separate files then you have to pass different parameters to Velvet.

Velvet requires an index file to be built before the assembly takes place. Longer k-mers result in a more stringent assembly with fewer overlaps, at the expense of coverage. There is no definitive value of k for any given project. However, there are several rules - k must be less than the read length and it should be an odd number.

Firstly we are going to run Velvet in single-end mode, ignoring the pairing information. Later on we will incorporate this information.

Pick a value of k between 21 and 99 as a starting point (don’t everyone choose the same value!) and run velveth to build the hash index.

(You should choose a better, more descriptive name than my_assembly_directory)

velveth my_assembly_directory value_of_k -shortPaired -fastq \
data/Sample280_trimmed.fastq

Note, the \ character permits commands to be entered over more than one line.

Now look in my_assembly_directory, you should see the following files:

Log
Roadmaps
Sequences

Have a look at the file Log - this is a useful reminder of what commands you typed to get this assembly result, useful for reproducing results later on.

Then we run velvetg to create contigs:

velvetg my_assembly_directory

This will take a minute or two.

Look again at my_assembly_directory, you should see the following extra files;

contigs.fa
Graph
LastGraph
PreGraph
stats.txt

The important files are:

  • contigs.fa - the assembly itself
  • Graph - a textual representation of the contig graph
  • stats.txt - a file containing statistics on each contig
Question Answer
What k-mer did you use?  
What is the N50 of your assembly according to Velvet?  
What is the largest contig size?  
What is the total assembly size?  

Group task: Fill out the Google Document with the results for your k-mer.

http://tinyurl.com/kmersheet

If you are running ahead, try some other values of k too.

An optimised assembly

Now re-run Velvet with the k-value the group likes best.

Now we will look at the effect of paired-end reads on the assembly quality.

For this to work well we need to set a new parameter, exp_cov. This parameter tells Velvet the mean coverage for non-repetitive parts of the genome you are assembling. With this information it can decide whether ambiguous parts in the network graph can be confidently traversed, or whether the sequence needs to be broken into contigs.

Run Velvet again but let it decide the values automatically

velvetg assemblies/280_miseq -cov_cutoff auto -exp_cov auto

Analysing the assembly

Let’s take a look at this assembly in more detail, and see if the information we have can reveal anything about the organism’s biology:

bin/contigs_stats.pl -t Velvet assemblies/280_miseq/contigs.fa -plot

hello

Whole genome comparison

So it turns out that this sequence type has been seen before. In fact there is an E. coli with the same sequence type in Genbank, although it is not in the MLST database! It is called E. coli 55989. Interestingly this strain is from the enteroaggregative lineage. Let’s compare our sequence with this one and see what the differences are. One easy way of doing this is with the xBASE annotation interface:

http://xbase.ac.uk/annotation/

  1. Select your reference sequence (Escherichia coli 55989)
  2. Upload your contigs from your assembly
  3. Choose a strain identifier
  4. Add your email address
  5. Submit

It should take 10-20 minutes but you don’t have to wait as I have already run the file, we have got the results for you already.

http://static.xbase.ac.uk/results/annotation//1V5o7OQSIFVIJ3Zi0KQz1vZbzsT3IrpC/

And the files for loading into the Artemis Comparison Tool are in assemblies

Now select “Launch ACT” from the web page. Open the act.jnlp file that opens. Choose Run.

When ACT has loaded, choose Open from the file menu.

Load the following files (they are on the USB stick):

Sequence file 1: annotations/act_reference_file.txt Sequence file 2: annotations/act_crunch_file.txt Sequence file 3: annotations/act_sequence_file.txt

Have a browse around ACT which permits the comparison of two genomes. Play with some of the options: GC track.

Question Answer
What is the typical average nucleotide identity for aligned regions?  
Find 5 examples of genes which are in E. coli 55989 but not in E. coli O104:H4 STEC 280  

Group task: finding novel regions

Find the coding sequences which are in the E. coli O104:H4 STEC strain but not in E. coli 55989. Between you, “crowd-source” the function of these genes, using BLAST, Pfam or whatever you like. Do it in batches for speed.

Post your results here as a group:

http://tinyurl.com/genecrowd

What genes are potentially responsible for the severe presentation of this outbreak?

Advanced

Advanced; if you are ahead of time, try this:

We estimate the value of exp_cov by using the app velvet-estimate-exp_cov.pl script.

velvet-estimate-exp_cov.pl my_assembly_directory/stats.txt less

We can also estimate another useful parameter, called cov_cutoff. Cov_cutoff removes low-frequency k-mers in the graph which may represent sequence errors or contamination or polymorphisms to help detangle the graph and improve the assembly.

Try running velvetg again with your estimates for good parameters

velvetg assemblies/280_miseq -cov_cutoff my_value_cov_cutoff -exp_cov my_value_exp_cov

If you have time, try varying the value of cov_cutoff and see how it affects the assembly.

What is the N50 of your assembly? What is the largest contig size? What is the total assembly size? How did the automatic values compare to your predictions?

Group task: finding novel regions

Find the coding sequences which are in the E. coli O104:H4 STEC strain but not in E. coli 55989. Between you, “crowd-source” the function of these genes, using BLAST, Pfam or whatever you like. Do it in batches for speed.

Post your results here as a group: https://docs.google.com/spreadsheet/ccc?key=0AkNPpmDaw5GhdHdYbXp5ZXpZNnJ4UzdWek5MMWlMUXc What genes are potentially responsible for the severe presentation of this outbreak?

Advanced:

MLST typing of your assembly

Now we have an assembly it would be nice to find out something about this strain to get clues as to its origin. Let us see if it is of a known ST.

We are going to retrieve the MLST type directly from the contigs. We could use BIGSDb for this. There is also the option of the web server at DTU. Visit it’s website http://cge.cbs.dtu.dk/services/MLST/

Select “Escherichia coli #1” (Achtman E. coli scheme) Select “Assembled genomes/contigs” Upload the file

Carry on with the next part while you wait for the results. What is the MLST type? Where was this sequence type previously isolated, and when? What disease did it cause?

Hint: Use the Achtman MLST database at http://mlst.ucc.ie/mlst/dbs/Ecoli/GetTableInfo_html it is around record 2,200

Mappin

Circular genome image generators

  • CGView
  • CGView Comparison Tool (http://stothard.afns.ualberta.ca/downloads/CCT/)

  • BRIG (uses CGView)
  • BLASTatlas (SOAP Web service with perl script)
  • Circos

Notes on CGView Comparison Tool

  • To use draft genomes, contigs need to be concatenated first

COST training school: Bioinformatics for microbial community analysis

I will be helping teach one day of this course which is fully-funded if you are a student from a COST-participating country... see below for details:

Deadline is next week, so hurry if you are interested.

COST training school ES1103: Bioinformatics for microbial community analysis

Dates: December 11th- 14th (the school will begin at 1pm on Tuesday and finish at 5pm on Friday)

Location: University of Liverpool, Centre for Genomic Research (CGR), Life Sciences Building, Liverpool L69 7ZB

Organisers: Dr Christopher Quince (Christopher.quince@glasgow.ac.uk), Dr Christiane Hertz-Fowler (chf@liv.ac.uk)

Lecturers: Dr Christopher Quince (Christopher.quince@glasgow.ac.uk), Dr Nick Loman (n.j.loman@bham.ac.uk) and Dr Martin Hartmann (martin.hartmann@microbiome.ch)

Description: The aim of this three and a half day workshop will be to give students an overview of the tools and bioinformatics techniques available for the analysis of next generation sequence data from microbial communities. The emphasis during the first three days will be on the analysis of amplicon sequences, for example 16S rRNA or fungal ITS, generated using next generation sequencing platforms: principally 454 but also Illumina and Ion Torrent. The entire process from initial sequence data through filtering, noise removal, OTU generation and taxonomic classification will be addressed during the first day and a half. Detailed instruction on performing noise-removal, using AmpliconNoise, and the use of the Mothur pipeline will be provided. The emphasis on the third day will be on multivariate statistics for the analysis and ecological interpretation of the resulting data sets using R. The final day of the workshop will consist of an introduction to microbial genomics, covering de novo sequence assembly, and gene annotation. Metagenome analysis will be discussed but not in great detail. The format will comprise a mixture of lectures and hands-on tutorials where students will process example data sets in real-time. Students will also be encouraged to bring their own data for analysis.

Application procedure: Funded places exist through the COST action ES1103 for students from participating member states (see http://www.cost.eu/domains_actions/essem/Actions/ES1103?parties). Applications should be sent by e-mail to Dr Christopher Quince (Christopher.quince@glasgow.ac.uk). Applications should consist of one paragraph describing the student’s motivation for attending the course and level of bioinformatics experience. The latter is purely to allow the correct pitching of content and all levels of prior knowledge will be catered for. However, basic Linux and sequence analysis skills would be helpful. Applications received before Wednesday November 28th will be considered and students selected who will benefit most from the training. Students from any career level from under-graduate to professorial can attend but this is a hands-on workshop and preference will be given to people who will analyse their own data at some point in the future.

Funding: COST will reimburse each student up to 1000 EUs to cover registration, travel, hotel and food. Hotel reservations have been made in a block booking at the nearby Liner Hotel (75 EU per night). If you do not want to stay at this hotel please notify us in your application. There will be an 80EU registration fee payable in advance by the student. This will cover lunches during the course.

Understanding insert size calculations from various software

Debugging some weird insert size distributions on recent MiSeq / Nextera sequencing runs which don’t seem to relate to the BioAnalyzer plots:

Weird insert sizes

Mapped to reference with BWASW, image produced by Qualimap.

Two problems here;

  • The distribution is properly weird (bimodal)
  • The sequencing lab thinks the median insert size is 800bp from their BioAnalyzer plot

I’ve seen weirdness in insert size distributions before. From vague memories of online discussions I had a half-remembrance that not all aligners calculate insert sizes the same way, e.g. some report the distances between reads, and some report the total fragment length.

I would consider the correct definition of the insert size to be the total length of the original fragment put into sequencing. Therefore it would include the read length and the distance between fragments. This should be consistent with the definition of column 9 in the SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

TLEN: signed observed Template LENgth. If all segments are mapped to the same reference, the unsigned observed template length equals the number of bases from the leftmost mapped base to the rightmost mapped base. The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is unde ned. It is set as 0 for single-segment template or when the information is unavailable

A quick test:

Extract 10kb of E. coli for test reference sequence

from Bio import SeqIO
import sys
rec = SeqIO.parse(open(sys.argv[1]), "fasta").next()
SeqIO.write(rec[0:int(sys.argv[2])], sys.stdout, "fasta")
view raw extractseq.py hosted with ❤ by GitHub
rsync -av rsync://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_O157_H7_EDL933_uid57831/NC_002655.fna .
python extractseq.py NC_002655.fna 10000 > ecoli_10k.fa
bwa index ecoli_10k.fa
bowtie2-build ecoli_10k.fa ecoli_10k.fa

Generate 500bp inserts, 100bp reads

Just a little script to pull out perfect reads from the reference file, with defined distance apart:

from Bio import SeqIO
import sys
import random
rec = SeqIO.parse(open(sys.argv[1]), "fasta").next()
insert_size = int(sys.argv[2])
rec_length = len(rec)
read_length = int(sys.argv[3])
number_seqs = int(sys.argv[4])
prefix = sys.argv[5]
fh1 = open("%s_1.fastq" % (prefix,), "w")
fh2 = open("%s_2.fastq" % (prefix,), "w")
for n in xrange(0, number_seqs):
random_pos = random.randint( 0 + insert_size, rec_length - insert_size )
insert = rec[random_pos : random_pos + insert_size]
fh = open("%s_1.fastq" % (prefix,))
print >>fh1, "@read%s/1" % (n,)
print >>fh1, str(insert[0 : read_length].seq)
print >>fh1, "+"
print >>fh1, "A" * read_length
print >>fh2, "@read%s/2" % (n,)
print >>fh2, str(insert[0-read_length :].reverse_complement().seq)
print >>fh2, "+"
print >>fh2, "A" * read_length
view raw extractpairs.py hosted with ❤ by GitHub
python extractpairs.py ecoli_10k.fa 500 100 1000 ec500
bwa bwasw ecoli_10k.fa ec500_1.fastq ec500_2.fastq
bowtie2 -I 0 -X 1000 -x ecoli_10k.fa -1 ec500_1.fastq -2 ec500_2.fastq

Result:

  • Bowtie2 and BWASW both report 500, -500 for each pair as the insert size in the SAM output (column 9). As expected.
  • Qualimap reports 500 as the insert size.

All as expected here. Perhaps the problem is with overlapping inserts?

Generate 300bp inserts, 250bp reads

python extractpairs.py ecoli_10k.fa 300 250 1000 ec300
bwa bwasw ecoli_10k.fa ec300_1.fastq ec300_2.fastq
bowtie2 -I 0 -X 1000 -x ecoli_10k.fa -1 ec300_1.fastq -2 ec300_2.fastq

Result:

  • BWASW reports 300, 200 for each pair. This is clearly wrong. And Qualimap run on this output calls the insert size 250 as a result!
  • Bowtie2 correctly reports 300, -300 for each pair.

Generate 200bp inserts, 150bp reads

  • BWASW reports 100, 200 for each pair. Again, wrong. Is it calculating the insert length as 2 * read length - insert_size ??
  • Bowtie2 correctly reports 200, -200

OK, so looks like BWASW might be the problem (0.6.2).

Let’s try BWA regular:

bwa aln ecoli_10k.fa ec300_1.fastq > ec300_1.aln
bwa aln ecoli_10k.fa ec300_2.fastq > ec300_2.aln
bwa sampe -a 1000 ecoli_10k.fa ec300_1.aln ec300_2.aln ec300_1.fastq ec300_2.fastq
  • BWA in aln/samse correctly reports 200, -200

OK, so the first problem we know is the BWASW is reporting the insert sizes wrong. I will let Heng Li know.

But why are the BioAnalyzer plots still in such disagreement?

Let’s try again with Bowtie2.

Weird insert sizes resolved

OK, maybe this makes a bit more sense now. But is the drop off at 250 bases real (due to correct clean-up), or an artifact of the read length?

Midlands Sequencing and Bioinformatics Meeting: First Meeting Tuesday, 11th December

Getting to the meeting

The meeting is in the Centre for Systems Biology, 3rd Floor Haworth Building, University of Birmingham. Full details on how to get here. Note you will need to ring the bell to gain entrance.

If you haven't made up your mind how to travel, I'd strongly suggest taking a train, the University has a campus railway station called "University (Birmingham)" and the Centre for Systems Biology is a 2 minute walk down-hill from there. The University station is 5 minutes from Birmingham New Street (take a train in the direction of Redditch, Hereford or Longbridge).

If you are coming by car, park in either the North or the South Car Park which are for visitors. It's pay and display but only a few pounds. Don't try to get onto campus as even if you get past security you will not likely find a space. Further details on parking and public transport here ...

Lunch won't be provided, but we'll have some snacks at half-time (authentic Birmingham samosas).

Going for a balti afterwards: we'll take taxis to the Balti Triangle and then back to New Street. We'll aim to be back in town by 8pm so that people can get trains. Those that can stay out later may want to get a drink at the German Market with us.

Introduction

I posted previously about setting up a regular, informal sequencing and bioinformatics meeting in the Midlands area (open to all-comers). Things are coming together nicely with about 30 people from other institutions expressing interest, and so I'm pleased to announce the date of the first meeting will be the afternoon of Tuesday, 11th December. We have some great guest speakers lined up already.

Please register for the meeting (the first one is free) via the Doodle poll. If you want to be notified about future meetings, or discuss how future meetings could work please sign up to the Google group.

The loose theme for this meeting is metagenomics and assembly (and metagenomics assembly).

Lastly, if you want to give a short talk at this meeting, please get in touch with me directly. The schedule is now full!

Agenda

Venue: Centre for Systems Biology, 3rd Floor, Haworth Building, University of Birmingham. How to get here.

12.45 Registration

13.00 Introduction (Nick Loman, Birmingham)

13.10 - 13.50 Sequence Analysis using Compressed Data Structures (Jared Simpson, Wellcome Trust Sanger Institute. Jared is the author of the ABySS and SGA assemblers.)

13.50 - 14.15 Metagenomics for Diagnosis of Bacterial Infection: Bioinformatics Challenges (Nick Loman, University of Birmingham)

14.15 - 14.30 The Population Structure and Genomic Architecture of S. aureus (Richard Everitt, University of Reading)

14.30 - 15.00 Refreshments

15.00 - 15.40 Compressive computing for next-generation sequencing data: making assembly fast, easy, and sensitive (C. Titus Brown, Michigan State University. Make sure to read his blog. Titus is responsible for an exciting new approach to genome assembly he calls 'digital normalization'.)

15.40 - 16.00 High-throughput microbial population genomics using cortex_var (Zamin Iqbal, University of Oxford)

16.00 - 16.10 Hardcore MiSeq Action: run optimisation, dealing with low-diversity samples and some tasty 16S recipes (Josh Quick, University of Birmingham. Josh was one of the systems integration engineers on the MiSeq before joining our group, and there's very little he doesn't know about this machine).

16.10 - 16.30 The bioinformatics clinic
Format: bring a single slide introducing a problem you would like to discuss with the audience.

16.30 Ends - all welcome to come for a Brummy balti with the speakers.

20.00 (very latest) Return to Birmingham New Street