A simple guide to variant calling with BWA, samtools, VarScan2

31 Oct 2012

Mapping and variant calling

Just as soon as we have got on top of this outbreak, cases are being reported in France which are also of the same sequence type. By now, a reference genome for the German outbreak strain has been nearly completed and is available at reference_genomes/280_flxplus.fna

Using SNPs we would like to determine how related the strains are, and how the two outbreaks may be related. Is there evidence of chains of transmission between outbreaks?

Indexing our reference for first use

First we need to index our reference file:

bwa index -a is reference_genomes/280_flxplus.fna
samtools faidx reference_genomes/280_flxplus.fna

Perform an alignment

Then we do an alignment. We will use bwasw to do this (-t 4 uses 4 CPUs):

bwa bwasw -t 4 reference_genomes/280_flxplus.fna data/French_1.fastq data/French_2.fastq -f my_output_file.sam

What is the calculated insert size and standard deviation?

Analysing SAM format

Have a look at the SAM file which has been produced, e.g.

less my_output_file.sam

Have a look at the first mapping:

How much of the read mapped to the reference?
What orientation is the read in?
What is the reported mapping quality?
How many mismatches are there?

Hint: refer to http://genome.sph.umich.edu/wiki/SAM

In order to do SNP calling we have to convert the reads to binary format and sort them:

samtools view -bS my_output_file.sam -o my_output_file.bam
samtools sort my_output_file.bam my_output_file.sorted

We can look at the consensus sequence using the samtools mpileup command.

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | less

What do you notice about the output?
By eyeballing the data, what is the depth of coverage for your sample?
Is the coverage consistent across the genome? Does it vary? Are there places where it varies more?
Can you spot any potential SNPs?

Let’s try and call some SNPs. We.ll do this by passing the output of mpileup through to VarScan which will perform some basic filtering for us, that we will customise later.

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2snp --output-vcf --strand-filter 0

Examine the output of VarScan.

How many SNPs are called?
How many of them fail a test? Which ones?

Load the Savant browser

Savant.sh

Have a look at a selection of SNPs. How do those which fail the filters differ from those that pass? What is the difference between HET and HOM SNPs? What does this mean for a bacterial strain? How could we get rid of HET SNPs?

mpileup2snp
This command calls SNPs from an mpileup file based on user-defined parameters:
	USAGE: java -jar VarScan.jar mpileup2snp [mpileup file] OPTIONS
	mpileup file - The SAMtools mpileup file
 
	OPTIONS:
		--min-coverage          Minimum read depth at a position to make a call [8]
		--min-reads2            Minimum supporting reads at a position to call variants [2]
		--min-avg-qual          Minimum base quality at a position to count a read [15]
		--min-var-freq          Minimum variant allele frequency threshold [0.01]
		--min-freq-for-hom  Minimum frequency to call homozygote [0.75]
		--p-value Default p-value threshold for calling variants [99e-02]
		--strand-filter         Ignore variants with >90% support on one strand [1]
		--output-vcf            If set to 1, outputs in VCF format
		--variants		Report only variant (SNP/indel) positions (mpileup2cns only) [0]

In the Grad 2011 paper they used the following criteria for SNP filtering:

Potential SNPs from the Illumina sequences were called by GATK Unified Genotyper (22), filtering the data according to the following parameters: >90% agreement among reads; at least five unambiguously mapped reads; no greater than 50% mapping ambiguity; insertions and deletions were ignored. Over 97% of the bases in the genome of each outbreak isolate fulfilled these criteria.

Can you reproduce these methods using VarScan? Trying setting different options:

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2snp --output-vcf modifiers_to_varscan_output

How many SNPs do you have now?
Are any of these SNPs dubious?

Examine the SNPs in Savant and check you are happy with them all.

Advanced: look for indels with mpileup2indel

samtools mpileup -f reference_genomes/280_flxplus.fna my_output_file.sorted.bam | java -jar bin/VarScan.jar mpileup2indel --output-vcf --strand-filter 0

As a group we can decide on some reasonable filtering settings, then we will do a large mpileup of all the available reads, and review the results in PhyloViz at the end.

Loman Labs

A simple guide to variant calling with BWA, samtools, VarScan2

Mapping and variant calling

Indexing our reference for first use

Perform an alignment

Analysing SAM format

Related Posts

Balti and Bioinformatics: 14th November 2019 14 Oct 2019

Food in Birmingham 11 Aug 2019

How to generate consensus sequences using nanopolish 21 Dec 2018