Adaptor trim or die: Experiences with Nextera libraries

One of the first posts I did on this blog, way back in September 2009 was about my experiences with filtering and trimming Illumina sequences, and it proved rather popular. To date, it has been viewed a whopping 8,560 times!

But funnily enough, since that post was written my attitude towards filtering Illumina data slowly changed. I was increasingly finding that aggressive quality trimming was making little to no difference to my de novo assemblies, and in some cases actually making them worse. The explanation was likely due to the evolution of Illumina base-calling accuracy. At the time I was dealing with early GA2 data, which had serious 3' quality drop-off issues, even with reads as short as 36 cycles (that's the true definition of short-read sequencing!). Nowadays with the MiSeq and HiSeq instruments, read qualities still decline at the 3' end, but mainly remain usable. Increasingly I found that quality was not a major determinant for getting good quality de novo assemblies, rather it came down to the usual old chestnuts of read length, coverage depth and insert size (for paired-end sequencing).

However, a recent analysis of a troublesome dataset has led me to revise my thoughts on the need to trim sequences routinely. This is due to the introduction of Nextera library preparations. Nextera is a really useful technique; it uses a transpososome (a transposase and transposon end complex) to fragment the genome (semi-)randomly with the addition of a specific sequence. With an additional round of PCR, sequencing adaptors and multiplexing barcodes are incorporated into the fragment ends. Simples, and no need for physical shearing methods. You clean up short fragments with AMpure beads, and there is an optional size-selection step which many people opt not to do (of which more later in this post).

However there is a fly in the ointment, which is becoming a major problem now the MiSeq is targeting ever longer read lengths. The libraries we have seen often have a short median fragment size, sometimes less than 200 bases. When combined with the MiSeq V2 Illumina 500-cycle kits run in paired 250 base mode this means that you will frequently be reading through the adaptor into some kind of crazy void-space (does this have a name?). Unless you routinely size select your fragments, the read length will be longer than the fragments.

Other than being a waste of sequencing reagents, this proves surprisingly fatal to de novo assembly, presumably because the adaptors form highly-connected nodes in the assembly graph which prevent contigs forming. It's not such a big deal with mapping applications, as most short-read aligners will happily soft-clip the unmapped 3' bases off the read.

I was recently asked to look at a dataset which exhibited a case of this phenomenon so extreme that I thought might be helpful to share with others. These results are from a bacterial whole-genome shotgun sequencing project, where assembly of the reads was resulting in particularly terrible results.

How terrible? Well, here's a Velvet assembly of the raw, untrimmed reads, using some arbitrary settings (k=43, exp_cov auto, cov_cutoff auto). The Velvet commands were:

velvet_1.2.07/velveth out 43 -shortPaired -fastq -separate fastqfile1 fastqfile2
velvet_1.2.07/velvetg out -exp_cov auto -cov_cutoff auto
..
Final graph has 3778954 nodes and n50 of 30, max 540, total 51121254, using 170538/3413828 reads

Bear in mind this is a fairly typical, GC-rich bacterial genome. The stats tell us that from the >3.4m 250-base reads in the dataset we end up with 3.7 million contigs, with an N50 of 30! We don't need the Assemblathon in this case to know that this is … bad. A few things are notable here: despite having >3.4m reads Velvet is reporting the median coverage depth is 1.0. Something ain't right, clearly.

Of course we have been naughty and done an assembly without QCing our data first. Here's the FastQC plot of the first pair from the untrimmed dataset:


Corrrr ... don't like the look of that very much.

Now, after going and letting off steam at your sequencing centre, you might be tempted to trim this dataset down and try and salvage something from it, right?

Heng Li's seqtk is as good as anything for a quick and dirty trim. This uses the Phred trimming algorithm which finds the maximum scoring subsequence when summing the quality minus a threshold value from each base. The default threshold is 0.05.

 seqtk trimfq fastqfile1 > fastqfile1_trimmed
seqtk trimfq fastqfile2 > fastqfile2_trimmed

Let's check out the FastQC plots now. They look nicer, right?

So, we'd expect better results from Velvet, right? Let's re-run the fun:

 

 Final graph has 3778929 nodes and n50 of 30, max 540, total 51120953, using 170555/3413828 reads

COMPUTER SAYS NO.

OK, what's going on? There are a few things we could do to prove this is adaptor contamination. We could try and estimate the insert size by mapping to a reference sequence, that would give a reasonable hint that adaptor contamination is the problem. But let's pretend we don't have a reference, we can't even assemble the sequences to map them back!

Is there a clue in the FastQC plots?

Looking at the regular FastQC k-mer plot there is some wackiness going on:

 

There's a definite enrichment for some k-mers early in the read, but this doesn't really give us enough information to answer the question.

Luckily, we can ask FastQC to check for k-mers of length 10 instead (I had to increase heap memory to get Java to play ball here):

fastqc -k 10 fastqfile1

OK

Well, it's now fairly clear that many of the enriched k-mers form part of the Nextera adaptor sequence (I have not put this in the post as Illumina are a bit funny about them but you can find them easily enough from a Google search), and they occur as early as 35 bases into the read.

There is also enrichment for two other k-mers later on in the read. Guessing, I would think these are k-mers from the barcode for this sample.

OK, let's trim those suckers off. I like to use Trimmomatic. Some people like Scythe (I haven't tried it, but it's by the awesome Vince Buffalo so it's bound to be good). There used to be a lovely Wiki page comparing all the trimmers, but I can't find it right now (please post in the comments!)

Re-run Velvet, one last time, on the adaptor trimmed dataset:

Median coverage depth = 34.625060
Final graph has 851 nodes and n50 of 64229, max 227743, total 7908057, using 2978347/3133288 reads

That's a bit more like it! Of course much better contig stats would have been achieved had the median insert size been greater than 250 bases, ideally greater than 450 bases.

*Many thanks to Ruth Miller of the British Columbia CDC for allowing me to share this instructive dataset on the blog!*