Nanopore Sequencing Book: DNA extraction and purification methods

DNA extraction strategies for nanopore sequencing

Joshua Quick and Nicholas J. Loman

Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, B15 2TT

This is the author proof of a chapter for the upcoming textbook edited by Dave Deamer and Daniel Branton entitled: Nanopore Sequencing: An Introduction published by World Scientific. We are grateful to have been permitted to retain the copyright for our chapter and it is reproduced in full. Please consider requesting an inspection copy and purchasing the book for your course!

Table of Contents

DNA extraction strategies for nanopore sequencing 1

Joshua Quick and Nicholas J. Loman 1

Introduction 3

Choosing a DNA extraction strategy 4

Basics of DNA extraction 5

DNA extraction kits 6

Spin column kits 6

Gravity flow columns 8

Magnetic beads 8

Manual techniques 10

Sample pre-processing 10

Cell lysis 11

Digestion with Proteinase K 12

The phenol-chloroform method 13

Ethanol precipitation 14

Dialysis 16

Megabase sized DNA 16

Input requirements for ultra-long reads 17

Quality control of DNA samples 18

Fragment size assessment 18

Absorbance ratios 19

Fluorescence spectroscopy 21

Size-selection with SPRI beads 22

Buffer exchange with SPRI beads 23

Size selection by gel electrophoresis 23

Repairing damaged DNA 24

Storage of HMW DNA 25

Handling HMW DNA 26

References 27


As far as we can tell, read lengths in nanopore sequencing are dependent on the library preparation rather than any limitation of the sequencing chemistry. Long reads are useful for many applications but in particular de novo assembly. This is because long reads span repeats in the genome resulting in longer assembled contiguous sequences (contigs) [1]. The longest reads generated using nanopore sequencing now exceed 1 megabase pairs in length (1.2 Mbp at time of publishing[2]), but even longer reads will likely be achievable with further improvements in DNA extraction and library preparation methods. Such long reads will be extremely helpful in order to assemble difficult regions of the genome such as eukaryotic centromeres and telomeres. It may even be possible one day to sequence entire bacterial chromosomes or even eukaryotic chromosomes in a single read! Possibly the only limit to read length is the rate of naturally occurring single strand breaks in DNA.

This chapter will discuss the most useful extraction techniques for nanopore sequencing focusing on best practices for routine work, experimental design and quality control. Finally we will discuss ongoing ­efforts to generate ‘ultra-long reads’.

Choosing a DNA extraction strategy

While it may be tempting to always pick a strategy to optimise for high molecular weight DNA, this comes at a significant cost in terms of time and labour (Figure 1). Sample input, read length and cost are all highly interdependent factors and designing a good experiment first requires an understanding of how these relate. If the goal is to assemble a bacterial genome (for example, to produce a reference sequence), we know that obtaining reads above the ‘golden threshold’ of 7 kilobases (the length of the ribosomal RNA operon) will in most cases result in a finished genome (meaning circularised with no gaps)[3]. The importance of the ribosomal RNA coding is that it is typically the longest repetitive region in a bacterial genome, so having reads longer than this threshold will enable these repeats to be ‘anchored’ to unique parts of the genome, permitting their assembly. Therefore, for many bacterial genomes, a simple spin column extraction (yielding typically up to 60 kilobase fragments) would be appropriate as fragment sizes will be sufficient to generate the read length required.

If, however, you are sequencing a complex metagenome with a mix of closely related species or strains (an extremely challenging assembly problem), then longer reads will be important for strain reconstruction using assembly. Similarly, complex genomes such as the human genome will benefit from the longest possible reads due to long repeats such as the in the centromeres, some of which still remain largely unassembled 15 years after the announcement of the first human reference genome. In these cases, cellular material is not limited so it is reasonable to attempt a high molecular weight DNA extraction.

Other applications may be limited by input quantity. Many clinical and environmental samples have intrinsically low biomass and therefore, in order to extract sufficient DNA for sequencing, these will need to be extracted with high recovery efficiency methods such as magnetic beads or spin columns. An understanding of the biological question at hand, and the limitations of the sample type available are therefore key to designing a good sequencing experiment.

Figure 1

Figure 1. Showing the approximate average size of DNA fragments isolated by different methods discussed in this chapter.

Basics of DNA extraction

Hundreds of DNA extraction methods have been described in the literature. Often they have been developed for specific cell or samples types, however they will usually share some common steps: cell lysis, purification and elution/precipitation. Here we will describe some of the routine methods used in DNA extraction.

DNA extraction kits

The simplest way to get started is to use a DNA extraction kit. These kits offer a high level of consistency and excel for low-input sample types. They are however more expensive than manual methods typically costing around $5 per sample and fragment length will be limited to around 60 Kb.

Spin column kits

This is the most common type of DNA extraction kit you will encounter in a laboratory. Spin columns are so called because reagents are added to the top of the tube then forced through the binding matrix when spun in a centrifuge. In some cases, columns include cell lysis reagents. Binding DNA, washing and eluting the DNA can be done rapidly in this way, with the whole process taking around an hour. In addition, you can perform many extractions in parallel, by using more positions in the centrifuge rotor. Spin columns are based on chemistry developed in 1990s[4, 5] using either silica or anion exchange resins to reversibly bind DNA allowing them to be separated from cellular proteins and polysaccharides.

It is worth understanding how spin columns work to understand why they are so effective at purifying and recovering DNA from a wide range of samples, but also their weaknesses. Most kits use high concentrations of guanidiunium hydrochloride in the lysis buffer[6]. Guanidiunium hydrochloride is a chaotropic agent that disrupts the hydrophobic interactions between water and other molecules. This is a good choice because it both lyses cells by denaturing membrane proteins and precipitates DNA by disrupting it’s hydration shell which maintains its solubility in aqueous conditions. Under these conditions DNA binds to the binding matrix in the column allowing proteins and other contaminants pass through. The DNA bound to the silica resin membrane can be washed using 70% ethanol to remove contaminating proteins and salts, including the lysis buffer itself. DNA is eluted off the column by adding a low ionic concentration buffer such as 10 mM Tris and incubating for a few minutes. The DNA resolubilizes in the aqueous solution and the purified DNA is eluted from the column by centrifugation. DNA is sheared during binding and elution due to the large physical forces experienced during centrifugation and is forced through the porous resin.

For common Gram-negative bacteria (such as E. coli) >60 Kb can be extracted using a kit with spin-column extraction in less than 30 minutes. Spin columns have a binding capacity of about 5-10 µg and can be run in batches, making them suitable for extracting large numbers of samples.

Gravity flow columns

Gravity flow columns (a common example is the Qiagen genomic-tip) [7] employ the same binding technology as spin columns but they come in larger sizes, the largest of which has a binding capacity of 500 µg (500G tip, also known as a ‘MaxiPrep’). These are not placed in the centrifuge but left in a rack allowing the lysate/wash solutions to drip though by gravity. These can be used to recover DNA with an average size of 100-200 Kb due to the gentle handling of the sample but are much slower. Unlike spin columns however, DNA is eluted from the column in a large volume then precipitated with isopropanol to concentrate it. DNA produced using this method should be higher quality than that produced using a spin column. They are especially useful for isolating large quantities of DNA and maybe an appropriate choice for many nanopore applications.

Magnetic Beads

Magnetic beads have many uses in molecular biology as they can be made with a wide variety of functional groups on the surface[8, 9]. Beads used for isolation of genomic DNA are uniform polystyrene and magnetite microspheres with a carboxyl coating. In the presence of a chaotropic agents DNA transitions from solution to a condensed ‘ball-like’ state in which it is attracted to the beads[10]. This allows the DNA to purified by washing with ethanol before being eluted by placing in a low ionic-strength solution. The negative charges of the carboxyl groups help to repel the similarly charged DNA off the beads. The main advantage to using magnetic beads is speed of processing as DNA binding occurs very quickly in solution. Such techniques are also amenable to automated handling are used in many commercial high throughput robot platforms.

Manual techniques

Figure 2

Figure 2. Figure showing the order of steps required for DNA extraction with optional sample pre-processing and fragmentation.

Sample pre-processing

Certain sample types, particularly plant and animal tissue, must be ground up before lysis in a process called homogenization to increase the surface area for cell lysis. This is usually done by freezing with liquid nitrogen then grinding in a Dounce homogenizer or pestle and mortar[11]. The liquid nitrogen has a dual purpose of making the sample very brittle for efficient grinding but also inhibits nuclease activity which would degrade DNA.

Spheroplasting is the process of digesting away the cell wall while keeping the cell intact by keeping the cells in sucrose buffer to protect them from osmotic shock[12]. The name spheroplast derives from the spherical appearance of cells after cell wall digestion due to the membrane pressure. This process allows cells to be easily lysed by the addition of detergent even from cells with thick cells walls such as yeast.

Cell lysis

Cell lysis is the process of breaking open cells to release DNA. This is usually performed by using detergents, enzymes or physical methods. Bacteria, yeast, plants and animals have very different cellular structure and therefore different lysis methods are employed. Commonly-used detergents include sodium dodecyl sulfate (SDS)[13] for bacterial and mammalian cells, and cetyltrimethylammonium bromide (CTAB) for plants[14]. Strong detergents like SDS also serve to protect DNA from degradation by inactivating nucleases. Many Gram-positive bacteria are too tough to lyse with detergents due to their peptidoglycan cell wall so lysis solutions may also incorporate additional enzymes. Lysozyme is an enzyme that breaks down the cell wall by catalyzing the hydrolysis of specific bonds in peptidoglycan. Specialist enzymes are used for Staphylococcus (lysostaphin) and Streptomyces (mutanolysin) where lysozyme is ineffective. Yeast cell walls are composed of two layers of ß-glucan which requires lyticase and zymolase to break it down. Some spore-forming bacteria and fungi may also have additional layers of peptidoglycan or chitin making them extremely resistant to enzymatic or chemical lysis so mechanical methods may be needed. The most common method is bead beating in which various sizes of ‘beads’ made from materials like glass or zirconium are vigorously shaken with the sample in a homogenizer disrupting tissues or smashing open cells. Bead beating is very effective at releasing DNA from cells however, due it’s vigorous nature it also causes a lot of DNA shearing making it unsuitable for isolating high molecular weight DNA. It may be necessary to refer to the literature to determine the most appropriate lysis method for your specific sample type.

Digestion with Proteinase K

Proteinase K is a serine protease which cleaves the peptide bonds in proteins. It is often added to lysis buffers as it is highly active in the presence of SDS[15], chaotropic salts and elevated temperature (50°C) which help unfold proteins making them more accessible for digestion. It is also useful for inactivating nucleases. These properties mean it is very useful for extracting high molecular weight DNA.

The phenol-chloroform method

Phenol was used to purify nucleic acids by Kirby in 1956, first using it to isolate RNA[16] then DNA[17]. It is an organic compound that acts as a solvent for proteins and is able to separate them from DNA. Phenol is slightly water-soluble but has a higher specific gravity so a mixture of the two can be separated by centrifugation into two phases. Adding chloroform as an additional organic solvent helps prevent phenol carry-over as phenol is more soluble in chloroform than water. DNA with an average size of 150 Kb or even much larger can be isolated using the phenol-chloroform method if performed carefully, partly due to reduced physical forces employed compared to column-based techniques [18]. It is also very effective at removing nucleases. This method was once the standard approach for DNA extraction but has largely fallen out of favor: partly due to its toxicity (it requires special handling procedures) as well as the easy availability of column-based methods. However, this approach is now seeing a resurgence for nanopore sequencing due to its effectiveness in generating long fragments, we have generated read datasets with an N50 of >100kb and with maximum read length of >1 megabase using this method [2].

To perform phenol-chloroform purification, an equal volume of phenol or phenol-chloroform is added to the aqueous solution of lysed cells. These are mixed on a rotator until a fine emulsion forms. After centrifugation the two phases separate, the aqueous phase on top and the denser organic phase below. At pH 8.0 DNA and RNA partition into the aqueous phase whereas proteins move into the organic phase purifying the DNA. Between them a white precipitate of proteins and usually forms which is known as the interphase. This process is often repeated a few times to ensure the complete removal of proteins before precipitating the DNA.

Ethanol precipitation

Following deproteinisation with phenol-chloroform, DNA can be purified and concentrated by ethanol precipitation. By adding salt and ethanol, DNA can be precipitated and washed before being re-suspended in a small volume. This allows high concentrations to be produced. Ethanol is much less polar than water and above a certain concentration it will disrupt the hydration shells surrounding the DNA. This allows the cations from the salt to form ionic bonds with the phosphates of the nucleic acids resulting in precipitation. A variety of salts can be used to provide the cations: sodium acetate or ammonium acetate are commonly used. If DNA precipitates in large enough quantities it appears out of the solution like a spider-web with bubbles trapped in it (an effect caused by the outgassing of ethanol). In some cases it can then be hooked out in one piece or ‘spooled’ on a glass rod[19]. If the quantity is insufficient or if it breaks up when spooled, it can be pelleted by spinning in a centrifuge. In both cases the DNA needs to be thoroughly washed in 70% ethanol to remove residual salts before being resuspended in a low ionic concentration buffer ideally at pH at 8.0 (see storage of HMW DNA).

Figure 3

Figure 3. DNA prepared using the phenol-chloroform method being hooked out of the using a glass rod.


Dialysis is a technique commonly used in protein purification but can also be used to remove impurities from DNA and is preferable to phenol-chloroform when isolating large DNA fragments due to even more gentle handling. In molecular biology, dialysis is a method for separating molecules by their rate of diffusion through a semi-permeable membrane. Ions in solution will diffuse from areas of high concentration (in the sample) to areas of low concentration (in the dialysis buffer) until equilibrium is reached but the larger DNA molecules cannot pass through the membrane so are retained. Dialysis is performed either by putting the sample inside dialysis tubing and submerging it in a large volume buffer or for smaller volumes, by pipetting the sample onto a membrane floating on the buffer so called ‘drop dialysis’[20]. A useful side effect of this method is that DNA becomes concentrated over time as water moves out of the sample due to gravity. If higher concentration is required the dialysis can be performed for longer.

Megabase sized DNA

Isolating megabase size DNA requires significantly more time and effort than other techniques. In order to keep DNA molecules intact they must be protected from hydrodynamic forces. This is achieved by embedding the cells in agarose blocks known as plugs[21]. The extraction procedure is then performed on the cells in situ by placing the plugs in lysis buffer, digestion buffer and wash buffer. DNA can be analysed by inserting the plugs directly into a gel for pulsed-field gel electrophoresis (PFGE) or released from the gel using agarase. Agarase cleaves agarose into smaller subunits which can no longer gel at room temperature. DNA released from agarose plugs requires further purification by dialysis but this may not result in sufficiently high concentrations to be used for nanopore sequencing. This method is therefore promising but requires further development.

Input requirements for ultra-long reads

One of the main impediments to generating ultra-long reads is having sufficient input material. If you are able to grow cells in culture then this less of a problem. However, if the sample is limited in quantity it may be pragmatic to consider another approach. The approximate number of cells required to generate ultra-long reads (15 g in our hands) are given below (for phenol-chloroform extractions).

Table 1

Table 1. Input requirements based on requiring minimum 15 g DNA for ultra-long library preparation.

Quality control of DNA samples

Performing the appropriate QC on DNA extractions is vital to avoid disappointment when sequencing! The most commonly performed QC procedures are fragment size assessment, absorbance spectrometry and fluorometric quantification.

Fragment size assessment

The TapeStation 2200 (Agilent) is a gel electrophoresis system used for fragment size assessment, although other instruments or conventional gel electrophoresis could also be used. One useful metric generated by the instrument analysis software is the DNA integrity number (DIN) which can be used to estimate the level of DNA degradation. A DNA sample with the majority of the DNA >60 Kb with little to no short fragments will have a DIN value of >9. If the sample shows a smear of short fragments, a sign of degradation it will have a DIN value <1. For all MinION library types a DIN value >9 is preferred. Lower values will result in more short reads. A 0.4x SPRI cleanup (see ‘Size selection with SPRI beads’) is able to remove fragments below 1500 bp. A better solution is to begin with high integrity DNA, then shearing down to the desired size, resulting in a tight fragment distribution with very few short fragments.

Absorbance ratios

Another important metric for DNA quality assessment is the absorbance measured by a spectrophotometer such as the NanoDrop. This instrument measures the UV and visible light absorbance of the DNA sample which permits both quantification of DNA and of common impurities.

The commonly used absorbance ratios for assessing DNA purity are 260/280 (absorbance at 260 nm / 280 nm) and 260/230. The 260/280 ratio is generally 1.8 for pure DNA. A lower value could indicate protein, phenol or guanidine hydrochloride contamination. The 260/230 ratio is a secondary metric and is generally 2-2.2 for pure DNA. A lower value may indicate phenol contamination. However, correct interpretation depends on the extraction method: if you have used a spin column extraction kit guanidine hydrochloride would be the most likely contaminant whereas if you have done a phenol-chloroform extraction then SDS or phenol contamination are more likely. Changes in sample pH can also affect 260/280 ratios, so the instrument should be blanked using the same buffer than the DNA is in before use. Each nucleotide has different absorption so the composition of the DNA will affect the 260/280 ratio, AT rich samples will have slightly higher 260/280 ratios than GC rich samples.


Checking that absorbance ratios are consistent with pure DNA is an important QC step prior to nanopore sequencing. If there is a problem at this stage it is best to repeat the DNA extraction to confirm that the ratios are repeatable. We have had excellent sequencing results with the DNA in Figure 3, which has higher ratios than expected for pure DNA. Nanodrop is mainly useful for DNA purity assessment but less so for quantification as absorbance is less accurate than fluorometry.

Figure 4

Figure 4. Absorbance spectra between 220 and 350 nm as measured by the NanoDrop instrument. This was the DNA sample used to generate the ultra-long reads for the MinION human genome sequencing project. It was extracted from the NA12878 cell line using the phenol method.

Fluorescence spectroscopy

Fluorescence spectroscopy is an important technique for DNA quantification. It relies on the fact that nucleic acid stains such as SYBR Green I fluoresce when intercalated in DNA. It is excited by blue light and re-emits green light of a longer wavelength. The level of fluorescence is proportional to DNA concentration which can be extrapolated from the fluorescence level of standards of known concentration. The Qubit (Life Technologies) is a convenient fluorescence spectrophotometer for single samples and different kits are available for different sample types and concentration ranges. The most useful for preparing nanopore libraries is the dsDNA HS Assay (Life Technologies) which measures concentrations between 0.01 - 100 ng/µl.

Size-selection with SPRI beads

DNA extractions with evidence of short fragments can be improved by performing size selection. A commonly used technique is the use of solid-phase reversible immobilization beads (SPRI). DNA binds to the beads in the presence of the bead buffer which contains a crowding agent, PEG (polyethylene glycol) and high concentration of sodium chloride. In these conditions, the DNA transitions from solution to a condensed ‘ball-like’ state in which it is attracted to the beads[10]. Size selection is controlled by altering the bead to sample volume ratio, with ratios of between 0.4x and 1.8x commonly used. SPRI is an easy way of achieving removal of short fragments but is only effective up to around 1500 base pairs at the lowest ratio of 0.4x.

Buffer exchange with SPRI beads

SPRI beads can be used to clean-up DNA prior to library preparation. This makes them useful for reworking DNA samples that have failed quality control e.g. by absorbance spectra or fragment distribution. If the absorbance spectra suggest salt contamination you might decide to do a 1x SPRI clean-up to remove the salt. A final example is if you wish to buffer exchange a sample into EB. Many extraction kits will use Tris-EDTA (TE) as elution buffer which contains 0.1 or 1 mM EDTA to protect DNA against nuclease activity. It does this by sequestering metal ions from solutions which could be used as cofactors by nuclease enzymes. However, if the concentration is too high it will also inhibit the transposase enzyme used for library preparation. If you do not know or suspect a DNA sample is in the wrong buffer you can use a 1.0x SPRI clean-up to buffer exchange the sample into EB.

Size selection by gel electrophoresis

Agarose gel electrophoresis is used to separate DNA fragments by size[22]. As DNA is negatively charged it migrates towards the anode when exposed to an electric field. Typical gels are made with 0.5 – 2.0% (w/v) agarose with lower percentage gels giving better resolution for long fragments as they have a larger pore size. However, low concentration agarose gels are very fragile and HMW (high molecular weight) DNA cannot be resolved with all sizes moving together. PFGE on the other hand can separate fragments up to 10 Mb using a field which changes direction forcing the DNA to migrate through the gel in a zigzag motion. The size separation ability of long fragments by PFGE is exploited by instruments such as BluePippin and SageHLS to perform size selection of genomic DNA. The most useful mode for nanopore sequencing is selecting the longest fragments in a DNA sample after g-TUBE or needle shearing, known as a high-pass size selection. Up to four samples to be size selected at once with the BluePippin agarose cassette with the fifth lane used for the ladder. The DNA migrates through the gel by PFGE until the shorter, unwanted fragments have run past the collection channel. At this point the anode is switched so the remaining fragments are electroeluted into buffer in the collection chamber. The point at which to switch is determined by the ladder running past a detector beneath the cartridge.

Repairing damaged DNA

When sequenced read length do not match the known size distribution DNA damage may be to blame. A common source of damage are single-stranded nicks. These are breaks in the DNA where there is no phosphodiester bond between two adjacent bases in the strand. These occur due to enzymatic activity or chemical damage to the DNA molecule. As the DNA strand is sequenced any nicks in the DNA will cause a premature termination of the sequencing read as there is no second strand to stabilise the nicked strand. Single strand nicks will not be detected by standard gel electrophoresis but can be detected on a formamide denaturing gel.

Single-strand breaks can be repaired using repair mixtures such as PreCR Repair Mix or FFPE DNA Repair Mix (New England Biosciences). These enzyme cocktails are designed to repair a variety of DNA damage, as well as single-strand breaks that can reduce sequencing errors and improve read lengths especially for old or damaged DNA samples. As an extreme example, ancient DNA (hundreds or thousands of years old) will contain an excess of abasic sites, deaminated cytosine, oxidized bases and nicks all of which should be reduced by FFPE DNA Repair Mix.

Storage of HMW DNA

After expending so much care and love on a high molecular weight extraction, a little extra care should be taken to ensure that good work is not undone during storage. HMW DNA should be resuspended in elution buffer (EB; 10 mM Tris-HCl pH 8.0) or Tris-EDTA buffer (TE; 10 mM Tris-HCl pH 8.0, 1 mM EDTA). TE will provide protection against nuclease activity by chelating any Mg2+ ions but may be incompatible with downstream enzymatic reactions. Both will keep the pH at 8.0 which is optimal for DNA storage as nucleases are less active at this pH. DNA should always be stored in the fridge at 5°C as freezing will result in physical shearing[23]. We have found DNA is stable for a year or more at this temperature if free from nucleases.

Handling HMW DNA

DNA is a rigid molecule due to the electrostatic repulsion between negatively charged phosphates[24]. This makes it vulnerable to double strand breaks due to the hydrodynamic forces in moving fluids e.g. when pipetting. These forces can be minimised by pouring when possible, rather than pipetting and stirring when mixing. Maintaining high concentrations may help to reduce shearing as high concentration of DNA are more viscous. Keeping DNA in a condensed form by adding PEG or polyamines such as spermidine also reduces the likelihood of shearing


  1. Jain, M.K., S.; Miga, K. H.; Quick, J.; Rand, A. C., Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 2018.
  2. Loose, M. 2018; Available from:
  3. Koren, S. and A.M. Phillippy, One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. Curr Opin Microbiol, 2015. 23: p. 110-20.
  4. Boom, R., et al., Rapid and simple method for purification of nucleic acids. J Clin Microbiol, 1990. 28(3): p. 495-503.
  5. Carter, M.J. and I.D. Milton, An inexpensive and simple method for DNA purifications on silica particles. Nucleic Acids Res, 1993. 21(4): p. 1044.
  6. Chomczynski, P. and N. Sacchi, Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction. Anal Biochem, 1987. 162(1): p. 156-9.
  7. QIAGEN QIAGEN Genomic DNA Handbook. 2001.
  8. Hultman, T., et al., Direct Solid-Phase Sequencing of Genomic and Plasmid DNA Using Magnetic Beads as Solid Support. Nucleic Acids Research, 1989. 17(13): p. 4937-4946.
  9. Uhlen, M., Magnetic Separation of DNA. Nature, 1989. 340(6236): p. 733-734.
  10. Lis, J.T., Fractionation of DNA fragments by polyethylene glycol induced precipitation. Methods Enzymol, 1980. 65(1): p. 347-53.
  11. Dounce, A.L., et al., A Method for Isolating Intact Mitochondria and Nuclei from the Same Homogenate, and the Influence of Mitochondrial Destruction on the Properties of Cell Nuclei. Journal of Biophysical and Biochemical Cytology, 1955. 1(2): p. 139-153.
  12. Hill, R.A. and M.N. Sillence, Improved membrane isolation in the purification of beta(2)-adrenoceptors from transgenic Escherichia coli. Protein Expression and Purification, 1997. 10(1): p. 162-167.
  13. Kay, E.R.M., N.S. Simmons, and A.L. Dounce, An Improved Preparation of Sodium Desoxyribonucleate. Journal of the American Chemical Society, 1952. 74(7): p. 1724-1726.
  14. Doyle, J.J., A rapid DNA isolation procedure for small quantities of fresh leaf tissue. 1987.
  15. Gross-Bellard, M., P. Oudet, and P. Chambon, Isolation of high-molecular-weight DNA from mammalian cells. Eur J Biochem, 1973. 36(1): p. 32-8.
  16. Kirby, K.S., A new method for the isolation of ribonucleic acids from mammalian tissues. Biochem J, 1956. 64(3): p. 405-8.
  17. Kirby, K.S., A new method for the isolation of deoxyribonucleic acids; evidence on the nature of bonds between deoxyribonucleic acid and protein. Biochem J, 1957. 66(3): p. 495-504.
  18. Sambrook, J. and D.W. Russell, Isolation of High-molecular-weight DNA from Mammalian Cells Using Proteinase K and Phenol. CSH Protoc, 2006. 2006(1).
  19. Bowtell, D.D., Rapid isolation of eukaryotic DNA. Anal Biochem, 1987. 162(2): p. 463-5.
  20. Marusyk, R. and A. Sergeant, A simple method for dialysis of small-volume samples. Anal Biochem, 1980. 105(2): p. 403-4.
  21. Schwartz, D.C. and C.R. Cantor, Separation of yeast chromosome-sized DNAs by pulsed field gradient gel electrophoresis. Cell, 1984. 37(1): p. 67-75.
  22. Sharp, P.A., B. Sugden, and J. Sambrook, Detection of two restriction endonuclease activities in Haemophilus parainfluenzae using analytical agarose–ethidium bromide electrophoresis. Biochemistry, 1973. 12(16): p. 3055-63.
  23. Ross, K.S., N.E. Haites, and K.F. Kelly, Repeated freezing and thawing of peripheral blood and DNA in suspension: effects on DNA yield and integrity. J Med Genet, 1990. 27(9): p. 569-70.
  24. Sambrook, J.a.R., D, Molecular Cloning: A Laboratory Manual 2001.

Thar she blows! Ultra long read method for nanopore sequencing

tl;dr version

  • Ultra long reads (up to 882 kb and indeed higher) can be achieved on the Oxford Nanopore MinION using traditional DNA extraction techniques and minor changes to the library preparation protocol, without the need for size selection
  • The protocol is available here; it involves a modified Sambrook phenol-chloroform extraction/purification, DNA QC, minimal pipetting steps, high-input to rapid kit and MinKNOW 1.4
  • We have tested it on E. coli and human so far with good results; data is of course available

Ultra-long reads: background

What if you could sequence E. coli in just one read? This was the challenge I set Josh. And why can’t we do that, if nanopore sequencing really has no read length limit?

Well actually: we’re not quite there yet, but we did manage to sequence 1/6th of the whole genome in a single read last week. Here’s how we (well, he) did it. As usual we like to release our protocols openly and early to encourage the community to test and improve them. Please let us know about any tweaks you find helpful! The community seems very excited by this judging by my Twitter feed and email inbox, so we have rush released the protocol. The tweets have also inspired commentaries by Keith Robison and James Hadfield, thanks guys!

First … a bit of background and the importance of working with moles not mass. This line of thinking was triggered during the Zika sequencing project when we noticed our yields when sequencing amplicons was never as good as with genomic DNA. Why was that?

We decided a possible reason is that nanopore sequencing protocols are usually expressed in terms of starting mass (typically 1 microgram for the ligation protocols). But of course 1 microgram of 300 bp fragments is a lot more (>25x more) DNA fragments compared to 1 microgram of 8000 base fragments. By not factoring this in the library prep, likely we were not making an efficient library because the protocol has not been scaled up 25 times to account for this difference. It stands to reason that it’s the molarity that’s important when loading the flowcell rather than the total volume of DNA. If you could load some imaginary molecule of DNA with mass 1000ng (bear with me), the chances of that interacting with the pore is still quite low. More molecules means more potential interactions with the pore, meaning more potential yield.

We calculated the desired starting molarity as 0.2 pM based on the length assumptions in the ONT protocol (in practice you load about 40% less after losses from library construction). So by increasing the amount of barcodes and adaptors, as we do in our Zika protocol, we can compensate for this.

That solves the short read problem, but we started thinking about how it would work in the other direction. What if you wanted to get the longest reads possible, what would this mean in mass? The rather silly idea was — if you wanted to get reads sufficiently long to cover a whole bacterial chromosome in a single read, what would the starting DNA concentration need to be?

The math here is simple; you just need to scale the starting DNA by 500x. But this would mean starting with ~500ug of DNA into the library preparation!

500ug of DNA is… quite a lot. And practically there are several problems with this idea:

  • you would need a lot of cells to start with (perhaps not such a problem with bacterial cultures but certainly restrictive for some applications)
  • what volume do you elute in? DNA starts to get viscous and thick as concentrations increase, at some point you just won’t be able to pipette any more
  • how do you deposit that much DNA into the flow cell?

So - we slightly scaled down our ambitions and decided that it could be practical to scale up the protocol 10-fold, which could still result in average 80 kb reads, a significant improvement to the 8kb typically seen with the standard protocol.

We’d already been using the Sambrook protocol (from the classic Molecular Cloning - over 173,000 citations!) for our human genome extractions, which reliably gives very high molecular weight DNA that can be recovered with a Shepherd’s crook fashioned from a glass rod). Previously Dominik Handler demonstrated that HMW extractions with careful pipetting could generate long reads with the rapid kit. So we did a new Sambrook extraction using an overnight culture of E. coli K-12 MG1655 and generated something that was very pure (260:280 of 2.0) and very high molecular weight (>60kb by TapeStation - the limit of the instrument). In fact the DNA is so long that you can’t really size it without employing a pulse-field gel electrophoresis setup. Sadly we don’t have a working one in the department, so infrequently are they used these days. So we were flying blind in terms of the true length of the fragments.

Scaling up the rapid kit was relatively straight-forward when dealing with inputs up to 2 ug. You get DNA at a concentration of 250 ng/ul then add the maximum 7.5 ul. However to get inputs of 10 ug it requires concentrations of 1 ug/ul where things start to get tricky. The library is so viscous loading beads start to clump together and it becomes harder to get the library through the SpotON port on the flowcell. Not satisfied with 10 ug either we pushed on towards 20 ug which required making a double volume library and adjusting the dilution downstream. We eventually settled on a protocol which could reliably give read N50’s over 100 Kb (i.e. half of the dataset in reads of 100 Kb of length or greater) with a tail stretching out to 500 Kb, or sometimes beyond…

The final piece of the puzzle was something we were aware of; the nanopore control software as of version 1.3 does periodic ‘global voltage flicks’ - meaning that the voltage is reversed across the flow cell every 10 minutes. The aim of this is to prevent strands or proteins blocking up the pores, by a rapid change of the direction of the ionic current. However, the problem with a 10 minute flicking interval is that it intrinsically limits the longest read on the system to 150kb (with 250 base/s chemistry) and 270kb (with 450 base/s chemistry). In MinKNOW 1.3 you could change the script parameters (stored in a YAML file) to remove this flick, but in MinKNOW 1.4 luckily it has been dispensed with entirely in favour of a much smarter system that dynamically unblocks individual pores on demand.

So … how does it look after all that’s been done?

We ran E. coli K-12 MG1655 on a standard FLO-MIN106 (R9.4) flowcell.

E. coli stats

Total bases: 5,014,576,373 (5Gb)
    Number of reads: 150,604
N50:  63,747
Mean: 33,296.44

Read length stats

Ewan Birney suggested this would be more interpretable as a log10 scale, and by golly he was right!

Alignment stats

Wow! The longest 10 reads in this dataset are:

1113805 916705 790987 778219 771232 671130 646480 629747 614903 603565


But hold your horses. As Keith Robison likes to say, and Mark Akeson as well, it’s not a read unless it maps to a reference. Or as Sir Lord Alan Sugar might say, “squiggles are for vanity, basecalls are sanity, but alignments are reality”.

Are these reads actually real, then?

Just judging by the distribution it’s clear that this is not all spurious channel noise.

Let’s align all the reads…

Alignment issues

This dataset poses a few challenges for aligners. BWA-MEM works, but is incredibly slow. It goes much faster if you split the read into 50kb chunks (e.g. with but this is a bit annoying.

I decided to use GraphMap, this has a few useful functions - it will try to make an end-to-end alignment and it also has a circular alignment mode, which is useful as we would expect many of these reads would cross the origin of replication at position 0.

Another problem! The SAM format will not convert to BAM successfully, so I’ve output using the BLAST -m5 format for ease of parsing. The SAM/BAM developers are working on this (CRAM is fine).

After a solid couple of days of alignment, here are the results:

So we lose a few of the really long reads here which are obviously noise (the 1Mb reads is just repetitive sequence and probably represents something stuck in a pore and the 900Kb read is not a full-length alignment), but otherwise there is an excellent correlation between the reads and alignments.

So, the longest alignments in the dataset are:

778217 771227 671129 646399 603564 559415 553029 494330 487836 470664

That’s theoretical 1x coverage of the 4.6Mb chromosome of E. coli in just the 7 longest reads !!

95.47% of the bases in the dataset map to the reference, and the mean alignment length is slightly higher at 34.7kb.

A few other notable things:

  • The 790kb read that didn’t align full-length is interesting. On inspection it is actually two reads - the template and complement strand of the same starting molecule, separated by an open pore signal. This gives us a clue as to how the proposed 1D^2 technology (which is replacing 2D reads) could work. Calling the two reads together (thanks, Chris Wright) gives a 95% accuracy read!
  • We’ve started using the Albacore basecaller for this, rather than uploading to Metrichor. Albacore seems to keep up with basecalling a live-run when using 60 cores.


So we would like to claim at least four world records here!

Longest mappable DNA strand sequence ** 
Longest mappable DNA strand & complement sequence
Highest nanopore run N50 (not sure about other platforms?)
Highest nanopore run mean read length

(**) we’ve actually beaten that record already with another run, but a subject for another post

An interesting exercise for the reader is to figure out the minimum number of reads that can be taken from this dataset to produce a contiguous E. coli assembly! My first attempt found a set of 43 reads which covers 92% of the genome, but you can do better!

Where now? Well, readers will notice that a real landmark is in sight - the first megabase read. We’ve been running this protocol for a bit over a week and a new hobby is ‘whale spotting’ for the largest reads we can see.

We haven’t quite yet worked out a systematic naming scheme for whales, but perhaps Google has the answer.

So in that case, we’ve in the past few days hit our first narwhal (an 882kb read from a different run, which translates to a 950kb fragment judged against the reference).

How can we go longer? Well it might be possible to increase the DNA input some more, but we start hitting issues with the viscosity which may start to prevent pipetting onto the flowcell. Also pipette shearing forces are presumably an issue at these concentrations.

The general consensus is that we will need to employ solid-phase DNA extractions and library construction, e.g. in agarose plugs. The SageHLS instrument also looks quite interesting.

Data availability

Hosting courtesy of CLIMB:


The nanopore squad, John Tyson and Matt Loose provided much helpful advice and input during the development of this protocol. Matt Loose came up with the whale naming scheme.

Thanks to ONT for technical support with particular thanks to Clive Brown, Chris Wright, David Stoddart and Graham Hall for advice and information.

Conflicts of interest

I have received an honorarium to speak at an Oxford Nanopore meeting, and travel and accommodation to attend London Calling 2015 and 2016. I have ongoing research collaborations with ONT although I am not financially compensated for this and hold no stocks, shares or options. ONT have supplied free-of-charge reagents as part of the MinION Access Programme and also generously supported our infectious disease surveillance projects with reagents.

BIOM25: Metagenomics practical

Scratchpad for session:

Normal human microbiome

The MetaHIT project studied healthy volunteers, as well as people with diabetes and inflammatory bowel disease to characterise their microbiomes:

Have a read of how the study was designed:

Q: Take 10 samples at random and look at their taxonomic distribution. Tabulate the top 3 phyla present and their relative abundances.

E. coli outbreak

Our paper describing the outbreak:

Our paper describing use of whole-genome shotgun metagenomics to diagnose the outbreak:

The data website:

Q. Pick some samples at random. For example sample, look at the taxonomic distributions.

Q. Do any samples look abnormal, compared to the ‘normal’ microbiome?

Q. Are any toxins present? Which ones? What is the significance of this toxin and how might it cause disease?

Antibiotic resistance

Here is a report from one of the outbreak metagenomes using a different analysis pipeline:

Q. What antibiotic resistance genes are present? (Hint: check the AMR report)

Q. What antibiotics might the outbreak strain be resistant to?

Q. How could we prove that the outbreak strain is resistant to these antibiotics?

Non-human environment

Now, choose a non-human environment to study, and to present to the group:





Q. What did the study set out to find?

Q. How did they sample their environment? How many samples did they look at?

Q. How does this environment compare taxonomically with the human gut? Is it more or less diverse? Are the set of organisms present similar or different?

Q. How does this environment compare functionally with the human gut? Can you explain these findings in the context of the environment?

BIOM25: 16S Practical

BIOM25: 16S Practical

In this practical we will analyse datasets from several studies, some very important, others perhaps just a little silly.

At first, we will go through a dataset together, this is from a pioneering paper:

  • The Human Microbiome in Space and Time.

After that, in groups, we will analyse one of three different datasets:

  • CSI: Microbiome. Can you determine who has been using a keyboard from the microbiome that is left behind? Do keyboards have a core microbiome??
  • The microbiome of restroom surfaces (toilets!)
  • Development of the infant gut microbiome.

Please watch this video for a useful demonstration of how principal component analysis works:

General questions

Q: What is the difference between alpha- and beta-diversity?

Human microbiome in space and time


Supplementary material:

Let’s have a look at the results.



Alpha diversity:

Bar plots by sample site:

PCoA analysis:

Q: Is there evidence of natural clusters being formed?

Q: Do samples cluster by individual? If not, how do they cluster?

Q: What are the most dominant taxa in stool, skin, urine? Look at different taxonomic levels down to genus.

Q: Are these sites similar or different? What are the major differences in taxonomic profile between these three sites?

##CSI: Microbiome

Original paper:

Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:


Important metadata fields for this project:

  • Description_duplicate - the key from any keyboard
  • HOST_SUBJECT_ID - the person each keyboard belongs to

Hint: M1, M2 and M9 are the three participants referred to in the paper.

Q: What are the most abundant taxa?

Q: Check the PCA plots, do samples cluster by key, or by subject (hint: HOST_SUBJECT_ID, )

Q: Go back to the taxa barplots, can you figure out which taxa are driving the variation producing grouping?

Q: Which of these taxa are part of the normal skin microbiome? Are any out of plcae? Where might they come from?

Q: Do you think this technique will really be usable for forensics? What are the challenges? What other techniques might work better for studying the microbiome?

Q: Now, read the paper in more detail and prepare a short summary to present the context for the study, the methods employed and the results found.

##Restroom surfaces


Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:

Now, look at the output of QIIME:


Fields of importance: Floor, Level, SURFACE, BUILDING

Q: What surfaces have the greatest amount of diversity? Is this expected?

Q: What do the profiles of stool, etc. look like?

Q: Are there any natural looking clusters in the data?

Q: Which sources of samples are most similar to others?

Q: Is there any clustering between different floors of the building?

Q: Compare the weighted vs unweighted Unifrac results, do the clusters look more natural in one or the toher?

Q: Which surfaces have the most diversity? Least?

Q: Now, read the paper in more detail and prepare a short summary to present to the whole group. Consider: the context for the study, the methods that were employed and the results found. What did you think? What are the limitations of the study?

Infant gut metagenome


Q: Skim read the introduction of the paper to get a feel for what they are trying to find out.

Q: Look at the Methods section and put the primer selection into TestPrime:

Now, look at the output of QIIME:


Fields of importance:

  • SampleID - age in days of infant

Q: Is there any evidence of a gradient? (Key: use SampleID and turn gradient colours on)

Q: How do the taxa change over time?

Q: Which infant samples do the maternal stool most look like?

Q: Is the colour of stools associated with their bacterial diversity?

Q: Now, read the paper in more detail and prepare a short summary to present to the whole group. Consider: the context for the study, the methods that were employed and the results found. What did you think? What are the limitations of the study?

##Instructor notes on building this tutorial

  • Download from QIIME db site or the BEAST
  • Get greengenes tree file
  • -i study_1335_closed_reference_otu_table.biom -o core -m study_1335_mapping_file.txt -e 1000 -t ../gg.tree -c “GENDER,FLOOR,BUILDING,SURFACE”
  • -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”
  • -i study_232_closed_reference_otu_table.biom -ocore2 -m study_232_mapping_file.txt -e 1000 -t gg.tree -c “HOST_SUBJECT_ID,Description_duplicate”

2016: The Loman Lab year in Tweets


“The past is a foreign country” – well, that’s how I feel about January 2016 looking back today. Definitely some things happened in January, but can’t remember them. So I’m using Twitter Analytics to remind me.

Oh! This was the month that #researchparasites came out, to the horror and amusement of the genomics field:


Was a good month. Our paper on the Ebola real-time genomic surveillance work came out, and it looked like the Ebola epidemic was well and truly over.

There was also fun to be had at AGBT.


Just as we thought we had left Ebola behind, there was a flare-up in Guinea that spread to Liberia.

Phylogenetic analysis showed that the new cases were very closely related to an Ebola genome sequenced 500 days previously, as can be seen from this NextStrain tree.

Independently, the epidemiologists identified a survivor who had been infected some 500 days previously, the very same individual.

This was a remarkable demonstration of the power of genomics, working in synergy with the epidemiologists on the ground.

Around the same time, we learnt we had been funded by MRC/Wellcome Trust/Newton to receive funds as part of the emergency response to Zika. Remarkably, the outcome was known just a few weeks after submitting the application, and we had the money just days after that. If only all grant funding could be like this …


We wasted no time getting started. Josh flew to Sao Paulo to Ester Sabino’s laboratory to start testing out sequencing protocols for Zika.


Oxford Nanopore released the R9 pore and it was something of a relief to see it was working well:

We launched the ZiBRA project, a road trip around North-East Brazil to investigate the genomic epidemiology of Zika cases in this region, the most heavily hit by cases of microcephaly in newborns..


We hit the road for the ZiBRA project and started generating Zika genomes working in collaboration with the local public health laboratories. Lots of diaries and blog posts are on the Zibra project website if you want to read more about this trip.

We made lots of new lifelong friends in Brazil, and we didn’t die even though our bus caught on fire at one point, although we hit some technical obstacles with sequencing very very low abundance samples.

The year seemed to be going pretty well. Until the Brexit vote …

Not good.


We launched the CLIMB cyberinfrastructure for microbial genomics to the public. Sign up for your own CLIMB account at the Bryn website. There are videos from the launch available, including this CLIMB demo.

So far over 150 research groups in the UK have signed up for our virtual machine infrastructure which runs across three sites (Birmingham, Warwick and Cardiff) with Swansea to launch in 2017. Particular props to Radoslaw Poplawski, Tom Connor, Andy Smith, Marius Bakke and Matt Bull on the technical side who helped get this launched - just in time!


We sweated over the Zika sequencing protocol and eventually by the end of summer Josh Quick nailed something that worked well on samples with very low viral copy numbers.

In August we just about had time to fit in a week in Cornwall to teach Porecamp with Konrad and his crew.

Pablo, Emily, Jennie and Andy really got MicrobesNG motoring (over 7500 genomes sequenced with a median wait time of 6 weeks!), with insert sizes bolstered with a nice new Nextera XT protocol.


The manuscript describing March’s Ebola flare-up was published.

Zika genomes were coming out thick and fast thanks to the new protocol, our Brazilian collaborators, Sarah Hill and Alli Black. Josh’s third trip in 2017. A picture of Zika diversity was now starting to be built (beautifully visualised by Trevor and Richard’s wonderful Nextstrain site - vote for them to win the Open Science prize!).

A stunning new preprint from Andrew Rambaut, Gytis Dudas and the whole cast of Ebola sequencing collaborators was posted:

We only managed one Balti and Bioinformatics in 2016 but it was a good one:

Christiane and the crew put on a good show at Genome Science.


I met Bill Gates and Nathan Myrvhold and gave a presentation at a “learning session” for Bill about using NGS to fight infection: it was incredible.

All the depressing news on Twitter got too much for me, and I took a break for 3 weeks. After 24 hours of extreme withdrawal symptoms, it was actually quite nice to do something else with my time, like imagining what people were saying on Twitter.

They did, though.

(By the way Zam you were wrong).


Donald Trump - PEOTUS.

Not a dream though.


A relaxed end to the year – we did a bit of Beach (well, Leith) sequencing with Andrew Rambaut and Tom Little in Edinburgh, look out for more BeachSeq action early in 2017..

And we even managed to release data from a 30x human genome on MinION working collaboratively on the sequencing with Nottingham, UCSC, UBC and Norwich, a mere 39 flowcells for that (assembly N50 - 3Mb!):

Nanopore got named one of Science’s 10 breakthroughs of the year and we got a little name check.

Finally in December, we heard the great news that the Ebola ca suffit! trial reported 100% efficacy for the Ebola vaccine. Well done Stephan, Miles, Sophie and all the others who worked on this.

A few changes in 2017

The MicrobesNG team sees a change in 2017 - we are sad to say goodbye to Andy Smith - our database programmer on the MicrobesNG project. He’s done an amazing job building the MicrobesNG website, our LIMS, the CLIMB Bryn site, and even had time to help out with the Zibra Project database and the Primal Scheme site. We cannot be too annoyed that he only spent a year with us – he’s had the once in a lifetime opportunity to become a trainee pilot with Aer Lingus, a lifetime dream for Andy.

There have been some changes in Birmingham too – it’s been really nice to have Alan McNally join the IMI as a new Senior Lecturer. And we are really excited that Willem van Schaik is joining the IMI later in April, Brexit be damned!

Politically we are in uncharted territory, so we enter 2017 with some trepidation about what is to happen to the scientific environment, but we also hope that the awesome wins out.

Happy New Year to all friends and collaborators from the Loman Lab!