Single molecule real-time sequencing with the PacBio RS

I attended a talk by Dr Stephen Turner, the founder and Chief Technology Officer of Pacific Biosciences, promoting PacBio’s SMRT (Single Molecule Real Time) sequencing platform. While I’d heard of the “next-next-generation” of sequencing technologies at least 18 months ago, this was the first time I’d paid much attention to them. What sets the “next-next-gen” from the “next-gen” platforms (why won’t this terminology die already!) is that rather than sequencing a cluster (Illumina) or a bead (SOLiD) of amplified and identical molecules we sequence a single molecule. As Pushkarev et al. explain, “Single-molecule sequencing enabled analysis of human genomic information without the need for cloning, amplification or ligation”. And there are many other cool things you can do with single molecule sequencing as Stephen did a great job of showing.

The RS is PacBio’s first commercial kit and I believe only a handful of customers currently have access to the machines. The PacBio technology also uses polymerase, which is different to existing technologies (I need to read up on this to understand it better) and is built on the concept of a zero-mode waveguide that allows sequencing of 2-3bp per second. Now while this may sound slow there are 75,000 such processes being performed in parallel on a single plate on the RS. Thus, Stephen claimed a sequencing time of less than one day for small-ish genomes.

The current read lengths generated by the RS are 2,500-2,700bp, though up to 22,000bp has been achieved. This is long! The gold-standard Sanger sequencing produces reads of about 800bp. Of course the longer the reads the greater the error rate, and the current 13-14% error rate has me concerned. Stephen downplayed these error rates and provided an argument as to why the RS is still more accurate than current “next-gen” platforms (an argument based on the RS’s superior read length, though I didn’t follow this). The read length/error rate trade-off will work out favourably for some applications and limit the RS’s use in others. Another limiting factor (at least in the human world I work in) is the throughput of the machine. All the examples focused on small genomes (e.g. E. Coli, Vibrio cholerae) or targeted applications, and I think it will be a while before we see mammalian-sized genomes sequenced on a regular basis with the RS. I don’t mean this as a knock on the technology, just to give some perspective on where it will be useful.

Now to the examples…(not all of which are yet achievable but are at least in theory):

  • Consensus sequencing: In some cases the read length may be longer than the DNA fragment you are trying to sequence. In such cases a hairpin adapter can be used to circularise the fragment and the sequencer reads around the circular fragment multiple times to produce a consensus sequence. Neat!
  • Whole-gene RNA-seq: Sequence the RNA of a gene in a single read which allows you to simply “read off” the splice sites and get a precise count of the polyA tail length (up to around 20bp polyA tails). Current technologies really struggle to produce precise quantitive counts of homopolymer runs.
  • Gene fusion detection: No more inferring gene fusion boundaries by split reads, just read it off the sequence
  • Sequencing “difficult” regions: GC-rich regions of the genome are notoriously difficult to sequence. The PCR cycles necessary for most experiments introduce bias whereby GC-rich and GC-poor regions of the genome are sequenced less often than those made up of ~50% GC. The RS does not appear to suffer from this bias, likely because no PCR amplification is necessary. Similar “difficult” regions are the CGG repeats of central important in Fragile X syndrome.
  • Finishing microbial genomes: Many of the published microbial genomes are not finished or complete in the same sense as the human genome. For example, genomes known to be comprised of a one or two chromosomes may be assembled as tens of separate contigs. The long reads lengths of the RS enable these contigs to be be joined by sequencing the contig boundaries.
  • Mapping disease outbreaks and sources: The RS was used in establishing the origin of the Haitian cholera outbreak and the German E. Coli outbreak of 2011.
  • Epigenetics: This had me particularly interested. Not only can the RS sequence your ordinary nucleotides, it can directly sequence DNA modifications. And not just the “simple” modifications like 5mC, but more exotic types such as 5-hmC, 4-mC, 6-mA, base J, Clucosyl-5-hmC, DNA damager, dU and ribonucleotides. I think there were some twenty four in total, of which all but three were new to me. Very exciting stuff and who knows what will be found once we can perform this kind of analysis as standard. The beauty of this is that no modification of the sample is required to read-out these DNA modifications unlike, for example, the bisulfite conversion necessary for studying 5meC with current technologies (and which is also unable to distinguish between 5meC and 5-hmC. See Flusberg et al. (Nature, 2010) for a proof of principle study using the PacBio technology to detect DNA methylation. The hairpin adapters also allow for experiments designed to detect hemi-methylation vs. full methylation.
  • Direct RNA sequencing: Rather than converting the RNA to cDNA and then sequencing, PacBio are developing a method that uses reverse transcriptase rather than the polymerase so they can directly sequence the RNA molecules. This would then allow them to search for RNA-modifications, analogous to the DNA-modifications described above!

All in all, this was a very exciting talk and gave a peak into what will be possible with these “next-next-gen” technologies. It was of course a sales-pitch, but the technology and the science being done were well worth hearing about.