Tim Fallon

A crowdfunded biology project byTim Fallon


Good News! PacBio long-read sequencing data received

Lab Note #13
Oct 28, 2016

Dear firefly genome fans,

Happy Friday!  In addition to that excellent news, we have some other news we thought you might enjoy as well! We’ve just received the firefly genome sequence data back from the 16 PacBio RSII SMRT DNA sequencing cells that were enabled by your generous support of our firefly genome crowdfunding campaign.  This sequencing utilized the high-molecular-weight DNA previously extracted from 4 adult male Photinus pyralis fireflies collected from Lawrenceville New Jersey this past summer.

The quantity of data received comes out to about 20 times coverage of the male Photinus pyralis genome, meaning that every base (all those 'A','T','C', or 'G's) in the 436 million base pair firefly genome should have been sequenced on average 20 times!

Remember that this is so-called "shotgun" sequencing data, where the genome is physically broken into many many pieces, each piece is sequenced, and then needs to be put back together into the original information with a powerful computer (or a room of powerful computers) .  Although the firefly genomic DNA we extracted started as many many copies of 10 linear DNA molecules (the chromosomes of Photinus pyralis), the PacBio data came back in 1,114,185 pieces (aka reads)!  

So just how long are these reads? Take a look below for the actual sequence information for one of the longer reads we obtained, at 16,933 base pairs long:

Whoa!!  That is definitely a long read.  Reads as long as this will be great for assembling the firefly genome back together.  But what does the length distribution of the rest of the reads look like?

Histogram of the sizes of the received PacBio DNA sequencing reads

Pretty long!  There is even a significant proportion of reads that are longer than the 16,933 base pair long read shown above.  Attentive readers might note that the Y-axis (the # of reads), seems a bit lower than the 1,114,185 total PacBio reads I cited before.  This plot represents the data from a single SMRT cell, whereas the 1,114,185 reads are from the 16 SMRT cells combined.

How does this "long-read" data compare to the standard "short-read" DNA sequencing data, from the sequencing company Illumina?  Let's take a look:

Two typical 100x100 paired-end Illumina DNA sequencing reads

As you can see, compared to the PacBio long-read, these reads are tiny.  With two reads at 100bp long, these reads form what is called an "Illumina read pair" that represent sequencing of the same DNA molecule, from opposite ends, with an unsequenced gap ~400 base pairs long in the middle between them (see this webpage for more information on what I mean).  

Lets compare these PacBio and Illumina reads head to head, and see what they look like:

A genome browser view of the Photinus pyralis mitochondrial genomes, plus some example reads

What you see above is a genome browser, which is a software program that lets you view genomes and the data that relates to them in a graphical way.  The area that says "16 kb" above is actually a linear representation of our draft 16.341 kb (kilo-base pairPhotinus pyralis mitochondrial genome.  You may recall that mitochondria are the "power-houses of the cell", but beyond being power-houses they also have their own circular genome independent from the main linear genome contained in the chromosomes!  If you were to zoom in close enough, the genome browser would show the actual sequence information, but from this zoomed out level only the length bar is shown.  What I've done above is "mapped" the PacBio & Illumina reads previously mentioned to the draft mitochondrial genome (I cheated a bit, I chose those reads specifically *because* they aligned to the mitochondrial genome).  The reads are now shown lined up to the location on the mitochondrial genome that they are most similar to in DNA sequence, and therefore is the location they most likely physically came from before getting sequenced.

You'll notice that the short reads (Illumina) are totally gray, which means that all 200 base pairs match what is found on the mitochondrial genome exactly.  On the other hand, the long read (PacBio) looks more festive, with a lot of color.  Each small colored line in the figure above represents a single base change from the mitochondrial genome, so where the mitochondrial genome might be "ATCG", the PacBio read may be "ATTG", a single base change away.  Some of these differences might be due to genetic differences (the draft mitochondrial genome & the PacBio data came from different individuals), or from sequencing errors. 

Let's take a look at *all* the mapped reads and see if we can figure out what is going on:

The whole mitochondrial genome, now with more reads!

Okay!  So this plot seems a lot busier, but it is fundamentally the same things: reads mapped to the mitochondrial genome.  Now there are just lots and lots more reads.  Perhaps you're starting to get a feel for how genomic data falls into the "big data" category? From a quick glance, the top reads (Illumina), are shorter, and a lot less colorful, representing few changes from the draft mitochondrial genome.  This is due to the lower error rate of Illumina reads, at around 0.2% (2 errors per 1000).  On the other hand, for the bottom reads (PacBio), the error rate is around 13% (~1 error per 10).  This error rate seems pretty high (and it is), but in practice, PacBio errors are totally random, meaning that if you have 20X coverage, the probability of having an error in every single read for the same nucleotide would be 0.13^(20) or, 1.9*10^(-19), or, pretty small!  Therefore the "consensus accuracy" of PacBio sequencing ends up being the best of any sequencing technology.

Just as a reminder, "coverage" is the number of reads that map to a particular nucleotide (it is also known as "sequencing depth").  In the case of the Illumina data, from the data above you can see that we have an incredible ~20,000X coverage over the whole mitochondrial genome, whereas in the case of the PacBio data there is a much more reasonable, but still quite high, ~1500X coverage.  You might recall that way back at the beginning of this post I said the PacBio data was ~20X coverage of the firefly genome, so how did the mitochondrial genome get 1500X coverage?  Recall that there are ~hundreds of mitochondria per cell, each carrying ~1-10 copies of its own mitochondrial genome.  Compare that to the lonely 2 copies of the Photinus pyralis genome that is contained in the chromosomes of the nucleus of the cell, and it starts to become clear why we ended up with so much data from the mitochondria genome.  There are roughly 1000 times more copies of mitochondrial DNA to start with!

But look carefully: there is something else going on in the PacBio data; it looks like sometimes the same base shows up in almost all the reads!  With 1500X coverage, the chance of this being a random sequencing error is 0, and we can interpret these differences as real genetic changes between the mitochondrial draft genome and the fireflies used for the PacBio sequencing. 

So, even by just looking at the mitochondrial genome, we can learn something about the genetic diversity of fireflies, which would help us understand how Photinus pyralis fireflies spread across North America, determine how robust their populations are, and help advise conservation efforts.

Sounds nice, you might be thinking. But what about the reference quality firefly genome?  That is the goal of the project right?  Indeed!  In fact, the Photinus pyralis mitochondrial genome demonstrates the kind of assembly issues we have run into with our existing Photinus pyralis short-read datasets, and why we needed your support to obtain long-read PacBio sequencing!  Read on to find out more:

Unlike a fair puzzle, genomes (even tiny mitochondrial genomes) contain repetitive elements which hamper reassembly.  Instead of being a nice linear puzzle (like putting back together a shredded book), genomes in practice turn out to be more like hairballs, as with short reads many different places in the genome look quite similar (or exactly the same).  The trick is that the very long PacBio reads are like using very large puzzle pieces, so instead of pieces being repetitive, they actually become unique, making the puzzle problem that much easier.  Imagine that instead of shredding a book into individual words (impossible to reassemble), it was shredded into overlapping paragraphs.

Let's start with the mitochondrial genome.  As the ~17.5 kilo base-pair Photinus pyralis mitochondrial genome only represents ~0.0004% of the 436 million base pair main firefly genome, and is circular and self-contained, it should be straightforward to completely assemble with short-reads alone right?  Not exactly.  Team firefly tried to assemble the firefly mitochondrial genome with our existing short read data, and we were able to produce something that was correct, but incomplete.  Despite our best efforts, we couldn't turn it into a full-length mitochondrial genome.  

A genome browser view: Zoomed into the left side of the mitochondrial genome

First off, we knew there was potentially some trouble with our draft mitochondrial genome by comparison to existing data.  Our draft mitochondrial genome is 16,341 base pairs, whereas the most closely related previously sequenced firefly mitochondrial genome is 17,739 base pairs, suggesting that our draft might be missing ~1000 base pairs.  

Secondly, if you look at the "Illumina sequencing coverage" plot above, you see something strange starts to happen as you get closer to the left side.  You'll notice there is a sudden jump up in the coverage, and then it continues to increases to ~2x times the coverage of the rest of the mitochondrial genome.  This is a bad sign, as coverage shouldn't jump (this signals a technical issue, such as having the wrong sequence), and the fact that it increases to ~2X the rest of the coverage signals that the region below it may be repetitive, and was artificially compressed into a single region.

If we look at the PacBio data, although the coverage doesn't change much, there starts to be a lot of purple indicators in that region.  Those purple indicators signal the presence of insertions, which says there is additional sequence data contained in the PacBio reads which isn't present in the existing mitochondrial genome assembly.  The interpretation is clear: the information that was missing in the short reads is contained in the PacBio reads!  With a little bit of tweaking, this set of PacBio data will allow us to produce a reference quality Photinus pyralis mitochondrial genome.

As for the rest of the firefly genome?  This PacBio data will help there too, but in that case we don't have 1500X coverage, and will have consider how we utilize the data more carefully. Team firefly is currently working away considering many strategies to use this data and bring the firefly genome to life.  Stay tuned for more!

Please wait...