Firefly genome update: Mitochondrial genome deciphered!
Dear firefly genome fans,
Happy Friday! To celebrate, team firefly would like to share our latest progress on the firefly genome with you.
In our last lab note, we took our first look at the real data from the long-read PacBio sequencing that your support enabled. In that note, we were working with a draft of the Photinus pyralis mitochondrial genome, and although we knew our draft mitochondrial genome was incomplete, it was clear the PacBio data contained the information we needed to complete the mitochondrial genome that the short reads just didn't have. So how did it turn out? I'm happy to report that the Photinus pyralis mitochondrial genome is now fully deciphered! What does it look like? Take a look below!

So, what are you looking at here exactly? Recall that unlike the linear chromosomes in the nucleus of the cell, the mitochondrial genome is actually a continuous circle of DNA. The plot above shows all the 17,082 base pairs, in that native circular orientation. Within those 17,082 base pairs, there are 13 protein coding genes (in green) 2 ribosomal RNA genes (in blue), and 23 transfer RNA genes (in orange). But what are all those mitochondrial genes doing?
Remember how the mitochondria is the powerhouse of the cell? It seems like a cute analogy, but it has real truth! Did you know there are actual electrical currents and spinning mechanical motors in real mitochondria? These currents and pumps are called the electron transport chain, and this chain is what harnesses the energy from the food that we eat (in the form of the energy storage molecule ATP) every second of every day.
And fireflies are the same way, they need that mitochondrial energy to get through the day. More than that, the bioluminescent chemical reaction that fireflies control to produce light uses ATP directly! So, given the important role the mitochondria play, perhaps it's not surprising that the genes in the mitochondrial genome are "powerhouse" genes, which is to say the genes in the mitochondrial genome code for the molecular machines and wires that make up the electron transport chain.
So, "nad" in the plot above stands for NADH dehydrogenase (part of Complex I of the electron transport chain). "cox" stands for cytochrome C oxidase (part of Complex IV), and cob stands for cytochrome B oxidase (part of Complex III). Together the multiple proteins in Complex I, II, III, and IV make up the wire for the electrons removed from the food that we eat, and "atp" (ATP synthase), is the mechanical motor that makes ATP from the energy released from those flowing electrons (checkout this video to see it in action!). But does it take only 13 proteins to make the mitochondria a powerhouse? Not even close. A mitochondria is made up of thousands of different proteins, and the majority of these genes are stored in the main genome in the cell nucleus (attentive readers may have noted none of the Complex II genes were listed as present in the mitochondrial genome). But, a select few of those genes are finicky enough that they can't bear to be away from the powerhouse, and therefore are still stored right next to the action. These are the 13 genes we see in the firefly mitochondrial genome.
But what are those features in gray? In turns out there there are two non-gene regions, the AT-rich region (so named because it has a lot of 'A' & 'T' base pairs) which is believed to be where the DNA replication initiates for the mitochondrial genome, and the "tandem repeat unit", or TRU, which is a repetitive element that has been reported in other firefly mitochondrial genomes. Between you and me, the TRU of Photinus pyralis (but not other fireflies), seems to be a duplication of the tryptophan tRNA, but at this point it is unclear what that means, if anything.

This TRU repetitive element consists of 12 copies of a 76 bp repeat, plus 36 bp of a partial repeat, making an altogether 871 bp long repetitive element! Remember from the last lab note the problematic repetitive region in the draft mitochondrial genome? The one that the short reads couldn't assemble? Yep, turns out that problematic region was the TRU. How did the PacBio reads handle it? Perfectly. I turns out that there was a 3733 bp long PacBio read which spanned the whole TRU region. See data from the actual read below. If you look carefully, you can actually see the repetitive TRU region!

By replacing the incorrect TRU region in the draft mitochondrial genome with that PacBio read, a complete circular mitochondrial genome was produced. But wait, you might ask. Didn't I say that PacBio sequencing has a 13% error rate? Over 3733 base pairs in the read above, wouldn't that mean ~500 errors would be introduced? In fact, yes, that was a concern. But, there is a clever trick of PacBio sequencing called "circular consensus sequencing", which solves this error problem. I'll explain:

In a nutshell, all PacBio sequencing reads are actually coming off the circular DNA molecules like the one shown above. The DNA polymerase, shown as the bubbles in gray, is the "magic" tool that produces the sequence data (this video explains how the magic actually works).
The yellow parts represent one strand of the DNA double helix, whereas the purple parts represent the complementary DNA strand. Together the yellow and purple strands are the actual DNA that was once sitting inside a firefly cell, but through the science of DNA extraction & "PacBio library preparation", that DNA was converted into the circular form above with single stranded DNA adaptor sequences (shown in green). In this form the DNA is ready to go on the PacBio instrument. Once on the instrument, the DNA polymerase goes around this circle in a single direction, producing data from a single strand at a time. The "polymerase read" represents the first version of the data coming off the instrument, and includes information on single strands (yellow and purple), as well as adaptors (green).
The trick is, for the data in the polymerase read, the error rate is ~13%, which isn't too great. However, the polymerase might keep going around the circle and pass over the yellow and purple strands multiple times, and since the adaptor sequence is known, each pass can be can be separated out of the polymerase read as a "subread". Although the single pass error rate is ~13%, by combining the overlapping subreads the random errors cancel out, and you can get unbeatably low consensus error rates.
In this case, the read I used above to solve the TRU repetitive region in the firefly mitochondrial genome went around the circle 5 times (~9.5 subreads), which when integrated using PacBio's "circular consensus sequencing" methodology, gave the whole "CCS" read a negligible error rate. Out of the 3733 bases of the final CCS read, there are no errors. Compared to the 0.2% error rate for Illumina sequencing, an equivalently long Illumina read would have ~7 errors. So this PacBio CCS read is over 10 times longer than the longest Illumina read, with a lower error rate. Not bad right?
But before you go off telling your friends how amazing PacBio sequencing is, keep in mind that reads like this are a very small proportion of the data in a 20 kb+ long-insert library (like we prepared here). By my calculation, something like 1/10000th of the data. But in this case, this lucky CCS read turned out to be the key tool to help us decipher the firefly mitochondrial genome.
That's all for now, hope you enjoyed this update! Team firefly still has a lot of work on the nuclear genome. In fact, we've ordered 30X more PacBio sequencing to give us the best chance at a reference quality nuclear genome. Stay tuned!
(mitochondrial gene annotation performed by the MITOS server: http://mitos.bioinf.uni-leipzi...)
(mitochondrial figure drawn with CIRCOS: http://circos.ca and Inkscape https://inkscape.org/en/)
(Full sequence, annotation, and description of assembly method of the Photinus pyralis mitochondrial genome will be published with the main firefly genome manuscript. Those who would like to work with the pre-publication data are encouraged to get in touch with any of the team firefly members to discuss)
1 comment