Into a Massive Amount of Data
We finally got the genome skimming data back and I have started working with it. The first pass I am currently doing is called "reference guided assembly." My advisor for this project was able to assist me with downloading existing gene sequences for Mammillaria that had been uploaded to GenBank as a result of earlier projects. GenBank is an amazing resource where scientists upload their genetic data, available to all for free. A couple of prior Mammillaria molecular phylogeny projects had used PCR to sequence the following plastid genome markers: matK, rpl16 and psbatrnH.
![](https://d3t9s8cdqyboc5.cloudfront.net/images?path=126831/qZE1M0SoMpqF81fx4Asl_1bb01b8ee4547740888172bdbe7d2c9d--techno-studio-gear.jpg&width=650&height=)
About 2 years ago I downloaded all of that data, aligned the sequences and created an interleaved file of all three markers. Preliminary phylogenies were recovered and the results in fact were what led us to want to pursue a much more exhaustive approach to molecular analysis of Cochemiea and Mammillaria.
Reference guided assembly involves using the raw reads from the genome skimming (up to 4.5 million reads per sample!) and casting them against a reference example of the gene one is interested in. So I am using matK, rpl16 and psbatrnH from Mammillaria capensis from the GenBank data as the reference genes, and casting all of my new genome skimming samples against those to generate new consensus sequences for my samples. It's an automated process using a very powerful software package called Geneious!.
After consensus sequences for all three markers are generated for all of my samples, they get automatically aligned with the existing aligned data sets I have for these three markers, from GenBank. Then I'll create an interleaved file of the new sequences from the samples we are working with, and start to analyze the data using (probably) parsimony analysis as well as maximum likelihood. This approach has not been represented in much published research and has never been done with Mammillaria, so it should be fascinating to see how the sequences that are derived from reference guided assembly compare to the ones from GenBank that were developed through PCR, as well as to compare the phylogenetic reconstructions and bootstrap values for the branch positions.
After this first pass using reference guided assembly, we will move on to more whole genome approaches, so that our analysis will be more accurately genomic analysis in the next steps.
Meanwhile, the RADseq plate of Cochemiea halei population samples awaits processing and I'm digging deeply into the literature and talking with other researchers on how to best analyze the data that will come back from that.
It's exciting to be working with new technology to gain a deeper understanding of these endangered plants. Thank you again for supporting this project!
0 comments