Bioinformatic analyses and population controls
Following aDNA sequencing, post-sequencing data processing is required to make sense of the sequenced DNA data. Bioinformatic (computational) analyses are used to piece together sequenced aDNA fragments by mapping the individual reads to reference genome databases. The DNA nucleotide sequences obtained via shotgun sequencing are also identified by matching these to existing sequence databases, with a ‘blasting’ procedure that uses a Basic Local Alignment Search Tool (BLAST). Related genome sequences are however required to detect endogenous DNA and exclude contaminating sequences. Besides recovering more sequences for analysis, a more closely related genome sequence also gives a more complete picture of the ancient genome by avoiding a bias against highly diverged regions. Correspondingly, the absence of comparative sequences derived from close relatives limits the value of a genome project of an extinct species as any sequence comparison will be limited to genomic regions that share sufficient conservation to reliably detect ancient DNA sequences.
Given the influence of contamination on the reliability of DNA sequences obtained from ancient samples, appropriate analytical and population protocols are required during the analyses of the extracted aDNA sequences. Analytical controls includes the use of software programmes such as MapDamage (Jónsson et al., 2013) which computes nucleotide misincorporation and fragmentation patterns using NGS reads mapped against a reference genome and EAGER which is used to perform quality control, mapping, authentication, contamination estimation and genotyping of NGS data. The EAGER pipeline incorporates methods for paired-end read merging, duplication removal and mapping that are tailored to improve the analysis output for aDNA projects. The PALEOMIX pipeline also supports the quantification of post-mortem DNA damage and standard misincorporation and fragmentation patterns. When several genomes are available, PALEOMIX can reconstruct maximum likelihood phylogenomic trees and reveal the phylogenetic relationships among taxa. Finally, metaBIT is an integrative and automated metagenomic pipeline for analysing microbial profiles from HTS shotgun data. This software can also be used to monitor laboratory contamination and detect microbial species, including pathogens.
Population controls involve the comparison of aDNA sequences to databases specific to the geographic region or temporal period from which the aDNA derives. To ascertain the validity of obtained sequence reads, and as one would expect to recover aDNA from a specific set of indigenous vertebrate and botanical species, population controls particular to local (i.e. southern African) environmental parameters must be applied. This can be achieved via comparison with databases such as the International Barcode of Life (IBOL) and the African Centre for DNA Barcoding (ACDB). African and non-African human single-nucleotide polymorphism (SNP) data is available at the 1000 Genomes Project website (Auton et al., 2015) and the Online Ancient Genome Repository (OAGR). This enables the comparison of human aDNA sequences with known San hunter-gatherer and Bantu-speaking agro-pastoralist sequences. Whereas specific mutations are related to the San lifestyle, such as the VDR allele associated with higher bone mineral density, UGT1A3 (associated with increased metabolism of endo- and xenobiotics) and ACTN3 (associated with increased sprint and power performance), others, such the Bantu-speaker Duffy null (DARC) malaria-resistance allele, and the European-derived lactase persistence allele and the SLC24A5 allele (associated with light-coloured skin), are indicative of geographically-foreign human populations.
1 comment