Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data

Drummond, Emma and Clancy, David and Knight, Joanne (2021) Making Mitochondrial Haplogroup and DNA Sequence Predictions from Low-Density Genotyping data. Masters thesis, Lancaster University.

[thumbnail of 2021DrummondMScBiomed]
Text (2021DrummondMScBiomed)
MScBiomed_EDrummond_correctedR_mm_t3.pdf - Published Version

Download (7MB)


The mitochondrial genome (mtDNA) is inherited differently and mutates more frequently than the genetic material residing in the cells’ nucleus. Whilst the genome of the mtDNA is small, at only 16.5 kilobases, it contains key components of the metabolic chain, and must communicate in a precise and timely way with the genes in the nuclear genome and sense the minute-to-minute needs of its host cell. MtDNA is an underexplored place to search for health-related variants. Unlike the time-consuming and expensive methods of whole genome sequencing, genotyping examines certain positions in the genome allowing imputation of the other variants typically linked to these positions. Current methods, which use nuclear genome data to model their predictions, do not tailor imputation to take advantage of the different inheritance patterns of the mtDNA. I present a novel method, using an open-source library of fully sequenced mtDNA samples with manually assigned haplogroups, to take genotyping data and predict the other variants present in the sample’s mtDNA sequence, a two-stage method referred to as in silico genotyping and barcode matching. The method has been assessed for performance on a test data set to explore inconsistencies across the mitochondrial genome and the human mtDNA phylogeny. The first use of in silico genotyping and barcode matching is presented; extending the use of UKBiobank’s data [22]. The UKBiobank represents data which is not only rich in detail but also covers a large population of individuals aged between 51 and 84 in 2021. The phenotypic data is health-focussed, including general health records, which is being augmented by new diagnoses or events in the participants’ medical history. Extensive use is being made of the data in UKBiobank with the exception of the mitochondrial DNA (mtDNA). The scale of the phenotypic data collected by the UKBiobank is proving a valuable resource, values all the more because of the difficulty and expense of its collection. Making further use of phenotyping by extending potential associations into the mtDNA is vital, and likely to offer substantial rewards. Using the method described below to transform genotypes into predicted mtDNA sequence opens the doors for mitochondrial variation to be put to considerable use too. The introduction presents evidence that: (a) the mitochondrion is essential for cell and organism function, (b) mtDNA can harbour variations associating with phenotypes, and (c) the current methods of mtDNA imputation can be improved upon. The method presented mimics any genotyping microarray to produce a library of data transformed to appear as if it had been genotyped by the physical array. The effectiveness and accuracy of this transformation have been investigated and the results are presented. Finally, the transformed library is used to predict the UKBiobank participant data to greatly extend a data set with huge reserves of potential especially for mitochondrial data. My development of in silico genotyping and barcode matching has allowed me to make weighted prediction for test samples, guessing their haplogroups and the variants they carry. Whilst I admit to the significant potential to improve algorithms, the overall accuracy of these predictions is at a level high enough to search for links between UKBiobank samples and their phenotypic data in a GWAS-style search.

Item Type:
Thesis (Masters)
?? mitochondriamtdnagenotypesimputationr software ??
ID Code:
Deposited By:
Deposited On:
06 Aug 2021 08:20
Last Modified:
16 Jul 2024 05:56