Computational analysis of genetic variation

Field, Matthew Arnell

Computational analysis of genetic variation

Date

2015

Authors

Field, Matthew Arnell

Abstract

High throughput sequences are generating increasingly detailed catalogues of genetic variation both in human disease and within the larger population. To effectively utilise this rich data set for maximum research benefit, as a discipline we require robust, flexible, and reproducible analysis pipelines capable of accurately detecting and prioritising variants. While data-specific computational algorithms aimed at deriving accurate data from these technologies have reached maturity, two major challenges remain in order to realise the goals of elucidating the underlying genetic causes of disease as a means of developing custom treatment options. The first challenge is the creation of high-throughput variant detection pipelines able to reliably detect sample variation from a variety of sequence data types. Such a system needs to be scalable, flexible, robust, highly automated, and able to support reproducible analyses in order to support both default and custom variant detection workflows. The second challenge is the effective prioritisation of the huge number of variants detected in each sample, a task required to reduce the large search space for causal variants down to variant lists suitable for manual interrogation. This thesis describes six publications describing components of the larger informatics framework I have developed over the last four years to address these challenges, a framework designed from the onset to effectively manage and process large data sets with an end goal of utilising computational analysis of sequence data to further understand the relationship between genetic variation and human disease. The first publication “Reliably detecting clinically important variants requires both combined variant calls and optimized filtering strategies” describes a variant detection strategy designed to minimize false negative variants as is desired when utilising patient variation data in the clinic. The next four publications describe custom workflows developed for detecting variants in sequence data from different sample types, namely paired cancer samples (“Tumour procurement, DNA extraction, coverage analysis and optimisation of mutation-calling algorithms for human melanoma genomes”), pedigrees (“Reducing the search space for causal genetic variants with VASP: Variant Analysis of Sequenced Pedigrees”), mixed cell populations containing ultra-rare mutations (“DeepSNVMiner: A sequence analysis tool to detect emergent, rare mutations in sub-sets of cell populations”) and mouse exome data containing ENU mutations (“Massively parallel sequencing of the mouse exome to accurately identify rare, induced mutations: an immediate source for thousands of new mouse models”) . The last publication, “Comparison of predicted and actual consequences of missense mutations” focuses on the validation of computational tools that predict functional impact of missense mutations and further attempts to explain why many missense mutations predicted to be damaging do not result in an observable phenotype as might be expected. Collectively these publications detail efforts to reliably detect and prioritise variants across a wide variety of data types, efforts all based around the significant underlying software framework I have developed to better elucidate the link between genetic variation and disease.