Datenbestand vom 13. März 2019
Tel: 089 / 66060798
Mo - Fr, 9 - 12 Uhr
Fax: 089 / 66060799
aktualisiert am 13. März 2019
978-3-8439-2709-3, Reihe Informatik
Disease risk prediction in genome-wide association studies
130 Seiten, Dissertation Eberhard-Karls-Universität Tübingen (2016), Softcover, A5
Genome-wide association studies (GWAS) have introduced a completely new approach for discovering genetic variants that are associated with observed phenotypes. In contrast to monogenic disorders, which are caused by mutations with large effects in a single gene, the underlying genetic cause of complex diseases is much more difficult to discover. A large number of variants with each having small effect size requires studies with much larger sample sizes to identify associations with the phenotype using statistical methods.
Besides the identification of novel risk factors and the deciphering of underlying genetic mechanisms, the prediction of the individual disease risk gains increased importance. This thesis focuses on said predictions using machine learning techniques.
For this, a general machine learning workflow for creating and evaluating predictive models was adapted for the use with GWAS data and implemented. This approach using the support vector machine (SVM) was then applied to multiple GWAS data sets of different diseases, including Parkinson's disease and type 1 diabetes. In addition to the predictability of the individual disease risk, the role of rare and uncommon single nucleotide polymorphisms (SNPs) was investigated on these data sets including a study using simulated data sets. Even though the predictive performance was not yet feasible for practical relevance, the results remained stable when validating the models on external data sets.
The use of SNP data with these algorithms requires a numerical encoding, and different encoding schemes are possible. Each encoding scheme implies a different genetic risk model, which can influence the performance of the machine learning algorithms. In an extensive study we compared the effect of different feature encodings on seven different supervised learning algorithms using various disease data sets. A statistical evaluation using non-parametric tests showed a clear advantage of the additive encoding over all data sets and tested algorithms.
Using this encoding, we then compared the predictive performance of the algorithms themselves. While only two groups of algorithms with similar performance could be identified, the results suggest that the difference between algorithms partially depends on the number of available features.