Datenbestand vom 12. August 2022

Tel: 0175 / 9263392 Mo - Fr, 9 - 12 Uhr

Impressum Fax: 089 / 66060799

aktualisiert am 12. August 2022

978-3-8439-2725-3, Reihe Statistik

Silke Janitza Resampling Approaches in Biometrical Applications: Developments in Random Forests and in Bootstrap-based Procedures

185 Seiten, Dissertation Ludwig-Maximilians-Universität München (2016), Softcover, A5

The first two parts of this thesis provide new developments on the resampling method random forests. Random forests are an ensemble of classification or regression trees with each tree being built from a sample drawn either with or without replacement from the original data. While classification and regression problems using random forest methodology have been extensively investigated in the past, there seems to be a lack of literature on handling ordinal regression problems, that is, if response categories have an inherent ordering. In the first part, this thesis investigates if incorporating the ordering information in random forests improves prediction and variable selection. When using random forest's variable importance measures, the researcher faces the problem that there is no natural cutoff for importance scores that can be used to differentiate between important and non-important variables. In the second part, this thesis introduces a computationally fast heuristic variable importance test for high-dimensional data settings.

Other resampling approaches, which are based on the bootstrap, are investigated in the third and fourth parts of this thesis. These address for example stability investigations. Repeating the same analysis on a large number of data samples from the same data generating process allows one to draw conclusions on how stable the results are against data perturbations. Since in practical applications the data generating process is unknown, several authors proposed using the bootstrap instead. However, applying the data analysis on bootstrap samples as if they were samples drawn from the true distribution might be misleading if the data analysis includes hypothesis testing or model selection steps using information criteria or data splitting approaches. This is addressed in the third and fourth parts of this thesis, respectively, and promising solutions are investigated.