Evaluating statistical and machine learning methods to predict risk of in-hospital child mortality in Uganda


Autoria(s): Nguyen, Grant
Contribuinte(s)

Flaxman, Abraham D

Data(s)

22/09/2016

01/08/2016

Resumo

Thesis (Master's)--University of Washington, 2016-08

Building on existing research on child mortality in Uganda, we used data from a six-hospital malaria surveillance system including signs and symptoms of all admitted patients, testing results, treatments provided, diagnoses at admission and discharge, and patient outcomes. We tested the relative performance of five statistical and machine learning methods to predict in-hospital mortality and extracted variable importance scores relating signs and symptoms, treatment, testing, and diagnosis to in-hospital mortality. To determine the performance of each method to predict in-hospital child mortality, we applied all of the methods within 10 repetitions of 10-fold cross validation. Based on the predictions of each method on the held-out folds of the dataset, we used Area Under the Curve (AUC) to judge relative performance. We extracted variable importance scores for logistic regression, random forests, and gradient boosting machines, and ranked variable importance using inclusion and model coefficients for conditional inference trees and logistic regression. Overall, logistic regression, random forests, and gradient boosting machines significantly outperformed decision and conditional inference trees in predicting in-hospital mortality. Using only variables present at admission, logistic regression, random forests, and gradient boosting machines had AUC values of 0.83 (0.80-0.85), 0.82 (0.79-0.84), and 0.80 (0.78-0.83) respectively, compared to AUC values of 0.72 (0.70-0.75) and 0.72 (0.69-0.75) for decision and conditional inference trees. The top-3 methods by AUC were able to correctly categorize 80% of in-hospital deaths while misclassifying 35% or fewer of eventual non-deaths as high-risk. Considering only variables available at admission, the following variables were important predictors of mortality across four or more methods: treatment at admission with paracetamol, admission with severe malaria, age, inability to sit, inability to drink, hospital site, deep breathing, number of diagnoses at admission, and difficulty breathing. While the top 10-15 variables were highly ranked across multiple methods, many lower-ranked variables were highlighted as important by only one or two methods. This study highlights the relative strength of logistic regression, random forests, and gradient boosting machines in predicting child mortality using a high-dimensionality dataset. The variable importance scores largely confirm the results of previous studies on symptoms and signs related to mortality, and point to interesting relationships for future investigation and research. However, divergences in variable importance underscore the usefulness of applying multiple methods to identify variables that remain important across various methods. Future directions for this work include applying ensemble models, further exploring key predictor variables, and extending this analytical framework towards other clinical prediction environments.

Formato

application/pdf

Identificador

Nguyen_washington_0250O_16437.pdf

http://hdl.handle.net/1773/36994

Idioma(s)

en_US

Relação

Appendix.pdf; pdf; Appendix.

Palavras-Chave #child mortality #machine learning #risk score #uganda #Public health #Biostatistics #global health
Tipo

Thesis