2 resultados para learning to program
em CaltechTHESIS
Resumo:
In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.
In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.
Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.
In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.
Resumo:
The first chapter of this thesis deals with automating data gathering for single cell microfluidic tests. The programs developed saved significant amounts of time with no loss in accuracy. The technology from this chapter was applied to experiments in both Chapters 4 and 5.
The second chapter describes the use of statistical learning to prognose if an anti-angiogenic drug (Bevacizumab) would successfully treat a glioblastoma multiforme tumor. This was conducted by first measuring protein levels from 92 blood samples using the DNA-encoded antibody library platform. This allowed the measure of 35 different proteins per sample, with comparable sensitivity to ELISA. Two statistical learning models were developed in order to predict whether the treatment would succeed. The first, logistic regression, predicted with 85% accuracy and an AUC of 0.901 using a five protein panel. These five proteins were statistically significant predictors and gave insight into the mechanism behind anti-angiogenic success/failure. The second model, an ensemble model of logistic regression, kNN, and random forest, predicted with a slightly higher accuracy of 87%.
The third chapter details the development of a photocleavable conjugate that multiplexed cell surface detection in microfluidic devices. The method successfully detected streptavidin on coated beads with 92% positive predictive rate. Furthermore, chambers with 0, 1, 2, and 3+ beads were statistically distinguishable. The method was then used to detect CD3 on Jurkat T cells, yielding a positive predictive rate of 49% and false positive rate of 0%.
The fourth chapter talks about the use of measuring T cell polyfunctionality in order to predict whether a patient will succeed an adoptive T cells transfer therapy. In 15 patients, we measured 10 proteins from individual T cells (~300 cells per patient). The polyfunctional strength index was calculated, which was then correlated with the patient's progress free survival (PFS) time. 52 other parameters measured in the single cell test were correlated with the PFS. No statistical correlator has been determined, however, and more data is necessary to reach a conclusion.
Finally, the fifth chapter talks about the interactions between T cells and how that affects their protein secretion. It was observed that T cells in direct contact selectively enhance their protein secretion, in some cases by over 5 fold. This occurred for Granzyme B, Perforin, CCL4, TNFa, and IFNg. IL- 10 was shown to decrease slightly upon contact. This phenomenon held true for T cells from all patients tested (n=8). Using single cell data, the theoretical protein secretion frequency was calculated for two cells and then compared to the observed rate of secretion for both two cells not in contact, and two cells in contact. In over 90% of cases, the theoretical protein secretion rate matched that of two cells not in contact.