7 resultados para Nonparametric discriminant analysis
em Duke University
Resumo:
We study the problem of supervised linear dimensionality reduction, taking an information-theoretic viewpoint. The linear projection matrix is designed by maximizing the mutual information between the projected signal and the class label. By harnessing a recent theoretical result on the gradient of mutual information, the above optimization problem can be solved directly using gradient descent, without requiring simplification of the objective function. Theoretical analysis and empirical comparison are made between the proposed method and two closely related methods, and comparisons are also made with a method in which Rényi entropy is used to define the mutual information (in this case the gradient may be computed simply, under a special parameter setting). Relative to these alternative approaches, the proposed method achieves promising results on real datasets. Copyright 2012 by the author(s)/owner(s).
Resumo:
BACKGROUND: Nonparametric Bayesian techniques have been developed recently to extend the sophistication of factor models, allowing one to infer the number of appropriate factors from the observed data. We consider such techniques for sparse factor analysis, with application to gene-expression data from three virus challenge studies. Particular attention is placed on employing the Beta Process (BP), the Indian Buffet Process (IBP), and related sparseness-promoting techniques to infer a proper number of factors. The posterior density function on the model parameters is computed using Gibbs sampling and variational Bayesian (VB) analysis. RESULTS: Time-evolving gene-expression data are considered for respiratory syncytial virus (RSV), Rhino virus, and influenza, using blood samples from healthy human subjects. These data were acquired in three challenge studies, each executed after receiving institutional review board (IRB) approval from Duke University. Comparisons are made between several alternative means of per-forming nonparametric factor analysis on these data, with comparisons as well to sparse-PCA and Penalized Matrix Decomposition (PMD), closely related non-Bayesian approaches. CONCLUSIONS: Applying the Beta Process to the factor scores, or to the singular values of a pseudo-SVD construction, the proposed algorithms infer the number of factors in gene-expression data. For real data the "true" number of factors is unknown; in our simulations we consider a range of noise variances, and the proposed Bayesian models inferred the number of factors accurately relative to other methods in the literature, such as sparse-PCA and PMD. We have also identified a "pan-viral" factor of importance for each of the three viruses considered in this study. We have identified a set of genes associated with this pan-viral factor, of interest for early detection of such viruses based upon the host response, as quantified via gene-expression data.
Resumo:
As more diagnostic testing options become available to physicians, it becomes more difficult to combine various types of medical information together in order to optimize the overall diagnosis. To improve diagnostic performance, here we introduce an approach to optimize a decision-fusion technique to combine heterogeneous information, such as from different modalities, feature categories, or institutions. For classifier comparison we used two performance metrics: The receiving operator characteristic (ROC) area under the curve [area under the ROC curve (AUC)] and the normalized partial area under the curve (pAUC). This study used four classifiers: Linear discriminant analysis (LDA), artificial neural network (ANN), and two variants of our decision-fusion technique, AUC-optimized (DF-A) and pAUC-optimized (DF-P) decision fusion. We applied each of these classifiers with 100-fold cross-validation to two heterogeneous breast cancer data sets: One of mass lesion features and a much more challenging one of microcalcification lesion features. For the calcification data set, DF-A outperformed the other classifiers in terms of AUC (p < 0.02) and achieved AUC=0.85 +/- 0.01. The DF-P surpassed the other classifiers in terms of pAUC (p < 0.01) and reached pAUC=0.38 +/- 0.02. For the mass data set, DF-A outperformed both the ANN and the LDA (p < 0.04) and achieved AUC=0.94 +/- 0.01. Although for this data set there were no statistically significant differences among the classifiers' pAUC values (pAUC=0.57 +/- 0.07 to 0.67 +/- 0.05, p > 0.10), the DF-P did significantly improve specificity versus the LDA at both 98% and 100% sensitivity (p < 0.04). In conclusion, decision fusion directly optimized clinically significant performance measures, such as AUC and pAUC, and sometimes outperformed two well-known machine-learning techniques when applied to two different breast cancer data sets.
Resumo:
BACKGROUND: To our knowledge, the antiviral activity of pegylated interferon alfa-2a has not been studied in participants with untreated human immunodeficiency virus type 1 (HIV-1) infection but without chronic hepatitis C virus (HCV) infection. METHODS: Untreated HIV-1-infected volunteers without HCV infection received 180 microg of pegylated interferon alfa-2a weekly for 12 weeks. Changes in plasma HIV-1 RNA load, CD4(+) T cell counts, pharmacokinetics, pharmacodynamic measurements of 2',5'-oligoadenylate synthetase (OAS) activity, and induction levels of interferon-inducible genes (IFIGs) were measured. Nonparametric statistical analysis was performed. RESULTS: Eleven participants completed 12 weeks of therapy. The median plasma viral load decrease and change in CD4(+) T cell counts at week 12 were 0.61 log(10) copies/mL (90% confidence interval [CI], 0.20-1.18 log(10) copies/mL) and -44 cells/microL (90% CI, -95 to 85 cells/microL), respectively. There was no correlation between plasma viral load decreases and concurrent pegylated interferon plasma concentrations. However, participants with larger increases in OAS level exhibited greater decreases in plasma viral load at weeks 1 and 2 (r = -0.75 [90% CI, -0.93 to -0.28] and r = -0.61 [90% CI, -0.87 to -0.09], respectively; estimated Spearman rank correlation). Participants with higher baseline IFIG levels had smaller week 12 decreases in plasma viral load (0.66 log(10) copies/mL [90% CI, 0.06-0.91 log(10) copies/mL]), whereas those with larger IFIG induction levels exhibited larger decreases in plasma viral load (-0.74 log(10) copies/mL [90% CI, -0.93 to -0.21 log(10) copies/mL]). CONCLUSION: Pegylated interferon alfa-2a was well tolerated and exhibited statistically significant anti-HIV-1 activity in HIV-1-monoinfected patients. The anti-HIV-1 effect correlated with OAS protein levels (weeks 1 and 2) and IFIG induction levels (week 12) but not with pegylated interferon concentrations.
Resumo:
A shearing quotient (SQ) is a way of quantitatively representing the Phase I shearing edges on a molar tooth. Ordinary or phylogenetic least squares regression is fit to data on log molar length (independent variable) and log sum of measured shearing crests (dependent variable). The derived linear equation is used to generate an 'expected' shearing crest length from molar length of included individuals or taxa. Following conversion of all variables to real space, the expected value is subtracted from the observed value for each individual or taxon. The result is then divided by the expected value and multiplied by 100. SQs have long been the metric of choice for assessing dietary adaptations in fossil primates. Not all studies using SQ have used the same tooth position or crests, nor have all computed regression equations using the same approach. Here we focus on re-analyzing the data of one recent study to investigate the magnitude of effects of variation in 1) shearing crest inclusion, and 2) details of the regression setup. We assess the significance of these effects by the degree to which they improve or degrade the association between computed SQs and diet categories. Though altering regression parameters for SQ calculation has a visible effect on plots, numerous iterations of statistical analyses vary surprisingly little in the success of the resulting variables for assigning taxa to dietary preference. This is promising for the comparability of patterns (if not casewise values) in SQ between studies. We suggest that differences in apparent dietary fidelity of recent studies are attributable principally to tooth position examined.
Resumo:
This paper uses dynamic impulse response analysis to investigate the interrelationships among stock price volatility, trading volume, and the leverage effect. Dynamic impulse response analysis is a technique for analyzing the multi-step-ahead characteristics of a nonparametric estimate of the one-step conditional density of a strictly stationary process. The technique is the generalization to a nonlinear process of Sims-style impulse response analysis for linear models. In this paper, we refine the technique and apply it to a long panel of daily observations on the price and trading volume of four stocks actively traded on the NYSE: Boeing, Coca-Cola, IBM, and MMM.
Resumo:
Bayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.
Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.
Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with
little or no prior knowledge