7 resultados para MULTIFACTOR-DIMENSIONALITY REDUCTION

em Duke University


Relevância:

80.00% 80.00%

Publicador:

Resumo:

We study the problem of supervised linear dimensionality reduction, taking an information-theoretic viewpoint. The linear projection matrix is designed by maximizing the mutual information between the projected signal and the class label. By harnessing a recent theoretical result on the gradient of mutual information, the above optimization problem can be solved directly using gradient descent, without requiring simplification of the objective function. Theoretical analysis and empirical comparison are made between the proposed method and two closely related methods, and comparisons are also made with a method in which Rényi entropy is used to define the mutual information (in this case the gradient may be computed simply, under a special parameter setting). Relative to these alternative approaches, the proposed method achieves promising results on real datasets. Copyright 2012 by the author(s)/owner(s).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.

Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.

One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.

Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.

In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.

Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.

The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.

Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

AIM: To examine whether smokers who reduce their quantity of cigarettes smoked between two periods are more or less likely to quit subsequently. STUDY DESIGN: Data come from the Health and Retirement Study, a nationally representative survey of older Americans aged 51-61 in 1991 followed every 2 years from 1992 to 1998. The 2064 participants smoking at baseline and the first follow-up comprise the main sample. MEASUREMENTS: Smoking cessation by 1996 is examined as the primary outcome. A secondary outcome is relapse by 1998. Spontaneous changes in smoking quantity between the first two waves make up the key predictor variables. Control variables include gender, age, education, race, marital status, alcohol use, psychiatric problems, acute or chronic health problems and smoking quantity. FINDINGS: Large (over 50%) and even moderate (25-50%) reductions in quantity smoked between 1992 and 1994 predict prospectively increased likelihood of cessation in 1996 compared to no change in quantity (OR 2.96, P<0.001 and OR 1.61, P<0.01, respectively). Additionally, those who reduced and then quit were somewhat less likely to relapse by 1998 than those who did not reduce in the 2 years prior to quitting. CONCLUSIONS: Reducing successfully the quantity of cigarettes smoked appears to have a beneficial effect on future cessation likelihood, even after controlling for initial smoking level and other variables known to impact smoking cessation. These results indicate that the harm reduction strategy of reduced smoking warrants further study.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Men who have sex with men (MSM) represent more than half of all new HIV infections in the United States. Utilizing a collaborative, community based approach, a brief risk reduction intervention was developed and pilot tested among newly HIV-diagnosed MSM receiving HIV care in a primary care setting. Sixty-five men, within 3 months of diagnosis, were randomly assigned to the experimental condition or control condition and assessed at baseline, 3-month, and 6-month follow-up. Effect sizes were calculated to explore differences between conditions and over time. Results demonstrated the potential effectiveness of the intervention in reducing risk behavior, improving mental health, and increasing use of ancillary services. Process evaluation data demonstrated the acceptability of the intervention to patients, clinic staff, and administration. The results provide evidence that a brief intervention can be successfully integrated into HIV care services for newly diagnosed MSM and should be evaluated for efficacy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Bycatch reduction technology (BRT) modifies fishing gear to increase selectivity and avoid capture of non-target species, or to facilitate their non-lethal release. As a solution to fisheries-related mortality of non-target species, BRT is an attractive option; effectively implemented, BRT presents a technical 'fix' that can reduce pressure for politically contentious and economically detrimental interventions, such as fisheries closures. While a number of factors might contribute to effective implementation, our review of BRT literature finds that research has focused on technical design and experimental performance of individual technologies. In contrast, and with a few notable exceptions, research on the human and institutional context of BRT, and more specifically on how fishers respond to BRT, is limited. This is not to say that fisher attitudes are ignored or overlooked, but that incentives for fisher uptake of BRT are usually assumed rather than assessed or demonstrated. Three assumptions about fisher incentives dominate: (1) economic incentives will generate acceptance of BRT; (2) enforcement will generate compliance with BRT; and (3) 'participation' by fishers will increase acceptance and compliance, and overall support for BRT. In this paper, we explore evidence for and against these assumptions and situate our analysis in the wider social science literature on fisheries. Our goal is to highlight the need and suggest focal areas for further research. © Inter-Research 2008.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Indoor residual spraying (IRS) has become an increasingly popular method of insecticide use for malaria control, and many recent studies have reported on its effectiveness in reducing malaria burden in a single community or region. There is a need for systematic review and integration of the published literature on IRS and the contextual determining factors of its success in controlling malaria. This study reports the findings of a meta-regression analysis based on 13 published studies, which were chosen from more than 400 articles through a systematic search and selection process. The summary relative risk for reducing malaria prevalence was 0.38 (95% confidence interval = 0.31-0.46), which indicated a risk reduction of 62%. However, an excessive degree of heterogeneity was found between the studies. The meta-regression analysis indicates that IRS is more effective with high initial prevalence, multiple rounds of spraying, use of DDT, and in regions with a combination of Plasmodium falciparum and P. vivax malaria.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Vein grafting results in the development of intimal hyperplasia with accompanying changes in guanine nucleotide-binding (G) protein expression and function. Several serum mitogens that act through G protein-coupled receptors, such as lysophosphatidic acid, stimulate proliferative pathways that are dependent on the G protein betagamma subunit (Gbetagamma)-mediated activation of p21ras. This study examines the role of Gbetagamma signaling in intimal hyperplasia by targeting a gene encoding a specific Gbetagamma inhibitor in an experimental rabbit vein graft model. This inhibitor, the carboxyl terminus of the beta-adrenergic receptor kinase (betaARK(CT)), contains a Gbetagamma-binding domain. Vein graft intimal hyperplasia was significantly reduced by 37% (P<0.01), and physiological studies demonstrated that the normal alterations in G protein coupling phenotypically seen in this model were blocked by betaARK(CT) treatment. Thus, it appears that Gbetagamma-mediated pathways play a major role in intimal hyperplasia and that targeting inhibitors of Gbetagamma signaling offers novel intraoperative therapeutic modalities to inhibit the development of vein graft intimal hyperplasia and subsequent vein graft failure.