893 resultados para High dimensional regression


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This thesis is concerned with change point analysis for time series, i.e. with detection of structural breaks in time-ordered, random data. This long-standing research field regained popularity over the last few years and is still undergoing, as statistical analysis in general, a transformation to high-dimensional problems. We focus on the fundamental »change in the mean« problem and provide extensions of the classical non-parametric Darling-Erdős-type cumulative sum (CUSUM) testing and estimation theory within highdimensional Hilbert space settings. In the first part we contribute to (long run) principal component based testing methods for Hilbert space valued time series under a rather broad (abrupt, epidemic, gradual, multiple) change setting and under dependence. For the dependence structure we consider either traditional m-dependence assumptions or more recently developed m-approximability conditions which cover, e.g., MA, AR and ARCH models. We derive Gumbel and Brownian bridge type approximations of the distribution of the test statistic under the null hypothesis of no change and consistency conditions under the alternative. A new formulation of the test statistic using projections on subspaces allows us to simplify the standard proof techniques and to weaken common assumptions on the covariance structure. Furthermore, we propose to adjust the principal components by an implicit estimation of a (possible) change direction. This approach adds flexibility to projection based methods, weakens typical technical conditions and provides better consistency properties under the alternative. In the second part we contribute to estimation methods for common changes in the means of panels of Hilbert space valued time series. We analyze weighted CUSUM estimates within a recently proposed »high-dimensional low sample size (HDLSS)« framework, where the sample size is fixed but the number of panels increases. We derive sharp conditions on »pointwise asymptotic accuracy« or »uniform asymptotic accuracy« of those estimates in terms of the weighting function. Particularly, we prove that a covariance-based correction of Darling-Erdős-type CUSUM estimates is required to guarantee uniform asymptotic accuracy under moderate dependence conditions within panels and that these conditions are fulfilled, e.g., by any MA(1) time series. As a counterexample we show that for AR(1) time series, close to the non-stationary case, the dependence is too strong and uniform asymptotic accuracy cannot be ensured. Finally, we conduct simulations to demonstrate that our results are practically applicable and that our methodological suggestions are advantageous.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In these last years a great effort has been put in the development of new techniques for automatic object classification, also due to the consequences in many applications such as medical imaging or driverless cars. To this end, several mathematical models have been developed from logistic regression to neural networks. A crucial aspect of these so called classification algorithms is the use of algebraic tools to represent and approximate the input data. In this thesis, we examine two different models for image classification based on a particular tensor decomposition named Tensor-Train (TT) decomposition. The use of tensor approaches preserves the multidimensional structure of the data and the neighboring relations among pixels. Furthermore the Tensor-Train, differently from other tensor decompositions, does not suffer from the curse of dimensionality making it an extremely powerful strategy when dealing with high-dimensional data. It also allows data compression when combined with truncation strategies that reduce memory requirements without spoiling classification performance. The first model we propose is based on a direct decomposition of the database by means of the TT decomposition to find basis vectors used to classify a new object. The second model is a tensor dictionary learning model, based on the TT decomposition where the terms of the decomposition are estimated using a proximal alternating linearized minimization algorithm with a spectral stepsize.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The study of random probability measures is a lively research topic that has attracted interest from different fields in recent years. In this thesis, we consider random probability measures in the context of Bayesian nonparametrics, where the law of a random probability measure is used as prior distribution, and in the context of distributional data analysis, where the goal is to perform inference given avsample from the law of a random probability measure. The contributions contained in this thesis can be subdivided according to three different topics: (i) the use of almost surely discrete repulsive random measures (i.e., whose support points are well separated) for Bayesian model-based clustering, (ii) the proposal of new laws for collections of random probability measures for Bayesian density estimation of partially exchangeable data subdivided into different groups, and (iii) the study of principal component analysis and regression models for probability distributions seen as elements of the 2-Wasserstein space. Specifically, for point (i) above we propose an efficient Markov chain Monte Carlo algorithm for posterior inference, which sidesteps the need of split-merge reversible jump moves typically associated with poor performance, we propose a model for clustering high-dimensional data by introducing a novel class of anisotropic determinantal point processes, and study the distributional properties of the repulsive measures, shedding light on important theoretical results which enable more principled prior elicitation and more efficient posterior simulation algorithms. For point (ii) above, we consider several models suitable for clustering homogeneous populations, inducing spatial dependence across groups of data, extracting the characteristic traits common to all the data-groups, and propose a novel vector autoregressive model to study of growth curves of Singaporean kids. Finally, for point (iii), we propose a novel class of projected statistical methods for distributional data analysis for measures on the real line and on the unit-circle.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper addresses robust model-order reduction of a high dimensional nonlinear partial differential equation (PDE) model of a complex biological process. Based on a nonlinear, distributed parameter model of the same process which was validated against experimental data of an existing, pilot-scale BNR activated sludge plant, we developed a state-space model with 154 state variables in this work. A general algorithm for robustly reducing the nonlinear PDE model is presented and based on an investigation of five state-of-the-art model-order reduction techniques, we are able to reduce the original model to a model with only 30 states without incurring pronounced modelling errors. The Singular perturbation approximation balanced truncating technique is found to give the lowest modelling errors in low frequency ranges and hence is deemed most suitable for controller design and other real-time applications. (C) 2002 Elsevier Science Ltd. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper a complex-order van der Pol oscillator is considered. The complex derivative Dα±ȷβ , with α,β∈R + is a generalization of the concept of integer derivative, where α=1, β=0. By applying the concept of complex derivative, we obtain a high-dimensional parameter space. Amplitude and period values of the periodic solutions of the two versions of the complex-order van der Pol oscillator are studied for variation of these parameters. Fourier transforms of the periodic solutions of the two oscillators are also analyzed.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação apresentada como requisito parcial para obtenção do grau de Mestre em Ciência e Sistemas de Informação Geográfica

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Feature discretization (FD) techniques often yield adequate and compact representations of the data, suitable for machine learning and pattern recognition problems. These representations usually decrease the training time, yielding higher classification accuracy while allowing for humans to better understand and visualize the data, as compared to the use of the original features. This paper proposes two new FD techniques. The first one is based on the well-known Linde-Buzo-Gray quantization algorithm, coupled with a relevance criterion, being able perform unsupervised, supervised, or semi-supervised discretization. The second technique works in supervised mode, being based on the maximization of the mutual information between each discrete feature and the class label. Our experimental results on standard benchmark datasets show that these techniques scale up to high-dimensional data, attaining in many cases better accuracy than existing unsupervised and supervised FD approaches, while using fewer discretization intervals.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In machine learning and pattern recognition tasks, the use of feature discretization techniques may have several advantages. The discretized features may hold enough information for the learning task at hand, while ignoring minor fluctuations that are irrelevant or harmful for that task. The discretized features have more compact representations that may yield both better accuracy and lower training time, as compared to the use of the original features. However, in many cases, mainly with medium and high-dimensional data, the large number of features usually implies that there is some redundancy among them. Thus, we may further apply feature selection (FS) techniques on the discrete data, keeping the most relevant features, while discarding the irrelevant and redundant ones. In this paper, we propose relevance and redundancy criteria for supervised feature selection techniques on discrete data. These criteria are applied to the bin-class histograms of the discrete features. The experimental results, on public benchmark data, show that the proposed criteria can achieve better accuracy than widely used relevance and redundancy criteria, such as mutual information and the Fisher ratio.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Major in Competition and Regulation

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In Portugal, about 20% of full-time workers are employed under a fixed-term contract. Using a rich longitudinal matched employer-employee dataset for Portugal, with more than 20 million observations and covering the 2002-2012 period, we confirm the common idea that fixed-term contracts are not desirable when compared to permanent ones, by estimating a conditional wage gap of -1.7 log points. Then, we evaluate the sources of that wage penalty by combining a three way high-dimensional fixed effects model with the decomposition of Gelbach (2014), in which the three dimensions considered are the worker’s unobserved ability, the firm’s compensation wage policy and the job title effect. It is shown that the average worker with a fixed-term contract is less productive than his/her permanent counterparts, explaining -3.92 log points of the FTC wage penalty. Additionally, the sorting of workers into lower-paid job titles is also responsible for -0.59 log points of the wage gap. Surprisingly, we found that the allocation of workers among firms mitigates the existing wage penalty (in 4.23 log points), as fixed-term workers are concentrated into firms with a more generous compensation policy. Finally, following Figueiredo et al. (2014), we further control for the worker-firm match characteristics and reach the conclusion that fixed-term employment relationships have an overrepresentation of low quality worker-firm matches, explaining 0.65 log points of the FTC wage penalty.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We develop methods for Bayesian inference in vector error correction models which are subject to a variety of switches in regime (e.g. Markov switches in regime or structural breaks). An important aspect of our approach is that we allow both the cointegrating vectors and the number of cointegrating relationships to change when the regime changes. We show how Bayesian model averaging or model selection methods can be used to deal with the high-dimensional model space that results. Our methods are used in an empirical study of the Fisher effect.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We develop methods for Bayesian inference in vector error correction models which are subject to a variety of switches in regime (e.g. Markov switches in regime or structural breaks). An important aspect of our approach is that we allow both the cointegrating vectors and the number of cointegrating relationships to change when the regime changes. We show how Bayesian model averaging or model selection methods can be used to deal with the high-dimensional model space that results. Our methods are used in an empirical study of the Fisher e ffect.