48 resultados para High dimensional regression
em Universidad Politécnica de Madrid
Resumo:
Pragmatism is the leading motivation of regularization. We can understand regularization as a modification of the maximum-likelihood estimator so that a reasonable answer could be given in an unstable or ill-posed situation. To mention some typical examples, this happens when fitting parametric or non-parametric models with more parameters than data or when estimating large covariance matrices. Regularization is usually used, in addition, to improve the bias-variance tradeoff of an estimation. Then, the definition of regularization is quite general, and, although the introduction of a penalty is probably the most popular type, it is just one out of multiple forms of regularization. In this dissertation, we focus on the applications of regularization for obtaining sparse or parsimonious representations, where only a subset of the inputs is used. A particular form of regularization, L1-regularization, plays a key role for reaching sparsity. Most of the contributions presented here revolve around L1-regularization, although other forms of regularization are explored (also pursuing sparsity in some sense). In addition to present a compact review of L1-regularization and its applications in statistical and machine learning, we devise methodology for regression, supervised classification and structure induction of graphical models. Within the regression paradigm, we focus on kernel smoothing learning, proposing techniques for kernel design that are suitable for high dimensional settings and sparse regression functions. We also present an application of regularized regression techniques for modeling the response of biological neurons. Supervised classification advances deal, on the one hand, with the application of regularization for obtaining a na¨ıve Bayes classifier and, on the other hand, with a novel algorithm for brain-computer interface design that uses group regularization in an efficient manner. Finally, we present a heuristic for inducing structures of Gaussian Bayesian networks using L1-regularization as a filter. El pragmatismo es la principal motivación de la regularización. Podemos entender la regularización como una modificación del estimador de máxima verosimilitud, de tal manera que se pueda dar una respuesta cuando la configuración del problema es inestable. A modo de ejemplo, podemos mencionar el ajuste de modelos paramétricos o no paramétricos cuando hay más parámetros que casos en el conjunto de datos, o la estimación de grandes matrices de covarianzas. Se suele recurrir a la regularización, además, para mejorar el compromiso sesgo-varianza en una estimación. Por tanto, la definición de regularización es muy general y, aunque la introducción de una función de penalización es probablemente el método más popular, éste es sólo uno de entre varias posibilidades. En esta tesis se ha trabajado en aplicaciones de regularización para obtener representaciones dispersas, donde sólo se usa un subconjunto de las entradas. En particular, la regularización L1 juega un papel clave en la búsqueda de dicha dispersión. La mayor parte de las contribuciones presentadas en la tesis giran alrededor de la regularización L1, aunque también se exploran otras formas de regularización (que igualmente persiguen un modelo disperso). Además de presentar una revisión de la regularización L1 y sus aplicaciones en estadística y aprendizaje de máquina, se ha desarrollado metodología para regresión, clasificación supervisada y aprendizaje de estructura en modelos gráficos. Dentro de la regresión, se ha trabajado principalmente en métodos de regresión local, proponiendo técnicas de diseño del kernel que sean adecuadas a configuraciones de alta dimensionalidad y funciones de regresión dispersas. También se presenta una aplicación de las técnicas de regresión regularizada para modelar la respuesta de neuronas reales. Los avances en clasificación supervisada tratan, por una parte, con el uso de regularización para obtener un clasificador naive Bayes y, por otra parte, con el desarrollo de un algoritmo que usa regularización por grupos de una manera eficiente y que se ha aplicado al diseño de interfaces cerebromáquina. Finalmente, se presenta una heurística para inducir la estructura de redes Bayesianas Gaussianas usando regularización L1 a modo de filtro.
Resumo:
Multi-dimensional classification (MDC) is the supervised learning problem where an instance is associated with multiple classes, rather than with a single class, as in traditional classification problems. Since these classes are often strongly correlated, modeling the dependencies between them allows MDC methods to improve their performance – at the expense of an increased computational cost. In this paper we focus on the classifier chains (CC) approach for modeling dependencies, one of the most popular and highest-performing methods for multi-label classification (MLC), a particular case of MDC which involves only binary classes (i.e., labels). The original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors along the chain. Here we present novel Monte Carlo schemes, both for finding a good chain sequence and performing efficient inference. Our algorithms remain tractable for high-dimensional data sets and obtain the best predictive performance across several real data sets.
Resumo:
The Self-OrganizingMap (SOM) is a neural network model that performs an ordered projection of a high dimensional input space in a low-dimensional topological structure. The process in which such mapping is formed is defined by the SOM algorithm, which is a competitive, unsupervised and nonparametric method, since it does not make any assumption about the input data distribution. The feature maps provided by this algorithm have been successfully applied for vector quantization, clustering and high dimensional data visualization processes. However, the initialization of the network topology and the selection of the SOM training parameters are two difficult tasks caused by the unknown distribution of the input signals. A misconfiguration of these parameters can generate a feature map of low-quality, so it is necessary to have some measure of the degree of adaptation of the SOM network to the input data model. The topologypreservation is the most common concept used to implement this measure. Several qualitative and quantitative methods have been proposed for measuring the degree of SOM topologypreservation, particularly using Kohonen's model. In this work, two methods for measuring the topologypreservation of the Growing Cell Structures (GCSs) model are proposed: the topographic function and the topology preserving map
Resumo:
Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain.
Resumo:
Probabilistic modeling is the de�ning characteristic of estimation of distribution algorithms (EDAs) which determines their behavior and performance in optimization. Regularization is a well-known statistical technique used for obtaining an improved model by reducing the generalization error of estimation, especially in high-dimensional problems. `1-regularization is a type of this technique with the appealing variable selection property which results in sparse model estimations. In this thesis, we study the use of regularization techniques for model learning in EDAs. Several methods for regularized model estimation in continuous domains based on a Gaussian distribution assumption are presented, and analyzed from di�erent aspects when used for optimization in a high-dimensional setting, where the population size of EDA has a logarithmic scale with respect to the number of variables. The optimization results obtained for a number of continuous problems with an increasing number of variables show that the proposed EDA based on regularized model estimation performs a more robust optimization, and is able to achieve signi�cantly better results for larger dimensions than other Gaussian-based EDAs. We also propose a method for learning a marginally factorized Gaussian Markov random �eld model using regularization techniques and a clustering algorithm. The experimental results show notable optimization performance on continuous additively decomposable problems when using this model estimation method. Our study also covers multi-objective optimization and we propose joint probabilistic modeling of variables and objectives in EDAs based on Bayesian networks, speci�cally models inspired from multi-dimensional Bayesian network classi�ers. It is shown that with this approach to modeling, two new types of relationships are encoded in the estimated models in addition to the variable relationships captured in other EDAs: objectivevariable and objective-objective relationships. An extensive experimental study shows the e�ectiveness of this approach for multi- and many-objective optimization. With the proposed joint variable-objective modeling, in addition to the Pareto set approximation, the algorithm is also able to obtain an estimation of the multi-objective problem structure. Finally, the study of multi-objective optimization based on joint probabilistic modeling is extended to noisy domains, where the noise in objective values is represented by intervals. A new version of the Pareto dominance relation for ordering the solutions in these problems, namely �-degree Pareto dominance, is introduced and its properties are analyzed. We show that the ranking methods based on this dominance relation can result in competitive performance of EDAs with respect to the quality of the approximated Pareto sets. This dominance relation is then used together with a method for joint probabilistic modeling based on `1-regularization for multi-objective feature subset selection in classi�cation, where six di�erent measures of accuracy are considered as objectives with interval values. The individual assessment of the proposed joint probabilistic modeling and solution ranking methods on datasets with small-medium dimensionality, when using two di�erent Bayesian classi�ers, shows that comparable or better Pareto sets of feature subsets are approximated in comparison to standard methods.
Resumo:
Many existing engineering works model the statistical characteristics of the entities under study as normal distributions. These models are eventually used for decision making, requiring in practice the definition of the classification region corresponding to the desired confidence level. Surprisingly enough, however, a great amount of computer vision works using multidimensional normal models leave unspecified or fail to establish correct confidence regions due to misconceptions on the features of Gaussian functions or to wrong analogies with the unidimensional case. The resulting regions incur in deviations that can be unacceptable in high-dimensional models. Here we provide a comprehensive derivation of the optimal confidence regions for multivariate normal distributions of arbitrary dimensionality. To this end, firstly we derive the condition for region optimality of general continuous multidimensional distributions, and then we apply it to the widespread case of the normal probability density function. The obtained results are used to analyze the confidence error incurred by previous works related to vision research, showing that deviations caused by wrong regions may turn into unacceptable as dimensionality increases. To support the theoretical analysis, a quantitative example in the context of moving object detection by means of background modeling is given.
Resumo:
This paper describes a novel approach to phonotactic LID, where instead of using soft-counts based on phoneme lattices, we use posteriogram to obtain n-gram counts. The high-dimensional vectors of counts are reduced to low-dimensional units for which we adapted the commonly used term i-vectors. The reduction is based on multinomial subspace modeling and is designed to work in the total-variability space. The proposed technique was tested on the NIST 2009 LRE set with better results to a system based on using soft-counts (Cavg on 30s: 3.15% vs 3.43%), and with very good results when fused with an acoustic i-vector LID system (Cavg on 30s acoustic 2.4% vs 1.25%). The proposed technique is also compared with another low dimensional projection system based on PCA. In comparison with the original soft-counts, the proposed technique provides better results, reduces the problems due to sparse counts, and avoids the process of using pruning techniques when creating the lattices.
Resumo:
Virtual reality (VR) techniques to understand and obtain conclusions of data in an easy way are being used by the scientific community. However, these techniques are not used frequently for analyzing large amounts of data in life sciences, particularly in genomics, due to the high complexity of data (curse of dimensionality). Nevertheless, new approaches that allow to bring out the real important data characteristics, arise the possibility of constructing VR spaces to visually understand the intrinsic nature of data. It is well known the benefits of representing high dimensional data in tridimensional spaces by means of dimensionality reduction and transformation techniques, complemented with a strong component of interaction methods. Thus, a novel framework, designed for helping to visualize and interact with data about diseases, is presented. In this paper, the framework is applied to the Van't Veer breast cancer dataset is used, while oncologists from La Paz Hospital (Madrid) are interacting with the obtained results. That is to say a first attempt to generate a visually tangible model of breast cancer disease in order to support the experience of oncologists is presented.
Resumo:
Non-parametric belief propagation (NBP) is a well-known message passing method for cooperative localization in wireless networks. However, due to the over-counting problem in the networks with loops, NBP’s convergence is not guaranteed, and its estimates are typically less accurate. One solution for this problem is non-parametric generalized belief propagation based on junction tree. However, this method is intractable in large-scale networks due to the high-complexity of the junction tree formation, and the high-dimensionality of the particles. Therefore, in this article, we propose the non-parametric generalized belief propagation based on pseudo-junction tree (NGBP-PJT). The main difference comparing with the standard method is the formation of pseudo-junction tree, which represents the approximated junction tree based on thin graph. In addition, in order to decrease the number of high-dimensional particles, we use more informative importance density function, and reduce the dimensionality of the messages. As by-product, we also propose NBP based on thin graph (NBP-TG), a cheaper variant of NBP, which runs on the same graph as NGBP-PJT. According to our simulation and experimental results, NGBP-PJT method outperforms NBP and NBP-TG in terms of accuracy, computational, and communication cost in reasonably sized networks.
Resumo:
Markerless video-based human pose estimation algorithms face a high-dimensional problem that is frequently broken down into several lower-dimensional ones by estimating the pose of each limb separately. However, in order to do so they need to reliably locate the torso, for which they typically rely on time coherence and tracking algorithms. Their losing track usually results in catastrophic failure of the process, requiring human intervention and thus precluding their usage in real-time applications. We propose a very fast rough pose estimation scheme based on global shape descriptors built on 3D Zernike moments. Using an articulated model that we configure in many poses, a large database of descriptor/pose pairs can be computed off-line. Thus, the only steps that must be done on-line are the extraction of the descriptors for each input volume and a search against the database to get the most likely poses. While the result of such process is not a fine pose estimation, it can be useful to help more sophisticated algorithms to regain track or make more educated guesses when creating new particles in particle-filter-based tracking schemes. We have achieved a performance of about ten fps on a single computer using a database of about one million entries.
Resumo:
Multi-label classification (MLC) is the supervised learning problem where an instance may be associated with multiple labels. Modeling dependencies between labels allows MLC methods to improve their performance at the expense of an increased computational cost. In this paper we focus on the classifier chains (CC) approach for modeling dependencies. On the one hand, the original CC algorithm makes a greedy approximation, and is fast but tends to propagate errors down the chain. On the other hand, a recent Bayes-optimal method improves the performance, but is computationally intractable in practice. Here we present a novel double-Monte Carlo scheme (M2CC), both for finding a good chain sequence and performing efficient inference. The M2CC algorithm remains tractable for high-dimensional data sets and obtains the best overall accuracy, as shown on several real data sets with input dimension as high as 1449 and up to 103 labels.
Resumo:
Esta tesis presenta los resultados de la investigación realizada sobre la inertización de cenizas volantes procedentes de residuos sólidos urbanos y su posterior encapsulación en distintas matrices de mortero. Durante el proceso de inertización, se ha logrado la inertización de éste residuo tóxico y peligroso (RTP) y también su valorización como subproducto. De esta forma se dispone de nueva “materia prima” a bajo coste y la eliminación de un residuo tóxico y peligroso con la consiguiente conservación de recursos naturales alternativos. La caracterización química de las cenizas analizadas refleja que éstas presentan altas concentraciones de cloruros, Zn y Pb. Durante la investigación se ha desarrollado un proceso de inertización de las cenizas volantes con bicarbonato sódico (NaHCO3) que reduce en un 99% el contenido en cloruros y mantiene el pH en valores óptimos para que la concentración de los metales pesados en el lixiviado sea mínima debido a su estabilización en forma de carbonatos insolubles. Se han elaborado morteros con cuatro tipos distintos de cementos (CEM-I, CEM-II, CAC y CSA) incorporando cenizas volantes inertizadas en una proporción igual a un 10% en peso del árido utilizado. Los morteros ensayados abarcan distintas dosificaciones tanto en la utilización de áridos con distintos diámetros (0/2 y 0/4), como en la relación cemento/árido (1/1 y 1/3). Se han obtenido las propiedades físicas y mecánicas de estos morteros mediante ensayos de Trabajabilidad, Estabilidad Dimensional, Carbonatación, Porosidad y Resistencias Mecánicas. De igual forma, se presentan resultados de ensayos de lixiviación de Zn, Pb, Cu y Cd, sobre probetas monolíticas de los morteros con los mejores comportamientos físico/mecánicos, donde se ha analizado el contenido en iones de dichos metales pesados lixiviados mediante determinación voltamperométrica de redisolución anódica Se concluye que todos los morteros ensayados son técnicamente aceptables, siendo los más favorables los elaborados con Cemento de Sulfoaluminato de Calcio (CSA) y con Cemento de Aluminato de Calcio (CAC). En este último caso, se mejoran las resistencias a compresión de los morteros de referencia en más de un 48%, y las resistencias a flexión en más de un 67%. De igual forma, los ensayos de lixiviado revelan la completa encapsulación de los iones de Zn y la mitigación en el lixiviado de los iones de Pb. Ambos morteros podrían ser perfectamente validos en actuaciones en las que se necesitase un producto de fraguado rápido, altas resistencias iniciales y compensación de las retracciones con una elevada estabilidad dimensional. En base a esto, el material podría ser utilizado como mortero de reparación en viales y pavimentos que requiriesen altas prestaciones, tales como: soleras industriales, pistas de aterrizaje, aparcamientos, etc. O bien, para la confección de elementos prefabricados sin armaduras estructurales, dada su elevada resistencia a flexión. ABSTRACT This dissertation presents the results of a research on inerting fly ash from urban solid waste and its subsequent encapsulation in mortar matrixes. The inerting of this hazardous toxic waste, as well as its valorization as a by-product has been achieved. In this way, a new "raw material" is available through a simple process and the toxic and hazardous waste is eliminated, and consequently, conservation of alternative natural resources is strengthened. Chemical analysis of the ashes analyzed shows high concentrations of soluble chlorides, Zn and Pb. An inerting process of fly ash with sodium bicarbonate (NaHCO3) has been developed which reduces 99% the content of chlorides and maintains pH at optimal values, so that the concentration of heavy metals in the leachate is minimum, due to its stabilization in the form of insoluble carbonates. Mortars with four different types of cements (CEM-I, CEM-II, CAC and CSA) have been developed by the addition of inertized fly ash in the form of carbonates, in the proportion of 10% in weight of the aggregates used. The samples tested include different proportions in the use of aggregates with different sizes (0/2 and 0/4), and in the cement/aggregate ratio (1/1 and 1/3). Physical/mechanical properties of these mortars have been studied through workability, dimensional stability, carbonation, porosity and mechanic strength tests. Leaching tests of Zn, Pb, Cu and Cd ions are also being performed on monolithic samples of the best behavioral mortars. The content in leachated heavy metal ions is being analyzed through stripping voltammetry determination. Conclusions drawn are that the tested CAC and CSA cement mortars present much better behavior than those of CEM-I and CEM-II cement. The results are especially remarkable for the CAC cement mortars, improving reference mortars compression strengths in more than 48%, and also bending strengths in more than 67%. Leaching tests confirm that the encapsulation of Zn and Pb is achieved and leachate of both ions is mitigated within the mortar matrixes. For the above stated reasons, it might be concluded that mortars made with calcium aluminate cements or calcium sulfoaluminate with the incorporation of treated fly ash, may be perfectly valid for uses in which a fast-curing product, with high initial strength and drying shrinkage compensation with a high dimensional stability is required. Based on this, the material could be used as repair mortar for structures, roads and industrial pavements requiring high performance, such as: industrial floorings, landing tracks, parking lots, etc. Alternatively, it could also be used in the manufacture of prefabricated elements without structural reinforcement, given its high bending strength.
Resumo:
Traffic flow time series data are usually high dimensional and very complex. Also they are sometimes imprecise and distorted due to data collection sensor malfunction. Additionally, events like congestion caused by traffic accidents add more uncertainty to real-time traffic conditions, making traffic flow forecasting a complicated task. This article presents a new data preprocessing method targeting multidimensional time series with a very high number of dimensions and shows its application to real traffic flow time series from the California Department of Transportation (PEMS web site). The proposed method consists of three main steps. First, based on a language for defining events in multidimensional time series, mTESL, we identify a number of types of events in time series that corresponding to either incorrect data or data with interference. Second, each event type is restored utilizing an original method that combines real observations, local forecasted values and historical data. Third, an exponential smoothing procedure is applied globally to eliminate noise interference and other random errors so as to provide good quality source data for future work.
Resumo:
To date, although much attention has been paid to the estimation and modeling of the voice source (ie, the glottal airflow volume velocity), the measurement and characterization of the supraglottal pressure wave have been much less studied. Some previous results have unveiled that the supraglottal pressure wave has some spectral resonances similar to those of the voice pressure wave. This makes the supraglottal wave partially intelligible. Although the explanation for such effect seems to be clearly related to the reflected pressure wave traveling upstream along the vocal tract, the influence that nonlinear source-filter interaction has on it is not as clear. This article provides an insight into this issue by comparing the acoustic analyses of measured and simulated supraglottal and voice waves. Simulations have been performed using a high-dimensional discrete vocal fold model. Results of such comparative analysis indicate that spectral resonances in the supraglottal wave are mainly caused by the regressive pressure wave that travels upstream along the vocal tract and not by source-tract interaction. On the contrary and according to simulation results, source-tract interaction has a role in the loss of intelligibility that happens in the supraglottal wave with respect to the voice wave. This loss of intelligibility mainly corresponds to spectral differences for frequencies above 1500 Hz.
Resumo:
Axisymmetric shells are analyzed by means of one-dimensional continuum elements by using the analogy between the bending of shells and the bending of beams on elastic foundation. The mathematical model is formulated in the frequency domain. Because the solution of the governing equations of vibration of beams are exact, the spatial discretization only depends on geometrical or material considerations. For some kind of situations, for example, for high frequency excitations, this approach may be more convenient than other conventional ones such as the finite element method.