965 resultados para Maximum-entropy probability density
Performance Tuning Non-Uniform Sampling for Sensitivity Enhancement of Signal-Limited Biological NMR
Resumo:
Non-uniform sampling (NUS) has been established as a route to obtaining true sensitivity enhancements when recording indirect dimensions of decaying signals in the same total experimental time as traditional uniform incrementation of the indirect evolution period. Theory and experiments have shown that NUS can yield up to two-fold improvements in the intrinsic signal-to-noise ratio (SNR) of each dimension, while even conservative protocols can yield 20-40 % improvements in the intrinsic SNR of NMR data. Applications of biological NMR that can benefit from these improvements are emerging, and in this work we develop some practical aspects of applying NUS nD-NMR to studies that approach the traditional detection limit of nD-NMR spectroscopy. Conditions for obtaining high NUS sensitivity enhancements are considered here in the context of enabling H-1,N-15-HSQC experiments on natural abundance protein samples and H-1,C-13-HMBC experiments on a challenging natural product. Through systematic studies we arrive at more precise guidelines to contrast sensitivity enhancements with reduced line shape constraints, and report an alternative sampling density based on a quarter-wave sinusoidal distribution that returns the highest fidelity we have seen to date in line shapes obtained by maximum entropy processing of non-uniformly sampled data.
Resumo:
Mendelian models can predict who carries an inherited deleterious mutation of known disease genes based on family history. For example, the BRCAPRO model is commonly used to identify families who carry mutations of BRCA1 and BRCA2, based on familial breast and ovarian cancers. These models incorporate the age of diagnosis of diseases in relatives and current age or age of death. We develop a rigorous foundation for handling multiple diseases with censoring. We prove that any disease unrelated to mutations can be excluded from the model, unless it is sufficiently common and dependent on a mutation-related disease time. Furthermore, if a family member has a disease with higher probability density among mutation carriers, but the model does not account for it, then the carrier probability is deflated. However, even if a family only has diseases the model accounts for, if the model excludes a mutation-related disease, then the carrier probability will be inflated. In light of these results, we extend BRCAPRO to account for surviving all non-breast/ovary cancers as a single outcome. The extension also enables BRCAPRO to extract more useful information from male relatives. Using 1500 familes from the Cancer Genetics Network, accounting for surviving other cancers improves BRCAPRO’s concordance index from 0.758 to 0.762 (p = 0.046), improves its positive predictive value from 35% to 39% (p < 10−6) without impacting its negative predictive value, and improves its overall calibration, although calibration slightly worsens for those with carrier probability < 10%. Copyright c 2000 John Wiley & Sons, Ltd.
Resumo:
Marshall's (1970) lemma is an analytical result which implies root-n-consistency of the distribution function corresponding to the Grenander (1956) estimator of a non-decreasing probability density. The present paper derives analogous results for the setting of convex densities on [0,\infty).
Resumo:
We present a novel approach to the inference of spectral functions from Euclidean time correlator data that makes close contact with modern Bayesian concepts. Our method differs significantly from the maximum entropy method (MEM). A new set of axioms is postulated for the prior probability, leading to an improved expression, which is devoid of the asymptotically flat directions present in the Shanon-Jaynes entropy. Hyperparameters are integrated out explicitly, liberating us from the Gaussian approximations underlying the evidence approach of the maximum entropy method. We present a realistic test of our method in the context of the nonperturbative extraction of the heavy quark potential. Based on hard-thermal-loop correlator mock data, we establish firm requirements in the number of data points and their accuracy for a successful extraction of the potential from lattice QCD. Finally we reinvestigate quenched lattice QCD correlators from a previous study and provide an improved potential estimation at T2.33TC.
Resumo:
RESUMEN El apoyo a la selección de especies a la restauración de la vegetación en España en los últimos 40 años se ha basado fundamentalmente en modelos de distribución de especies, también llamados modelos de nicho ecológico, que estiman la probabilidad de presencia de las especies en función de las condiciones del medio físico (clima, suelo, etc.). Con esta tesis se ha intentado contribuir a la mejora de la capacidad predictiva de los modelos introduciendo algunas propuestas metodológicas adaptadas a los datos disponibles actualmente en España y enfocadas al uso de los modelos en la selección de especies. No siempre se dispone de datos a una resolución espacial adecuada para la escala de los proyectos de restauración de la vegetación. Sin embrago es habitual contar con datos de baja resolución espacial para casi todas las especies vegetales presentes en España. Se propone un método de recalibración que actualiza un modelo de regresión logística de baja resolución espacial con una nueva muestra de alta resolución espacial. El método permite obtener predicciones de calidad aceptable con muestras relativamente pequeñas (25 presencias de la especie) frente a las muestras mucho mayores (más de 100 presencias) que requería una estrategia de modelización convencional que no usara el modelo previo. La selección del método estadístico puede influir decisivamente en la capacidad predictiva de los modelos y por esa razón la comparación de métodos ha recibido mucha atención en la última década. Los estudios previos consideraban a la regresión logística como un método inferior a técnicas más modernas como las de máxima entropía. Los resultados de la tesis demuestran que esa diferencia observada se debe a que los modelos de máxima entropía incluyen técnicas de regularización y la versión de la regresión logística usada en las comparaciones no. Una vez incorporada la regularización a la regresión logística usando penalización, las diferencias en cuanto a capacidad predictiva desaparecen. La regresión logística penalizada es, por tanto, una alternativa más para el ajuste de modelos de distribución de especies y está a la altura de los métodos modernos con mejor capacidad predictiva como los de máxima entropía. A menudo, los modelos de distribución de especies no incluyen variables relativas al suelo debido a que no es habitual que se disponga de mediciones directas de sus propiedades físicas o químicas. La incorporación de datos de baja resolución espacial proveniente de mapas de suelo nacionales o continentales podría ser una alternativa. Los resultados de esta tesis sugieren que los modelos de distribución de especies de alta resolución espacial mejoran de forma ligera pero estadísticamente significativa su capacidad predictiva cuando se incorporan variables relativas al suelo procedente de mapas de baja resolución espacial. La validación es una de las etapas fundamentales del desarrollo de cualquier modelo empírico como los modelos de distribución de especies. Lo habitual es validar los modelos evaluando su capacidad predictiva especie a especie, es decir, comparando en un conjunto de localidades la presencia o ausencia observada de la especie con las predicciones del modelo. Este tipo de evaluación no responde a una cuestión clave en la restauración de la vegetación ¿cuales son las n especies más idóneas para el lugar a restaurar? Se ha propuesto un método de evaluación de modelos adaptado a esta cuestión que consiste en estimar la capacidad de un conjunto de modelos para discriminar entre las especies presentes y ausentes de un lugar concreto. El método se ha aplicado con éxito a la validación de 188 modelos de distribución de especies leñosas orientados a la selección de especies para la restauración de la vegetación en España. Las mejoras metodológicas propuestas permiten mejorar la capacidad predictiva de los modelos de distribución de especies aplicados a la selección de especies en la restauración de la vegetación y también permiten ampliar el número de especies para las que se puede contar con un modelo que apoye la toma de decisiones. SUMMARY During the last 40 years, decision support tools for plant species selection in ecological restoration in Spain have been based on species distribution models (also called ecological niche models), that estimate the probability of occurrence of the species as a function of environmental predictors (e.g., climate, soil). In this Thesis some methodological improvements are proposed to contribute to a better predictive performance of such models, given the current data available in Spain and focusing in the application of the models to selection of species for ecological restoration. Fine grained species distribution data are required to train models to be used at the scale of the ecological restoration projects, but this kind of data are not always available for every species. On the other hand, coarse grained data are available for almost every species in Spain. A recalibration method is proposed that updates a coarse grained logistic regression model using a new fine grained updating sample. The method allows obtaining acceptable predictive performance with reasonably small updating sample (25 occurrences of the species), in contrast with the much larger samples (more than 100 occurrences) required for a conventional modeling approach that discards the coarse grained data. The choice of the statistical method may have a dramatic effect on model performance, therefore comparisons of methods have received much interest in the last decade. Previous studies have shown a poorer performance of the logistic regression compared to novel methods like maximum entropy models. The results of this Thesis show that the observed difference is caused by the fact that maximum entropy models include regularization techniques and the versions of logistic regression compared do not. Once regularization has been added to the logistic regression using a penalization procedure, the differences in model performance disappear. Therefore, penalized logistic regression may be considered one of the best performing methods to model species distributions. Usually, species distribution models do not consider soil related predictors because direct measurements of the chemical or physical properties are often lacking. The inclusion of coarse grained soil data from national or continental soil maps could be a reasonable alternative. The results of this Thesis suggest that the performance of the models slightly increase after including soil predictors form coarse grained soil maps. Model validation is a key stage of the development of empirical models, such as species distribution models. The usual way of validating is based on the evaluation of model performance for each species separately, i.e., comparing observed species presences or absence to predicted probabilities in a set of sites. This kind of evaluation is not informative for a common question in ecological restoration projects: which n species are the most suitable for the environment of the site to be restored? A method has been proposed to address this question that estimates the ability of a set of models to discriminate among present and absent species in a evaluation site. The method has been successfully applied to the validation of 188 species distribution models used to support decisions on species selection for ecological restoration in Spain. The proposed methodological approaches improve the predictive performance of the predictive models applied to species selection in ecological restoration and increase the number of species for which a model that supports decisions can be fitted.
Resumo:
In this paper we present a global overview of the recent study carried out in Spain for the new hazard map, which final goal is the revision of the Building Code in our country (NCSE-02). The study was carried our for a working group joining experts from The Instituto Geografico Nacional (IGN) and the Technical University of Madrid (UPM) , being the different phases of the work supervised by an expert Committee integrated by national experts from public institutions involved in subject of seismic hazard. The PSHA method (Probabilistic Seismic Hazard Assessment) has been followed, quantifying the epistemic uncertainties through a logic tree and the aleatory ones linked to variability of parameters by means of probability density functions and Monte Carlo simulations. In a first phase, the inputs have been prepared, which essentially are: 1) a project catalogue update and homogenization at Mw 2) proposal of zoning models and source characterization 3) calibration of Ground Motion Prediction Equations (GMPE’s) with actual data and development of a local model with data collected in Spain for Mw < 5.5. In a second phase, a sensitivity analysis of the different input options on hazard results has been carried out in order to have criteria for defining the branches of the logic tree and their weights. Finally, the hazard estimation was done with the logic tree shown in figure 1, including nodes for quantifying uncertainties corresponding to: 1) method for estimation of hazard (zoning and zoneless); 2) zoning models, 3) GMPE combinations used and 4) regression method for estimation of source parameters. In addition, the aleatory uncertainties corresponding to the magnitude of the events, recurrence parameters and maximum magnitude for each zone have been also considered including probability density functions and Monte Carlo simulations The main conclusions of the study are presented here, together with the obtained results in terms of PGA and other spectral accelerations SA (T) for return periods of 475, 975 and 2475 years. The map of the coefficient of variation (COV) are also represented to give an idea of the zones where the dispersion among results are the highest and the zones where the results are robust.
Resumo:
A relation between Cost Of Energy, COE, maximum allowed tip speed, and rated wind speed, is obtained for wind turbines with a given goal rated power. The wind regime is characterised by the corresponding parameters of the probability density function of wind speed. The non-dimensional characteristics of the rotor: number of blades, the blade radial distributions of local solidity, twist angle, and airfoil type, play the role of parameters in the mentioned relation. The COE is estimated using a cost model commonly used by the designers. This cost model requires basic design data such as the rotor radius and the ratio between the hub height and the rotor radius. Certain design options, DO, related to the technology of the power plant, tower and blades are also required as inputs. The function obtained for the COE can be explored to �nd those values of rotor radius that give rise to minimum cost of energy for a given wind regime as the tip speed limitation changes. The analysis reveals that iso-COE lines evolve parallel to iso-radius lines for large values of limit tip speed but that this is not the case for small values of the tip speed limits. It is concluded that, as the tip speed limit decreases, the optimum decision for keeping minimum COE values can be: a) reducing the rotor radius for places with high weibull scale parameter or b) increasing the rotor radius for places with low weibull scale parameter
Resumo:
In this paper, we propose a novel filter for feature selection. Such filter relies on the estimation of the mutual information between features and classes. We bypass the estimation of the probability density function with the aid of the entropic-graphs approximation of Rényi entropy, and the subsequent approximation of the Shannon one. The complexity of such bypassing process does not depend on the number of dimensions but on the number of patterns/samples, and thus the curse of dimensionality is circumvented. We show that it is then possible to outperform a greedy algorithm based on the maximal relevance and minimal redundancy criterion. We successfully test our method both in the contexts of image classification and microarray data classification.
Resumo:
The interface between a Pt(111) electrode and a room temperature ionic liquid, 1-ethyl-2,3-dimethylimidazolium bis(trifluoromethylsulfonyl)imide, was investigated with the laser-induced temperature jump method. In this technique, the temperature of the interface is suddenly increased by applying short laser pulses. The change of the electrode potential caused by the thermal perturbation is measured under coulostatic conditions during the subsequent temperature relaxation. This change is mainly related to the reorganization of the solvent components near the electrode surface. The sign of the potential transient depends on the potential of the experiment. At high potential values, positive transients indicate a higher density of anions than cations close the surface, contributing negatively to the potential of the electrode. Decreasing the applied potential to sufficiently low values, the transient becomes negative, meaning that the density of cations becomes then higher at the surface of the electrode. The potential dependence of the interfacial response shows a marked hysteresis depending on the direction in which the applied potential is changed.
Resumo:
We investigate the dependence of Bayesian error bars on the distribution of data in input space. For generalized linear regression models we derive an upper bound on the error bars which shows that, in the neighbourhood of the data points, the error bars are substantially reduced from their prior values. For regions of high data density we also show that the contribution to the output variance due to the uncertainty in the weights can exhibit an approximate inverse proportionality to the probability density. Empirical results support these conclusions.
Resumo:
Mixture Density Networks (MDNs) are a well-established method for modelling the conditional probability density which is useful for complex multi-valued functions where regression methods (such as MLPs) fail. In this paper we extend earlier research of a regularisation method for a special case of MDNs to the general case using evidence based regularisation and we show how the Hessian of the MDN error function can be evaluated using R-propagation. The method is tested on two data sets and compared with early stopping.
Resumo:
Principal component analysis (PCA) is one of the most popular techniques for processing, compressing and visualising data, although its effectiveness is limited by its global linearity. While nonlinear variants of PCA have been proposed, an alternative paradigm is to capture data complexity by a combination of local linear PCA projections. However, conventional PCA does not correspond to a probability density, and so there is no unique way to combine PCA models. Previous attempts to formulate mixture models for PCA have therefore to some extent been ad hoc. In this paper, PCA is formulated within a maximum-likelihood framework, based on a specific form of Gaussian latent variable model. This leads to a well-defined mixture model for probabilistic principal component analysers, whose parameters can be determined using an EM algorithm. We discuss the advantages of this model in the context of clustering, density modelling and local dimensionality reduction, and we demonstrate its application to image compression and handwritten digit recognition.
Resumo:
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.
Resumo:
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA.
Resumo:
Mixture Density Networks are a principled method to model conditional probability density functions which are non-Gaussian. This is achieved by modelling the conditional distribution for each pattern with a Gaussian Mixture Model for which the parameters are generated by a neural network. This thesis presents a novel method to introduce regularisation in this context for the special case where the mean and variance of the spherical Gaussian Kernels in the mixtures are fixed to predetermined values. Guidelines for how these parameters can be initialised are given, and it is shown how to apply the evidence framework to mixture density networks to achieve regularisation. This also provides an objective stopping criteria that can replace the `early stopping' methods that have previously been used. If the neural network used is an RBF network with fixed centres this opens up new opportunities for improved initialisation of the network weights, which are exploited to start training relatively close to the optimum. The new method is demonstrated on two data sets. The first is a simple synthetic data set while the second is a real life data set, namely satellite scatterometer data used to infer the wind speed and wind direction near the ocean surface. For both data sets the regularisation method performs well in comparison with earlier published results. Ideas on how the constraint on the kernels may be relaxed to allow fully adaptable kernels are presented.