951 resultados para Spectral linear mixture model
Resumo:
The composition and abundance of algal pigments provide information on phytoplankton community characteristics such as photoacclimation, overall biomass and taxonomic composition. In particular, pigments play a major role in photoprotection and in the light-driven part of photosynthesis. Most phytoplankton pigments can be measured by high-performance liquid chromatography (HPLC) techniques applied to filtered water samples. This method, as well as other laboratory analyses, is time consuming and therefore limits the number of samples that can be processed in a given time. In order to receive information on phytoplankton pigment composition with a higher temporal and spatial resolution, we have developed a method to assess pigment concentrations from continuous optical measurements. The method applies an empirical orthogonal function (EOF) analysis to remote-sensing reflectance data derived from ship-based hyperspectral underwater radiometry and from multispectral satellite data (using the Medium Resolution Imaging Spectrometer - MERIS - Polymer product developed by Steinmetz et al., 2011, doi:10.1364/OE.19.009783) measured in the Atlantic Ocean. Subsequently we developed multiple linear regression models with measured (collocated) pigment concentrations as the response variable and EOF loadings as predictor variables. The model results show that surface concentrations of a suite of pigments and pigment groups can be well predicted from the ship-based reflectance measurements, even when only a multispectral resolution is chosen (i.e., eight bands, similar to those used by MERIS). Based on the MERIS reflectance data, concentrations of total and monovinyl chlorophyll a and the groups of photoprotective and photosynthetic carotenoids can be predicted with high quality. As a demonstration of the utility of the approach, the fitted model based on satellite reflectance data as input was applied to 1 month of MERIS Polymer data to predict the concentration of those pigment groups for the whole eastern tropical Atlantic area. Bootstrapping explorations of cross-validation error indicate that the method can produce reliable predictions with relatively small data sets (e.g., < 50 collocated values of reflectance and pigment concentration). The method allows for the derivation of time series from continuous reflectance data of various pigment groups at various regions, which can be used to study variability and change of phytoplankton composition and photophysiology.
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Otto-von-Guericke-Universität Magdeburg, Fakultät für Mathematik, Dissertation, 2016
Resumo:
One of the most challenging task underlying many hyperspectral imagery applications is the spectral unmixing, which decomposes a mixed pixel into a collection of reectance spectra, called endmember signatures, and their corresponding fractional abundances. Independent Component Analysis (ICA) have recently been proposed as a tool to unmix hyperspectral data. The basic goal of ICA is to nd a linear transformation to recover independent sources (abundance fractions) given only sensor observations that are unknown linear mixtures of the unobserved independent sources. In hyperspectral imagery the sum of abundance fractions associated to each pixel is constant due to physical constraints in the data acquisition process. Thus, sources cannot be independent. This paper address hyperspectral data source dependence and its impact on ICA performance. The study consider simulated and real data. In simulated scenarios hyperspectral observations are described by a generative model that takes into account the degradation mechanisms normally found in hyperspectral applications. We conclude that ICA does not unmix correctly all sources. This conclusion is based on the a study of the mutual information. Nevertheless, some sources might be well separated mainly if the number of sources is large and the signal-to-noise ratio (SNR) is high.
Resumo:
BACKGROUND: Regional differences in physician supply can be found in many health care systems, regardless of their organizational and financial structure. A theoretical model is developed for the physicians' decision on office allocation, covering demand-side factors and a consumption time function. METHODS: To test the propositions following the theoretical model, generalized linear models were estimated to explain differences in 412 German districts. Various factors found in the literature were included to control for physicians' regional preferences. RESULTS: Evidence in favor of the first three propositions of the theoretical model could be found. Specialists show a stronger association to higher populated districts than GPs. Although indicators for regional preferences are significantly correlated with physician density, their coefficients are not as high as population density. CONCLUSIONS: If regional disparities should be addressed by political actions, the focus should be to counteract those parameters representing physicians' preferences in over- and undersupplied regions.
Resumo:
A spectral aging test was developed to estimate the photochemical damage of oil, acrylic and gouache paints exposed to permanent lighting. The paints were irradiated at seven different wavelengths in the optical range to control and evaluate their spectral behaviour. To reach this objective, boxes with isolated aging cells were made. In each of box, one LED of a different wavelength and one photodiode were installed. Inside the boxes, the temperature of an exhibit area was recreated through a thermocouple sensor that controlled the temperature using a fan. The heat produced by the LED was dissipated by a thermal radiator. Moreover, to evaluate the exposure time dependence of the irradiation level, the test was performed using two different irradiation levels in ten exposure series. After each series, the spectral reflectance was measured, and the data collected for each paint and wavelength were used to develop a model of damage produced by the interaction between the spectral radiant exposure and the paint.
Resumo:
We treat the problem of existence of a location-then-price equilibrium in the circle model with a linear quadratic type of transportation cost function which can be either convex or concave. We show the existence of a unique perfect equilibrium for the concave case when the linear and quadratic terms are equal and of a unique perfect equilibrium for the convex case when the linear term is equal to zero. Aside from these two cases, there are feasible locations by the firms for which no equilibrium in the price subgame exists. Finally, we provide a full taxonomy of the price equilibrium regions in terms of weights of the linear and quadratic terms in the cost function.
Resumo:
In this thesis, new classes of models for multivariate linear regression defined by finite mixtures of seemingly unrelated contaminated normal regression models and seemingly unrelated contaminated normal cluster-weighted models are illustrated. The main difference between such families is that the covariates are treated as fixed in the former class of models and as random in the latter. Thus, in cluster-weighted models the assignment of the data points to the unknown groups of observations depends also by the covariates. These classes provide an extension to mixture-based regression analysis for modelling multivariate and correlated responses in the presence of mild outliers that allows to specify a different vector of regressors for the prediction of each response. Expectation-conditional maximisation algorithms for the calculation of the maximum likelihood estimate of the model parameters have been derived. As the number of free parameters incresases quadratically with the number of responses and the covariates, analyses based on the proposed models can become unfeasible in practical applications. These problems have been overcome by introducing constraints on the elements of the covariance matrices according to an approach based on the eigen-decomposition of the covariance matrices. The performances of the new models have been studied by simulations and using real datasets in comparison with other models. In order to gain additional flexibility, mixtures of seemingly unrelated contaminated normal regressions models have also been specified so as to allow mixing proportions to be expressed as functions of concomitant covariates. An illustration of the new models with concomitant variables and a study on housing tension in the municipalities of the Emilia-Romagna region based on different types of multivariate linear regression models have been performed.
Resumo:
In acquired immunodeficiency syndrome (AIDS) studies it is quite common to observe viral load measurements collected irregularly over time. Moreover, these measurements can be subjected to some upper and/or lower detection limits depending on the quantification assays. A complication arises when these continuous repeated measures have a heavy-tailed behavior. For such data structures, we propose a robust structure for a censored linear model based on the multivariate Student's t-distribution. To compensate for the autocorrelation existing among irregularly observed measures, a damped exponential correlation structure is employed. An efficient expectation maximization type algorithm is developed for computing the maximum likelihood estimates, obtaining as a by-product the standard errors of the fixed effects and the log-likelihood function. The proposed algorithm uses closed-form expressions at the E-step that rely on formulas for the mean and variance of a truncated multivariate Student's t-distribution. The methodology is illustrated through an application to an Human Immunodeficiency Virus-AIDS (HIV-AIDS) study and several simulation studies.
Resumo:
A method using the ring-oven technique for pre-concentration in filter paper discs and near infrared hyperspectral imaging is proposed to identify four detergent and dispersant additives, and to determine their concentration in gasoline. Different approaches were used to select the best image data processing in order to gather the relevant spectral information. This was attained by selecting the pixels of the region of interest (ROI), using a pre-calculated threshold value of the PCA scores arranged as histograms, to select the spectra set; summing up the selected spectra to achieve representativeness; and compensating for the superimposed filter paper spectral information, also supported by scores histograms for each individual sample. The best classification model was achieved using linear discriminant analysis and genetic algorithm (LDA/GA), whose correct classification rate in the external validation set was 92%. Previous classification of the type of additive present in the gasoline is necessary to define the PLS model required for its quantitative determination. Considering that two of the additives studied present high spectral similarity, a PLS regression model was constructed to predict their content in gasoline, while two additional models were used for the remaining additives. The results for the external validation of these regression models showed a mean percentage error of prediction varying from 5 to 15%.
Resumo:
Remote sensing data are each time more available and can be used to monitor the vegetal development of main agricultural crops, such as the Arabic coffee in Brazil, since that the relationship between spectral and agronomical data be well known. Therefore, this work had the main objective to assess the use of Quickbird satellite images to estimate biophysical parameters of coffee crop. Test area was composed by 25 coffee fields located between the cities of Ribeirão Corrente, Franca and Cristais Paulista (SP), Brazil, and the biophysical parameters used were row and between plants spacing, plant height, LAI, canopy diameter, percentage of vegetation cover, roughness and biomass. Spectral data were the reflectance of four bands of QUICKBIRD and values of four vegetations indexes (NDVI, GVI, SAVI and RVI) based on the same satellite. All these data were analyzed using linear and nonlinear regression methods to generate estimation models of biophysical parameters. The use of regression models based on nonlinear equations was more appropriate to estimate parameters such as the LAI and the percentage of biomass, important to indicate the productivity of coffee crop.
Resumo:
PURPOSE: To evaluate the sensitivity and specificity of machine learning classifiers (MLCs) for glaucoma diagnosis using Spectral Domain OCT (SD-OCT) and standard automated perimetry (SAP). METHODS: Observational cross-sectional study. Sixty two glaucoma patients and 48 healthy individuals were included. All patients underwent a complete ophthalmologic examination, achromatic standard automated perimetry (SAP) and retinal nerve fiber layer (RNFL) imaging with SD-OCT (Cirrus HD-OCT; Carl Zeiss Meditec Inc., Dublin, California). Receiver operating characteristic (ROC) curves were obtained for all SD-OCT parameters and global indices of SAP. Subsequently, the following MLCs were tested using parameters from the SD-OCT and SAP: Bagging (BAG), Naive-Bayes (NB), Multilayer Perceptron (MLP), Radial Basis Function (RBF), Random Forest (RAN), Ensemble Selection (ENS), Classification Tree (CTREE), Ada Boost M1(ADA),Support Vector Machine Linear (SVML) and Support Vector Machine Gaussian (SVMG). Areas under the receiver operating characteristic curves (aROC) obtained for isolated SAP and OCT parameters were compared with MLCs using OCT+SAP data. RESULTS: Combining OCT and SAP data, MLCs' aROCs varied from 0.777(CTREE) to 0.946 (RAN).The best OCT+SAP aROC obtained with RAN (0.946) was significantly larger the best single OCT parameter (p<0.05), but was not significantly different from the aROC obtained with the best single SAP parameter (p=0.19). CONCLUSION: Machine learning classifiers trained on OCT and SAP data can successfully discriminate between healthy and glaucomatous eyes. The combination of OCT and SAP measurements improved the diagnostic accuracy compared with OCT data alone.