12 resultados para Sparsity
em Duke University
Resumo:
Time-dependent density functional theory (TDDFT) has broad application in the study of electronic response, excitation and transport. To extend such application to large and complex systems, we develop a reformulation of TDDFT equations in terms of non-orthogonal localized molecular orbitals (NOLMOs). NOLMO is the most localized representation of electronic degrees of freedom and has been used in ground state calculations. In atomic orbital (AO) representation, the sparsity of NOLMO is transferred to the coefficient matrix of molecular orbitals (MOs). Its novel use in TDDFT here leads to a very simple form of time propagation equations which can be solved with linear-scaling effort. We have tested the method for several long-chain saturated and conjugated molecular systems within the self-consistent charge density-functional tight-binding method (SCC-DFTB) and demonstrated its accuracy. This opens up pathways for TDDFT applications to large bio- and nano-systems.
Resumo:
In this paper, we propose generalized sampling approaches for measuring a multi-dimensional object using a compact compound-eye imaging system called thin observation module by bound optics (TOMBO). This paper shows the proposed system model, physical examples, and simulations to verify TOMBO imaging using generalized sampling. In the system, an object is modulated and multiplied by a weight distribution with physical coding, and the coded optical signal is integrated on to a detector array. A numerical estimation algorithm employing a sparsity constraint is used for object reconstruction.
Resumo:
We discuss a general approach to dynamic sparsity modeling in multivariate time series analysis. Time-varying parameters are linked to latent processes that are thresholded to induce zero values adaptively, providing natural mechanisms for dynamic variable inclusion/selection. We discuss Bayesian model specification, analysis and prediction in dynamic regressions, time-varying vector autoregressions, and multivariate volatility models using latent thresholding. Application to a topical macroeconomic time series problem illustrates some of the benefits of the approach in terms of statistical and economic interpretations as well as improved predictions. Supplementary materials for this article are available online. © 2013 Copyright Taylor and Francis Group, LLC.
Resumo:
A framework for adaptive and non-adaptive statistical compressive sensing is developed, where a statistical model replaces the standard sparsity model of classical compressive sensing. We propose within this framework optimal task-specific sensing protocols specifically and jointly designed for classification and reconstruction. A two-step adaptive sensing paradigm is developed, where online sensing is applied to detect the signal class in the first step, followed by a reconstruction step adapted to the detected class and the observed samples. The approach is based on information theory, here tailored for Gaussian mixture models (GMMs), where an information-theoretic objective relationship between the sensed signals and a representation of the specific task of interest is maximized. Experimental results using synthetic signals, Landsat satellite attributes, and natural images of different sizes and with different noise levels show the improvements achieved using the proposed framework when compared to more standard sensing protocols. The underlying formulation can be applied beyond GMMs, at the price of higher mathematical and computational complexity. © 1991-2012 IEEE.
Resumo:
In regression analysis of counts, a lack of simple and efficient algorithms for posterior computation has made Bayesian approaches appear unattractive and thus underdeveloped. We propose a lognormal and gamma mixed negative binomial (NB) regression model for counts, and present efficient closed-form Bayesian inference; unlike conventional Poisson models, the proposed approach has two free parameters to include two different kinds of random effects, and allows the incorporation of prior information, such as sparsity in the regression coefficients. By placing a gamma distribution prior on the NB dispersion parameter r, and connecting a log-normal distribution prior with the logit of the NB probability parameter p, efficient Gibbs sampling and variational Bayes inference are both developed. The closed-form updates are obtained by exploiting conditional conjugacy via both a compound Poisson representation and a Polya-Gamma distribution based data augmentation approach. The proposed Bayesian inference can be implemented routinely, while being easily generalizable to more complex settings involving multivariate dependence structures. The algorithms are illustrated using real examples. Copyright 2012 by the author(s)/owner(s).
Resumo:
Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets.
Resumo:
We introduce a dynamic directional model (DDM) for studying brain effective connectivity based on intracranial electrocorticographic (ECoG) time series. The DDM consists of two parts: a set of differential equations describing neuronal activity of brain components (state equations), and observation equations linking the underlying neuronal states to observed data. When applied to functional MRI or EEG data, DDMs usually have complex formulations and thus can accommodate only a few regions, due to limitations in spatial resolution and/or temporal resolution of these imaging modalities. In contrast, we formulate our model in the context of ECoG data. The combined high temporal and spatial resolution of ECoG data result in a much simpler DDM, allowing investigation of complex connections between many regions. To identify functionally segregated sub-networks, a form of biologically economical brain networks, we propose the Potts model for the DDM parameters. The neuronal states of brain components are represented by cubic spline bases and the parameters are estimated by minimizing a log-likelihood criterion that combines the state and observation equations. The Potts model is converted to the Potts penalty in the penalized regression approach to achieve sparsity in parameter estimation, for which a fast iterative algorithm is developed. The methods are applied to an auditory ECoG dataset.
Resumo:
Histopathology is the clinical standard for tissue diagnosis. However, histopathology has several limitations including that it requires tissue processing, which can take 30 minutes or more, and requires a highly trained pathologist to diagnose the tissue. Additionally, the diagnosis is qualitative, and the lack of quantitation leads to possible observer-specific diagnosis. Taken together, it is difficult to diagnose tissue at the point of care using histopathology.
Several clinical situations could benefit from more rapid and automated histological processing, which could reduce the time and the number of steps required between obtaining a fresh tissue specimen and rendering a diagnosis. For example, there is need for rapid detection of residual cancer on the surface of tumor resection specimens during excisional surgeries, which is known as intraoperative tumor margin assessment. Additionally, rapid assessment of biopsy specimens at the point-of-care could enable clinicians to confirm that a suspicious lesion is successfully sampled, thus preventing an unnecessary repeat biopsy procedure. Rapid and low cost histological processing could also be potentially useful in settings lacking the human resources and equipment necessary to perform standard histologic assessment. Lastly, automated interpretation of tissue samples could potentially reduce inter-observer error, particularly in the diagnosis of borderline lesions.
To address these needs, high quality microscopic images of the tissue must be obtained in rapid timeframes, in order for a pathologic assessment to be useful for guiding the intervention. Optical microscopy is a powerful technique to obtain high-resolution images of tissue morphology in real-time at the point of care, without the need for tissue processing. In particular, a number of groups have combined fluorescence microscopy with vital fluorescent stains to visualize micro-anatomical features of thick (i.e. unsectioned or unprocessed) tissue. However, robust methods for segmentation and quantitative analysis of heterogeneous images are essential to enable automated diagnosis. Thus, the goal of this work was to obtain high resolution imaging of tissue morphology through employing fluorescence microscopy and vital fluorescent stains and to develop a quantitative strategy to segment and quantify tissue features in heterogeneous images, such as nuclei and the surrounding stroma, which will enable automated diagnosis of thick tissues.
To achieve these goals, three specific aims were proposed. The first aim was to develop an image processing method that can differentiate nuclei from background tissue heterogeneity and enable automated diagnosis of thick tissue at the point of care. A computational technique called sparse component analysis (SCA) was adapted to isolate features of interest, such as nuclei, from the background. SCA has been used previously in the image processing community for image compression, enhancement, and restoration, but has never been applied to separate distinct tissue types in a heterogeneous image. In combination with a high resolution fluorescence microendoscope (HRME) and a contrast agent acriflavine, the utility of this technique was demonstrated through imaging preclinical sarcoma tumor margins. Acriflavine localizes to the nuclei of cells where it reversibly associates with RNA and DNA. Additionally, acriflavine shows some affinity for collagen and muscle. SCA was adapted to isolate acriflavine positive features or APFs (which correspond to RNA and DNA) from background tissue heterogeneity. The circle transform (CT) was applied to the SCA output to quantify the size and density of overlapping APFs. The sensitivity of the SCA+CT approach to variations in APF size, density and background heterogeneity was demonstrated through simulations. Specifically, SCA+CT achieved the lowest errors for higher contrast ratios and larger APF sizes. When applied to tissue images of excised sarcoma margins, SCA+CT correctly isolated APFs and showed consistently increased density in tumor and tumor + muscle images compared to images containing muscle. Next, variables were quantified from images of resected primary sarcomas and used to optimize a multivariate model. The sensitivity and specificity for differentiating positive from negative ex vivo resected tumor margins was 82% and 75%. The utility of this approach was further tested by imaging the in vivo tumor cavities from 34 mice after resection of a sarcoma with local recurrence as a bench mark. When applied prospectively to images from the tumor cavity, the sensitivity and specificity for differentiating local recurrence was 78% and 82%. The results indicate that SCA+CT can accurately delineate APFs in heterogeneous tissue, which is essential to enable automated and rapid surveillance of tissue pathology.
Two primary challenges were identified in the work in aim 1. First, while SCA can be used to isolate features, such as APFs, from heterogeneous images, its performance is limited by the contrast between APFs and the background. Second, while it is feasible to create mosaics by scanning a sarcoma tumor bed in a mouse, which is on the order of 3-7 mm in any one dimension, it is not feasible to evaluate an entire human surgical margin. Thus, improvements to the microscopic imaging system were made to (1) improve image contrast through rejecting out-of-focus background fluorescence and to (2) increase the field of view (FOV) while maintaining the sub-cellular resolution needed for delineation of nuclei. To address these challenges, a technique called structured illumination microscopy (SIM) was employed in which the entire FOV is illuminated with a defined spatial pattern rather than scanning a focal spot, such as in confocal microscopy.
Thus, the second aim was to improve image contrast and increase the FOV through employing wide-field, non-contact structured illumination microscopy and optimize the segmentation algorithm for new imaging modality. Both image contrast and FOV were increased through the development of a wide-field fluorescence SIM system. Clear improvement in image contrast was seen in structured illumination images compared to uniform illumination images. Additionally, the FOV is over 13X larger than the fluorescence microendoscope used in aim 1. Initial segmentation results of SIM images revealed that SCA is unable to segment large numbers of APFs in the tumor images. Because the FOV of the SIM system is over 13X larger than the FOV of the fluorescence microendoscope, dense collections of APFs commonly seen in tumor images could no longer be sparsely represented, and the fundamental sparsity assumption associated with SCA was no longer met. Thus, an algorithm called maximally stable extremal regions (MSER) was investigated as an alternative approach for APF segmentation in SIM images. MSER was able to accurately segment large numbers of APFs in SIM images of tumor tissue. In addition to optimizing MSER for SIM image segmentation, an optimal frequency of the illumination pattern used in SIM was carefully selected because the image signal to noise ratio (SNR) is dependent on the grid frequency. A grid frequency of 31.7 mm-1 led to the highest SNR and lowest percent error associated with MSER segmentation.
Once MSER was optimized for SIM image segmentation and the optimal grid frequency was selected, a quantitative model was developed to diagnose mouse sarcoma tumor margins that were imaged ex vivo with SIM. Tumor margins were stained with acridine orange (AO) in aim 2 because AO was found to stain the sarcoma tissue more brightly than acriflavine. Both acriflavine and AO are intravital dyes, which have been shown to stain nuclei, skeletal muscle, and collagenous stroma. A tissue-type classification model was developed to differentiate localized regions (75x75 µm) of tumor from skeletal muscle and adipose tissue based on the MSER segmentation output. Specifically, a logistic regression model was used to classify each localized region. The logistic regression model yielded an output in terms of probability (0-100%) that tumor was located within each 75x75 µm region. The model performance was tested using a receiver operator characteristic (ROC) curve analysis that revealed 77% sensitivity and 81% specificity. For margin classification, the whole margin image was divided into localized regions and this tissue-type classification model was applied. In a subset of 6 margins (3 negative, 3 positive), it was shown that with a tumor probability threshold of 50%, 8% of all regions from negative margins exceeded this threshold, while over 17% of all regions exceeded the threshold in the positive margins. Thus, 8% of regions in negative margins were considered false positives. These false positive regions are likely due to the high density of APFs present in normal tissues, which clearly demonstrates a challenge in implementing this automatic algorithm based on AO staining alone.
Thus, the third aim was to improve the specificity of the diagnostic model through leveraging other sources of contrast. Modifications were made to the SIM system to enable fluorescence imaging at a variety of wavelengths. Specifically, the SIM system was modified to enabling imaging of red fluorescent protein (RFP) expressing sarcomas, which were used to delineate the location of tumor cells within each image. Initial analysis of AO stained panels confirmed that there was room for improvement in tumor detection, particularly in regards to false positive regions that were negative for RFP. One approach for improving the specificity of the diagnostic model was to investigate using a fluorophore that was more specific to staining tumor. Specifically, tetracycline was selected because it appeared to specifically stain freshly excised tumor tissue in a matter of minutes, and was non-toxic and stable in solution. Results indicated that tetracycline staining has promise for increasing the specificity of tumor detection in SIM images of a preclinical sarcoma model and further investigation is warranted.
In conclusion, this work presents the development of a combination of tools that is capable of automated segmentation and quantification of micro-anatomical images of thick tissue. When compared to the fluorescence microendoscope, wide-field multispectral fluorescence SIM imaging provided improved image contrast, a larger FOV with comparable resolution, and the ability to image a variety of fluorophores. MSER was an appropriate and rapid approach to segment dense collections of APFs from wide-field SIM images. Variables that reflect the morphology of the tissue, such as the density, size, and shape of nuclei and nucleoli, can be used to automatically diagnose SIM images. The clinical utility of SIM imaging and MSER segmentation to detect microscopic residual disease has been demonstrated by imaging excised preclinical sarcoma margins. Ultimately, this work demonstrates that fluorescence imaging of tissue micro-anatomy combined with a specialized algorithm for delineation and quantification of features is a means for rapid, non-destructive and automated detection of microscopic disease, which could improve cancer management in a variety of clinical scenarios.
Resumo:
PURPOSE: X-ray computed tomography (CT) is widely used, both clinically and preclinically, for fast, high-resolution anatomic imaging; however, compelling opportunities exist to expand its use in functional imaging applications. For instance, spectral information combined with nanoparticle contrast agents enables quantification of tissue perfusion levels, while temporal information details cardiac and respiratory dynamics. The authors propose and demonstrate a projection acquisition and reconstruction strategy for 5D CT (3D+dual energy+time) which recovers spectral and temporal information without substantially increasing radiation dose or sampling time relative to anatomic imaging protocols. METHODS: The authors approach the 5D reconstruction problem within the framework of low-rank and sparse matrix decomposition. Unlike previous work on rank-sparsity constrained CT reconstruction, the authors establish an explicit rank-sparse signal model to describe the spectral and temporal dimensions. The spectral dimension is represented as a well-sampled time and energy averaged image plus regularly undersampled principal components describing the spectral contrast. The temporal dimension is represented as the same time and energy averaged reconstruction plus contiguous, spatially sparse, and irregularly sampled temporal contrast images. Using a nonlinear, image domain filtration approach, the authors refer to as rank-sparse kernel regression, the authors transfer image structure from the well-sampled time and energy averaged reconstruction to the spectral and temporal contrast images. This regularization strategy strictly constrains the reconstruction problem while approximately separating the temporal and spectral dimensions. Separability results in a highly compressed representation for the 5D data in which projections are shared between the temporal and spectral reconstruction subproblems, enabling substantial undersampling. The authors solved the 5D reconstruction problem using the split Bregman method and GPU-based implementations of backprojection, reprojection, and kernel regression. Using a preclinical mouse model, the authors apply the proposed algorithm to study myocardial injury following radiation treatment of breast cancer. RESULTS: Quantitative 5D simulations are performed using the MOBY mouse phantom. Twenty data sets (ten cardiac phases, two energies) are reconstructed with 88 μm, isotropic voxels from 450 total projections acquired over a single 360° rotation. In vivo 5D myocardial injury data sets acquired in two mice injected with gold and iodine nanoparticles are also reconstructed with 20 data sets per mouse using the same acquisition parameters (dose: ∼60 mGy). For both the simulations and the in vivo data, the reconstruction quality is sufficient to perform material decomposition into gold and iodine maps to localize the extent of myocardial injury (gold accumulation) and to measure cardiac functional metrics (vascular iodine). Their 5D CT imaging protocol represents a 95% reduction in radiation dose per cardiac phase and energy and a 40-fold decrease in projection sampling time relative to their standard imaging protocol. CONCLUSIONS: Their 5D CT data acquisition and reconstruction protocol efficiently exploits the rank-sparse nature of spectral and temporal CT data to provide high-fidelity reconstruction results without increased radiation dose or sampling time.
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
The advances in three related areas of state-space modeling, sequential Bayesian learning, and decision analysis are addressed, with the statistical challenges of scalability and associated dynamic sparsity. The key theme that ties the three areas is Bayesian model emulation: solving challenging analysis/computational problems using creative model emulators. This idea defines theoretical and applied advances in non-linear, non-Gaussian state-space modeling, dynamic sparsity, decision analysis and statistical computation, across linked contexts of multivariate time series and dynamic networks studies. Examples and applications in financial time series and portfolio analysis, macroeconomics and internet studies from computational advertising demonstrate the utility of the core methodological innovations.
Chapter 1 summarizes the three areas/problems and the key idea of emulating in those areas. Chapter 2 discusses the sequential analysis of latent threshold models with use of emulating models that allows for analytical filtering to enhance the efficiency of posterior sampling. Chapter 3 examines the emulator model in decision analysis, or the synthetic model, that is equivalent to the loss function in the original minimization problem, and shows its performance in the context of sequential portfolio optimization. Chapter 4 describes the method for modeling the steaming data of counts observed on a large network that relies on emulating the whole, dependent network model by independent, conjugate sub-models customized to each set of flow. Chapter 5 reviews those advances and makes the concluding remarks.