179 resultados para Factorization
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Abstract
Continuous variable is one of the major data types collected by the survey organizations. It can be incomplete such that the data collectors need to fill in the missingness. Or, it can contain sensitive information which needs protection from re-identification. One of the approaches to protect continuous microdata is to sum them up according to different cells of features. In this thesis, I represents novel methods of multiple imputation (MI) that can be applied to impute missing values and synthesize confidential values for continuous and magnitude data.
The first method is for limiting the disclosure risk of the continuous microdata whose marginal sums are fixed. The motivation for developing such a method comes from the magnitude tables of non-negative integer values in economic surveys. I present approaches based on a mixture of Poisson distributions to describe the multivariate distribution so that the marginals of the synthetic data are guaranteed to sum to the original totals. At the same time, I present methods for assessing disclosure risks in releasing such synthetic magnitude microdata. The illustration on a survey of manufacturing establishments shows that the disclosure risks are low while the information loss is acceptable.
The second method is for releasing synthetic continuous micro data by a nonstandard MI method. Traditionally, MI fits a model on the confidential values and then generates multiple synthetic datasets from this model. Its disclosure risk tends to be high, especially when the original data contain extreme values. I present a nonstandard MI approach conditioned on the protective intervals. Its basic idea is to estimate the model parameters from these intervals rather than the confidential values. The encouraging results of simple simulation studies suggest the potential of this new approach in limiting the posterior disclosure risk.
The third method is for imputing missing values in continuous and categorical variables. It is extended from a hierarchically coupled mixture model with local dependence. However, the new method separates the variables into non-focused (e.g., almost-fully-observed) and focused (e.g., missing-a-lot) ones. The sub-model structure of focused variables is more complex than that of non-focused ones. At the same time, their cluster indicators are linked together by tensor factorization and the focused continuous variables depend locally on non-focused values. The model properties suggest that moving the strongly associated non-focused variables to the side of focused ones can help to improve estimation accuracy, which is examined by several simulation studies. And this method is applied to data from the American Community Survey.
Resumo:
The first long-term aerosol sampling and chemical characterization results from measurements at the Cape Verde Atmospheric Observatory (CVAO) on the island of São Vicente are presented and are discussed with respect to air mass origin and seasonal trends. In total 671 samples were collected using a high-volume PM10 sampler on quartz fiber filters from January 2007 to December 2011. The samples were analyzed for their aerosol chemical composition, including their ionic and organic constituents. Back trajectory analyses showed that the aerosol at CVAO was strongly influenced by emissions from Europe and Africa, with the latter often responsible for high mineral dust loading. Sea salt and mineral dust dominated the aerosol mass and made up in total about 80% of the aerosol mass. The 5-year PM10 mean was 47.1 ± 55.5 µg/m**2, while the mineral dust and sea salt means were 27.9 ± 48.7 and 11.1 ± 5.5 µg/m**2, respectively. Non-sea-salt (nss) sulfate made up 62% of the total sulfate and originated from both long-range transport from Africa or Europe and marine sources. Strong seasonal variation was observed for the aerosol components. While nitrate showed no clear seasonal variation with an annual mean of 1.1 ± 0.6 µg/m**3, the aerosol mass, OC (organic carbon) and EC (elemental carbon), showed strong winter maxima due to strong influence of African air mass inflow. Additionally during summer, elevated concentrations of OM were observed originating from marine emissions. A summer maximum was observed for non-sea-salt sulfate and was connected to periods when air mass inflow was predominantly of marine origin, indicating that marine biogenic emissions were a significant source. Ammonium showed a distinct maximum in spring and coincided with ocean surface water chlorophyll a concentrations. Good correlations were also observed between nss-sulfate and oxalate during the summer and winter seasons, indicating a likely photochemical in-cloud processing of the marine and anthropogenic precursors of these species. High temporal variability was observed in both chloride and bromide depletion, differing significantly within the seasons, air mass history and Saharan dust concentration. Chloride (bromide) depletion varied from 8.8 ± 8.5% (62 ± 42%) in Saharan-dust-dominated air mass to 30 ± 12% (87 ± 11%) in polluted Europe air masses. During summer, bromide depletion often reached 100% in marine as well as in polluted continental samples. In addition to the influence of the aerosol acidic components, photochemistry was one of the main drivers of halogenide depletion during the summer; while during dust events, displacement reaction with nitric acid was found to be the dominant mechanism. Positive matrix factorization (PMF) analysis identified three major aerosol sources: sea salt, aged sea salt and long-range transport. The ionic budget was dominated by the first two of these factors, while the long-range transport factor could only account for about 14% of the total observed ionic mass.
Resumo:
This work outlines the theoretical advantages of multivariate methods in biomechanical data, validates the proposed methods and outlines new clinical findings relating to knee osteoarthritis that were made possible by this approach. New techniques were based on existing multivariate approaches, Partial Least Squares (PLS) and Non-negative Matrix Factorization (NMF) and validated using existing data sets. The new techniques developed, PCA-PLS-LDA (Principal Component Analysis – Partial Least Squares – Linear Discriminant Analysis), PCA-PLS-MLR (Principal Component Analysis – Partial Least Squares –Multiple Linear Regression) and Waveform Similarity (based on NMF) were developed to address the challenging characteristics of biomechanical data, variability and correlation. As a result, these new structure-seeking technique revealed new clinical findings. The first new clinical finding relates to the relationship between pain, radiographic severity and mechanics. Simultaneous analysis of pain and radiographic severity outcomes, a first in biomechanics, revealed that the knee adduction moment’s relationship to radiographic features is mediated by pain in subjects with moderate osteoarthritis. The second clinical finding was quantifying the importance of neuromuscular patterns in brace effectiveness for patients with knee osteoarthritis. I found that brace effectiveness was more related to the patient’s unbraced neuromuscular patterns than it was to mechanics, and that these neuromuscular patterns were more complicated than simply increased overall muscle activity, as previously thought.
Resumo:
A primary goal of context-aware systems is delivering the right information at the right place and right time to users in order to enable them to make effective decisions and improve their quality of life. There are three key requirements for achieving this goal: determining what information is relevant, personalizing it based on the users’ context (location, preferences, behavioral history etc.), and delivering it to them in a timely manner without an explicit request from them. These requirements create a paradigm that we term as “Proactive Context-aware Computing”. Most of the existing context-aware systems fulfill only a subset of these requirements. Many of these systems focus only on personalization of the requested information based on users’ current context. Moreover, they are often designed for specific domains. In addition, most of the existing systems are reactive - the users request for some information and the system delivers it to them. These systems are not proactive i.e. they cannot anticipate users’ intent and behavior and act proactively without an explicit request from them. In order to overcome these limitations, we need to conduct a deeper analysis and enhance our understanding of context-aware systems that are generic, universal, proactive and applicable to a wide variety of domains. To support this dissertation, we explore several directions. Clearly the most significant sources of information about users today are smartphones. A large amount of users’ context can be acquired through them and they can be used as an effective means to deliver information to users. In addition, social media such as Facebook, Flickr and Foursquare provide a rich and powerful platform to mine users’ interests, preferences and behavioral history. We employ the ubiquity of smartphones and the wealth of information available from social media to address the challenge of building proactive context-aware systems. We have implemented and evaluated a few approaches, including some as part of the Rover framework, to achieve the paradigm of Proactive Context-aware Computing. Rover is a context-aware research platform which has been evolving for the last 6 years. Since location is one of the most important context for users, we have developed ‘Locus’, an indoor localization, tracking and navigation system for multi-story buildings. Other important dimensions of users’ context include the activities that they are engaged in. To this end, we have developed ‘SenseMe’, a system that leverages the smartphone and its multiple sensors in order to perform multidimensional context and activity recognition for users. As part of the ‘SenseMe’ project, we also conducted an exploratory study of privacy, trust, risks and other concerns of users with smart phone based personal sensing systems and applications. To determine what information would be relevant to users’ situations, we have developed ‘TellMe’ - a system that employs a new, flexible and scalable approach based on Natural Language Processing techniques to perform bootstrapped discovery and ranking of relevant information in context-aware systems. In order to personalize the relevant information, we have also developed an algorithm and system for mining a broad range of users’ preferences from their social network profiles and activities. For recommending new information to the users based on their past behavior and context history (such as visited locations, activities and time), we have developed a recommender system and approach for performing multi-dimensional collaborative recommendations using tensor factorization. For timely delivery of personalized and relevant information, it is essential to anticipate and predict users’ behavior. To this end, we have developed a unified infrastructure, within the Rover framework, and implemented several novel approaches and algorithms that employ various contextual features and state of the art machine learning techniques for building diverse behavioral models of users. Examples of generated models include classifying users’ semantic places and mobility states, predicting their availability for accepting calls on smartphones and inferring their device charging behavior. Finally, to enable proactivity in context-aware systems, we have also developed a planning framework based on HTN planning. Together, these works provide a major push in the direction of proactive context-aware computing.
Resumo:
Abstract The ultimate problem considered in this thesis is modeling a high-dimensional joint distribution over a set of discrete variables. For this purpose, we consider classes of context-specific graphical models and the main emphasis is on learning the structure of such models from data. Traditional graphical models compactly represent a joint distribution through a factorization justi ed by statements of conditional independence which are encoded by a graph structure. Context-speci c independence is a natural generalization of conditional independence that only holds in a certain context, speci ed by the conditioning variables. We introduce context-speci c generalizations of both Bayesian networks and Markov networks by including statements of context-specific independence which can be encoded as a part of the model structures. For the purpose of learning context-speci c model structures from data, we derive score functions, based on results from Bayesian statistics, by which the plausibility of a structure is assessed. To identify high-scoring structures, we construct stochastic and deterministic search algorithms designed to exploit the structural decomposition of our score functions. Numerical experiments on synthetic and real-world data show that the increased exibility of context-specific structures can more accurately emulate the dependence structure among the variables and thereby improve the predictive accuracy of the models.
Resumo:
We present efficient algorithms for solving Legendre equations over Q (equivalently, for finding rational points on rational conics) and parametrizing all solutions. Unlike existing algorithms, no integer factorization is required, provided that the prime factors of the discriminant are known.
Resumo:
An extended formulation of a polyhedron P is a linear description of a polyhedron Q together with a linear map π such that π(Q)=P. These objects are of fundamental importance in polyhedral combinatorics and optimization theory, and the subject of a number of studies. Yannakakis’ factorization theorem (Yannakakis in J Comput Syst Sci 43(3):441–466, 1991) provides a surprising connection between extended formulations and communication complexity, showing that the smallest size of an extended formulation of $$P$$P equals the nonnegative rank of its slack matrix S. Moreover, Yannakakis also shows that the nonnegative rank of S is at most 2c, where c is the complexity of any deterministic protocol computing S. In this paper, we show that the latter result can be strengthened when we allow protocols to be randomized. In particular, we prove that the base-2 logarithm of the nonnegative rank of any nonnegative matrix equals the minimum complexity of a randomized communication protocol computing the matrix in expectation. Using Yannakakis’ factorization theorem, this implies that the base-2 logarithm of the smallest size of an extended formulation of a polytope P equals the minimum complexity of a randomized communication protocol computing the slack matrix of P in expectation. We show that allowing randomization in the protocol can be crucial for obtaining small extended formulations. Specifically, we prove that for the spanning tree and perfect matching polytopes, small variance in the protocol forces large size in the extended formulation.
Resumo:
Surface ozone is formed in the presence of NOx (NO + NO2) and volatile organic compounds (VOCs) and is hazardous to human health. A better understanding of these precursors is needed for developing effective policies to improve air quality. To evaluate the year-to-year changes in source contributions to total VOCs, Positive Matrix Factorization (PMF) was used to perform source apportionment using available hourly observations from June through August at a Photochemical Assessment Monitoring Station (PAMS) in Essex, MD for each year from 2007-2015. Results suggest that while gasoline and vehicle exhaust emissions have fallen, the contribution of natural gas sources to total VOCs has risen. To investigate this increasing natural gas influence, ethane measurements from PAMS sites in Essex, MD and Washington, D.C. were examined. Following a period of decline, daytime ethane concentrations have increased significantly after 2009. This trend appears to be linked with the rapid shale gas production in upwind, neighboring states, especially Pennsylvania and West Virginia. Back-trajectory analyses similarly show that ethane concentrations at these monitors were significantly greater if air parcels had passed through counties containing a high density of unconventional natural gas wells. In addition to VOC emissions, the compressors and engines involved with hydraulic fracturing operations also emit NOx and particulate matter (PM). The Community Multi-scale Air Quality (CMAQ) Model was used to simulate air quality for the Eastern U.S. in 2020, including emissions from shale gas operations in the Appalachian Basin. Predicted concentrations of ozone and PM show the largest decreases when these natural gas resources are hypothetically used to convert coal-fired power plants, despite the increased emissions from hydraulic fracturing operations expanded into all possible shale regions in the Appalachian Basin. While not as clean as burning natural gas, emissions of NOx from coal-fired power plants can be reduced by utilizing post-combustion controls. However, even though capital investment has already been made, these controls are not always operated at optimal rates. CMAQ simulations for the Eastern U.S. in 2018 show ozone concentrations decrease by ~5 ppb when controls on coal-fired power plants limit NOx emissions to historically best rates.
Resumo:
The transverse momentum dependent parton distribution/fragmentation functions (TMDs) are essential in the factorization of a number of processes like Drell-Yan scattering, vector boson production, semi-inclusive deep inelastic scattering, etc. We provide a comprehensive study of unpolarized TMDs at next-to-next-to-leading order, which includes an explicit calculation of these TMDs and an extraction of their matching coefficients onto their integrated analogues, for all flavor combinations. The obtained matching coefficients are important for any kind of phenomenology involving TMDs. In the present study each individual TMD is calculated without any reference to a specific process. We recover the known results for parton distribution functions and provide new results for the fragmentation functions. The results for the gluon transverse momentum dependent fragmentation functions are presented for the first time at one and two loops. We also discuss the structure of singularities of TMD operators and TMD matrix elements, crossing relations between TMD parton distribution functions and TMD fragmentation functions, and renormalization group equations. In addition, we consider the behavior of the matching coefficients at threshold and make a conjecture on their structure to all orders in perturbation theory.
Resumo:
Multivariate orthogonal polynomials in D real dimensions are considered from the perspective of the Cholesky factorization of a moment matrix. The approach allows for the construction of corresponding multivariate orthogonal polynomials, associated second kind functions, Jacobi type matrices and associated three term relations and also Christoffel-Darboux formulae. The multivariate orthogonal polynomials, their second kind functions and the corresponding Christoffel-Darboux kernels are shown to be quasi-determinants as well as Schur complements of bordered truncations of the moment matrix; quasi-tau functions are introduced. It is proven that the second kind functions are multivariate Cauchy transforms of the multivariate orthogonal polynomials. Discrete and continuous deformations of the measure lead to Toda type integrable hierarchy, being the corresponding flows described through Lax and Zakharov-Shabat equations; bilinear equations are found. Varying size matrix nonlinear partial difference and differential equations of the 2D Toda lattice type are shown to be solved by matrix coefficients of the multivariate orthogonal polynomials. The discrete flows, which are shown to be connected with a Gauss-Borel factorization of the Jacobi type matrices and its quasi-determinants, lead to expressions for the multivariate orthogonal polynomials and their second kind functions in terms of shifted quasi-tau matrices, which generalize to the multidimensional realm, those that relate the Baker and adjoint Baker functions to ratios of Miwa shifted tau-functions in the 1D scenario. In this context, the multivariate extension of the elementary Darboux transformation is given in terms of quasi-determinants of matrices built up by the evaluation, at a poised set of nodes lying in an appropriate hyperplane in R^D, of the multivariate orthogonal polynomials. The multivariate Christoffel formula for the iteration of m elementary Darboux transformations is given as a quasi-determinant. It is shown, using congruences in the space of semi-infinite matrices, that the discrete and continuous flows are intimately connected and determine nonlinear partial difference-differential equations that involve only one site in the integrable lattice behaving as a Kadomstev-Petviashvili type system. Finally, a brief discussion of measures with a particular linear isometry invariance and some of its consequences for the corresponding multivariate polynomials is given. In particular, it is shown that the Toda times that preserve the invariance condition lay in a secant variety of the Veronese variety of the fixed point set of the linear isometry.
Resumo:
Matrix factorization (MF) has evolved as one of the better practice to handle sparse data in field of recommender systems. Funk singular value decomposition (SVD) is a variant of MF that exists as state-of-the-art method that enabled winning the Netflix prize competition. The method is widely used with modifications in present day research in field of recommender systems. With the potential of data points to grow at very high velocity, it is prudent to devise newer methods that can handle such data accurately as well as efficiently than Funk-SVD in the context of recommender system. In view of the growing data points, I propose a latent factor model that caters to both accuracy and efficiency by reducing the number of latent features of either users or items making it less complex than Funk-SVD, where latent features of both users and items are equal and often larger. A comprehensive empirical evaluation of accuracy on two publicly available, amazon and ml-100 k datasets reveals the comparable accuracy and lesser complexity of proposed methods than Funk-SVD.
Resumo:
Social interactions have been the focus of social science research for a century, but their study has recently been revolutionized by novel data sources and by methods from computer science, network science, and complex systems science. The study of social interactions is crucial for understanding complex societal behaviours. Social interactions are naturally represented as networks, which have emerged as a unifying mathematical language to understand structural and dynamical aspects of socio-technical systems. Networks are, however, highly dimensional objects, especially when considering the scales of real-world systems and the need to model the temporal dimension. Hence the study of empirical data from social systems is challenging both from a conceptual and a computational standpoint. A possible approach to tackling such a challenge is to use dimensionality reduction techniques that represent network entities in a low-dimensional feature space, preserving some desired properties of the original data. Low-dimensional vector space representations, also known as network embeddings, have been extensively studied, also as a way to feed network data to machine learning algorithms. Network embeddings were initially developed for static networks and then extended to incorporate temporal network data. We focus on dimensionality reduction techniques for time-resolved social interaction data modelled as temporal networks. We introduce a novel embedding technique that models the temporal and structural similarities of events rather than nodes. Using empirical data on social interactions, we show that this representation captures information relevant for the study of dynamical processes unfolding over the network, such as epidemic spreading. We then turn to another large-scale dataset on social interactions: a popular Web-based crowdfunding platform. We show that tensor-based representations of the data and dimensionality reduction techniques such as tensor factorization allow us to uncover the structural and temporal aspects of the system and to relate them to geographic and temporal activity patterns.
Resumo:
The main purpose of this thesis is to go beyond two usual assumptions that accompany theoretical analysis in spin-glasses and inference: the i.i.d. (independently and identically distributed) hypothesis on the noise elements and the finite rank regime. The first one appears since the early birth of spin-glasses. The second one instead concerns the inference viewpoint. Disordered systems and Bayesian inference have a well-established relation, evidenced by their continuous cross-fertilization. The thesis makes use of techniques coming both from the rigorous mathematical machinery of spin-glasses, such as the interpolation scheme, and from Statistical Physics, such as the replica method. The first chapter contains an introduction to the Sherrington-Kirkpatrick and spiked Wigner models. The first is a mean field spin-glass where the couplings are i.i.d. Gaussian random variables. The second instead amounts to establish the information theoretical limits in the reconstruction of a fixed low rank matrix, the “spike”, blurred by additive Gaussian noise. In chapters 2 and 3 the i.i.d. hypothesis on the noise is broken by assuming a noise with inhomogeneous variance profile. In spin-glasses this leads to multi-species models. The inferential counterpart is called spatial coupling. All the previous models are usually studied in the Bayes-optimal setting, where everything is known about the generating process of the data. In chapter 4 instead we study the spiked Wigner model where the prior on the signal to reconstruct is ignored. In chapter 5 we analyze the statistical limits of a spiked Wigner model where the noise is no longer Gaussian, but drawn from a random matrix ensemble, which makes its elements dependent. The thesis ends with chapter 6, where the challenging problem of high-rank probabilistic matrix factorization is tackled. Here we introduce a new procedure called "decimation" and we show that it is theoretically to perform matrix factorization through it.