996 resultados para compositional models
Resumo:
One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By anessential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur inmany compositional situations, such as household budget patterns, time budgets,palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful insuch situations. From consideration of such examples it seems sensible to build up amodel in two stages, the first determining where the zeros will occur and the secondhow the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data
Resumo:
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison’s idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by asimplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able togenerate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow definingmonitors to be used to assess the extent of possible environmental contamination. Case study on running and ground waters from Chiavenna Valley (Northern Italy) by using Na+, K+, Ca2+, Mg2+, HCO3-, SO4 2- and Cl- concentrations will be illustrated
Resumo:
We consider two fundamental properties in the analysis of two-way tables of positive data: the principle of distributional equivalence, one of the cornerstones of correspondence analysis of contingency tables, and the principle of subcompositional coherence, which forms the basis of compositional data analysis. For an analysis to be subcompositionally coherent, it suffices to analyse the ratios of the data values. The usual approach to dimension reduction in compositional data analysis is to perform principal component analysis on the logarithms of ratios, but this method does not obey the principle of distributional equivalence. We show that by introducing weights for the rows and columns, the method achieves this desirable property. This weighted log-ratio analysis is theoretically equivalent to spectral mapping , a multivariate method developed almost 30 years ago for displaying ratio-scale data from biological activity spectra. The close relationship between spectral mapping and correspondence analysis is also explained, as well as their connection with association modelling. The weighted log-ratio methodology is applied here to frequency data in linguistics and to chemical compositional data in archaeology.
Resumo:
The singular value decomposition and its interpretation as alinear biplot has proved to be a powerful tool for analysing many formsof multivariate data. Here we adapt biplot methodology to the specifficcase of compositional data consisting of positive vectors each of whichis constrained to have unit sum. These relative variation biplots haveproperties relating to special features of compositional data: the studyof ratios, subcompositions and models of compositional relationships. Themethodology is demonstrated on a data set consisting of six-part colourcompositions in 22 abstract paintings, showing how the singular valuedecomposition can achieve an accurate biplot of the colour ratios and howpossible models interrelating the colours can be diagnosed.
Resumo:
A comment about the article “Local sensitivity analysis for compositional data with application to soil texture in hydrologic modelling” writen by L. Loosvelt and co-authors. The present comment is centered in three specific points. The first one is related to the fact that the authors avoid the use of ilr-coordinates. The second one refers to some generalization of sensitivity analysis when input parameters are compositional. The third tries to show that the role of the Dirichlet distribution in the sensitivity analysis is irrelevant
Resumo:
Aim: Modelling species at the assemblage level is required to make effective forecast of global change impacts on diversity and ecosystem functioning. Community predictions may be achieved using macroecological properties of communities (MEM), or by stacking of individual species distribution models (S-SDMs). To obtain more realistic predictions of species assemblages, the SESAM framework suggests applying successive filters to the initial species source pool, by combining different modelling approaches and rules. Here we provide a first test of this framework in mountain grassland communities. Location: The western Swiss Alps. Methods: Two implementations of the SESAM framework were tested: a "Probability ranking" rule based on species richness predictions and rough probabilities from SDMs, and a "Trait range" rule that uses the predicted upper and lower bound of community-level distribution of three different functional traits (vegetative height, specific leaf area and seed mass) to constraint a pool of environmentally filtered species from binary SDMs predictions. Results: We showed that all independent constraints expectedly contributed to reduce species richness overprediction. Only the "Probability ranking" rule allowed slightly but significantly improving predictions of community composition. Main conclusion: We tested various ways to implement the SESAM framework by integrating macroecological constraints into S-SDM predictions, and report one that is able to improve compositional predictions. We discuss possible improvements, such as further improving the causality and precision of environmental predictors, using other assembly rules and testing other types of ecological or functional constraints.
Resumo:
One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By an essential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur in many compositional situations, such as household budget patterns, time budgets, palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful in such situations. From consideration of such examples it seems sensible to build up a model in two stages, the first determining where the zeros will occur and the second how the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data
Resumo:
This analysis was stimulated by the real data analysis problem of household expenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that try to add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spending excluding alcohol/tobacco similar for teetotal and non-teetotal households? In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than one component, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durables within the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small. While this analysis is based on around economic data, the ideas carry over to many other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)
Resumo:
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison’s idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by a simplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able to generate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow defining monitors to be used to assess the extent of possible environmental contamination. Case study on running and ground waters from Chiavenna Valley (Northern Italy) by using Na+, K+, Ca2+, Mg2+, HCO3-, SO4 2- and Cl- concentrations will be illustrated
Resumo:
A compositional time series is obtained when a compositional data vector is observed at different points in time. Inherently, then, a compositional time series is a multivariate time series with important constraints on the variables observed at any instance in time. Although this type of data frequently occurs in situations of real practical interest, a trawl through the statistical literature reveals that research in the field is very much in its infancy and that many theoretical and empirical issues still remain to be addressed. Any appropriate statistical methodology for the analysis of compositional time series must take into account the constraints which are not allowed for by the usual statistical techniques available for analysing multivariate time series. One general approach to analyzing compositional time series consists in the application of an initial transform to break the positive and unit sum constraints, followed by the analysis of the transformed time series using multivariate ARIMA models. In this paper we discuss the use of the additive log-ratio, centred log-ratio and isometric log-ratio transforms. We also present results from an empirical study designed to explore how the selection of the initial transform affects subsequent multivariate ARIMA modelling as well as the quality of the forecasts
Resumo:
A joint distribution of two discrete random variables with finite support can be displayed as a two way table of probabilities adding to one. Assume that this table has n rows and m columns and all probabilities are non-null. This kind of table can be seen as an element in the simplex of n · m parts. In this context, the marginals are identified as compositional amalgams, conditionals (rows or columns) as subcompositions. Also, simplicial perturbation appears as Bayes theorem. However, the Euclidean elements of the Aitchison geometry of the simplex can also be translated into the table of probabilities: subspaces, orthogonal projections, distances. Two important questions are addressed: a) given a table of probabilities, which is the nearest independent table to the initial one? b) which is the largest orthogonal projection of a row onto a column? or, equivalently, which is the information in a row explained by a column, thus explaining the interaction? To answer these questions three orthogonal decompositions are presented: (1) by columns and a row-wise geometric marginal, (2) by rows and a columnwise geometric marginal, (3) by independent two-way tables and fully dependent tables representing row-column interaction. An important result is that the nearest independent table is the product of the two (row and column)-wise geometric marginal tables. A corollary is that, in an independent table, the geometric marginals conform with the traditional (arithmetic) marginals. These decompositions can be compared with standard log-linear models. Key words: balance, compositional data, simplex, Aitchison geometry, composition, orthonormal basis, arithmetic and geometric marginals, amalgam, dependence measure, contingency table
Resumo:
The composition of the labour force is an important economic factor for a country. Often the changes in proportions of different groups are of interest. I this paper we study a monthly compositional time series from the Swedish Labour Force Survey from 1994 to 2005. Three models are studied: the ILR-transformed series, the ILR-transformation of the compositional differenced series of order 1, and the ILRtransformation of the compositional differenced series of order 12. For each of the three models a VAR-model is fitted based on the data 1994-2003. We predict the time series 15 steps ahead and calculate 95 % prediction regions. The predictions of the three models are compared with actual values using MAD and MSE and the prediction regions are compared graphically in a ternary time series plot. We conclude that the first, and simplest, model possesses the best predictive power of the three models
Resumo:
Evolution of compositions in time, space, temperature or other covariates is frequent in practice. For instance, the radioactive decomposition of a sample changes its composition with time. Some of the involved isotopes decompose into other isotopes of the sample, thus producing a transfer of mass from some components to other ones, but preserving the total mass present in the system. This evolution is traditionally modelled as a system of ordinary di erential equations of the mass of each component. However, this kind of evolution can be decomposed into a compositional change, expressed in terms of simplicial derivatives, and a mass evolution (constant in this example). A rst result is that the simplicial system of di erential equations is non-linear, despite of some subcompositions behaving linearly. The goal is to study the characteristics of such simplicial systems of di erential equa- tions such as linearity and stability. This is performed extracting the compositional dif ferential equations from the mass equations. Then, simplicial derivatives are expressed in coordinates of the simplex, thus reducing the problem to the standard theory of systems of di erential equations, including stability. The characterisation of stability of these non-linear systems relays on the linearisation of the system of di erential equations at the stationary point, if any. The eigenvelues of the linearised matrix and the associated behaviour of the orbits are the main tools. For a three component system, these orbits can be plotted both in coordinates of the simplex or in a ternary diagram. A characterisation of processes with transfer of mass in closed systems in terms of stability is thus concluded. Two examples are presented for illustration, one of them is a radioactive decay
Resumo:
Self-organizing maps (Kohonen 1997) is a type of artificial neural network developed to explore patterns in high-dimensional multivariate data. The conventional version of the algorithm involves the use of Euclidean metric in the process of adaptation of the model vectors, thus rendering in theory a whole methodology incompatible with non-Euclidean geometries. In this contribution we explore the two main aspects of the problem: 1. Whether the conventional approach using Euclidean metric can shed valid results with compositional data. 2. If a modification of the conventional approach replacing vectorial sum and scalar multiplication by the canonical operators in the simplex (i.e. perturbation and powering) can converge to an adequate solution. Preliminary tests showed that both methodologies can be used on compositional data. However, the modified version of the algorithm performs poorer than the conventional version, in particular, when the data is pathological. Moreover, the conventional ap- proach converges faster to a solution, when data is \well-behaved". Key words: Self Organizing Map; Artificial Neural networks; Compositional data
Resumo:
The aim of this study was to investigate the effects of numerous milk compositional factors on milk coagulation properties using Partial Least Squares (PLS). Milk from herds of Jersey and Holstein-Friesian cattle was collected across the year and blended (n=55), to maximize variation in composition and coagulation. The milk was analysed for casein, protein, fat, titratable acidity, lactose, Ca2+, urea content, micelles size, fat globule size, somatic cell count and pH. Milk coagulation properties were defined as coagulation time, curd firmness and curd firmness rate measured by a controlled strain rheometer. The models derived from PLS had higher predictive power than previous models demonstrating the value of measuring more milk components. In addition to the well-established relationships with casein and protein levels, CMS and fat globule size were found to have as strong impact on all of the three models. The study also found a positive impact of fat on milk coagulation properties and a strong relationship between lactose and curd firmness, and urea and curd firmness rate, all of which warrant further investigation due to current lack of knowledge of the underlying mechanism. These findings demonstrate the importance of using a wider range of milk compositional variable for the prediction of the milk coagulation properties, and hence as indicators of milk suitability for cheese making.