990 resultados para COMPOSITIONAL DATA
Resumo:
Many techniques in information retrieval produce counts from a sample, and it is common to analyse these counts as proportions of the whole - term frequencies are a familiar example. Proportions carry only relative information and are not free to vary independently of one another: for the proportion of one term to increase, one or more others must decrease. These constraints are hallmarks of compositional data. While there has long been discussion in other fields of how such data should be analysed, to our knowledge, Compositional Data Analysis (CoDA) has not been considered in IR. In this work we explore compositional data in IR through the lens of distance measures, and demonstrate that common measures, naïve to compositions, have some undesirable properties which can be avoided with composition-aware measures. As a practical example, these measures are shown to improve clustering. Copyright 2014 ACM.
Resumo:
Compositional data analysis usually deals with relative information between parts where the total (abundances, mass, amount, etc.) is unknown or uninformative. This article addresses the question of what to do when the total is known and is of interest. Tools used in this case are reviewed and analysed, in particular the relationship between the positive orthant of D-dimensional real space, the product space of the real line times the D-part simplex, and their Euclidean space structures. The first alternative corresponds to data analysis taking logarithms on each component, and the second one to treat a log-transformed total jointly with a composition describing the distribution of component amounts. Real data about total abundances of phytoplankton in an Australian river motivated the present study and are used for illustration.
Resumo:
The Irish and UK governments, along with other countries, have made a commitment to limit the concentrations of greenhouse gases in the atmosphere by reducing emissions from the burning of fossil fuels. This can be achieved (in part) through increasing the sequestration of CO2 from the atmosphere including monitoring the amount stored in vegetation and soils. A large proportion of soil carbon is held within peat due to the relatively high carbon density of peat and organic-rich soils. This is particularly important for a country such as Ireland, where some 16% of the land surface is covered by peat. For Northern Ireland, it has been estimated that the total amount of carbon stored in vegetation is 4.4Mt compared to 386Mt stored within peat and soils. As a result it has become increasingly important to measure and monitor changes in stores of carbon in soils. The conservation and restoration of peat covered areas, although ongoing for many years, has become increasingly important. This is summed up in current EU policy outlined by the European Commission (2012) which seeks to assess the relative contributions of the different inputs and outputs of organic carbon and organic matter to and from soil. Results are presented from the EU-funded Tellus Border Soil Carbon Project (2011 to 2013) which aimed to improve current estimates of carbon in soil and peat across Northern Ireland and the bordering counties of the Republic of Ireland.
Historical reports and previous surveys provide baseline data. To monitor change in peat depth and soil organic carbon, these historical data are integrated with more recently acquired airborne geophysical (radiometric) data and ground-based geochemical data generated by two surveys, the Tellus Project (2004-2007: covering Northern Ireland) and the EU-funded Tellus Border project (2011-2013) covering the six bordering counties of the Republic of Ireland, Donegal, Sligo, Leitrim, Cavan, Monaghan and Louth. The concept being applied is that saturated organic-rich soil and peat attenuate gamma-radiation from underlying soils and rocks. This research uses the degree of spatial correlation (coregionalization) between peat depth, soil organic carbon (SOC) and the attenuation of the radiometric signal to update a limited sampling regime of ground-based measurements with remotely acquired data. To comply with the compositional nature of the SOC data (perturbations of loss on ignition [LOI] data), a compositional data analysis approach is investigated. Contemporaneous ground-based measurements allow corroboration for the updated mapped outputs. This provides a methodology that can be used to improve estimates of soil carbon with minimal impact to sensitive habitats (like peat bogs), but with maximum output of data and knowledge.
Resumo:
Statistics are regularly used to make some form of comparison between trace evidence or deploy the exclusionary principle (Morgan and Bull, 2007) in forensic investigations. Trace evidence are routinely the results of particle size, chemical or modal analyses and as such constitute compositional data. The issue is that compositional data including percentages, parts per million etc. only carry relative information. This may be problematic where a comparison of percentages and other constraint/closed data is deemed a statistically valid and appropriate way to present trace evidence in a court of law. Notwithstanding an awareness of the existence of the constant sum problem since the seminal works of Pearson (1896) and Chayes (1960) and the introduction of the application of log-ratio techniques (Aitchison, 1986; Pawlowsky-Glahn and Egozcue, 2001; Pawlowsky-Glahn and Buccianti, 2011; Tolosana-Delgado and van den Boogaart, 2013) the problem that a constant sum destroys the potential independence of variances and covariances required for correlation regression analysis and empirical multivariate methods (principal component analysis, cluster analysis, discriminant analysis, canonical correlation) is all too often not acknowledged in the statistical treatment of trace evidence. Yet the need for a robust treatment of forensic trace evidence analyses is obvious. This research examines the issues and potential pitfalls for forensic investigators if the constant sum constraint is ignored in the analysis and presentation of forensic trace evidence. Forensic case studies involving particle size and mineral analyses as trace evidence are used to demonstrate the use of a compositional data approach using a centred log-ratio (clr) transformation and multivariate statistical analyses.
Resumo:
Conventional practice in Regional Geochemistry includes as a final step of any geochemical campaign the generation of a series of maps, to show the spatial distribution of each of the components considered. Such maps, though necessary, do not comply with the compositional, relative nature of the data, which unfortunately make any conclusion based on them sensitive
to spurious correlation problems. This is one of the reasons why these maps are never interpreted isolated. This contribution aims at gathering a series of statistical methods to produce individual maps of multiplicative combinations of components (logcontrasts), much in the flavor of equilibrium constants, which are designed on purpose to capture certain aspects of the data.
We distinguish between supervised and unsupervised methods, where the first require an external, non-compositional variable (besides the compositional geochemical information) available in an analogous training set. This external variable can be a quantity (soil density, collocated magnetics, collocated ratio of Th/U spectral gamma counts, proportion of clay particle fraction, etc) or a category (rock type, land use type, etc). In the supervised methods, a regression-like model between the external variable and the geochemical composition is derived in the training set, and then this model is mapped on the whole region. This case is illustrated with the Tellus dataset, covering Northern Ireland at a density of 1 soil sample per 2 square km, where we map the presence of blanket peat and the underlying geology. The unsupervised methods considered include principal components and principal balances
(Pawlowsky-Glahn et al., CoDaWork2013), i.e. logcontrasts of the data that are devised to capture very large variability or else be quasi-constant. Using the Tellus dataset again, it is found that geological features are highlighted by the quasi-constant ratios Hf/Nb and their ratio against SiO2; Rb/K2O and Zr/Na2O and the balance between these two groups of two variables; the balance of Al2O3 and TiO2 vs. MgO; or the balance of Cr, Ni and Co vs. V and Fe2O3. The largest variability appears to be related to the presence/absence of peat.
Resumo:
These notes have been prepared as support to a short course on compositional data analysis. Their aim is to transmit the basic concepts and skills for simple applications, thus setting the premises for more advanced projects
Resumo:
We take stock of the present position of compositional data analysis, of what has been achieved in the last 20 years, and then make suggestions as to what may be sensible avenues of future research. We take an uncompromisingly applied mathematical view, that the challenge of solving practical problems should motivate our theoretical research; and that any new theory should be thoroughly investigated to see if it may provide answers to previously abandoned practical considerations. Indeed a main theme of this lecture will be to demonstrate this applied mathematical approach by a number of challenging examples
Resumo:
One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By an essential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur in many compositional situations, such as household budget patterns, time budgets, palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful in such situations. From consideration of such examples it seems sensible to build up a model in two stages, the first determining where the zeros will occur and the second how the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data
Resumo:
One of the disadvantages of old age is that there is more past than future: this, however, may be turned into an advantage if the wealth of experience and, hopefully, wisdom gained in the past can be reflected upon and throw some light on possible future trends. To an extent, then, this talk is necessarily personal, certainly nostalgic, but also self critical and inquisitive about our understanding of the discipline of statistics. A number of almost philosophical themes will run through the talk: search for appropriate modelling in relation to the real problem envisaged, emphasis on sensible balances between simplicity and complexity, the relative roles of theory and practice, the nature of communication of inferential ideas to the statistical layman, the inter-related roles of teaching, consultation and research. A list of keywords might be: identification of sample space and its mathematical structure, choices between transform and stay, the role of parametric modelling, the role of a sample space metric, the underused hypothesis lattice, the nature of compositional change, particularly in relation to the modelling of processes. While the main theme will be relevance to compositional data analysis we shall point to substantial implications for general multivariate analysis arising from experience of the development of compositional data analysis…
Resumo:
Modern methods of compositional data analysis are not well known in biomedical research. Moreover, there appear to be few mathematical and statistical researchers working on compositional biomedical problems. Like the earth and environmental sciences, biomedicine has many problems in which the relevant scienti c information is encoded in the relative abundance of key species or categories. I introduce three problems in cancer research in which analysis of compositions plays an important role. The problems involve 1) the classi cation of serum proteomic pro les for early detection of lung cancer, 2) inference of the relative amounts of di erent tissue types in a diagnostic tumor biopsy, and 3) the subcellular localization of the BRCA1 protein, and it's role in breast cancer patient prognosis. For each of these problems I outline a partial solution. However, none of these problems is \solved". I attempt to identify areas in which additional statistical development is needed with the hope of encouraging more compositional data analysts to become involved in biomedical research
Resumo:
The application of compositional data analysis through log ratio trans- formations corresponds to a multinomial logit model for the shares themselves. This model is characterized by the property of Independence of Irrelevant Alter- natives (IIA). IIA states that the odds ratio in this case the ratio of shares is invariant to the addition or deletion of outcomes to the problem. It is exactly this invariance of the ratio that underlies the commonly used zero replacement procedure in compositional data analysis. In this paper we investigate using the nested logit model that does not embody IIA and an associated zero replacement procedure and compare its performance with that of the more usual approach of using the multinomial logit model. Our comparisons exploit a data set that com- bines voting data by electoral division with corresponding census data for each division for the 2001 Federal election in Australia