30 resultados para Methods: Data Analysis
Resumo:
Presentation in CODAWORK'03, session 4: Applications to archeometry
Resumo:
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison’s idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by a simplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able to generate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow defining monitors to be used to assess the extent of possible environmental contamination. Case study on running and ground waters from Chiavenna Valley (Northern Italy) by using Na+, K+, Ca2+, Mg2+, HCO3-, SO4 2- and Cl- concentrations will be illustrated
Resumo:
R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets
Resumo:
The low levels of unemployment recorded in the UK in recent years are widely cited as evidence of the country’s improved economic performance, and the apparent convergence of unemployment rates across the country’s regions used to suggest that the longstanding divide in living standards between the relatively prosperous ‘south’ and the more depressed ‘north’ has been substantially narrowed. Dissenters from these conclusions have drawn attention to the greatly increased extent of non-employment (around a quarter of the UK’s working age population are not in employment) and the marked regional dimension in its distribution across the country. Amongst these dissenters it is generally agreed that non-employment is concentrated amongst older males previously employed in the now very much smaller ‘heavy’ industries (e.g. coal, steel, shipbuilding). This paper uses the tools of compositiona l data analysis to provide a much richer picture of non-employment and one which challenges the conventional analysis wisdom about UK labour market performance as well as the dissenters view of the nature of the problem. It is shown that, associated with the striking ‘north/south’ divide in nonemployment rates, there is a statistically significant relationship between the size of the non-employment rate and the composition of non-employment. Specifically, it is shown that the share of unemployment in non-employment is negatively correlated with the overall non-employment rate: in regions where the non-employment rate is high the share of unemployment is relatively low. So the unemployment rate is not a very reliable indicator of regional disparities in labour market performance. Even more importantly from a policy viewpoint, a significant positive relationship is found between the size of the non-employment rate and the share of those not employed through reason of sickness or disability and it seems (contrary to the dissenters) that this connection is just as strong for women as it is for men
Resumo:
Usually, psychometricians apply classical factorial analysis to evaluate construct validity of order rank scales. Nevertheless, these scales have particular characteristics that must be taken into account: total scores and rank are highly relevant
Resumo:
First application of compositional data analysis techniques to Australian election data
Resumo:
Isotopic data are currently becoming an important source of information regarding sources, evolution and mixing processes of water in hydrogeologic systems. However, it is not clear how to treat with statistics the geochemical data and the isotopic data together. We propose to introduce the isotopic information as new parts, and apply compositional data analysis with the resulting increased composition. Results are equivalent to downscale the classical isotopic delta variables, because they are already relative (as needed in the compositional framework) and isotopic variations are almost always very small. This methodology is illustrated and tested with the study of the Llobregat River Basin (Barcelona, NE Spain), where it is shown that, though very small, isotopic variations comp lement geochemical principal components, and help in the better identification of pollution sources
Resumo:
Functional Data Analysis (FDA) deals with samples where a whole function is observed for each individual. A particular case of FDA is when the observed functions are density functions, that are also an example of infinite dimensional compositional data. In this work we compare several methods for dimensionality reduction for this particular type of data: functional principal components analysis (PCA) with or without a previous data transformation and multidimensional scaling (MDS) for diferent inter-densities distances, one of them taking into account the compositional nature of density functions. The difeerent methods are applied to both artificial and real data (households income distributions)
Resumo:
Modern methods of compositional data analysis are not well known in biomedical research. Moreover, there appear to be few mathematical and statistical researchers working on compositional biomedical problems. Like the earth and environmental sciences, biomedicine has many problems in which the relevant scienti c information is encoded in the relative abundance of key species or categories. I introduce three problems in cancer research in which analysis of compositions plays an important role. The problems involve 1) the classi cation of serum proteomic pro les for early detection of lung cancer, 2) inference of the relative amounts of di erent tissue types in a diagnostic tumor biopsy, and 3) the subcellular localization of the BRCA1 protein, and it's role in breast cancer patient prognosis. For each of these problems I outline a partial solution. However, none of these problems is \solved". I attempt to identify areas in which additional statistical development is needed with the hope of encouraging more compositional data analysts to become involved in biomedical research
Resumo:
Several eco-toxicological studies have shown that insectivorous mammals, due to their feeding habits, easily accumulate high amounts of pollutants in relation to other mammal species. To assess the bio-accumulation levels of toxic metals and their in°uence on essential metals, we quantified the concentration of 19 elements (Ca, K, Fe, B, P, S, Na, Al, Zn, Ba, Rb, Sr, Cu, Mn, Hg, Cd, Mo, Cr and Pb) in bones of 105 greater white-toothed shrews (Crocidura russula) from a polluted (Ebro Delta) and a control (Medas Islands) area. Since chemical contents of a bio-indicator are mainly compositional data, conventional statistical analyses currently used in eco-toxicology can give misleading results. Therefore, to improve the interpretation of the data obtained, we used statistical techniques for compositional data analysis to define groups of metals and to evaluate the relationships between them, from an inter-population viewpoint. Hypothesis testing on the adequate balance-coordinates allow us to confirm intuition based hypothesis and some previous results. The main statistical goal was to test equal means of balance-coordinates for the two defined populations. After checking normality, one-way ANOVA or Mann-Whitney tests were carried out for the inter-group balances
Resumo:
The statistical analysis of compositional data should be treated using logratios of parts, which are difficult to use correctly in standard statistical packages. For this reason a freeware package, named CoDaPack was created. This software implements most of the basic statistical methods suitable for compositional data. In this paper we describe the new version of the package that now is called CoDaPack3D. It is developed in Visual Basic for applications (associated with Excel©), Visual Basic and Open GL, and it is oriented towards users with a minimum knowledge of computers with the aim at being simple and easy to use. This new version includes new graphical output in 2D and 3D. These outputs could be zoomed and, in 3D, rotated. Also a customization menu is included and outputs could be saved in jpeg format. Also this new version includes an interactive help and all dialog windows have been improved in order to facilitate its use. To use CoDaPack one has to access Excel© and introduce the data in a standard spreadsheet. These should be organized as a matrix where Excel© rows correspond to the observations and columns to the parts. The user executes macros that return numerical or graphical results. There are two kinds of numerical results: new variables and descriptive statistics, and both appear on the same sheet. Graphical output appears in independent windows. In the present version there are 8 menus, with a total of 38 submenus which, after some dialogue, directly call the corresponding macro. The dialogues ask the user to input variables and further parameters needed, as well as where to put these results. The web site http://ima.udg.es/CoDaPack contains this freeware package and only Microsoft Excel© under Microsoft Windows© is required to run the software. Kew words: Compositional data Analysis, Software
Resumo:
In a seminal paper, Aitchison and Lauder (1985) introduced classical kernel density estimation techniques in the context of compositional data analysis. Indeed, they gave two options for the choice of the kernel to be used in the kernel estimator. One of these kernels is based on the use the alr transformation on the simplex SD jointly with the normal distribution on RD-1. However, these authors themselves recognized that this method has some deficiencies. A method for overcoming these dificulties based on recent developments for compositional data analysis and multivariate kernel estimation theory, combining the ilr transformation with the use of the normal density with a full bandwidth matrix, was recently proposed in Martín-Fernández, Chacón and Mateu- Figueras (2006). Here we present an extensive simulation study that compares both methods in practice, thus exploring the finite-sample behaviour of both estimators
Resumo:
Pounamu (NZ jade), or nephrite, is a protected mineral in its natural form following the transfer of ownership back to Ngai Tahu under the Ngai Tahu (Pounamu Vesting) Act 1997. Any theft of nephrite is prosecutable under the Crimes Act 1961. Scientific evidence is essential in cases where origin is disputed. A robust method for discrimination of this material through the use of elemental analysis and compositional data analysis is required. Initial studies have characterised the variability within a given nephrite source. This has included investigation of both in situ outcrops and alluvial material. Methods for the discrimination of two geographically close nephrite sources are being developed. Key Words: forensic, jade, nephrite, laser ablation, inductively coupled plasma mass spectrometry, multivariate analysis, elemental analysis, compositional data analysis
Resumo:
Planners in public and private institutions would like coherent forecasts of the components of age-specic mortality, such as causes of death. This has been di cult to achieve because the relative values of the forecast components often fail to behave in a way that is coherent with historical experience. In addition, when the group forecasts are combined the result is often incompatible with an all-groups forecast. It has been shown that cause-specic mortality forecasts are pessimistic when compared with all-cause forecasts (Wilmoth, 1995). This paper abandons the conventional approach of using log mortality rates and forecasts the density of deaths in the life table. Since these values obey a unit sum constraint for both conventional single-decrement life tables (only one absorbing state) and multiple-decrement tables (more than one absorbing state), they are intrinsically relative rather than absolute values across decrements as well as ages. Using the methods of Compositional Data Analysis pioneered by Aitchison (1986), death densities are transformed into the real space so that the full range of multivariate statistics can be applied, then back-transformed to positive values so that the unit sum constraint is honoured. The structure of the best-known, single-decrement mortality-rate forecasting model, devised by Lee and Carter (1992), is expressed in compositional form and the results from the two models are compared. The compositional model is extended to a multiple-decrement form and used to forecast mortality by cause of death for Japan
Resumo:
Factor analysis as frequent technique for multivariate data inspection is widely used also for compositional data analysis. The usual way is to use a centered logratio (clr) transformation to obtain the random vector y of dimension D. The factor model is then y = Λf + e (1) with the factors f of dimension k < D, the error term e, and the loadings matrix Λ. Using the usual model assumptions (see, e.g., Basilevsky, 1994), the factor analysis model (1) can be written as Cov(y) = ΛΛT + ψ (2) where ψ = Cov(e) has a diagonal form. The diagonal elements of ψ as well as the loadings matrix Λ are estimated from an estimation of Cov(y). Given observed clr transformed data Y as realizations of the random vector y. Outliers or deviations from the idealized model assumptions of factor analysis can severely effect the parameter estimation. As a way out, robust estimation of the covariance matrix of Y will lead to robust estimates of Λ and ψ in (2), see Pison et al. (2003). Well known robust covariance estimators with good statistical properties, like the MCD or the S-estimators (see, e.g. Maronna et al., 2006), rely on a full-rank data matrix Y which is not the case for clr transformed data (see, e.g., Aitchison, 1986). The isometric logratio (ilr) transformation (Egozcue et al., 2003) solves this singularity problem. The data matrix Y is transformed to a matrix Z by using an orthonormal basis of lower dimension. Using the ilr transformed data, a robust covariance matrix C(Z) can be estimated. The result can be back-transformed to the clr space by C(Y ) = V C(Z)V T where the matrix V with orthonormal columns comes from the relation between the clr and the ilr transformation. Now the parameters in the model (2) can be estimated (Basilevsky, 1994) and the results have a direct interpretation since the links to the original variables are still preserved. The above procedure will be applied to data from geochemistry. Our special interest is on comparing the results with those of Reimann et al. (2002) for the Kola project data