41 resultados para CATEGORICAL-DATA ANALYSIS


Relevância:

100.00% 100.00%

Publicador:

Resumo:

R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The low levels of unemployment recorded in the UK in recent years are widely cited as evidence of the country’s improved economic performance, and the apparent convergence of unemployment rates across the country’s regions used to suggest that the longstanding divide in living standards between the relatively prosperous ‘south’ and the more depressed ‘north’ has been substantially narrowed. Dissenters from these conclusions have drawn attention to the greatly increased extent of non-employment (around a quarter of the UK’s working age population are not in employment) and the marked regional dimension in its distribution across the country. Amongst these dissenters it is generally agreed that non-employment is concentrated amongst older males previously employed in the now very much smaller ‘heavy’ industries (e.g. coal, steel, shipbuilding). This paper uses the tools of compositiona l data analysis to provide a much richer picture of non-employment and one which challenges the conventional analysis wisdom about UK labour market performance as well as the dissenters view of the nature of the problem. It is shown that, associated with the striking ‘north/south’ divide in nonemployment rates, there is a statistically significant relationship between the size of the non-employment rate and the composition of non-employment. Specifically, it is shown that the share of unemployment in non-employment is negatively correlated with the overall non-employment rate: in regions where the non-employment rate is high the share of unemployment is relatively low. So the unemployment rate is not a very reliable indicator of regional disparities in labour market performance. Even more importantly from a policy viewpoint, a significant positive relationship is found between the size of the non-employment rate and the share of those not employed through reason of sickness or disability and it seems (contrary to the dissenters) that this connection is just as strong for women as it is for men

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main instrument used in psychological measurement is the self-report questionnaire. One of its major drawbacks however is its susceptibility to response biases. A known strategy to control these biases has been the use of so-called ipsative items. Ipsative items are items that require the respondent to make between-scale comparisons within each item. The selected option determines to which scale the weight of the answer is attributed. Consequently in questionnaires only consisting of ipsative items every respondent is allotted an equal amount, i.e. the total score, that each can distribute differently over the scales. Therefore this type of response format yields data that can be considered compositional from its inception. Methodological oriented psychologists have heavily criticized this type of item format, since the resulting data is also marked by the associated unfavourable statistical properties. Nevertheless, clinicians have kept using these questionnaires to their satisfaction. This investigation therefore aims to evaluate both positions and addresses the similarities and differences between the two data collection methods. The ultimate objective is to formulate a guideline when to use which type of item format. The comparison is based on data obtained with both an ipsative and normative version of three psychological questionnaires, which were administered to 502 first-year students in psychology according to a balanced within-subjects design. Previous research only compared the direct ipsative scale scores with the derived ipsative scale scores. The use of compositional data analysis techniques also enables one to compare derived normative score ratios with direct normative score ratios. The addition of the second comparison not only offers the advantage of a better-balanced research strategy. In principle it also allows for parametric testing in the evaluation

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Usually, psychometricians apply classical factorial analysis to evaluate construct validity of order rank scales. Nevertheless, these scales have particular characteristics that must be taken into account: total scores and rank are highly relevant

Relevância:

100.00% 100.00%

Publicador:

Resumo:

First application of compositional data analysis techniques to Australian election data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In any discipline, where uncertainty and variability are present, it is important to have principles which are accepted as inviolate and which should therefore drive statistical modelling, statistical analysis of data and any inferences from such an analysis. Despite the fact that two such principles have existed over the last two decades and from these a sensible, meaningful methodology has been developed for the statistical analysis of compositional data, the application of inappropriate and/or meaningless methods persists in many areas of application. This paper identifies at least ten common fallacies and confusions in compositional data analysis with illustrative examples and provides readers with necessary, and hopefully sufficient, arguments to persuade the culprits why and how they should amend their ways

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Isotopic data are currently becoming an important source of information regarding sources, evolution and mixing processes of water in hydrogeologic systems. However, it is not clear how to treat with statistics the geochemical data and the isotopic data together. We propose to introduce the isotopic information as new parts, and apply compositional data analysis with the resulting increased composition. Results are equivalent to downscale the classical isotopic delta variables, because they are already relative (as needed in the compositional framework) and isotopic variations are almost always very small. This methodology is illustrated and tested with the study of the Llobregat River Basin (Barcelona, NE Spain), where it is shown that, though very small, isotopic variations comp lement geochemical principal components, and help in the better identification of pollution sources

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In an earlier investigation (Burger et al., 2000) five sediment cores near the Rodrigues Triple Junction in the Indian Ocean were studied applying classical statistical methods (fuzzy c-means clustering, linear mixing model, principal component analysis) for the extraction of endmembers and evaluating the spatial and temporal variation of geochemical signals. Three main factors of sedimentation were expected by the marine geologists: a volcano-genetic, a hydro-hydrothermal and an ultra-basic factor. The display of fuzzy membership values and/or factor scores versus depth provided consistent results for two factors only; the ultra-basic component could not be identified. The reason for this may be that only traditional statistical methods were applied, i.e. the untransformed components were used and the cosine-theta coefficient as similarity measure. During the last decade considerable progress in compositional data analysis was made and many case studies were published using new tools for exploratory analysis of these data. Therefore it makes sense to check if the application of suitable data transformations, reduction of the D-part simplex to two or three factors and visual interpretation of the factor scores would lead to a revision of earlier results and to answers to open questions . In this paper we follow the lines of a paper of R. Tolosana- Delgado et al. (2005) starting with a problem-oriented interpretation of the biplot scattergram, extracting compositional factors, ilr-transformation of the components and visualization of the factor scores in a spatial context: The compositional factors will be plotted versus depth (time) of the core samples in order to facilitate the identification of the expected sources of the sedimentary process. Kew words: compositional data analysis, biplot, deep sea sediments

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many multivariate methods that are apparently distinct can be linked by introducing one or more parameters in their definition. Methods that can be linked in this way are correspondence analysis, unweighted or weighted logratio analysis (the latter also known as "spectral mapping"), nonsymmetric correspondence analysis, principal component analysis (with and without logarithmic transformation of the data) and multidimensional scaling. In this presentation I will show how several of these methods, which are frequently used in compositional data analysis, may be linked through parametrizations such as power transformations, linear transformations and convex linear combinations. Since the methods of interest here all lead to visual maps of data, a "movie" can be made where where the linking parameter is allowed to vary in small steps: the results are recalculated "frame by frame" and one can see the smooth change from one method to another. Several of these "movies" will be shown, giving a deeper insight into the similarities and differences between these methods

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Several eco-toxicological studies have shown that insectivorous mammals, due to their feeding habits, easily accumulate high amounts of pollutants in relation to other mammal species. To assess the bio-accumulation levels of toxic metals and their in°uence on essential metals, we quantified the concentration of 19 elements (Ca, K, Fe, B, P, S, Na, Al, Zn, Ba, Rb, Sr, Cu, Mn, Hg, Cd, Mo, Cr and Pb) in bones of 105 greater white-toothed shrews (Crocidura russula) from a polluted (Ebro Delta) and a control (Medas Islands) area. Since chemical contents of a bio-indicator are mainly compositional data, conventional statistical analyses currently used in eco-toxicology can give misleading results. Therefore, to improve the interpretation of the data obtained, we used statistical techniques for compositional data analysis to define groups of metals and to evaluate the relationships between them, from an inter-population viewpoint. Hypothesis testing on the adequate balance-coordinates allow us to confirm intuition based hypothesis and some previous results. The main statistical goal was to test equal means of balance-coordinates for the two defined populations. After checking normality, one-way ANOVA or Mann-Whitney tests were carried out for the inter-group balances

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Factor analysis as frequent technique for multivariate data inspection is widely used also for compositional data analysis. The usual way is to use a centered logratio (clr) transformation to obtain the random vector y of dimension D. The factor model is then y = Λf + e (1) with the factors f of dimension k < D, the error term e, and the loadings matrix Λ. Using the usual model assumptions (see, e.g., Basilevsky, 1994), the factor analysis model (1) can be written as Cov(y) = ΛΛT + ψ (2) where ψ = Cov(e) has a diagonal form. The diagonal elements of ψ as well as the loadings matrix Λ are estimated from an estimation of Cov(y). Given observed clr transformed data Y as realizations of the random vector y. Outliers or deviations from the idealized model assumptions of factor analysis can severely effect the parameter estimation. As a way out, robust estimation of the covariance matrix of Y will lead to robust estimates of Λ and ψ in (2), see Pison et al. (2003). Well known robust covariance estimators with good statistical properties, like the MCD or the S-estimators (see, e.g. Maronna et al., 2006), rely on a full-rank data matrix Y which is not the case for clr transformed data (see, e.g., Aitchison, 1986). The isometric logratio (ilr) transformation (Egozcue et al., 2003) solves this singularity problem. The data matrix Y is transformed to a matrix Z by using an orthonormal basis of lower dimension. Using the ilr transformed data, a robust covariance matrix C(Z) can be estimated. The result can be back-transformed to the clr space by C(Y ) = V C(Z)V T where the matrix V with orthonormal columns comes from the relation between the clr and the ilr transformation. Now the parameters in the model (2) can be estimated (Basilevsky, 1994) and the results have a direct interpretation since the links to the original variables are still preserved. The above procedure will be applied to data from geochemistry. Our special interest is on comparing the results with those of Reimann et al. (2002) for the Kola project data

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper addresses the application of a PCA analysis on categorical data prior to diagnose a patients data set using a Case-Based Reasoning (CBR) system. The particularity is that the standard PCA techniques are designed to deal with numerical attributes, but our medical data set contains many categorical data and alternative methods as RS-PCA are required. Thus, we propose to hybridize RS-PCA (Regular Simplex PCA) and a simple CBR. Results show how the hybrid system produces similar results when diagnosing a medical data set, that the ones obtained when using the original attributes. These results are quite promising since they allow to diagnose with less computation effort and memory storage

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Modern methods of compositional data analysis are not well known in biomedical research. Moreover, there appear to be few mathematical and statistical researchers working on compositional biomedical problems. Like the earth and environmental sciences, biomedicine has many problems in which the relevant scienti c information is encoded in the relative abundance of key species or categories. I introduce three problems in cancer research in which analysis of compositions plays an important role. The problems involve 1) the classi cation of serum proteomic pro les for early detection of lung cancer, 2) inference of the relative amounts of di erent tissue types in a diagnostic tumor biopsy, and 3) the subcellular localization of the BRCA1 protein, and it's role in breast cancer patient prognosis. For each of these problems I outline a partial solution. However, none of these problems is \solved". I attempt to identify areas in which additional statistical development is needed with the hope of encouraging more compositional data analysts to become involved in biomedical research

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This analysis was stimulated by the real data analysis problem of household expenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that try to add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spending excluding alcohol/tobacco similar for teetotal and non-teetotal households? In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than one component, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durables within the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small. While this analysis is based on around economic data, the ideas carry over to many other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)

Relevância:

90.00% 90.00%

Publicador:

Resumo:

As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced