852 resultados para Initial data problem
Resumo:
The application of compositional data analysis through log ratio trans- formations corresponds to a multinomial logit model for the shares themselves. This model is characterized by the property of Independence of Irrelevant Alter- natives (IIA). IIA states that the odds ratio in this case the ratio of shares is invariant to the addition or deletion of outcomes to the problem. It is exactly this invariance of the ratio that underlies the commonly used zero replacement procedure in compositional data analysis. In this paper we investigate using the nested logit model that does not embody IIA and an associated zero replacement procedure and compare its performance with that of the more usual approach of using the multinomial logit model. Our comparisons exploit a data set that com- bines voting data by electoral division with corresponding census data for each division for the 2001 Federal election in Australia
Resumo:
This analysis was stimulated by the real data analysis problem of household expenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that try to add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spending excluding alcohol/tobacco similar for teetotal and non-teetotal households? In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than one component, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durables within the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small. While this analysis is based on around economic data, the ideas carry over to many other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)
Resumo:
As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced
Resumo:
First discussion on compositional data analysis is attributable to Karl Pearson, in 1897. However, notwithstanding the recent developments on algebraic structure of the simplex, more than twenty years after Aitchison’s idea of log-transformations of closed data, scientific literature is again full of statistical treatments of this type of data by using traditional methodologies. This is particularly true in environmental geochemistry where besides the problem of the closure, the spatial structure (dependence) of the data have to be considered. In this work we propose the use of log-contrast values, obtained by a simplicial principal component analysis, as LQGLFDWRUV of given environmental conditions. The investigation of the log-constrast frequency distributions allows pointing out the statistical laws able to generate the values and to govern their variability. The changes, if compared, for example, with the mean values of the random variables assumed as models, or other reference parameters, allow defining monitors to be used to assess the extent of possible environmental contamination. Case study on running and ground waters from Chiavenna Valley (Northern Italy) by using Na+, K+, Ca2+, Mg2+, HCO3-, SO4 2- and Cl- concentrations will be illustrated
Resumo:
The low levels of unemployment recorded in the UK in recent years are widely cited as evidence of the country’s improved economic performance, and the apparent convergence of unemployment rates across the country’s regions used to suggest that the longstanding divide in living standards between the relatively prosperous ‘south’ and the more depressed ‘north’ has been substantially narrowed. Dissenters from these conclusions have drawn attention to the greatly increased extent of non-employment (around a quarter of the UK’s working age population are not in employment) and the marked regional dimension in its distribution across the country. Amongst these dissenters it is generally agreed that non-employment is concentrated amongst older males previously employed in the now very much smaller ‘heavy’ industries (e.g. coal, steel, shipbuilding). This paper uses the tools of compositiona l data analysis to provide a much richer picture of non-employment and one which challenges the conventional analysis wisdom about UK labour market performance as well as the dissenters view of the nature of the problem. It is shown that, associated with the striking ‘north/south’ divide in nonemployment rates, there is a statistically significant relationship between the size of the non-employment rate and the composition of non-employment. Specifically, it is shown that the share of unemployment in non-employment is negatively correlated with the overall non-employment rate: in regions where the non-employment rate is high the share of unemployment is relatively low. So the unemployment rate is not a very reliable indicator of regional disparities in labour market performance. Even more importantly from a policy viewpoint, a significant positive relationship is found between the size of the non-employment rate and the share of those not employed through reason of sickness or disability and it seems (contrary to the dissenters) that this connection is just as strong for women as it is for men
Resumo:
Compositional random vectors are fundamental tools in the Bayesian analysis of categorical data. Many of the issues that are discussed with reference to the statistical analysis of compositional data have a natural counterpart in the construction of a Bayesian statistical model for categorical data. This note builds on the idea of cross-fertilization of the two areas recommended by Aitchison (1986) in his seminal book on compositional data. Particular emphasis is put on the problem of what parameterization to use
Resumo:
The statistical analysis of literary style is the part of stylometry that compares measurable characteristics in a text that are rarely controlled by the author, with those in other texts. When the goal is to settle authorship questions, these characteristics should relate to the author’s style and not to the genre, epoch or editor, and they should be such that their variation between authors is larger than the variation within comparable texts from the same author. For an overview of the literature on stylometry and some of the techniques involved, see for example Mosteller and Wallace (1964, 82), Herdan (1964), Morton (1978), Holmes (1985), Oakes (1998) or Lebart, Salem and Berry (1998). Tirant lo Blanc, a chivalry book, is the main work in catalan literature and it was hailed to be “the best book of its kind in the world” by Cervantes in Don Quixote. Considered by writters like Vargas Llosa or Damaso Alonso to be the first modern novel in Europe, it has been translated several times into Spanish, Italian and French, with modern English translations by Rosenthal (1996) and La Fontaine (1993). The main body of this book was written between 1460 and 1465, but it was not printed until 1490. There is an intense and long lasting debate around its authorship sprouting from its first edition, where its introduction states that the whole book is the work of Martorell (1413?-1468), while at the end it is stated that the last one fourth of the book is by Galba (?-1490), after the death of Martorell. Some of the authors that support the theory of single authorship are Riquer (1990), Chiner (1993) and Badia (1993), while some of those supporting the double authorship are Riquer (1947), Coromines (1956) and Ferrando (1995). For an overview of this debate, see Riquer (1990). Neither of the two candidate authors left any text comparable to the one under study, and therefore discriminant analysis can not be used to help classify chapters by author. By using sample texts encompassing about ten percent of the book, and looking at word length and at the use of 44 conjunctions, prepositions and articles, Ginebra and Cabos (1998) detect heterogeneities that might indicate the existence of two authors. By analyzing the diversity of the vocabulary, Riba and Ginebra (2000) estimates that stylistic boundary to be near chapter 383. Following the lead of the extensive literature, this paper looks into word length, the use of the most frequent words and into the use of vowels in each chapter of the book. Given that the features selected are categorical, that leads to three contingency tables of ordered rows and therefore to three sequences of multinomial observations. Section 2 explores these sequences graphically, observing a clear shift in their distribution. Section 3 describes the problem of the estimation of a suden change-point in those sequences, in the following sections we propose various ways to estimate change-points in multinomial sequences; the method in section 4 involves fitting models for polytomous data, the one in Section 5 fits gamma models onto the sequence of Chi-square distances between each row profiles and the average profile, the one in Section 6 fits models onto the sequence of values taken by the first component of the correspondence analysis as well as onto sequences of other summary measures like the average word length. In Section 7 we fit models onto the marginal binomial sequences to identify the features that distinguish the chapters before and after that boundary. Most methods rely heavily on the use of generalized linear models
Resumo:
This paper sets out to identify the initial positions of the different decision makers who intervene in a group decision making process with a reduced number of actors, and to establish possible consensus paths between these actors. As a methodological support, it employs one of the most widely-known multicriteria decision techniques, namely, the Analytic Hierarchy Process (AHP). Assuming that the judgements elicited by the decision makers follow the so-called multiplicative model (Crawford and Williams, 1985; Altuzarra et al., 1997; Laininen and Hämäläinen, 2003) with log-normal errors and unknown variance, a Bayesian approach is used in the estimation of the relative priorities of the alternatives being compared. These priorities, estimated by way of the median of the posterior distribution and normalised in a distributive manner (priorities add up to one), are a clear example of compositional data that will be used in the search for consensus between the actors involved in the resolution of the problem through the use of Multidimensional Scaling tools
Resumo:
In an earlier investigation (Burger et al., 2000) five sediment cores near the Rodrigues Triple Junction in the Indian Ocean were studied applying classical statistical methods (fuzzy c-means clustering, linear mixing model, principal component analysis) for the extraction of endmembers and evaluating the spatial and temporal variation of geochemical signals. Three main factors of sedimentation were expected by the marine geologists: a volcano-genetic, a hydro-hydrothermal and an ultra-basic factor. The display of fuzzy membership values and/or factor scores versus depth provided consistent results for two factors only; the ultra-basic component could not be identified. The reason for this may be that only traditional statistical methods were applied, i.e. the untransformed components were used and the cosine-theta coefficient as similarity measure. During the last decade considerable progress in compositional data analysis was made and many case studies were published using new tools for exploratory analysis of these data. Therefore it makes sense to check if the application of suitable data transformations, reduction of the D-part simplex to two or three factors and visual interpretation of the factor scores would lead to a revision of earlier results and to answers to open questions . In this paper we follow the lines of a paper of R. Tolosana- Delgado et al. (2005) starting with a problem-oriented interpretation of the biplot scattergram, extracting compositional factors, ilr-transformation of the components and visualization of the factor scores in a spatial context: The compositional factors will be plotted versus depth (time) of the core samples in order to facilitate the identification of the expected sources of the sedimentary process. Kew words: compositional data analysis, biplot, deep sea sediments
Resumo:
Self-organizing maps (Kohonen 1997) is a type of artificial neural network developed to explore patterns in high-dimensional multivariate data. The conventional version of the algorithm involves the use of Euclidean metric in the process of adaptation of the model vectors, thus rendering in theory a whole methodology incompatible with non-Euclidean geometries. In this contribution we explore the two main aspects of the problem: 1. Whether the conventional approach using Euclidean metric can shed valid results with compositional data. 2. If a modification of the conventional approach replacing vectorial sum and scalar multiplication by the canonical operators in the simplex (i.e. perturbation and powering) can converge to an adequate solution. Preliminary tests showed that both methodologies can be used on compositional data. However, the modified version of the algorithm performs poorer than the conventional version, in particular, when the data is pathological. Moreover, the conventional ap- proach converges faster to a solution, when data is \well-behaved". Key words: Self Organizing Map; Artificial Neural networks; Compositional data
Resumo:
Our essay aims at studying suitable statistical methods for the clustering of compositional data in situations where observations are constituted by trajectories of compositional data, that is, by sequences of composition measurements along a domain. Observed trajectories are known as “functional data” and several methods have been proposed for their analysis. In particular, methods for clustering functional data, known as Functional Cluster Analysis (FCA), have been applied by practitioners and scientists in many fields. To our knowledge, FCA techniques have not been extended to cope with the problem of clustering compositional data trajectories. In order to extend FCA techniques to the analysis of compositional data, FCA clustering techniques have to be adapted by using a suitable compositional algebra. The present work centres on the following question: given a sample of compositional data trajectories, how can we formulate a segmentation procedure giving homogeneous classes? To address this problem we follow the steps described below. First of all we adapt the well-known spline smoothing techniques in order to cope with the smoothing of compositional data trajectories. In fact, an observed curve can be thought of as the sum of a smooth part plus some noise due to measurement errors. Spline smoothing techniques are used to isolate the smooth part of the trajectory: clustering algorithms are then applied to these smooth curves. The second step consists in building suitable metrics for measuring the dissimilarity between trajectories: we propose a metric that accounts for difference in both shape and level, and a metric accounting for differences in shape only. A simulation study is performed in order to evaluate the proposed methodologies, using both hierarchical and partitional clustering algorithm. The quality of the obtained results is assessed by means of several indices
Resumo:
In this paper we examine the problem of compositional data from a different starting point. Chemical compositional data, as used in provenance studies on archaeological materials, will be approached from the measurement theory. The results will show, in a very intuitive way that chemical data can only be treated by using the approach developed for compositional data. It will be shown that compositional data analysis is a particular case in projective geometry, when the projective coordinates are in the positive orthant, and they have the properties of logarithmic interval metrics. Moreover, it will be shown that this approach can be extended to a very large number of applications, including shape analysis. This will be exemplified with a case study in architecture of Early Christian churches dated back to the 5th-7th centuries AD
Resumo:
Factor analysis as frequent technique for multivariate data inspection is widely used also for compositional data analysis. The usual way is to use a centered logratio (clr) transformation to obtain the random vector y of dimension D. The factor model is then y = Λf + e (1) with the factors f of dimension k < D, the error term e, and the loadings matrix Λ. Using the usual model assumptions (see, e.g., Basilevsky, 1994), the factor analysis model (1) can be written as Cov(y) = ΛΛT + ψ (2) where ψ = Cov(e) has a diagonal form. The diagonal elements of ψ as well as the loadings matrix Λ are estimated from an estimation of Cov(y). Given observed clr transformed data Y as realizations of the random vector y. Outliers or deviations from the idealized model assumptions of factor analysis can severely effect the parameter estimation. As a way out, robust estimation of the covariance matrix of Y will lead to robust estimates of Λ and ψ in (2), see Pison et al. (2003). Well known robust covariance estimators with good statistical properties, like the MCD or the S-estimators (see, e.g. Maronna et al., 2006), rely on a full-rank data matrix Y which is not the case for clr transformed data (see, e.g., Aitchison, 1986). The isometric logratio (ilr) transformation (Egozcue et al., 2003) solves this singularity problem. The data matrix Y is transformed to a matrix Z by using an orthonormal basis of lower dimension. Using the ilr transformed data, a robust covariance matrix C(Z) can be estimated. The result can be back-transformed to the clr space by C(Y ) = V C(Z)V T where the matrix V with orthonormal columns comes from the relation between the clr and the ilr transformation. Now the parameters in the model (2) can be estimated (Basilevsky, 1994) and the results have a direct interpretation since the links to the original variables are still preserved. The above procedure will be applied to data from geochemistry. Our special interest is on comparing the results with those of Reimann et al. (2002) for the Kola project data
Resumo:
This paper proposes a pose-based algorithm to solve the full SLAM problem for an autonomous underwater vehicle (AUV), navigating in an unknown and possibly unstructured environment. The technique incorporate probabilistic scan matching with range scans gathered from a mechanical scanning imaging sonar (MSIS) and the robot dead-reckoning displacements estimated from a Doppler velocity log (DVL) and a motion reference unit (MRU). The proposed method utilizes two extended Kalman filters (EKF). The first, estimates the local path travelled by the robot while grabbing the scan as well as its uncertainty and provides position estimates for correcting the distortions that the vehicle motion produces in the acoustic images. The second is an augment state EKF that estimates and keeps the registered scans poses. The raw data from the sensors are processed and fused in-line. No priory structural information or initial pose are considered. The algorithm has been tested on an AUV guided along a 600 m path within a marina environment, showing the viability of the proposed approach
Resumo:
Los niños que presentan dolor abdominal agudo representan una de las principales demandas de atención en los servicios de Urgencias Pediátricas, convirtiéndose en un reto para quien realiza la valoración inicial la decisión de que paciente amerita realizar estudios adicionales, para descartar una patología quirúrgica. Considerando que la apendicitis es la primera emergencia quirúrgica en niños, y que sus complicaciones son un problema frecuente, es imprescindible que el clínico conozca cual es la utilidad y beneficio que le brindan las herramientas diagnósticas disponibles para realizar un diagnóstico más certero. Objetivo: Describir el manejo de los pacientes menores de 18 años con sospecha de apendicitis en la FCI y determinar la sensibilidad y especificidad de la ecografía en esta población. Métodos: Estudio de Prueba Diagnóstica. Revisión expedita de los niños valorados en la FCI con sospecha clínica de apendicitis aguda durante un periodo de 4 meses. Se tomaron datos de hallazgos ecográficos para apendicitis (positivos, negativos o indeterminados) y del desenlace de tener o no apendicitis. Posteriormente se realizó una inferencia mediante una tabla de contingencia, teniendo en cuenta intervalos de confianza para determinar la sensibilidad, especificidad, VPP, VPN y exactitud de la prueba basados en el diagnóstico final. Resultados: En 52% de pacientes que consultaron por dolor abdominal se sospecho patología quirúrgica, grupo en el que se utilizó la ecografía abdominal como herramienta diagnóstica para descartar apendicitis encontrando: sensibilidad: 63%(IC 95%, 48.6-75.5), especificidad: 82,7%(IC 95%, 76- 87.8) y exactitud: 78,2%(IC 95%, 72-83,4). La prevalencia de la enfermedad fue 22,8%, con probabilidad postprueba positiva de 88.3%(IC 95%, 82,1-92.6). Conclusiones: Los resultados de este estudio evidencian un menor rendimiento de la ecografía como prueba diagnóstica para apendicitis en pediatría en nuestro medio comparado al descrito en la literatura. Dicha discordancia probablemente se encuentra determinada por los sesgos propios de un estudio histórico como este, en el que básicamente la falta de uniformidad en el registro de datos en la historia clínica por los observadores (clínicos y radiólogos), así como las limitaciones para determinar el seguimiento de todos los pacientes que no fueron operados, pueden alterar la capacidad operativa real de la prueba. De ahí la importancia de realizar a futuro un estudio concurrente.