69 resultados para methods: data analysis
Resumo:
In this paper we examine the problem of compositional data from a different startingpoint. Chemical compositional data, as used in provenance studies on archaeologicalmaterials, will be approached from the measurement theory. The results will show, in avery intuitive way that chemical data can only be treated by using the approachdeveloped for compositional data. It will be shown that compositional data analysis is aparticular case in projective geometry, when the projective coordinates are in thepositive orthant, and they have the properties of logarithmic interval metrics. Moreover,it will be shown that this approach can be extended to a very large number ofapplications, including shape analysis. This will be exemplified with a case study inarchitecture of Early Christian churches dated back to the 5th-7th centuries AD
Resumo:
This analysis was stimulated by the real data analysis problem of householdexpenditure data. The full dataset contains expenditure data for a sample of 1224 households. The expenditure is broken down at 2 hierarchical levels: 9 major levels (e.g. housing, food, utilities etc.) and 92 minor levels. There are also 5 factors and 5 covariates at the household level. Not surprisingly, there are a small number of zeros at the major level, but many zeros at the minor level. The question is how best to model the zeros. Clearly, models that tryto add a small amount to the zero terms are not appropriate in general as at least some of the zeros are clearly structural, e.g. alcohol/tobacco for households that are teetotal. The key question then is how to build suitable conditional models. For example, is the sub-composition of spendingexcluding alcohol/tobacco similar for teetotal and non-teetotal households?In other words, we are looking for sub-compositional independence. Also, what determines whether a household is teetotal? Can we assume that it is independent of the composition? In general, whether teetotal will clearly depend on the household level variables, so we need to be able to model this dependence. The other tricky question is that with zeros on more than onecomponent, we need to be able to model dependence and independence of zeros on the different components. Lastly, while some zeros are structural, others may not be, for example, for expenditure on durables, it may be chance as to whether a particular household spends money on durableswithin the sample period. This would clearly be distinguishable if we had longitudinal data, but may still be distinguishable by looking at the distribution, on the assumption that random zeros will usually be for situations where any non-zero expenditure is not small.While this analysis is based on around economic data, the ideas carry over tomany other situations, including geological data, where minerals may be missing for structural reasons (similar to alcohol), or missing because they occur only in random regions which may be missed in a sample (similar to the durables)
Resumo:
As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completelyabsent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and byMartín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involvedparts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method isintroduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that thetheoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approachhas reasonable properties from a compositional point of view. In particular, it is “natural” in the sense thatit recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in thesame paper a substitution method for missing values on compositional data sets is introduced
Resumo:
In a seminal paper, Aitchison and Lauder (1985) introduced classical kernel densityestimation techniques in the context of compositional data analysis. Indeed, they gavetwo options for the choice of the kernel to be used in the kernel estimator. One ofthese kernels is based on the use the alr transformation on the simplex SD jointly withthe normal distribution on RD-1. However, these authors themselves recognized thatthis method has some deficiencies. A method for overcoming these dificulties based onrecent developments for compositional data analysis and multivariate kernel estimationtheory, combining the ilr transformation with the use of the normal density with a fullbandwidth matrix, was recently proposed in Martín-Fernández, Chacón and Mateu-Figueras (2006). Here we present an extensive simulation study that compares bothmethods in practice, thus exploring the finite-sample behaviour of both estimators
Resumo:
The quantitative estimation of Sea Surface Temperatures from fossils assemblages is afundamental issue in palaeoclimatic and paleooceanographic investigations. TheModern Analogue Technique, a widely adopted method based on direct comparison offossil assemblages with modern coretop samples, was revised with the aim ofconforming it to compositional data analysis. The new CODAMAT method wasdeveloped by adopting the Aitchison metric as distance measure. Modern coretopdatasets are characterised by a large amount of zeros. The zero replacement was carriedout by adopting a Bayesian approach to the zero replacement, based on a posteriorestimation of the parameter of the multinomial distribution. The number of modernanalogues from which reconstructing the SST was determined by means of a multipleapproach by considering the Proxies correlation matrix, Standardized Residual Sum ofSquares and Mean Squared Distance. This new CODAMAT method was applied to theplanktonic foraminiferal assemblages of a core recovered in the Tyrrhenian Sea.Kew words: Modern analogues, Aitchison distance, Proxies correlation matrix,Standardized Residual Sum of Squares
Resumo:
Hydrogeological research usually includes some statistical studies devised to elucidate mean background state, characterise relationships among different hydrochemical parameters, and show the influence of human activities. These goals are achieved either by means of a statistical approach or by mixing modelsbetween end-members. Compositional data analysis has proved to be effective with the first approach, but there is no commonly accepted solution to the end-member problem in a compositional framework.We present here a possible solution based on factor analysis of compositions illustrated with a case study.We find two factors on the compositional bi-plot fitting two non-centered orthogonal axes to the most representative variables. Each one of these axes defines a subcomposition, grouping those variables thatlay nearest to it. With each subcomposition a log-contrast is computed and rewritten as an equilibrium equation. These two factors can be interpreted as the isometric log-ratio coordinates (ilr) of three hiddencomponents, that can be plotted in a ternary diagram. These hidden components might be interpreted as end-members.We have analysed 14 molarities in 31 sampling stations all along the Llobregat River and its tributaries, with a monthly measure during two years. We have obtained a bi-plot with a 57% of explained totalvariance, from which we have extracted two factors: factor G, reflecting geological background enhanced by potash mining; and factor A, essentially controlled by urban and/or farming wastewater. Graphicalrepresentation of these two factors allows us to identify three extreme samples, corresponding to pristine waters, potash mining influence and urban sewage influence. To confirm this, we have available analysisof diffused and widespread point sources identified in the area: springs, potash mining lixiviates, sewage, and fertilisers. Each one of these sources shows a clear link with one of the extreme samples, exceptfertilisers due to the heterogeneity of their composition.This approach is a useful tool to distinguish end-members, and characterise them, an issue generally difficult to solve. It is worth note that the end-member composition cannot be fully estimated but only characterised through log-ratio relationships among components. Moreover, the influence of each endmember in a given sample must be evaluated in relative terms of the other samples. These limitations areintrinsic to the relative nature of compositional data
Resumo:
Background: The objective of the present study was to compare three different sampling and questionnaire administration methods used in the international KIDSCREEN study in terms of participation, response rates, and external validity. Methods: Children and adolescents aged 8–18 years were surveyed in 13 European countries using either telephone sampling and mail administration, random sampling of school listings followed by classroom or mail administration, or multistage random sampling of communities and households with self-administration of the survey materials at home. Cooperation, completion, and response rates were compared across countries and survey methods. Data on non-respondents was collected in 8 countries. The population fraction (PF, respondents in each sex-age, or educational level category, divided by the population in the same category from Eurostat census data) and population fraction ratio (PFR, ratio of PF) and their corresponding 95% confidence intervals were used to analyze differences by country between the KIDSCREEN samples and a reference Eurostat population. Results: Response rates by country ranged from 18.9% to 91.2%. Response rates were highest in the school-based surveys (69.0%–91.2%). Sample proportions by age and gender were similar to the reference Eurostat population in most countries, although boys and adolescents were slightly underrepresented (PFR <1). Parents in lower educational categories were less likely to participate (PFR <1 in 5 countries). Parents in higher educational categories were overrepresented when the school and household sampling strategies were used (PFR = 1.78–2.97). Conclusion: School-based sampling achieved the highest overall response rates but also produced slightly more biased samples than the other methods. The results suggest that the samples were sufficiently representative to provide reference population values for the KIDSCREEN instrument.
Resumo:
The aim of this talk is to convince the reader that there are a lot of interesting statisticalproblems in presentday life science data analysis which seem ultimately connected withcompositional statistics.Key words: SAGE, cDNA microarrays, (1D-)NMR, virus quasispecies
Resumo:
Subcompositional coherence is a fundamental property of Aitchison s approach to compositional data analysis, and is the principal justification for using ratios of components. We maintain, however, that lack of subcompositional coherence, that is incoherence, can be measured in an attempt to evaluate whether any given technique is close enough, for all practical purposes, to being subcompositionally coherent. This opens up the field to alternative methods, which might be better suited to cope with problems such as data zeros and outliers, while being only slightly incoherent. The measure that we propose is based on the distance measure between components. We show that the two-part subcompositions, which appear to be the most sensitive to subcompositional incoherence, can be used to establish a distance matrix which can be directly compared with the pairwise distances in the full composition. The closeness of these two matrices can be quantified using a stress measure that is common in multidimensional scaling, providing a measure of subcompositional incoherence. The approach is illustrated using power-transformed correspondence analysis, which has already been shown to converge to log-ratio analysis as the power transform tends to zero.
Resumo:
In this paper we present a Bayesian image reconstruction algorithm with entropy prior (FMAPE) that uses a space-variant hyperparameter. The spatial variation of the hyperparameter allows different degrees of resolution in areas of different statistical characteristics, thus avoiding the large residuals resulting from algorithms that use a constant hyperparameter. In the first implementation of the algorithm, we begin by segmenting a Maximum Likelihood Estimator (MLE) reconstruction. The segmentation method is based on using a wavelet decomposition and a self-organizing neural network. The result is a predetermined number of extended regions plus a small region for each star or bright object. To assign a different value of the hyperparameter to each extended region and star, we use either feasibility tests or cross-validation methods. Once the set of hyperparameters is obtained, we carried out the final Bayesian reconstruction, leading to a reconstruction with decreased bias and excellent visual characteristics. The method has been applied to data from the non-refurbished Hubble Space Telescope. The method can be also applied to ground-based images.
Resumo:
Aim: To describe changes in leisure time and occupational physical activity status in an urban Mediterranean population-based cohort, and to evaluate sociodemographic, health-related and lifestyle correlates of such changes. Methods: Data for this study come from the Cornellè Health Interview Survey Follow-Up Study, a prospective cohort study of a representative sample (n¿=¿2500) of the population. Participants in the analysis reported here include 1246 subjects (567 men and 679 women) who had complete data on physical activity at the 1994 baseline survey and at the 2002 follow-up. We fitted Breslow-Cox regression models to assess the association between correlates of interest and changes in physical activity. Results: Regarding leisure time physical activity, 61.6% of cohort members with ¿sedentary¿ habits in 1994 changed their status to ¿light/moderate¿ physical activity in 2002, and 70% who had ¿light/moderate¿ habits in 1994 did not change their activity level. Regarding occupational physical activity, 74.4% of cohort members who were ¿active¿ did not change their level of activity, and 64.3% of participants with ¿sedentary¿ habits in 1994 changed to ¿active¿ occupational physical activity. No clear correlates of change in physical activity were identified in multivariate analyses. Conclusion: While changes in physical activity are evident in this population-based cohort, no clear determinants of such changes were recognised. Further longitudinal studies including other potential individual and contextual determinants are needed to better understand determinants of changes in physical activity at the population level.
Resumo:
Objective: To describe the methodology of Confirmatory Factor Analyis for categorical items and to apply this methodology to evaluate the factor structure and invariance of the WHO-Disability Assessment Schedule (WHODAS-II) questionnaire, developed by the World HealthOrganization.Methods: Data used for the analysis come from the European Study of Mental Disorders(ESEMeD), a cross-sectional interview to a representative sample of the general population of 6 european countries (n=8796). Respondents were administered a modified version of theWHODAS-II, that measures functional disability in the previous 30 days in 6 differentdimensions: Understanding and Communicating; Self-Care, Getting Around, Getting Along withOthers, Life Activities and Participation. The questionnaire includes two types of items: 22severity items (5 points likert) and 8 frequency items (continuous). An Exploratory factoranalysis (EFA) with promax rotation was conducted on a random 50% of the sample. Theremaining half of the sample was used to perform a Confirmatory Factor Analysis (CFA) inorder to compare three different models: (a) the model suggested by the results obtained in theEFA; (b) the theoretical model suggested by the WHO with 6 dimensions; (c) a reduced modelequivalent to model b where 4 of the frequency items are excluded. Moreover, a second orderfactor was also evaluated. Finally, a CFA with covariates was estimated in order to evaluatemeasurement invariance of the items between Mediterranean and non-mediterranean countries.Results: The solution that provided better results in the EFA was that containing 7 factors. Twoof the frequency items presented high factor loadings in the same factor, and one of thempresented factor loadings smaller than 0.3 with all the factors. With regard to the CFA, thereduced model (model c) presented the best goodness of fit results (CFI=0.992,TLI=0.996,RMSEA=0.024). The second order factor structure presented adequate goodness of fit (CFI=0.987,TLI=0.991, RMSEA=0.036). Measurement non-invariance was detected for one of the items of thequestionnaire (FD20 ¿ Embarrassment due to health problems).Conclusions: AFC confirmed the initial hypothesis about the factorial structure of the WHODAS-II in 6factors. The second order factor supports the existence of a global dimension of disability. The use of 4of the frequency items is not recommended in the scoring of the corresponding dimensions.
Resumo:
After a rockfall event, a usual post event survey includes qualitative volume estimation, trajectory mapping and determination of departing zones. However, quantitative measurements are not usually made. Additional relevant quantitative information could be useful in determining the spatial occurrence of rockfall events and help us in quantifying their size. Seismic measurements could be suitable for detection purposes since they are non invasive methods and are relatively inexpensive. Moreover, seismic techniques could provide important information on rockfall size and location of impacts. On 14 February 2007 the Avalanche Group of the University of Barcelona obtained the seismic data generated by an artificially triggered rockfall event at the Montserrat massif (near Barcelona, Spain) carried out in order to purge a slope. Two 3 component seismic stations were deployed in the area about 200 m from the explosion point that triggered the rockfall. Seismic signals and video images were simultaneously obtained. The initial volume of the rockfall was estimated to be 75 m3 by laser scanner data analysis. After the explosion, dozens of boulders ranging from 10¿4 to 5 m3 in volume impacted on the ground at different locations. The blocks fell down onto a terrace, 120 m below the release zone. The impact generated a small continuous mass movement composed of a mixture of rocks, sand and dust that ran down the slope and impacted on the road 60 m below. Time, time-frequency evolution and particle motion analysis of the seismic records and seismic energy estimation were performed. The results are as follows: 1 ¿ A rockfall event generates seismic signals with specific characteristics in the time domain; 2 ¿ the seismic signals generated by the mass movement show a time-frequency evolution different from that of other seismogenic sources (e.g. earthquakes, explosions or a single rock impact). This feature could be used for detection purposes; 3 ¿ particle motion plot analysis shows that the procedure to locate the rock impact using two stations is feasible; 4 ¿ The feasibility and validity of seismic methods for the detection of rockfall events, their localization and size determination are comfirmed.
Resumo:
The correlation between the species composition of pasture communities and soil properties in Plana de Vic has been studied using two multivariate methods, Correspondence Analysis (CA) for the vegetation data and Principal Component Analysis (PCA) for the soil data. To analyse the pastures, we took 144 vegetation relevés (comprising 201 species) that have been classified into 10 phytocoenological communities elsewhere. Most of these communities are almost entirely built up by perennials, ranging from xerophilous, clearly Mediterranean, to mesophilous, related to medium-European pastures, but a few occurring in shallow soils are dominated by therophytes. As for the soil properties, we analysed texture, pH, depth, bulk density, organic matter, C/N ratio and the carbonates content of 25 samples, correspondingto representative relevés of the communities studied.
Resumo:
The present study proposes a modification in one of the most frequently applied effect size procedures in single-case data analysis the percent of nonoverlapping data. In contrast to other techniques, the calculus and interpretation of this procedure is straightforward and it can be easily complemented by visual inspection of the graphed data. Although the percent of nonoverlapping data has been found to perform reasonably well in N = 1 data, the magnitude of effect estimates it yields can be distorted by trend and autocorrelation. Therefore, the data correction procedure focuses on removing the baseline trend from data prior to estimating the change produced in the behavior due to intervention. A simulation study is carried out in order to compare the original and the modified procedures in several experimental conditions. The results suggest that the new proposal is unaffected by trend and autocorrelation and can be used in case of unstable baselines and sequentially related measurements.