Biblioteca Digital

15 resultados para Point Data

em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain

Exploration of geological variability and possible processes through the use of compositional data analysis: an example using scottish metamorphosed

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Developments in the statistical analysis of compositional data over the last twodecades have made possible a much deeper exploration of the nature of variability,and the possible processes associated with compositional data sets from manydisciplines. In this paper we concentrate on geochemical data sets. First we explainhow hypotheses of compositional variability may be formulated within the naturalsample space, the unit simplex, including useful hypotheses of subcompositionaldiscrimination and specific perturbational change. Then we develop through standardmethodology, such as generalised likelihood ratio tests, statistical tools to allow thesystematic investigation of a complete lattice of such hypotheses. Some of these tests are simple adaptations of existing multivariate tests but others require specialconstruction. We comment on the use of graphical methods in compositional dataanalysis and on the ordination of specimens. The recent development of the conceptof compositional processes is then explained together with the necessary tools for astaying- in-the-simplex approach, namely compositional singular value decompositions. All these statistical techniques are illustrated for a substantial compositional data set, consisting of 209 major-oxide and rare-element compositions of metamorphosed limestones from the Northeast and Central Highlands of Scotland.Finally we point out a number of unresolved problems in the statistical analysis ofcompositional processes

Markov chain montecarlo method applied to rounding zeros of compositional data: first approach

Relevância:

30.00% 30.00%

Publicador:

Resumo:

As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completelyabsent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and byMartín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involvedparts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method isintroduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that thetheoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approachhas reasonable properties from a compositional point of view. In particular, it is “natural” in the sense thatit recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in thesame paper a substitution method for missing values on compositional data sets is introduced

Some last thoughts on compositional data analysis

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One of the disadvantages of old age is that there is more past than future: this,however, may be turned into an advantage if the wealth of experience and, hopefully,wisdom gained in the past can be reflected upon and throw some light on possiblefuture trends. To an extent, then, this talk is necessarily personal, certainly nostalgic,but also self critical and inquisitive about our understanding of the discipline ofstatistics. A number of almost philosophical themes will run through the talk: searchfor appropriate modelling in relation to the real problem envisaged, emphasis onsensible balances between simplicity and complexity, the relative roles of theory andpractice, the nature of communication of inferential ideas to the statistical layman, theinter-related roles of teaching, consultation and research. A list of keywords might be:identification of sample space and its mathematical structure, choices betweentransform and stay, the role of parametric modelling, the role of a sample spacemetric, the underused hypothesis lattice, the nature of compositional change,particularly in relation to the modelling of processes. While the main theme will berelevance to compositional data analysis we shall point to substantial implications forgeneral multivariate analysis arising from experience of the development ofcompositional data analysis…

Evaluating predictive densities of U.S. output growth and inflation in a large macroeconomic data set

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We evaluate conditional predictive densities for U.S. output growth and inflationusing a number of commonly used forecasting models that rely on a large number ofmacroeconomic predictors. More specifically, we evaluate how well conditional predictive densities based on the commonly used normality assumption fit actual realizationsout-of-sample. Our focus on predictive densities acknowledges the possibility that, although some predictors can improve or deteriorate point forecasts, they might have theopposite effect on higher moments. We find that normality is rejected for most modelsin some dimension according to at least one of the tests we use. Interestingly, however,combinations of predictive densities appear to be correctly approximated by a normaldensity: the simple, equal average when predicting output growth and Bayesian modelaverage when predicting inflation.

An analysis of the accounting principles applied by the European Farm Accountancy Data Network

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In spite of its relative importance in the economy of many countriesand its growing interrelationships with other sectors, agriculture has traditionally been excluded from accounting standards. Nevertheless, to support its Common Agricultural Policy, for years the European Commission has been making an effort to obtain standardized information on the financial performance and condition of farms. Through the Farm Accountancy Data Network (FADN), every year data are gathered from a rotating sample of 60.000 professional farms across all member states. FADN data collection is not structured as an accounting cycle but as an extensive questionnaire. This questionnaire refers to assets, liabilities, revenues and expenses, and seems to try to obtain a "true and fair view" of the financial performance and condition of the farms it surveys. However, the definitions used in the questionnaire and the way data is aggregated often appear flawed from an accounting perspective. The objective of this paper is to contrast the accounting principles implicit in the FADN questionnaire with generally accepted accounting principles, particularly those found in the IVth Directive of the European Union, on the one hand, and those recently proposed by the International Accounting Standards Committee’s Steering Committeeon Agriculture in its Draft Statement of Principles, on the other hand. There are two reasons why this is useful. First, it allows to make suggestions how the information provided by FADN could be more in accordance with the accepted accounting framework, and become a more valuable tool for policy makers, farmers, and other stakeholders. Second, it helps assessing the suitability of FADN to become the starting point for a European accounting standard on agriculture.

Random assignment of intervention points in two phase single-case designs: data-division-specific distributions

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The present study explores the statistical properties of a randomization test based on the random assignment of the intervention point in a two-phase (AB) single-case design. The focus is on randomization distributions constructed with the values of the test statistic for all possible random assignments and used to obtain p-values. The shape of those distributions is investigated for each specific data division defined by the moment in which the intervention is introduced. Another aim of the study consisted in testing the detection of inexistent effects (i.e., production of false alarms) in autocorrelated data series, in which the assumption of exchangeability between observations may be untenable. In this way, it was possible to compare nominal and empirical Type I error rates in order to obtain evidence on the statistical validity of the randomization test for each individual data division. The results suggest that when either of the two phases has considerably less measurement times, Type I errors may be too probable and, hence, the decision making process to be carried out by applied researchers may be jeopardized.

GSVA: gene set variation analysis for microarray and RNA-seq data

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression proﬁles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular proﬁling experiments move beyond simple case-control studies, robust and ﬂexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in diﬀerential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org.

Modeling the Emergence of Social Structure from a Phylogenetic Point of View

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Based on provious (Hemelrijk 1998; Puga-González, Hildenbrant & Hemelrijk 2009), we have developed an agent-based model and software, called A-KinGDom, which allows us to simulate the emergence of the social structure in a group of non-human primates. The model includes dominance and affiliative interactions and incorporate s two main innovations (preliminary dominance interactions and a kinship factor), which allow us to define four different attack and affiliative strategies. In accordance with these strategies, we compared the data obtained under four simulation conditions with the results obtained in a provious study (Dolado & Beltran 2012) involving empirical observations of a captive group of mangabeys (Cercocebus torquatus)

Kernel-PCA data integration with enhanced interpretability

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background Nowadays, combining the different sources of information to improve the biological knowledge available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data types are kernel-based methods. Kernel-based data integration approaches consist of two basic steps: firstly the right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. Results We analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed. Conclusions The integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge.

RiskDiff: A web tool for the analysis of the difference due to risk and demographic factors for incidence or mortality data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background Analysing the observed differences for incidence or mortality of a particular disease between two different situations (such as time points, geographical areas, gender or other social characteristics) can be useful both for scientific or administrative purposes. From an epidemiological and public health point of view, it is of great interest to assess the effect of demographic factors in these observed differences in order to elucidate the effect of the risk of developing a disease or dying from it. The method proposed by Bashir and Estève, which splits the observed variation into three components: risk, population structure and population size is a common choice at practice. Results A web-based application, called RiskDiff has been implemented (available at http://rht.iconcologia.net/riskdiff.htm webcite), to perform this kind of statistical analyses, providing text and graphical summaries. Code from the implemented functions in R is also provided. An application to cancer mortality data from Catalonia is used for illustration. Conclusions Combining epidemiological with demographical factors is crucial for analysing incidence or mortality from a disease, especially if the population pyramids show substantial differences. The tool implemented may serve to promote and divulgate the use of this method to give advice for epidemiologic interpretation and decision making in public health.

Short communication. Persistence of point crop yield subjective estimates

Relevância:

30.00% 30.00%

Publicador:

Resumo:

En este trabajo se investiga la persistencia de las estimaciones puntuales subjetivas de rendimientos en cultivos anua- les realizadas por un amplio grupo de agricultores. La persistencia en el tiempo es una condición necesaria para la co- herencia y la confiabilidad de las estimaciones subjetivas de variables aleatorias. Los sujetos entrevistados estimaron valores puntuales de rendimientos de cultivos anuales (rendimientos medio, mayor, mínimo y más frecuente). Se han encontrado diferencias relativas poco importantes en todas las variables, excepto en los rendimientos mínimos, donde existe una alta dispersión. Los resultados son interesantes para estimar la adecuación de las técnicas de estimación de probabilidades subjetivas para ser utilizadas en los sistemas de ayuda en la toma de decisiones en agricultura.

Kernel-PCA data integration with enhanced interpretability

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background Nowadays, combining the different sources of information to improve the biological knowledge available is a challenge in bioinformatics. One of the most powerful methods for integrating heterogeneous data types are kernel-based methods. Kernel-based data integration approaches consist of two basic steps: firstly the right kernel is chosen for each data set; secondly the kernels from the different data sources are combined to give a complete representation of the available data for a given statistical task. Results We analyze the integration of data from several sources of information using kernel PCA, from the point of view of reducing dimensionality. Moreover, we improve the interpretability of kernel PCA by adding to the plot the representation of the input variables that belong to any dataset. In particular, for each input variable or linear combination of input variables, we can represent the direction of maximum growth locally, which allows us to identify those samples with higher/lower values of the variables analyzed. Conclusions The integration of different datasets and the simultaneous representation of samples and variables together give us a better understanding of biological knowledge.

Another Look at the Null of Stationary Real Exchange Rates: Panel Data with Structural Breaks and Cross-section Dependence

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper re-examines the null of stationary of real exchange rate for a panel of seventeen OECD developed countries during the post-Bretton Woods era. Our analysis simultaneously considers both the presence of cross-section dependence and multiple structural breaks that have not received much attention in previous panel methods of long-run PPP. Empirical results indicate that there is little evidence in favor of PPP hypothesis when the analysis does not account for structural breaks. This conclusion is reversed when structural breaks are considered in computation of the panel statistics. We also compute point estimates of half-life separately for idiosyncratic and common factor components and find that it is always below one year.

Extending the roughness of the data via transitive closures of similarity indexes

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One main assumption in the theory of rough sets applied to information tables is that the elements that exhibit the same information are indiscernible (similar) and form blocks that can be understood as elementary granules of knowledge about the universe. We propose a variant of this concept defining a measure of similarity between the elements of the universe in order to consider that two objects can be indiscernible even though they do not share all the attribute values because the knowledge is partial or uncertain. The set of similarities define a matrix of a fuzzy relation satisfying reflexivity and symmetry but transitivity thus a partition of the universe is not attained. This problem can be solved calculating its transitive closure what ensure a partition for each level belonging to the unit interval [0,1]. This procedure allows generalizing the theory of rough sets depending on the minimum level of similarity accepted. This new point of view increases the rough character of the data because increases the set of indiscernible objects. Finally, we apply our results to a not real application to be capable to remark the differences and the improvements between this methodology and the classical one

Comparison of a deterministic and a data driven model to describe MBR fouling

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Membrane bioreactors (MBRs) are a combination of activated sludge bioreactors and membrane filtration, enabling high quality effluent with a small footprint. However, they can be beset by fouling, which causes an increase in transmembrane pressure (TMP). Modelling and simulation of changes in TMP could be useful to describe fouling through the identification of the most relevant operating conditions. Using experimental data from a MBR pilot plant operated for 462days, two different models were developed: a deterministic model using activated sludge model n°2d (ASM2d) for the biological component and a resistance in-series model for the filtration component as well as a data-driven model based on multivariable regressions. Once validated, these models were used to describe membrane fouling (as changes in TMP over time) under different operating conditions. The deterministic model performed better at higher temperatures (>20°C), constant operating conditions (DO set-point, membrane air-flow, pH and ORP), and high mixed liquor suspended solids (>6.9gL-1) and flux changes. At low pH (<7) or periods with higher pH changes, the data-driven model was more accurate. Changes in the DO set-point of the aerobic reactor that affected the TMP were also better described by the data-driven model. By combining the use of both models, a better description of fouling can be achieved under different operating conditions