924 resultados para Automatic Analysis of Multivariate Categorical Data Sets
Resumo:
Presents constructs from classification theory and relates them to the study of hashtags and other forms of tags in social media data. Argues these constructs are useful to the study of the intersectionality of race, gender, and sexuality. Closes with an introduction to an historical case study from Amazon.com.
Resumo:
A compositional multivariate approach is used to analyse regional scale soil geochemical data obtained as part of the Tellus Project generated by the Geological Survey Northern Ireland (GSNI). The multi-element total concentration data presented comprise XRF analyses of 6862 rural soil samples collected at 20cm depths on a non-aligned grid at one site per 2 km2. Censored data were imputed using published detection limits. Using these imputed values for 46 elements (including LOI), each soil sample site was assigned to the regional geology map provided by GSNI initially using the dominant lithology for the map polygon. Northern Ireland includes a diversity of geology representing a stratigraphic record from the Mesoproterozoic, up to and including the Palaeogene. However, the advance of ice sheets and their meltwaters over the last 100,000 years has left at least 80% of the bedrock covered by superficial deposits, including glacial till and post-glacial alluvium and peat. The question is to what extent the soil geochemistry reflects the underlying geology or superficial deposits. To address this, the geochemical data were transformed using centered log ratios (clr) to observe the requirements of compositional data analysis and avoid closure issues. Following this, compositional multivariate techniques including compositional Principal Component Analysis (PCA) and minimum/maximum autocorrelation factor (MAF) analysis method were used to determine the influence of underlying geology on the soil geochemistry signature. PCA showed that 72% of the variation was determined by the first four principal components (PC’s) implying “significant” structure in the data. Analysis of variance showed that only 10 PC’s were necessary to classify the soil geochemical data. To consider an improvement over PCA that uses the spatial relationships of the data, a classification based on MAF analysis was undertaken using the first 6 dominant factors. Understanding the relationship between soil geochemistry and superficial deposits is important for environmental monitoring of fragile ecosystems such as peat. To explore whether peat cover could be predicted from the classification, the lithology designation was adapted to include the presence of peat, based on GSNI superficial deposit polygons and linear discriminant analysis (LDA) undertaken. Prediction accuracy for LDA classification improved from 60.98% based on PCA using 10 principal components to 64.73% using MAF based on the 6 most dominant factors. The misclassification of peat may reflect degradation of peat covered areas since the creation of superficial deposit classification. Further work will examine the influence of underlying lithologies on elemental concentrations in peat composition and the effect of this in classification analysis.
Resumo:
This paper applies two measures to assess spillovers across markets: the Diebold Yilmaz (2012) Spillover Index and the Hafner and Herwartz (2006) analysis of multivariate GARCH models using volatility impulse response analysis. We use two sets of data, daily realized volatility estimates taken from the Oxford Man RV library, running from the beginning of 2000 to October 2016, for the S&P500 and the FTSE, plus ten years of daily returns series for the New York Stock Exchange Index and the FTSE 100 index, from 3 January 2005 to 31 January 2015. Both data sets capture both the Global Financial Crisis (GFC) and the subsequent European Sovereign Debt Crisis (ESDC). The spillover index captures the transmission of volatility to and from markets, plus net spillovers. The key difference between the measures is that the spillover index captures an average of spillovers over a period, whilst volatility impulse responses (VIRF) have to be calibrated to conditional volatility estimated at a particular point in time. The VIRF provide information about the impact of independent shocks on volatility. In the latter analysis, we explore the impact of three different shocks, the onset of the GFC, which we date as 9 August 2007 (GFC1). It took a year for the financial crisis to come to a head, but it did so on 15 September 2008, (GFC2). The third shock is 9 May 2010. Our modelling includes leverage and asymmetric effects undertaken in the context of a multivariate GARCH model, which are then analysed using both BEKK and diagonal BEKK (DBEKK) models. A key result is that the impact of negative shocks is larger, in terms of the effects on variances and covariances, but shorter in duration, in this case a difference between three and six months.
Resumo:
3rd SMTDA Conference Proceedings, 11-14 June 2014, Lisbon Portugal.
Resumo:
When continuous data are coded to categorical variables, two types of coding are possible: crisp coding in the form of indicator, or dummy, variables with values either 0 or 1; or fuzzy coding where each observation is transformed to a set of "degrees of membership" between 0 and 1, using co-called membership functions. It is well known that the correspondence analysis of crisp coded data, namely multiple correspondence analysis, yields principal inertias (eigenvalues) that considerably underestimate the quality of the solution in a low-dimensional space. Since the crisp data only code the categories to which each individual case belongs, an alternative measure of fit is simply to count how well these categories are predicted by the solution. Another approach is to consider multiple correspondence analysis equivalently as the analysis of the Burt matrix (i.e., the matrix of all two-way cross-tabulations of the categorical variables), and then perform a joint correspondence analysis to fit just the off-diagonal tables of the Burt matrix - the measure of fit is then computed as the quality of explaining these tables only. The correspondence analysis of fuzzy coded data, called "fuzzy multiple correspondence analysis", suffers from the same problem, albeit attenuated. Again, one can count how many correct predictions are made of the categories which have highest degree of membership. But here one can also defuzzify the results of the analysis to obtain estimated values of the original data, and then calculate a measure of fit in the familiar percentage form, thanks to the resultant orthogonal decomposition of variance. Furthermore, if one thinks of fuzzy multiple correspondence analysis as explaining the two-way associations between variables, a fuzzy Burt matrix can be computed and the same strategy as in the crisp case can be applied to analyse the off-diagonal part of this matrix. In this paper these alternative measures of fit are defined and applied to a data set of continuous meteorological variables, which are coded crisply and fuzzily into three categories. Measuring the fit is further discussed when the data set consists of a mixture of discrete and continuous variables.
Resumo:
A biplot, which is the multivariate generalization of the two-variable scatterplot, can be used to visualize the results of many multivariate techniques, especially those that are based on the singular value decomposition. We consider data sets consisting of continuous-scale measurements, their fuzzy coding and the biplots that visualize them, using a fuzzy version of multiple correspondence analysis. Of special interest is the way quality of fit of the biplot is measured, since it is well-known that regular (i.e., crisp) multiple correspondence analysis seriously under-estimates this measure. We show how the results of fuzzy multiple correspondence analysis can be defuzzified to obtain estimated values of the original data, and prove that this implies an orthogonal decomposition of variance. This permits a measure of fit to be calculated in the familiar form of a percentage of explained variance, which is directly comparable to the corresponding fit measure used in principal component analysis of the original data. The approach is motivated initially by its application to a simulated data set, showing how the fuzzy approach can lead to diagnosing nonlinear relationships, and finally it is applied to a real set of meteorological data.
Resumo:
This article gives an overview over the methods used in the low--level analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of nucleic acid abundance in a set of tissues or cell populations for thousands of transcripts or loci simultaneously. Careful statistical design and analysis are essential to improve the efficiency and reliability of microarray experiments throughout the data acquisition and analysis process. This includes the design of probes, the experimental design, the image analysis of microarray scanned images, the normalization of fluorescence intensities, the assessment of the quality of microarray data and incorporation of quality information in subsequent analyses, the combination of information across arrays and across sets of experiments, the discovery and recognition of patterns in expression at the single gene and multiple gene levels, and the assessment of significance of these findings, considering the fact that there is a lot of noise and thus random features in the data. For all of these components, access to a flexible and efficient statistical computing environment is an essential aspect.
Resumo:
Many seemingly disparate approaches for marginal modeling have been developed in recent years. We demonstrate that many current approaches for marginal modeling of correlated binary outcomes produce likelihoods that are equivalent to the proposed copula-based models herein. These general copula models of underlying latent threshold random variables yield likelihood based models for marginal fixed effects estimation and interpretation in the analysis of correlated binary data. Moreover, we propose a nomenclature and set of model relationships that substantially elucidates the complex area of marginalized models for binary data. A diverse collection of didactic mathematical and numerical examples are given to illustrate concepts.
Resumo:
Historically, few articles have addressed the use of district level mill production data for analysing the effect of varietal change on sugarcane productivity trends. This appears to be due to lack of compiled district data sets and appropriate methods by which to analyse these data. Recently, varietal data on tonnes of sugarcane per hectare (TCH), sugar content (CCS), and their product, tonnes of sugar content per hectare (TSH) on a district basis, have been compiled. This study was conducted to develop a methodology for regular analysis of such data from mill districts to assess productivity trends over time, accounting for variety and variety x environment interaction effects for 3 mill districts (Mulgrave, Babinda, and Tully) from 1958 to 1995. Restricted maximum likelihood methodology was used to analyse the district level data and best linear unbiased predictors for random effects, and best linear unbiased estimates for fixed effects were computed in a mixed model analysis. In the combined analysis over districts, Q124 was the top ranking variety for TCH, and Q120 was top ranking for both CCS and TSH. Overall production for TCH increased over the 38-year period investigated. Some of this increase can be attributed to varietal improvement, although the predictors for TCH have shown little progress since the introduction of Q99 in 1976. Although smaller gains have been made in varietal improvement for CCS, overall production for CCS decreased over the 38 years due to non-varietal factors. Varietal improvement in TSH appears to have peaked in the mid-1980s. Overall production for TSH remained stable over time due to the varietal increase in TCH and the non-varietal decrease in CCS.
Resumo:
Cork stopper manufacturing process includes an operation, known as stabilisation, by which humid cork slabs are extensively colonised by fungi. The effects of fungal growth on cork are yet to be completely understood and are considered to be involved in the so called “cork taint” of bottled wine. It is essential to identify environmental constraints which define the appearance of the colonising fungal species and to trace their origin to the forest and/or as residents in the manufacturing space. The present article correlates two sets of data, from consecutive years and the same season, of systematic biologic sampling of two manufacturing units, located in the North and South of Portugal. Chrysonilia sitophila dominance was identified, followed by a high diversity of Penicillium species. Penicillium glabrum, found in all samples, was the most frequent isolated species. P. glabrum intra-species variability was investigated using DNA fingerprinting techniques revealing highly discriminative polymorphic markers in the genome. Cluster analysis of P. glabrum data was discussed in relation to the geographical location of strains, and results suggest that P. glabrum arise from predominantly the manufacturing space, although cork resident fungi can also contrib
Resumo:
Meta-analysis of genome-wide association studies (GWASs) has led to the discoveries of many common variants associated with complex human diseases. There is a growing recognition that identifying "causal" rare variants also requires large-scale meta-analysis. The fact that association tests with rare variants are performed at the gene level rather than at the variant level poses unprecedented challenges in the meta-analysis. First, different studies may adopt different gene-level tests, so the results are not compatible. Second, gene-level tests require multivariate statistics (i.e., components of the test statistic and their covariance matrix), which are difficult to obtain. To overcome these challenges, we propose to perform gene-level tests for rare variants by combining the results of single-variant analysis (i.e., p values of association tests and effect estimates) from participating studies. This simple strategy is possible because of an insight that multivariate statistics can be recovered from single-variant statistics, together with the correlation matrix of the single-variant test statistics, which can be estimated from one of the participating studies or from a publicly available database. We show both theoretically and numerically that the proposed meta-analysis approach provides accurate control of the type I error and is as powerful as joint analysis of individual participant data. This approach accommodates any disease phenotype and any study design and produces all commonly used gene-level tests. An application to the GWAS summary results of the Genetic Investigation of ANthropometric Traits (GIANT) consortium reveals rare and low-frequency variants associated with human height. The relevant software is freely available.
Resumo:
Empirical phylogeographic studies have progressively sampled greater numbers of loci over time, in part motivated by theoretical papers showing that estimates of key demographic parameters improve as the number of loci increases. Recently, next-generation sequencing has been applied to questions about organismal history, with the promise of revolutionizing the field. However, no systematic assessment of how phylogeographic data sets have changed over time with respect to overall size and information content has been performed. Here, we quantify the changing nature of these genetic data sets over the past 20years, focusing on papers published in Molecular Ecology. We found that the number of independent loci, the total number of alleles sampled and the total number of single nucleotide polymorphisms (SNPs) per data set has improved over time, with particularly dramatic increases within the past 5years. Interestingly, uniparentally inherited organellar markers (e.g. animal mitochondrial and plant chloroplast DNA) continue to represent an important component of phylogeographic data. Single-species studies (cf. comparative studies) that focus on vertebrates (particularly fish and to some extent, birds) represent the gold standard of phylogeographic data collection. Based on the current trajectory seen in our survey data, forecast modelling indicates that the median number of SNPs per data set for studies published by the end of the year 2016 may approach similar to 20000. This survey provides baseline information for understanding the evolution of phylogeographic data sets and underscores the fact that development of analytical methods for handling very large genetic data sets will be critical for facilitating growth of the field.
Resumo:
We propose a general framework for the analysis of animal telemetry data through the use of weighted distributions. It is shown that several interpretations of resource selection functions arise when constructed from the ratio of a use and availability distribution. Through the proposed general framework, several popular resource selection models are shown to be special cases of the general model by making assumptions about animal movement and behavior. The weighted distribution framework is shown to be easily extended to readily account for telemetry data that are highly auto-correlated; as is typical with use of new technology such as global positioning systems animal relocations. An analysis of simulated data using several models constructed within the proposed framework is also presented to illustrate the possible gains from the flexible modeling framework. The proposed model is applied to a brown bear data set from southeast Alaska.
Resumo:
The Gaia space mission is a major project for the European astronomical community. As challenging as it is, the processing and analysis of the huge data-flow incoming from Gaia is the subject of thorough study and preparatory work by the DPAC (Data Processing and Analysis Consortium), in charge of all aspects of the Gaia data reduction. This PhD Thesis was carried out in the framework of the DPAC, within the team based in Bologna. The task of the Bologna team is to define the calibration model and to build a grid of spectro-photometric standard stars (SPSS) suitable for the absolute flux calibration of the Gaia G-band photometry and the BP/RP spectrophotometry. Such a flux calibration can be performed by repeatedly observing each SPSS during the life-time of the Gaia mission and by comparing the observed Gaia spectra to the spectra obtained by our ground-based observations. Due to both the different observing sites involved and the huge amount of frames expected (≃100000), it is essential to maintain the maximum homogeneity in data quality, acquisition and treatment, and a particular care has to be used to test the capabilities of each telescope/instrument combination (through the “instrument familiarization plan”), to devise methods to keep under control, and eventually to correct for, the typical instrumental effects that can affect the high precision required for the Gaia SPSS grid (a few % with respect to Vega). I contributed to the ground-based survey of Gaia SPSS in many respects: with the observations, the instrument familiarization plan, the data reduction and analysis activities (both photometry and spectroscopy), and to the maintenance of the data archives. However, the field I was personally responsible for was photometry and in particular relative photometry for the production of short-term light curves. In this context I defined and tested a semi-automated pipeline which allows for the pre-reduction of imaging SPSS data and the production of aperture photometry catalogues ready to be used for further analysis. A series of semi-automated quality control criteria are included in the pipeline at various levels, from pre-reduction, to aperture photometry, to light curves production and analysis.
Resumo:
Aims: The reported rate of stent thrombosis (ST) after drug-eluting stent (DES) implantation varies among registries. To investigate differences in baseline characteristics and clinical outcome in European and Japanese all-comers registries, we performed a pooled analysis of patient-level data. Methods and results: The j-Cypher registry (JC) is a multicentre observational study conducted in Japan, including 12,824 patients undergoing SES implantation. From the Bern-Rotterdam registry (BR) enrolled at two academic hospitals in Switzerland and the Netherlands, 3,823 patients with SES were included in the current analysis. Patients in BR were younger, more frequently smokers and presented more frequently with ST-elevation myocardial infarction (MI). Conversely, JC patients more frequently had diabetes and hypertension. At five years, the definite ST rate was significantly lower in JC than BR (JC 1.6% vs. BR 3.3%, p<0.001), while the unadjusted mortality tended to be lower in BR than in JC (BR 13.2% vs. JC 14.4%, log-rank p=0.052). After adjustment, the j-Cypher registry was associated with a significantly lower risk of all-cause mortality (HR 0.56, 95% CI: 0.49-0.64) as well as definite stent thrombosis (HR 0.46, 95% CI: 0.35-0.61). Conclusions: The baseline characteristics of the two large registries were different. After statistical adjustment, JC was associated with lower mortality and ST.