909 resultados para exploratory data analysis
Resumo:
In an earlier investigation (Burger et al., 2000) five sediment cores near the Rodrigues Triple Junction in the Indian Ocean were studied applying classical statistical methods (fuzzy c-means clustering, linear mixing model, principal component analysis) for the extraction of endmembers and evaluating the spatial and temporal variation of geochemical signals. Three main factors of sedimentation were expected by the marine geologists: a volcano-genetic, a hydro-hydrothermal and an ultra-basic factor. The display of fuzzy membership values and/or factor scores versus depth provided consistent results for two factors only; the ultra-basic component could not be identified. The reason for this may be that only traditional statistical methods were applied, i.e. the untransformed components were used and the cosine-theta coefficient as similarity measure. During the last decade considerable progress in compositional data analysis was made and many case studies were published using new tools for exploratory analysis of these data. Therefore it makes sense to check if the application of suitable data transformations, reduction of the D-part simplex to two or three factors and visual interpretation of the factor scores would lead to a revision of earlier results and to answers to open questions . In this paper we follow the lines of a paper of R. Tolosana- Delgado et al. (2005) starting with a problem-oriented interpretation of the biplot scattergram, extracting compositional factors, ilr-transformation of the components and visualization of the factor scores in a spatial context: The compositional factors will be plotted versus depth (time) of the core samples in order to facilitate the identification of the expected sources of the sedimentary process. Kew words: compositional data analysis, biplot, deep sea sediments
Resumo:
Background: The validity of ensemble averaging on event-related potential (ERP) data has been questioned, due to its assumption that the ERP is identical across trials. Thus, there is a need for preliminary testing for cluster structure in the data. New method: We propose a complete pipeline for the cluster analysis of ERP data. To increase the signalto-noise (SNR) ratio of the raw single-trials, we used a denoising method based on Empirical Mode Decomposition (EMD). Next, we used a bootstrap-based method to determine the number of clusters, through a measure called the Stability Index (SI). We then used a clustering algorithm based on a Genetic Algorithm (GA)to define initial cluster centroids for subsequent k-means clustering. Finally, we visualised the clustering results through a scheme based on Principal Component Analysis (PCA). Results: After validating the pipeline on simulated data, we tested it on data from two experiments – a P300 speller paradigm on a single subject and a language processing study on 25 subjects. Results revealed evidence for the existence of 6 clusters in one experimental condition from the language processing study. Further, a two-way chi-square test revealed an influence of subject on cluster membership.
Resumo:
P>The use of seven domains for the Oral Health Impact Profile (OHIP)-EDENT was not supported for its Brazilian version, making data interpretation in clinical settings difficult. Thus, the aim of this study was to assess patients` responses for the translated OHIP-EDENT in a group of edentulous subjects and to develop factor scales for application in future studies. Data from 103 conventional and implant-retained complete denture wearers (36 men, mean age of 69 center dot 1 +/- 10 center dot 3 years) were assessed using the Brazilian version of the OHIP-EDENT. Oral health-related quality of life domains were identified by factor analysis using principal component analysis as the extraction method, followed by varimax rotation. Factor analysis identified four factors that accounted for 63% of the 19 items total variance, named masticatory discomfort and disability (four items), psychological discomfort and disability (five items), social disability (five items) and oral pain and discomfort (five items). Four factors/domains of the Brazilian OHIP-EDENT version represent patient-important aspects of oral health-related quality of life.
Resumo:
We examine bivariate extensions of Aït-Sahalia’s approach to the estimation of univariate diffusions. Our message is that extending his idea to a bivariate setting is not straightforward. In higher dimensions, as opposed to the univariate case, the elements of the Itô and Fokker-Planck representations do not coincide; and, even imposing sensible assumptions on the marginal drifts and volatilities is not sufficient to obtain direct generalisations. We develop exploratory estimation and testing procedures, by parametrizing the drifts of both component processes and setting restrictions on the terms of either the Itô or the Fokker-Planck covariance matrices. This may lead to highly nonlinear ordinary differential equations, where the definition of boundary conditions is crucial. For the methods developed, the Fokker-Planck representation seems more tractable than the Itô’s. Questions for further research include the design of regularity conditions on the time series dependence in the data, the kernels actually used and the bandwidths, to obtain asymptotic properties for the estimators proposed. A particular case seems promising: “causal bivariate models” in which only one of the diffusions contributes to the volatility of the other. Hedging strategies which estimate separately the univariate diffusions at stake may thus be improved.
Resumo:
Data visualization techniques are powerful in the handling and analysis of multivariate systems. One such technique known as parallel coordinates was used to support the diagnosis of an event, detected by a neural network-based monitoring system, in a boiler at a Brazilian Kraft pulp mill. Its attractiveness is the possibility of the visualization of several variables simultaneously. The diagnostic procedure was carried out step-by-step going through exploratory, explanatory, confirmatory, and communicative goals. This tool allowed the visualization of the boiler dynamics in an easier way, compared to commonly used univariate trend plots. In addition it facilitated analysis of other aspects, namely relationships among process variables, distinct modes of operation and discrepant data. The whole analysis revealed firstly that the period involving the detected event was associated with a transition between two distinct normal modes of operation, and secondly the presence of unusual changes in process variables at this time.
Resumo:
Multiple regression analysis is a complex statistical method with many potential uses. It has also become one of the most abused of all statistical procedures since anyone with a data base and suitable software can carry it out. An investigator should always have a clear hypothesis in mind before carrying out such a procedure and knowledge of the limitations of each aspect of the analysis. In addition, multiple regression is probably best used in an exploratory context, identifying variables that might profitably be examined by more detailed studies. Where there are many variables potentially influencing Y, they are likely to be intercorrelated and to account for relatively small amounts of the variance. Any analysis in which R squared is less than 50% should be suspect as probably not indicating the presence of significant variables. A further problem relates to sample size. It is often stated that the number of subjects or patients must be at least 5-10 times the number of variables included in the study.5 This advice should be taken only as a rough guide but it does indicate that the variables included should be selected with great care as inclusion of an obviously unimportant variable may have a significant impact on the sample size required.
Resumo:
ABSTRACT Researchers frequently have to analyze scales in which some participants have failed to respond to some items. In this paper we focus on the exploratory factor analysis of multidimensional scales (i.e., scales that consist of a number of subscales) where each subscale is made up of a number of Likert-type items, and the aim of the analysis is to estimate participants' scores on the corresponding latent traits. We propose a new approach to deal with missing responses in such a situation that is based on (1) multiple imputation of non-responses and (2) simultaneous rotation of the imputed datasets. We applied the approach in a real dataset where missing responses were artificially introduced following a real pattern of non-responses, and a simulation study based on artificial datasets. The results show that our approach (specifically, Hot-Deck multiple imputation followed of Consensus Promin rotation) was able to successfully compute factor score estimates even for participants that have missing data.
Resumo:
The general purpose of this work is to describe and analyse the financing phenomenon of crowdfunding and to investigate the relations among crowdfunders, project creators and crowdfunding websites. More specifically, it also intends to describe the profile differences between major crowdfunding platforms, such as Kickstarter and Indiegogo. The findings are supported by literature, gathered from different scientific research papers. In the empirical part, data about Kickstarter and Indiegogo was collected from their websites and also complemented with further data from other statistical websites. For finding out specific information, such as satisfaction of entrepreneurs from both platforms, a satisfaction survey was applied among 200 entrepreneurs from different countries. To identify the profile of users of the Kickstarter and of the Indiegogo platforms, a multivariate analysis was performed, using a Hierarchical Clusters Analysis for each platform under study. Descriptive analysis was used for exploring information about popularity of platforms, average cost and the most popular area of projects, profile of users and future opportunities of platforms. To assess differences between groups, association between variables, and answering to the research hypothesis, an inferential analysis it was applied. The results showed that the Kickstarter and Indiegogo are one of the most popular crowdfunding platforms. Both of them have thousands of users and they are generally satisfied. Each of them uses individual approach for crowdfunders. Despite this, they both could benefit from further improving their services. Furthermore, according the results it was possible to observe that there is a direct and positive relationship between the money needed for the projects and the money collected from the investors for the projects, per platform.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
The performance of three analytical methods for multiple-frequency bioelectrical impedance analysis (MFBIA) data was assessed. The methods were the established method of Cole and Cole, the newly proposed method of Siconolfi and co-workers and a modification of this procedure. Method performance was assessed from the adequacy of the curve fitting techniques, as judged by the correlation coefficient and standard error of the estimate, and the accuracy of the different methods in determining the theoretical values of impedance parameters describing a set of model electrical circuits. The experimental data were well fitted by all curve-fitting procedures (r = 0.9 with SEE 0.3 to 3.5% or better for most circuit-procedure combinations). Cole-Cole modelling provided the most accurate estimates of circuit impedance values, generally within 1-2% of the theoretical values, followed by the Siconolfi procedure using a sixth-order polynomial regression (1-6% variation). None of the methods, however, accurately estimated circuit parameters when the measured impedances were low (<20 Omega) reflecting the electronic limits of the impedance meter used. These data suggest that Cole-Cole modelling remains the preferred method for the analysis of MFBIA data.
Resumo:
The identification, modeling, and analysis of interactions between nodes of neural systems in the human brain have become the aim of interest of many studies in neuroscience. The complex neural network structure and its correlations with brain functions have played a role in all areas of neuroscience, including the comprehension of cognitive and emotional processing. Indeed, understanding how information is stored, retrieved, processed, and transmitted is one of the ultimate challenges in brain research. In this context, in functional neuroimaging, connectivity analysis is a major tool for the exploration and characterization of the information flow between specialized brain regions. In most functional magnetic resonance imaging (fMRI) studies, connectivity analysis is carried out by first selecting regions of interest (ROI) and then calculating an average BOLD time series (across the voxels in each cluster). Some studies have shown that the average may not be a good choice and have suggested, as an alternative, the use of principal component analysis (PCA) to extract the principal eigen-time series from the ROI(s). In this paper, we introduce a novel approach called cluster Granger analysis (CGA) to study connectivity between ROIs. The main aim of this method was to employ multiple eigen-time series in each ROI to avoid temporal information loss during identification of Granger causality. Such information loss is inherent in averaging (e.g., to yield a single ""representative"" time series per ROI). This, in turn, may lead to a lack of power in detecting connections. The proposed approach is based on multivariate statistical analysis and integrates PCA and partial canonical correlation in a framework of Granger causality for clusters (sets) of time series. We also describe an algorithm for statistical significance testing based on bootstrapping. By using Monte Carlo simulations, we show that the proposed approach outperforms conventional Granger causality analysis (i.e., using representative time series extracted by signal averaging or first principal components estimation from ROIs). The usefulness of the CGA approach in real fMRI data is illustrated in an experiment using human faces expressing emotions. With this data set, the proposed approach suggested the presence of significantly more connections between the ROIs than were detected using a single representative time series in each ROI. (c) 2010 Elsevier Inc. All rights reserved.
Resumo:
Qualitative data analysis (QDA) is often a time-consuming and laborious process usually involving the management of large quantities of textual data. Recently developed computer programs offer great advances in the efficiency of the processes of QDA. In this paper we report on an innovative use of a combination of extant computer software technologies to further enhance and simplify QDA. Used in appropriate circumstances, we believe that this innovation greatly enhances the speed with which theoretical and descriptive ideas can be abstracted from rich, complex, and chaotic qualitative data. © 2001 Human Sciences Press, Inc.
Resumo:
27th Annual Conference of the European Cetacean Society. Setúbal, Portugal, 8-10 April 2013.
Resumo:
Controlled fires in forest areas are frequently used in most Mediterranean countries as a preventive technique to avoid severe wildfires in summer season. In Portugal, this forest management method of fuel mass availability is also used and has shown to be beneficial as annual statistical reports confirm that the decrease of wildfires occurrence have a direct relationship with the controlled fire practice. However prescribed fire can have serious side effects in some forest soil properties. This work shows the changes that occurred in some forest soils properties after a prescribed fire action. The experiments were carried out in soil cover over a natural site of Andaluzitic schist, in Gramelas, Caminha, Portugal, that had not been burn for four years. The composed soil samples were collected from five plots at three different layers (0-3cm, 3-6cm and 6-18cm) during a three-year monitoring period after the prescribed burning. Principal Component Analysis was used to reach the presented conclusions.