36 resultados para exploratory data analysis
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
P>The use of seven domains for the Oral Health Impact Profile (OHIP)-EDENT was not supported for its Brazilian version, making data interpretation in clinical settings difficult. Thus, the aim of this study was to assess patients` responses for the translated OHIP-EDENT in a group of edentulous subjects and to develop factor scales for application in future studies. Data from 103 conventional and implant-retained complete denture wearers (36 men, mean age of 69 center dot 1 +/- 10 center dot 3 years) were assessed using the Brazilian version of the OHIP-EDENT. Oral health-related quality of life domains were identified by factor analysis using principal component analysis as the extraction method, followed by varimax rotation. Factor analysis identified four factors that accounted for 63% of the 19 items total variance, named masticatory discomfort and disability (four items), psychological discomfort and disability (five items), social disability (five items) and oral pain and discomfort (five items). Four factors/domains of the Brazilian OHIP-EDENT version represent patient-important aspects of oral health-related quality of life.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
The identification, modeling, and analysis of interactions between nodes of neural systems in the human brain have become the aim of interest of many studies in neuroscience. The complex neural network structure and its correlations with brain functions have played a role in all areas of neuroscience, including the comprehension of cognitive and emotional processing. Indeed, understanding how information is stored, retrieved, processed, and transmitted is one of the ultimate challenges in brain research. In this context, in functional neuroimaging, connectivity analysis is a major tool for the exploration and characterization of the information flow between specialized brain regions. In most functional magnetic resonance imaging (fMRI) studies, connectivity analysis is carried out by first selecting regions of interest (ROI) and then calculating an average BOLD time series (across the voxels in each cluster). Some studies have shown that the average may not be a good choice and have suggested, as an alternative, the use of principal component analysis (PCA) to extract the principal eigen-time series from the ROI(s). In this paper, we introduce a novel approach called cluster Granger analysis (CGA) to study connectivity between ROIs. The main aim of this method was to employ multiple eigen-time series in each ROI to avoid temporal information loss during identification of Granger causality. Such information loss is inherent in averaging (e.g., to yield a single ""representative"" time series per ROI). This, in turn, may lead to a lack of power in detecting connections. The proposed approach is based on multivariate statistical analysis and integrates PCA and partial canonical correlation in a framework of Granger causality for clusters (sets) of time series. We also describe an algorithm for statistical significance testing based on bootstrapping. By using Monte Carlo simulations, we show that the proposed approach outperforms conventional Granger causality analysis (i.e., using representative time series extracted by signal averaging or first principal components estimation from ROIs). The usefulness of the CGA approach in real fMRI data is illustrated in an experiment using human faces expressing emotions. With this data set, the proposed approach suggested the presence of significantly more connections between the ROIs than were detected using a single representative time series in each ROI. (c) 2010 Elsevier Inc. All rights reserved.
Resumo:
In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective Clustering with automatic K-determination (MOCK). the algorithm most closely related to ours. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The architecture of the new system uses Java language as programming environment. Since application parameters and hardware in a joint experiment are complex with a large variability of components, requirements and specification solutions need to be flexible and modular, independent from operating system and computer architecture. To describe and organize the information on all the components and the connections among them, systems are developed using the extensible Markup Language (XML) technology. The communication between clients and servers uses remote procedure call (RPC) based on the XML (RPC-XML technology). The integration among Java language, XML and RPC-XML technologies allows to develop easily a standard data and communication access layer between users and laboratories using common software libraries and Web application. The libraries allow data retrieval using the same methods for all user laboratories in the joint collaboration, and the Web application allows a simple graphical user interface (GUI) access. The TCABR tokamak team in collaboration with the IPFN (Instituto de Plasmas e Fusao Nuclear, Instituto Superior Tecnico, Universidade Tecnica de Lisboa) is implementing this remote participation technologies. The first version was tested at the Joint Experiment on TCABR (TCABRJE), a Host Laboratory Experiment, organized in cooperation with the IAEA (International Atomic Energy Agency) in the framework of the IAEA Coordinated Research Project (CRP) on ""Joint Research Using Small Tokamaks"". (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
This work presents a novel approach in order to increase the recognition power of Multiscale Fractal Dimension (MFD) techniques, when applied to image classification. The proposal uses Functional Data Analysis (FDA) with the aim of enhancing the MFD technique precision achieving a more representative descriptors vector, capable of recognizing and characterizing more precisely objects in an image. FDA is applied to signatures extracted by using the Bouligand-Minkowsky MFD technique in the generation of a descriptors vector from them. For the evaluation of the obtained improvement, an experiment using two datasets of objects was carried out. A dataset was used of characters shapes (26 characters of the Latin alphabet) carrying different levels of controlled noise and a dataset of fish images contours. A comparison with the use of the well-known methods of Fourier and wavelets descriptors was performed with the aim of verifying the performance of FDA method. The descriptor vectors were submitted to Linear Discriminant Analysis (LDA) classification method and we compared the correctness rate in the classification process among the descriptors methods. The results demonstrate that FDA overcomes the literature methods (Fourier and wavelets) in the processing of information extracted from the MFD signature. In this way, the proposed method can be considered as an interesting choice for pattern recognition and image classification using fractal analysis.
Resumo:
This paper presents the groundwater favorability mapping on a fractured terrain in the eastern portion of Sao Paulo State, Brazil. Remote sensing, airborne geophysical data, photogeologic interpretation, geologic and geomorphologic maps and geographic information system (GIS) techniques have been used. The results of cross-tabulation between these maps and well yield data allowed groundwater prospective parameters in a fractured-bedrock aquifer. These prospective parameters are the base for the favorability analysis whose principle is based on the knowledge-driven method. The mutticriteria analysis (weighted linear combination) was carried out to give a groundwater favorabitity map, because the prospective parameters have different weights of importance and different classes of each parameter. The groundwater favorability map was tested by cross-tabulation with new well yield data and spring occurrence. The wells with the highest values of productivity, as well as all the springs occurrence are situated in the excellent and good favorabitity mapped areas. It shows good coherence between the prospective parameters and the well yield and the importance of GIS techniques for definition of target areas for detail study and wells location. (c) 2008 Elsevier B.V. All rights reserved.
Resumo:
Molecular orbital calculations were carried out on a set of 28 non-imidazole H(3) antihistamine compounds using the Hartree-Fock method in order to investigate the possible relationships between electronic structural properties and binding affinity for H3 receptors (pK(i)). It was observed that the frontier effective-for-reaction molecular orbital (FERMO) energies were better correlated with pK(i) values than highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) energy values. Exploratory data analysis through hierarchical cluster (HCA) and principal component analysis (PCA) showed a separation of the compounds in two sets, one grouping the molecules with high pK(i) values, the other gathering low pK(i) value compounds. This separation was obtained with the use of the following descriptors: FERMO energies (epsilon(FERMO)), charges derived from the electrostatic potential on the nitrogen atom (N(1)), electronic density indexes for FERMO on the N(1) atom (Sigma((FERMO))c(i)(2)). and electrophilicity (omega`). These electronic descriptors were used to construct a quantitative structure-activity relationship (QSAR) model through the partial least-squares (PLS) method with three principal components. This model generated Q(2) = 0.88 and R(2) = 0.927 values obtained from a training set and external validation of 23 and 5 molecules, respectively. After the analysis of the PLS regression equation and the values for the selected electronic descriptors, it is suggested that high values of FERMO energies and of Sigma((FERMO))c(i)(2), together with low values of electrophilicity and pronounced negative charges on N(1) appear as desirable properties for the conception of new molecules which might have high binding affinity. 2010 Elsevier Inc. All rights reserved.
Resumo:
This paper aims to find relations between the socioeconomic characteristics, activity participation, land use patterns and travel behavior of the residents in the Sao Paulo Metropolitan Area (SPMA) by using Exploratory Multivariate Data Analysis (EMDA) techniques. The variables influencing travel pattern choices are investigated using: (a) Cluster Analysis (CA), grouping and characterizing the Traffic Zones (17), proposing the independent variable called Origin Cluster and, (b) Decision Tree (DT) to find a priori unknown relations among socioeconomic characteristics, land use attributes of the origin TZ and destination choices. The analysis was based on the origin-destination home-interview survey carried out in SPMA in 1997. The DT application revealed the variables of greatest influence on the travel pattern choice. The most important independent variable considered by DT is car ownership, followed by the Use of Transportation ""credits"" for Transit tariff, and, finally, activity participation variables and Origin Cluster. With these results, it was possible to analyze the influence of a family income, car ownership, position of the individual in the family, use of transportation ""credits"" for transit tariff (mainly for travel mode sequence choice), activities participation (activity sequence choice) and Origin Cluster (destination/travel distance choice). (c) 2010 Elsevier Ltd. All rights reserved.
Wavelet correlation between subjects: A time-scale data driven analysis for brain mapping using fMRI
Resumo:
Functional magnetic resonance imaging (fMRI) based on BOLD signal has been used to indirectly measure the local neural activity induced by cognitive tasks or stimulation. Most fMRI data analysis is carried out using the general linear model (GLM), a statistical approach which predicts the changes in the observed BOLD response based on an expected hemodynamic response function (HRF). In cases when the task is cognitively complex or in cases of diseases, variations in shape and/or delay may reduce the reliability of results. A novel exploratory method using fMRI data, which attempts to discriminate between neurophysiological signals induced by the stimulation protocol from artifacts or other confounding factors, is introduced in this paper. This new method is based on the fusion between correlation analysis and the discrete wavelet transform, to identify similarities in the time course of the BOLD signal in a group of volunteers. We illustrate the usefulness of this approach by analyzing fMRI data from normal subjects presented with standardized human face pictures expressing different degrees of sadness. The results show that the proposed wavelet correlation analysis has greater statistical power than conventional GLM or time domain intersubject correlation analysis. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
This paper proposes a regression model considering the modified Weibull distribution. This distribution can be used to model bathtub-shaped failure rate functions. Assuming censored data, we consider maximum likelihood and Jackknife estimators for the parameters of the model. We derive the appropriate matrices for assessing local influence on the parameter estimates under different perturbation schemes and we also present some ways to perform global influence. Besides, for different parameter settings, sample sizes and censoring percentages, various simulations are performed and the empirical distribution of the modified deviance residual is displayed and compared with the standard normal distribution. These studies suggest that the residual analysis usually performed in normal linear regression models can be straightforwardly extended for a martingale-type residual in log-modified Weibull regression models with censored data. Finally, we analyze a real data set under log-modified Weibull regression models. A diagnostic analysis and a model checking based on the modified deviance residual are performed to select appropriate models. (c) 2008 Elsevier B.V. All rights reserved.
Resumo:
In this study, regression models are evaluated for grouped survival data when the effect of censoring time is considered in the model and the regression structure is modeled through four link functions. The methodology for grouped survival data is based on life tables, and the times are grouped in k intervals so that ties are eliminated. Thus, the data modeling is performed by considering the discrete models of lifetime regression. The model parameters are estimated by using the maximum likelihood and jackknife methods. To detect influential observations in the proposed models, diagnostic measures based on case deletion, which are denominated global influence, and influence measures based on small perturbations in the data or in the model, referred to as local influence, are used. In addition to those measures, the local influence and the total influential estimate are also employed. Various simulation studies are performed and compared to the performance of the four link functions of the regression models for grouped survival data for different parameter settings, sample sizes and numbers of intervals. Finally, a data set is analyzed by using the proposed regression models. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
In this paper a new parametric method to deal with discrepant experimental results is developed. The method is based on the fit of a probability density function to the data. This paper also compares the characteristics of different methods used to deduce recommended values and uncertainties from a discrepant set of experimental data. The methods are applied to the (137)Cs and (90)Sr published half-lives and special emphasis is given to the deduced confidence intervals. The obtained results are analyzed considering two fundamental properties expected from an experimental result: the probability content of confidence intervals and the statistical consistency between different recommended values. The recommended values and uncertainties for the (137)Cs and (90)Sr half-lives are 10,984 (24) days and 10,523 (70) days, respectively. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.
Resumo:
Objetivou-se verificar a prevalência de deficiência auditiva referida pela população urbana de quatro localidades do Estado de São Paulo, Brasil, e estudar as causas atribuídas e variáveis sócio-demográficas. Foi realizado um estudo transversal de base populacional com dados referentes à população com 12 anos ou mais residente nas quatro localidades, em 2001 e 2002. Participaram 5.250 sujeitos selecionados por amostragem probabilística, estratificada e selecionada por conglomerados, em dois estágios. A análise dos dados foi exploratória, incluindo análise bivariada e regressão logística múltipla. A prevalência de deficiência auditiva foi 5,21%, mais acentuada nas faixas etárias acima de 59 anos (18,7%), que referiram doenças nos 15 dias anteriores à entrevista (8,4%), com transtorno mental comum (8,85%) e que fizeram uso de medicamentos nos últimos 3 dias (8,45%). O estudo dos fatores que se associam à deficiência auditiva direcionam intervenções de saúde para que atendam as reais necessidades da população, principalmente na atenção primária. Há necessidade de mais estudos populacionais com enfoque na audição, visto que esta é uma área escassa de publicações no Brasil.