15 resultados para data analysis: algorithms and implementation
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
In this paper, we present an algorithm for cluster analysis that integrates aspects from cluster ensemble and multi-objective clustering. The algorithm is based on a Pareto-based multi-objective genetic algorithm, with a special crossover operator, which uses clustering validation measures as objective functions. The algorithm proposed can deal with data sets presenting different types of clusters, without the need of expertise in cluster analysis. its result is a concise set of partitions representing alternative trade-offs among the objective functions. We compare the results obtained with our algorithm, in the context of gene expression data sets, to those achieved with multi-objective Clustering with automatic K-determination (MOCK). the algorithm most closely related to ours. (C) 2009 Elsevier B.V. All rights reserved.
A bivariate regression model for matched paired survival data: local influence and residual analysis
Resumo:
The use of bivariate distributions plays a fundamental role in survival and reliability studies. In this paper, we consider a location scale model for bivariate survival times based on the proposal of a copula to model the dependence of bivariate survival data. For the proposed model, we consider inferential procedures based on maximum likelihood. Gains in efficiency from bivariate models are also examined in the censored data setting. For different parameter settings, sample sizes and censoring percentages, various simulation studies are performed and compared to the performance of the bivariate regression model for matched paired survival data. Sensitivity analysis methods such as local and total influence are presented and derived under three perturbation schemes. The martingale marginal and the deviance marginal residual measures are used to check the adequacy of the model. Furthermore, we propose a new measure which we call modified deviance component residual. The methodology in the paper is illustrated on a lifetime data set for kidney patients.
Resumo:
The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The architecture of the new system uses Java language as programming environment. Since application parameters and hardware in a joint experiment are complex with a large variability of components, requirements and specification solutions need to be flexible and modular, independent from operating system and computer architecture. To describe and organize the information on all the components and the connections among them, systems are developed using the extensible Markup Language (XML) technology. The communication between clients and servers uses remote procedure call (RPC) based on the XML (RPC-XML technology). The integration among Java language, XML and RPC-XML technologies allows to develop easily a standard data and communication access layer between users and laboratories using common software libraries and Web application. The libraries allow data retrieval using the same methods for all user laboratories in the joint collaboration, and the Web application allows a simple graphical user interface (GUI) access. The TCABR tokamak team in collaboration with the IPFN (Instituto de Plasmas e Fusao Nuclear, Instituto Superior Tecnico, Universidade Tecnica de Lisboa) is implementing this remote participation technologies. The first version was tested at the Joint Experiment on TCABR (TCABRJE), a Host Laboratory Experiment, organized in cooperation with the IAEA (International Atomic Energy Agency) in the framework of the IAEA Coordinated Research Project (CRP) on ""Joint Research Using Small Tokamaks"". (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Information on fruits and vegetables consumption in Brazil in the three levels of dietary data was analyzed and compared. Data about national supply came from Food Balance Sheets compiled by the FAO; household availability information was obtained from the Brazilian National Household Budget Survey (HBS); and actual intake information came from a large individual dietary intake survey that was representative of the adult population of São Paulo city. All sources of information were collected between 2002 and 2003. A subset of the HBS, representative of São Paulo city, was used in our analysis in order to improve the quality of the comparison with actual intake data. The ratio of national supply to household availability of fruits and vegetables was 2.6 while the ratio of national supply to actual intake was 4.0. The discrepancy ratio in the comparison between household availability and actual intake was smaller, 1.6. While the use of supply and availability data has advantages, as lower cost, must be taken into account that these sources tend to overestimate actual intake of fruits and vegetables.
Resumo:
The identification, modeling, and analysis of interactions between nodes of neural systems in the human brain have become the aim of interest of many studies in neuroscience. The complex neural network structure and its correlations with brain functions have played a role in all areas of neuroscience, including the comprehension of cognitive and emotional processing. Indeed, understanding how information is stored, retrieved, processed, and transmitted is one of the ultimate challenges in brain research. In this context, in functional neuroimaging, connectivity analysis is a major tool for the exploration and characterization of the information flow between specialized brain regions. In most functional magnetic resonance imaging (fMRI) studies, connectivity analysis is carried out by first selecting regions of interest (ROI) and then calculating an average BOLD time series (across the voxels in each cluster). Some studies have shown that the average may not be a good choice and have suggested, as an alternative, the use of principal component analysis (PCA) to extract the principal eigen-time series from the ROI(s). In this paper, we introduce a novel approach called cluster Granger analysis (CGA) to study connectivity between ROIs. The main aim of this method was to employ multiple eigen-time series in each ROI to avoid temporal information loss during identification of Granger causality. Such information loss is inherent in averaging (e.g., to yield a single ""representative"" time series per ROI). This, in turn, may lead to a lack of power in detecting connections. The proposed approach is based on multivariate statistical analysis and integrates PCA and partial canonical correlation in a framework of Granger causality for clusters (sets) of time series. We also describe an algorithm for statistical significance testing based on bootstrapping. By using Monte Carlo simulations, we show that the proposed approach outperforms conventional Granger causality analysis (i.e., using representative time series extracted by signal averaging or first principal components estimation from ROIs). The usefulness of the CGA approach in real fMRI data is illustrated in an experiment using human faces expressing emotions. With this data set, the proposed approach suggested the presence of significantly more connections between the ROIs than were detected using a single representative time series in each ROI. (c) 2010 Elsevier Inc. All rights reserved.
Resumo:
This work presents a novel approach in order to increase the recognition power of Multiscale Fractal Dimension (MFD) techniques, when applied to image classification. The proposal uses Functional Data Analysis (FDA) with the aim of enhancing the MFD technique precision achieving a more representative descriptors vector, capable of recognizing and characterizing more precisely objects in an image. FDA is applied to signatures extracted by using the Bouligand-Minkowsky MFD technique in the generation of a descriptors vector from them. For the evaluation of the obtained improvement, an experiment using two datasets of objects was carried out. A dataset was used of characters shapes (26 characters of the Latin alphabet) carrying different levels of controlled noise and a dataset of fish images contours. A comparison with the use of the well-known methods of Fourier and wavelets descriptors was performed with the aim of verifying the performance of FDA method. The descriptor vectors were submitted to Linear Discriminant Analysis (LDA) classification method and we compared the correctness rate in the classification process among the descriptors methods. The results demonstrate that FDA overcomes the literature methods (Fourier and wavelets) in the processing of information extracted from the MFD signature. In this way, the proposed method can be considered as an interesting choice for pattern recognition and image classification using fractal analysis.
Resumo:
This paper presents the groundwater favorability mapping on a fractured terrain in the eastern portion of Sao Paulo State, Brazil. Remote sensing, airborne geophysical data, photogeologic interpretation, geologic and geomorphologic maps and geographic information system (GIS) techniques have been used. The results of cross-tabulation between these maps and well yield data allowed groundwater prospective parameters in a fractured-bedrock aquifer. These prospective parameters are the base for the favorability analysis whose principle is based on the knowledge-driven method. The mutticriteria analysis (weighted linear combination) was carried out to give a groundwater favorabitity map, because the prospective parameters have different weights of importance and different classes of each parameter. The groundwater favorability map was tested by cross-tabulation with new well yield data and spring occurrence. The wells with the highest values of productivity, as well as all the springs occurrence are situated in the excellent and good favorabitity mapped areas. It shows good coherence between the prospective parameters and the well yield and the importance of GIS techniques for definition of target areas for detail study and wells location. (c) 2008 Elsevier B.V. All rights reserved.
Resumo:
Optimization of photo-Fenton degradation of copper phthalocyanine blue was achieved by response surface methodology (RSM) constructed with the aid of a sequential injection analysis (SIA) system coupled to a homemade photo-reactor. Highest degradation percentage was obtained at the following conditions [H(2)O(2)]/[phthalocyanine] = 7, [H(2)O(2)]/[FeSO(4)] = 10, pH = 2.5, and stopped flow time in the photo reactor = 30 s. The SIA system was designed to prepare a monosegment containing the reagents and sample, to pump it toward the photo-reactor for the specified time and send the products to a flow-through spectrophotometer for monitoring the color reduction of the dye. Changes in parameters such as reagent molar ratios. residence time and pH were made by modifications in the software commanding the SI system, without the need for physical reconfiguration of reagents around the selection valve. The proposed procedure and system fed the statistical program with degradation data for fast construction of response surface plots. After optimization, 97% of the dye was degraded. (C) 2009 Elsevier B.V. All rights reserved.
Resumo:
This work presents the use of sequential injection analysis (SIA) and the response surface methodology as a tool for optimization of Fenton-based processes. Alizarin red S dye (C.I. 58005) was used as a model compound for the anthraquinones family. whose pigments have a large use in coatings industry. The following factors were considered: [H(2)O(2)]:[Alizarin] and [H(2)O(2)]:[FeSO(4)] ratios and pH. The SIA system was designed to add reagents to the reactor and to perform on-line sampling of the reaction medium, sending the samples to a flow-through spectrophotometer for monitoring the color reduction of the dye. The proposed system fed the statistical program with degradation data for fast construction of response surface plots. After optimization, 99.7% of the dye was degraded and the TOC content was reduced to 35% of the original value. Low reagents consumption and high sampling throughput were the remarkable features of the SIA system. (C) 2008 Published by Elsevier B.V.
Resumo:
This paper describes a chemotaxonomic analysis of a database of triterpenoid compounds from the Celastraceae family using principal component analysis (PCA). The numbers of occurrences of thirty types of triterpene skeleton in different tribes of the family were used as variables. The study shows that PCA applied to chemical data can contribute to an intrafamilial classification of Celastraceae, once some questionable taxa affinity was observed, from chemotaxonomic inferences about genera and they are in agreement with the phylogeny previously proposed. The inclusion of Hippocrateaceae within Celastraceae is supported by the triterpene chemistry.
Resumo:
This paper aims to investigate the influence of some dissolved air flotation (DAF) process variables (specifically: the hydraulic detention time in the contact zone and the supplied dissolved air concentration) and the pH values, as pretreatment chemical variables, on the micro-bubble size distribution (BSD) in a DAF contact zone. This work was carried out in a pilot plant where bubbles were measured by an appropriate non-intrusive image acquisition system. The results show that the obtained diameter ranges were in agreement with values reported in the literature (10-100mm), quite independently of the investigated conditions. The linear average diameter varied from 20 to 30mm, or equivalently, the Sauter (d(3,2)) diameter varied from 40 to 50mm. In all investigated conditions, D(50) was between 75% and 95%. The BSD might present different profile (with a bimodal curve trend), however, when analyzing the volumetric frequency distribution (in some cases with the appearance of peaks in diameters ranging from 90-100mm). Regarding volumetric frequency analysis, all the investigated parameters can modify the BSD in DAF contact zone after the release point, thus potentially causing changes in DAF kinetics. This finding prompts further research in order to verify the effect of these BSD changes on solid particle removal efficiency by DAF.
Resumo:
For a fixed family F of graphs, an F-packing in a graph G is a set of pairwise vertex-disjoint subgraphs of G, each isomorphic to an element of F. Finding an F-packing that maximizes the number of covered edges is a natural generalization of the maximum matching problem, which is just F = {K(2)}. In this paper we provide new approximation algorithms and hardness results for the K(r)-packing problem where K(r) = {K(2), K(3,) . . . , K(r)}. We show that already for r = 3 the K(r)-packing problem is APX-complete, and, in fact, we show that it remains so even for graphs with maximum degree 4. On the positive side, we give an approximation algorithm with approximation ratio at most 2 for every fixed r. For r = 3, 4, 5 we obtain better approximations. For r = 3 we obtain a simple 3/2-approximation, achieving a known ratio that follows from a more involved algorithm of Halldorsson. For r = 4, we obtain a (3/2 + epsilon)-approximation, and for r = 5 we obtain a (25/14 + epsilon)-approximation. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
This paper describes the development and evaluation of a sequential injection method to automate the determination of methyl parathion by square wave adsorptive cathodic stripping voltammetry exploiting the concept of monosegmented flow analysis to perform in-line sample conditioning and standard addition. Accumulation and stripping steps are made in the sample medium conditioned with 40 mmol L-1 Britton-Robinson buffer (pH 10) in 0.25 mol L-1 NaNO3. The homogenized mixture is injected at a flow rate of 10 mu Ls(-1) toward the flow cell, which is adapted to the capillary of a hanging drop mercury electrode. After a suitable deposition time, the flow is stopped and the potential is scanned from -0.3 to -1.0 V versus Ag/AgCl at frequency of 250 Hz and pulse height of 25 mV The linear dynamic range is observed for methyl parathion concentrations between 0.010 and 0.50 mgL(-1), with detection and quantification limits of 2 and 7 mu gL(-1), respectively. The sampling throughput is 25 h(-1) if the in line standard addition and sample conditioning protocols are followed, but this frequency can be increased up to 61 h(-1) if the sample is conditioned off-line and quantified using an external calibration curve. The method was applied for determination of methyl parathion in spiked water samples and the accuracy was evaluated either by comparison to high performance liquid chromatography with UV detection, or by the recovery percentages. Although no evidences of statistically significant differences were observed between the expected and obtained concentrations, because of the susceptibility of the method to interference by other pesticides (e.g., parathion, dichlorvos) and natural organic matter (e.g., fulvic and humic acids), isolation of the analyte may be required when more complex sample matrices are encountered. (C) 2007 Elsevier B.V. All rights reserved.