17 resultados para Spatial Data mining
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)
Resumo:
Geographic Data Warehouses (GDW) are one of the main technologies used in decision-making processes and spatial analysis, and the literature proposes several conceptual and logical data models for GDW. However, little effort has been focused on studying how spatial data redundancy affects SOLAP (Spatial On-Line Analytical Processing) query performance over GDW. In this paper, we investigate this issue. Firstly, we compare redundant and non-redundant GDW schemas and conclude that redundancy is related to high performance losses. We also analyze the issue of indexing, aiming at improving SOLAP query performance on a redundant GDW. Comparisons of the SB-index approach, the star-join aided by R-tree and the star-join aided by GiST indicate that the SB-index significantly improves the elapsed time in query processing from 25% up to 99% with regard to SOLAP queries defined over the spatial predicates of intersection, enclosure and containment and applied to roll-up and drill-down operations. We also investigate the impact of the increase in data volume on the performance. The increase did not impair the performance of the SB-index, which highly improved the elapsed time in query processing. Performance tests also show that the SB-index is far more compact than the star-join, requiring only a small fraction of at most 0.20% of the volume. Moreover, we propose a specific enhancement of the SB-index to deal with spatial data redundancy. This enhancement improved performance from 80 to 91% for redundant GDW schemas.
Resumo:
Melanoma is a highly aggressive and therapy resistant tumor for which the identification of specific markers and therapeutic targets is highly desirable. We describe here the development and use of a bioinformatic pipeline tool, made publicly available under the name of EST2TSE, for the in silico detection of candidate genes with tissue-specific expression. Using this tool we mined the human EST (Expressed Sequence Tag) database for sequences derived exclusively from melanoma. We found 29 UniGene clusters of multiple ESTs with the potential to predict novel genes with melanoma-specific expression. Using a diverse panel of human tissues and cell lines, we validated the expression of a subset of three previously uncharacterized genes (clusters Hs.295012, Hs.518391, and Hs.559350) to be highly restricted to melanoma/melanocytes and named them RMEL1, 2 and 3, respectively. Expression analysis in nevi, primary melanomas, and metastatic melanomas revealed RMEL1 as a novel melanocytic lineage-specific gene up-regulated during melanoma development. RMEL2 expression was restricted to melanoma tissues and glioblastoma. RMEL3 showed strong up-regulation in nevi and was lost in metastatic tumors. Interestingly, we found correlations of RMEL2 and RMEL3 expression with improved patient outcome, suggesting tumor and/or metastasis suppressor functions for these genes. The three genes are composed of multiple exons and map to 2q12.2, 1q25.3, and 5q11.2, respectively. They are well conserved throughout primates, but not other genomes, and were predicted as having no coding potential, although primate-conserved and human-specific short ORFs could be found. Hairpin RNA secondary structures were also predicted. Concluding, this work offers new melanoma-specific genes for future validation as prognostic markers or as targets for the development of therapeutic strategies to treat melanoma.
Resumo:
This work proposes a method based on both preprocessing and data mining with the objective of identify harmonic current sources in residential consumers. In addition, this methodology can also be applied to identify linear and nonlinear loads. It should be emphasized that the entire database was obtained through laboratory essays, i.e., real data were acquired from residential loads. Thus, the residential system created in laboratory was fed by a configurable power source and in its output were placed the loads and the power quality analyzers (all measurements were stored in a microcomputer). So, the data were submitted to pre-processing, which was based on attribute selection techniques in order to minimize the complexity in identifying the loads. A newer database was generated maintaining only the attributes selected, thus, Artificial Neural Networks were trained to realized the identification of loads. In order to validate the methodology proposed, the loads were fed both under ideal conditions (without harmonics), but also by harmonic voltages within limits pre-established. These limits are in accordance with IEEE Std. 519-1992 and PRODIST (procedures to delivery energy employed by Brazilian`s utilities). The results obtained seek to validate the methodology proposed and furnish a method that can serve as alternative to conventional methods.
Resumo:
Background: The inherent complexity of statistical methods and clinical phenomena compel researchers with diverse domains of expertise to work in interdisciplinary teams, where none of them have a complete knowledge in their counterpart's field. As a result, knowledge exchange may often be characterized by miscommunication leading to misinterpretation, ultimately resulting in errors in research and even clinical practice. Though communication has a central role in interdisciplinary collaboration and since miscommunication can have a negative impact on research processes, to the best of our knowledge, no study has yet explored how data analysis specialists and clinical researchers communicate over time. Methods/Principal Findings: We conducted qualitative analysis of encounters between clinical researchers and data analysis specialists (epidemiologist, clinical epidemiologist, and data mining specialist). These encounters were recorded and systematically analyzed using a grounded theory methodology for extraction of emerging themes, followed by data triangulation and analysis of negative cases for validation. A policy analysis was then performed using a system dynamics methodology looking for potential interventions to improve this process. Four major emerging themes were found. Definitions using lay language were frequently employed as a way to bridge the language gap between the specialties. Thought experiments presented a series of ""what if'' situations that helped clarify how the method or information from the other field would behave, if exposed to alternative situations, ultimately aiding in explaining their main objective. Metaphors and analogies were used to translate concepts across fields, from the unfamiliar to the familiar. Prolepsis was used to anticipate study outcomes, thus helping specialists understand the current context based on an understanding of their final goal. Conclusion/Significance: The communication between clinical researchers and data analysis specialists presents multiple challenges that can lead to errors.
Resumo:
Visualization of high-dimensional data requires a mapping to a visual space. Whenever the goal is to preserve similarity relations a frequent strategy is to use 2D projections, which afford intuitive interactive exploration, e. g., by users locating and selecting groups and gradually drilling down to individual objects. In this paper, we propose a framework for projecting high-dimensional data to 3D visual spaces, based on a generalization of the Least-Square Projection (LSP). We compare projections to 2D and 3D visual spaces both quantitatively and through a user study considering certain exploration tasks. The quantitative analysis confirms that 3D projections outperform 2D projections in terms of precision. The user study indicates that certain tasks can be more reliably and confidently answered with 3D projections. Nonetheless, as 3D projections are displayed on 2D screens, interaction is more difficult. Therefore, we incorporate suitable interaction functionalities into a framework that supports 3D transformations, predefined optimal 2D views, coordinated 2D and 3D views, and hierarchical 3D cluster definition and exploration. For visually encoding data clusters in a 3D setup, we employ color coding of projected data points as well as four types of surface renderings. A second user study evaluates the suitability of these visual encodings. Several examples illustrate the framework`s applicability for both visual exploration of multidimensional abstract (non-spatial) data as well as the feature space of multi-variate spatial data.
Resumo:
Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.
Resumo:
Objetivou-se com este trabalho utilizar regras de associação para identificar forças de mercado que regem a comercialização de touros com avaliação genética pelo programa Nelore Brasil. Essas regras permitem evidenciar padrões implícitos nas transações de grandes bases de dados, indicando causas e efeitos determinantes da oferta e comercialização de touros. Na análise foram considerados 19.736 registros de touros comercializados, 17 fazendas e 15 atributos referentes às diferenças esperadas nas progênies dos reprodutores, local e época da venda. Utilizou-se um sistema com interface gráfica usuário-dirigido que permite geração e seleção interativa de regras de associação. Análise de Pareto foi aplicada para as três medidas objetivas (suporte, confiança e lift) que acompanham cada uma das regras de associação, para validação das mesmas. Foram geradas 2.667 regras de associação, 164 consideradas úteis pelo usuário e 107 válidas para lift ≥ 1,0505. As fazendas participantes do programa Nelore Brasil apresentam especializações na oferta de touros, segundo características para habilidade materna, ganho de peso, fertilidade, precocidade sexual, longevidade, rendimento e terminação de carcaça. Os perfis genéticos dos touros são diferentes para as variedades padrão e mocho. Algumas regiões brasileiras são nichos de mercado para touros sem registro genealógico. A análise de evolução de mercado sugere que o mérito genético total, índice oficial do programa Nelore Brasil, tornou-se um importante índice para comercialização dos touros. Com o uso das regras de associação, foi possível descobrir forças do mercado e identificar combinações de atributos genéticos, geográficos e temporais que determinam a comercialização de touros no programa Nelore Brasil.
Resumo:
Age-related changes in running kinematics have been reported in the literature using classical inferential statistics. However, this approach has been hampered by the increased number of biomechanical gait variables reported and subsequently the lack of differences presented in these studies. Data mining techniques have been applied in recent biomedical studies to solve this problem using a more general approach. In the present work, we re-analyzed lower extremity running kinematic data of 17 young and 17 elderly male runners using the Support Vector Machine (SVM) classification approach. In total, 31 kinematic variables were extracted to train the classification algorithm and test the generalized performance. The results revealed different accuracy rates across three different kernel methods adopted in the classifier, with the linear kernel performing the best. A subsequent forward feature selection algorithm demonstrated that with only six features, the linear kernel SVM achieved 100% classification performance rate, showing that these features provided powerful combined information to distinguish age groups. The results of the present work demonstrate potential in applying this approach to improve knowledge about the age-related differences in running gait biomechanics and encourages the use of the SVM in other clinical contexts. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
Despite modern weed control practices, weeds continue to be a threat to agricultural production. Considering the variability of weeds, a classification methodology for the risk of infestation in agricultural zones using fuzzy logic is proposed. The inputs for the classification are attributes extracted from estimated maps for weed seed production and weed coverage using kriging and map analysis and from the percentage of surface infested by grass weeds, in order to account for the presence of weed species with a high rate of development and proliferation. The output for the classification predicts the risk of infestation of regions of the field for the next crop. The risk classification methodology described in this paper integrates analysis techniques which may help to reduce costs and improve weed control practices. Results for the risk classification of the infestation in a maize crop field are presented. To illustrate the effectiveness of the proposed system, the risk of infestation over the entire field is checked against the yield loss map estimated by kriging and also with the average yield loss estimated from a hyperbolic model.
Resumo:
The productivity associated with commonly available disassembly methods today seldomly makes disassembly the preferred end-of-life solution for massive take back product streams. Systematic reuse of parts or components, or recycling of pure material fractions are often not achievable in an economically sustainable way. In this paper a case-based review of current disassembly practices is used to analyse the factors influencing disassembly feasibility. Data mining techniques were used to identify major factors influencing the profitability of disassembly operations. Case characteristics such as involvement of the product manufacturer in the end-of-life treatment and continuous ownership are some of the important dimensions. Economic models demonstrate that the efficiency of disassembly operations should be increased an order of magnitude to assure the competitiveness of ecologically preferred, disassembly oriented end-of-life scenarios for large waste of electric and electronic equipment (WEEE) streams. Technological means available to increase the productivity of the disassembly operations are summarized. Automated disassembly techniques can contribute to the robustness of the process, but do not allow to overcome the efficiency gap if not combined with appropriate product design measures. Innovative, reversible joints, collectively activated by external trigger signals, form a promising approach to low cost, mass disassembly in this context. A short overview of the state-of-the-art in the development of such self-disassembling joints is included. (c) 2008 CIRP.
Resumo:
Several aspects of photoperception and light signal transduction have been elucidated by studies with model plants. However, the information available for economically important crops, such as Fabaceae species, is scarce. In order to incorporate the existing genomic tools into a strategy to advance soybean research, we have investigated publicly available expressed sequence tag ( EST) sequence databases in order to identify Glycine max sequences related to genes involved in light-regulated developmental control in model plants. Approximately 38,000 sequences from open-access databases were investigated, and all bona fide and putative photoreceptor gene families were found in soybean sequence databases. We have identified G. max orthologs for several families of transcriptional regulators and cytoplasmic proteins mediating photoreceptor-induced responses, although some important Arabidopsis phytochrome-signaling components are absent. Moreover, soybean and Arabidopsis gene-family homologs appear to have undergone a distinct expansion process in some cases. We propose a working model of light perception, signal transduction and response-eliciting in G. max, based on the identified key components from Arabidopsis. These results demonstrate the power of comparative genomics between model systems and crop species to elucidate several aspects of plant physiology and metabolism.
Resumo:
Study Design: Data mining of single nucleotide polymorphisms (SNPs) in gene pathways related to spinal cord injury (SCI). Objectives: To identify gene polymorphisms putatively implicated with neuronal damage evolution pathways, potentially useful to SCI study. Setting: Departments of Psychiatry and Orthopedics, Faculdade de Medicina, Universidade de Sao Paulo, Brazil. Methods: Genes involved with processes related to SCI, such as apoptosis, inflammatory response, axonogenesis, peripheral nervous system development and axon ensheathment, were determined by evaluating the `Biological Process` annotation of Gene Ontology (GO). Each gene of these pathways was mapped using MapViewer, and gene coordinates were used to identify their polymorphisms in the SNP database. As a proof of concept, the frequency of subset of SNPs, located in four genes (ALOX12, APOE, BDNF and NINJ1) was evaluated in the DNA of a group of 28 SCI patients and 38 individuals with no SC lesions. Results: We could identify a total of 95 276 SNPs in a set of 588 genes associated with the selected GO terms, including 3912 nucleotide alterations located in coding regions of genes. The five non-synonymous SNPs genotyped in our small group of patients, showed a significant frequency, reinforcing their potential use for the investigation of SCI evolution. Conclusion: Despite the importance of SNPs in many aspects of gene expression and protein activity, these gene alterations have not been explored in SCI research. Here we describe a set of potentially useful SNPs, some of which could underlie the genetic mechanisms involved in the post trauma spinal cord damage.
Resumo:
Tick-borne zoonoses (TBZ) are emerging diseases worldwide. A large amount of information (e.g. case reports, results of epidemiological surveillance, etc.) is dispersed through various reference sources (ISI and non-ISI journals, conference proceedings, technical reports, etc.). An integrated database-derived from the ICTTD-3 project (http://www.icttd.nl)-was developed in order to gather TBZ records in the (sub-)tropics, collected both by the authors and collaborators worldwide. A dedicated website (http://www.tickbornezoonoses.org) was created to promote collaboration and circulate information. Data collected are made freely available to researchers for analysis by spatial methods, integrating mapped ecological factors for predicting TBZ risk. The authors present the assembly process of the TBZ database: the compilation of an updated list of TBZ relevant for (sub-)tropics, the database design and its structure, the method of bibliographic search, the assessment of spatial precision of geo-referenced records. At the time of writing, 725 records extracted from 337 publications related to 59 countries in the (sub-)tropics, have been entered in the database. TBZ distribution maps were also produced. Imported cases have been also accounted for. The most important datasets with geo-referenced records were those on Spotted Fever Group rickettsiosis in Latin-America and Crimean-Congo Haemorrhagic Fever in Africa. The authors stress the need for international collaboration in data collection to update and improve the database. Supervision of data entered remains always necessary. Means to foster collaboration are discussed. The paper is also intended to describe the challenges encountered to assemble spatial data from various sources and to help develop similar data collections.
Resumo:
Prior to deforestation, So Paulo State had 79,000 km(2) covered by Cerrado (Brazilian savanna) physiognomies, but today less than 8.5% of this biodiversity hotspot remains, mostly in private lands. The global demand for agricultural goods has imposed strong pressure on natural areas, and the economic decisions of agribusiness managers are crucial to the fate of Cerrado domain remaining areas (CDRA) in Brazil. Our aim was to investigate the effectiveness of Brazilian private protected areas policy, and to propose a feasible alternative to promote CDRA protection. This article assessed the main agribusiness opportunity costs for natural areas preservation: the land use profitability and the arable land price. The CDRA percentage and the opportunity costs were estimated for 349 municipal districts of So Paulo State through secondary spatial data and profitability values of 38 main agricultural products. We found that Brazilian private protected areas policy fails to preserve CDRA, although the values of non-compliance fines were higher than average opportunity costs. The scenario with very restrictive laws on private protected areas and historical high interest rates allowed us to conceive a feasible cross compliance proposal to improve environmental and agricultural policies.
Resumo:
Predictive performance evaluation is a fundamental issue in design, development, and deployment of classification systems. As predictive performance evaluation is a multidimensional problem, single scalar summaries such as error rate, although quite convenient due to its simplicity, can seldom evaluate all the aspects that a complete and reliable evaluation must consider. Due to this, various graphical performance evaluation methods are increasingly drawing the attention of machine learning, data mining, and pattern recognition communities. The main advantage of these types of methods resides in their ability to depict the trade-offs between evaluation aspects in a multidimensional space rather than reducing these aspects to an arbitrarily chosen (and often biased) single scalar measure. Furthermore, to appropriately select a suitable graphical method for a given task, it is crucial to identify its strengths and weaknesses. This paper surveys various graphical methods often used for predictive performance evaluation. By presenting these methods in the same framework, we hope this paper may shed some light on deciding which methods are more suitable to use in different situations.