16 resultados para databases and data mining

em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The reproductive performance of cattle may be influenced by several factors, but mineral imbalances are crucial in terms of direct effects on reproduction. Several studies have shown that elements such as calcium, copper, iron, magnesium, selenium, and zinc are essential for reproduction and can prevent oxidative stress. However, toxic elements such as lead, nickel, and arsenic can have adverse effects on reproduction. In this paper, we applied a simple and fast method of multi-element analysis to bovine semen samples from Zebu and European classes used in reproduction programs and artificial insemination. Samples were analyzed by inductively coupled plasma spectrometry (ICP-MS) using aqueous medium calibration and the samples were diluted in a proportion of 1:50 in a solution containing 0.01% (vol/vol) Triton X-100 and 0.5% (vol/vol) nitric acid. Rhodium, iridium, and yttrium were used as the internal standards for ICP-MS analysis. To develop a reliable method of tracing the class of bovine semen, we used data mining techniques that make it possible to classify unknown samples after checking the differentiation of known-class samples. Based on the determination of 15 elements in 41 samples of bovine semen, 3 machine-learning tools for classification were applied to determine cattle class. Our results demonstrate the potential of support vector machine (SVM), multilayer perceptron (MLP), and random forest (RF) chemometric tools to identify cattle class. Moreover, the selection tools made it possible to reduce the number of chemical elements needed from 15 to just 8.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5. (C) 2012 Elsevier Ltd. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Given a large image set, in which very few images have labels, how to guess labels for the remaining majority? How to spot images that need brand new labels different from the predefined ones? How to summarize these data to route the user’s attention to what really matters? Here we answer all these questions. Specifically, we propose QuMinS, a fast, scalable solution to two problems: (i) Low-labor labeling (LLL) – given an image set, very few images have labels, find the most appropriate labels for the rest; and (ii) Mining and attention routing – in the same setting, find clusters, the top-'N IND.O' outlier images, and the 'N IND.R' images that best represent the data. Experiments on satellite images spanning up to 2.25 GB show that, contrasting to the state-of-the-art labeling techniques, QuMinS scales linearly on the data size, being up to 40 times faster than top competitors (GCap), still achieving better or equal accuracy, it spots images that potentially require unpredicted labels, and it works even with tiny initial label sets, i.e., nearly five examples. We also report a case study of our method’s practical usage to show that QuMinS is a viable tool for automatic coffee crop detection from remote sensing images.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results We have implemented an extension of Chado – the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework. A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different “omics” technologies with patient’s clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans webcite.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Last Glacial Maximum simulated sea surface temperature from the Paleo-Climate version of the National Center for Atmospheric Research Coupled Climate Model (NCAR-CCSM) are compared with available reconstructions and data-based products in the tropical and south Atlantic region. Model results are compared to data proxies based on the Multiproxy Approach for the Reconstruction of the Glacial Ocean surface product (MARGO). Results show that the model sea surface temperature is not consistent with the proxy-data in all of the region of interest. Discrepancies are found in the eastern, equatorial and in the high-latitude South Atlantic. The model overestimates the cooling in the southern South Atlantic (near 50 degrees S) shown by the proxy-data. Near the equator, model and proxies are in better agreement. In the eastern part of the equatorial basin the model underestimates the cooling shown by all proxies. A northward shift in the position of the subtropical convergence zone in the simulation suggests a compression or/and an equatorward shift of the subtropical gyre at the surface, consistent with what is observed in the proxy reconstruction. (C) 2008 Elsevier B.V. All rights reserved

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Each plasma physics laboratory has a proprietary scheme to control and data acquisition system. Usually, it is different from one laboratory to another. It means that each laboratory has its own way to control the experiment and retrieving data from the database. Fusion research relies to a great extent on international collaboration and this private system makes it difficult to follow the work remotely. The TCABR data analysis and acquisition system has been upgraded to support a joint research programme using remote participation technologies. The choice of MDSplus (Model Driven System plus) is proved by the fact that it is widely utilized, and the scientists from different institutions may use the same system in different experiments in different tokamaks without the need to know how each system treats its acquisition system and data analysis. Another important point is the fact that the MDSplus has a library system that allows communication between different types of language (JAVA, Fortran, C, C++, Python) and programs such as MATLAB, IDL, OCTAVE. In the case of tokamak TCABR interfaces (object of this paper) between the system already in use and MDSplus were developed, instead of using the MDSplus at all stages, from the control, and data acquisition to the data analysis. This was done in the way to preserve a complex system already in operation and otherwise it would take a long time to migrate. This implementation also allows add new components using the MDSplus fully at all stages. (c) 2012 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This work proposes a method for data clustering based on complex networks theory. A data set is represented as a network by considering different metrics to establish the connection between each pair of objects. The clusters are obtained by taking into account five community detection algorithms. The network-based clustering approach is applied in two real-world databases and two sets of artificially generated data. The obtained results suggest that the exponential of the Minkowski distance is the most suitable metric to quantify the similarities between pairs of objects. In addition, the community identification method based on the greedy optimization provides the best cluster solution. We compare the network-based clustering approach with some traditional clustering algorithms and verify that it provides the lowest classification error rate. (C) 2012 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Across the Americas and the Caribbean, nearly 561,000 slide-confirmed malaria infections were reported officially in 2008. The nine Amazonian countries accounted for 89% of these infections; Brazil and Peru alone contributed 56% and 7% of them, respectively. Local populations of the relatively neglected parasite Plasmodium vivax, which currently accounts for 77% of the regional malaria burden, are extremely diverse genetically and geographically structured. At a time when malaria elimination is placed on the public health agenda of several endemic countries, it remains unclear why malaria proved so difficult to control in areas of relatively low levels of transmission such as the Amazon Basin. We hypothesize that asymptomatic parasite carriage and massive environmental changes that affect vector abundance and behavior are major contributors to malaria transmission in epidemiologically diverse areas across the Amazon Basin. Here we review available data supporting this hypothesis and discuss their implications for current and future malaria intervention policies in the region. Given that locally generated scientific evidence is urgently required to support malaria control interventions in Amazonia, we briefly describe the aims of our current field-oriented malaria research in rural villages and gold-mining enclaves in Peru and a recently opened agricultural settlement in Brazil. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The objective of this study is to retrospectively report the results of interventions for controlling a vancomycin-resistant enterococcus (VRE) outbreak in a tertiary-care pediatric intensive care unit (PICU) of a University Hospital. After identification of the outbreak, interventions were made at the following levels: patient care, microbiological surveillance, and medical and nursing staff training. Data were collected from computer-based databases and from the electronic prescription system. Vancomycin use progressively increased after March 2008, peaking in August 2009. Five cases of VRE infection were identified, with 3 deaths. After the interventions, we noted a significant reduction in vancomycin prescription and use (75% reduction), and the last case of VRE infection was identified 4 months later. The survivors remained colonized until hospital discharge. After interventions there was a transient increase in PICU length-of-stay and mortality. Since then, the use of vancomycin has remained relatively constant and strict, no other cases of VRE infection or colonization have been identified and length-of-stay and mortality returned to baseline. In conclusion, we showed that a bundle intervention aiming at a strict control of vancomycin use and full compliance with the Hospital Infection Control Practices Advisory Committee guidelines, along with contact precautions and hand-hygiene promotion, can be effective in reducing vancomycin use and the emergence and spread of vancomycin-resistant bacteria in a tertiary-care PICU.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Dipteran a native Brazilian insect that has become a valuable model system for developmental biology research because it provides an interesting opportunity to study a different type of insect oogenesis. Sequences from a cDNA library that was constructed with poly A + RNA from the ovaries of larvae at different ages were analyzed. Molecular characterization confirmed interesting findings, such as the presence of . The gene encodes a conserved RNA-binding protein that is required during early development for the maintenance and division of the primordial germ cells of Diptera. plays an important role in specifying the posterior regions of insect embryos and is important for abdomen formation. In the present work, we showed the spatial and temporal expression profiles of this important gene, which is involved in oogenesis and early development. Data mining techniques were used to obtain the complete sequence of . Bioinformatic tools were used to determine the following: (1) the secondary structure of the 3'-untranslated region of the mRNA, (2) the encoded protein of the isolated gene, (3) the conserved zinc-finger domains of the Nanos protein, and (4) phylogenetic analyses. Furthermore, RNA in situ hybridization and immunolocalization were used to determine mRNA and protein expression in the tissues that were studied and to define as a germ cell molecular marker.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Abstract Background Mycelium-to-yeast transition in the human host is essential for pathogenicity by the fungus Paracoccidioides brasiliensis and both cell types are therefore critical to the establishment of paracoccidioidomycosis (PCM), a systemic mycosis endemic to Latin America. The infected population is of about 10 million individuals, 2% of whom will eventually develop the disease. Previously, transcriptome analysis of mycelium and yeast cells resulted in the assembly of 6,022 sequence groups. Gene expression analysis, using both in silico EST subtraction and cDNA microarray, revealed genes that were differential to yeast or mycelium, and we discussed those involved in sugar metabolism. To advance our understanding of molecular mechanisms of dimorphic transition, we performed an extended analysis of gene expression profiles using the methods mentioned above. Results In this work, continuous data mining revealed 66 new differentially expressed sequences that were MIPS(Munich Information Center for Protein Sequences)-categorised according to the cellular process in which they are presumably involved. Two well represented classes were chosen for further analysis: (i) control of cell organisation – cell wall, membrane and cytoskeleton, whose representatives were hex (encoding for a hexagonal peroxisome protein), bgl (encoding for a 1,3-β-glucosidase) in mycelium cells; and ags (an α-1,3-glucan synthase), cda (a chitin deacetylase) and vrp (a verprolin) in yeast cells; (ii) ion metabolism and transport – two genes putatively implicated in ion transport were confirmed to be highly expressed in mycelium cells – isc and ktp, respectively an iron-sulphur cluster-like protein and a cation transporter; and a putative P-type cation pump (pct) in yeast. Also, several enzymes from the cysteine de novo biosynthesis pathway were shown to be up regulated in the yeast form, including ATP sulphurylase, APS kinase and also PAPS reductase. Conclusion Taken together, these data show that several genes involved in cell organisation and ion metabolism/transport are expressed differentially along dimorphic transition. Hyper expression in yeast of the enzymes of sulphur metabolism reinforced that this metabolic pathway could be important for this process. Understanding these changes by functional analysis of such genes may lead to a better understanding of the infective process, thus providing new targets and strategies to control PCM.