9 resultados para DATA MINING
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
The reproductive performance of cattle may be influenced by several factors, but mineral imbalances are crucial in terms of direct effects on reproduction. Several studies have shown that elements such as calcium, copper, iron, magnesium, selenium, and zinc are essential for reproduction and can prevent oxidative stress. However, toxic elements such as lead, nickel, and arsenic can have adverse effects on reproduction. In this paper, we applied a simple and fast method of multi-element analysis to bovine semen samples from Zebu and European classes used in reproduction programs and artificial insemination. Samples were analyzed by inductively coupled plasma spectrometry (ICP-MS) using aqueous medium calibration and the samples were diluted in a proportion of 1:50 in a solution containing 0.01% (vol/vol) Triton X-100 and 0.5% (vol/vol) nitric acid. Rhodium, iridium, and yttrium were used as the internal standards for ICP-MS analysis. To develop a reliable method of tracing the class of bovine semen, we used data mining techniques that make it possible to classify unknown samples after checking the differentiation of known-class samples. Based on the determination of 15 elements in 41 samples of bovine semen, 3 machine-learning tools for classification were applied to determine cattle class. Our results demonstrate the potential of support vector machine (SVM), multilayer perceptron (MLP), and random forest (RF) chemometric tools to identify cattle class. Moreover, the selection tools made it possible to reduce the number of chemical elements needed from 15 to just 8.
Resumo:
Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Abstract Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.
Resumo:
In [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity
Resumo:
We review recent visualization techniques aimed at supporting tasks that require the analysis of text documents, from approaches targeted at visually summarizing the relevant content of a single document to those aimed at assisting exploratory investigation of whole collections of documents.Techniques are organized considering their target input materialeither single texts or collections of textsand their focus, which may be at displaying content, emphasizing relevant relationships, highlighting the temporal evolution of a document or collection, or helping users to handle results from a query posed to a search engine.We describe the approaches adopted by distinct techniques and briefly review the strategies they employ to obtain meaningful text models, discuss how they extract the information required to produce representative visualizations, the tasks they intend to support and the interaction issues involved, and strengths and limitations. Finally, we show a summary of techniques, highlighting their goals and distinguishing characteristics. We also briefly discuss some open problems and research directions in the fields of visual text mining and text analytics.
Resumo:
In this paper we have quantified the consistency of word usage in written texts represented by complex networks, where words were taken as nodes, by measuring the degree of preservation of the node neighborhood. Words were considered highly consistent if the authors used them with the same neighborhood. When ranked according to the consistency of use, the words obeyed a log-normal distribution, in contrast to Zipf's law that applies to the frequency of use. Consistency correlated positively with the familiarity and frequency of use, and negatively with ambiguity and age of acquisition. An inspection of some highly consistent words confirmed that they are used in very limited semantic contexts. A comparison of consistency indices for eight authors indicated that these indices may be employed for author recognition. Indeed, as expected, authors of novels could be distinguished from those who wrote scientific texts. Our analysis demonstrated the suitability of the consistency indices, which can now be applied in other tasks, such as emotion recognition.
Resumo:
The Dipteran a native Brazilian insect that has become a valuable model system for developmental biology research because it provides an interesting opportunity to study a different type of insect oogenesis. Sequences from a cDNA library that was constructed with poly A + RNA from the ovaries of larvae at different ages were analyzed. Molecular characterization confirmed interesting findings, such as the presence of . The gene encodes a conserved RNA-binding protein that is required during early development for the maintenance and division of the primordial germ cells of Diptera. plays an important role in specifying the posterior regions of insect embryos and is important for abdomen formation. In the present work, we showed the spatial and temporal expression profiles of this important gene, which is involved in oogenesis and early development. Data mining techniques were used to obtain the complete sequence of . Bioinformatic tools were used to determine the following: (1) the secondary structure of the 3'-untranslated region of the mRNA, (2) the encoded protein of the isolated gene, (3) the conserved zinc-finger domains of the Nanos protein, and (4) phylogenetic analyses. Furthermore, RNA in situ hybridization and immunolocalization were used to determine mRNA and protein expression in the tissues that were studied and to define as a germ cell molecular marker.
Resumo:
Abstract Background Mycelium-to-yeast transition in the human host is essential for pathogenicity by the fungus Paracoccidioides brasiliensis and both cell types are therefore critical to the establishment of paracoccidioidomycosis (PCM), a systemic mycosis endemic to Latin America. The infected population is of about 10 million individuals, 2% of whom will eventually develop the disease. Previously, transcriptome analysis of mycelium and yeast cells resulted in the assembly of 6,022 sequence groups. Gene expression analysis, using both in silico EST subtraction and cDNA microarray, revealed genes that were differential to yeast or mycelium, and we discussed those involved in sugar metabolism. To advance our understanding of molecular mechanisms of dimorphic transition, we performed an extended analysis of gene expression profiles using the methods mentioned above. Results In this work, continuous data mining revealed 66 new differentially expressed sequences that were MIPS(Munich Information Center for Protein Sequences)-categorised according to the cellular process in which they are presumably involved. Two well represented classes were chosen for further analysis: (i) control of cell organisation – cell wall, membrane and cytoskeleton, whose representatives were hex (encoding for a hexagonal peroxisome protein), bgl (encoding for a 1,3-β-glucosidase) in mycelium cells; and ags (an α-1,3-glucan synthase), cda (a chitin deacetylase) and vrp (a verprolin) in yeast cells; (ii) ion metabolism and transport – two genes putatively implicated in ion transport were confirmed to be highly expressed in mycelium cells – isc and ktp, respectively an iron-sulphur cluster-like protein and a cation transporter; and a putative P-type cation pump (pct) in yeast. Also, several enzymes from the cysteine de novo biosynthesis pathway were shown to be up regulated in the yeast form, including ATP sulphurylase, APS kinase and also PAPS reductase. Conclusion Taken together, these data show that several genes involved in cell organisation and ion metabolism/transport are expressed differentially along dimorphic transition. Hyper expression in yeast of the enzymes of sulphur metabolism reinforced that this metabolic pathway could be important for this process. Understanding these changes by functional analysis of such genes may lead to a better understanding of the infective process, thus providing new targets and strategies to control PCM.
Resumo:
The ubiquity of time series data across almost all human endeavors has produced a great interest in time series data mining in the last decade. While dozens of classification algorithms have been applied to time series, recent empirical evidence strongly suggests that simple nearest neighbor classification is exceptionally difficult to beat. The choice of distance measure used by the nearest neighbor algorithm is important, and depends on the invariances required by the domain. For example, motion capture data typically requires invariance to warping, and cardiology data requires invariance to the baseline (the mean value). Similarly, recent work suggests that for time series clustering, the choice of clustering algorithm is much less important than the choice of distance measure used.In this work we make a somewhat surprising claim. There is an invariance that the community seems to have missed, complexity invariance. Intuitively, the problem is that in many domains the different classes may have different complexities, and pairs of complex objects, even those which subjectively may seem very similar to the human eye, tend to be further apart under current distance measures than pairs of simple objects. This fact introduces errors in nearest neighbor classification, where some complex objects may be incorrectly assigned to a simpler class. Similarly, for clustering this effect can introduce errors by “suggesting” to the clustering algorithm that subjectively similar, but complex objects belong in a sparser and larger diameter cluster than is truly warranted.We introduce the first complexity-invariant distance measure for time series, and show that it generally produces significant improvements in classification and clustering accuracy. We further show that this improvement does not compromise efficiency, since we can lower bound the measure and use a modification of triangular inequality, thus making use of most existing indexing and data mining algorithms. We evaluate our ideas with the largest and most comprehensive set of time series mining experiments ever attempted in a single work, and show that complexity-invariant distance measures can produce improvements in classification and clustering in the vast majority of cases.