820 resultados para Data classification
Resumo:
This work proposes a method for data clustering based on complex networks theory. A data set is represented as a network by considering different metrics to establish the connection between each pair of objects. The clusters are obtained by taking into account five community detection algorithms. The network-based clustering approach is applied in two real-world databases and two sets of artificially generated data. The obtained results suggest that the exponential of the Minkowski distance is the most suitable metric to quantify the similarities between pairs of objects. In addition, the community identification method based on the greedy optimization provides the best cluster solution. We compare the network-based clustering approach with some traditional clustering algorithms and verify that it provides the lowest classification error rate. (C) 2012 Elsevier B.V. All rights reserved.
Resumo:
This paper presents a method for transforming the information of an engineering geological map into useful information for non-specialists involved in land-use planning. The method consists of classifying the engineering geological units in terms of land use capability and identifying the legal and the geologic restrictions that apply in the study area. Both informations are then superimposed over the land use and a conflict areas map is created. The analysis of these data leads to the identification of existing and forthcoming land use conflicts and enables the proposal of planning measures on a regional and local scale. The map for the regional planning was compiled at a 1:50,000 scale and encompasses the whole municipal land area where uses are mainly rural. The map for the local planning was compiled at a 1:10,000 scale and encompasses the urban area. Most of the classification and operations on maps used spatial analyst tools available in the Geographical Information System. The regional studies showed that the greater part of Analandia's territory presents appropriate land uses. The local-scale studies indicate that the majority of the densely occupied urban areas are in suitable land. Although the situation is in general positive, municipal policies should address the identified and expected land use conflicts, so that it can be further improved.
Resumo:
The attributes describing a data set may often be arranged in meaningful subsets, each of which corresponds to a different aspect of the data. An unsupervised algorithm (SCAD) that simultaneously performs fuzzy clustering and aspects weighting was proposed in the literature. However, SCAD may fail and halt given certain conditions. To fix this problem, its steps are modified and then reordered to reduce the number of parameters required to be set by the user. In this paper we prove that each step of the resulting algorithm, named ASCAD, globally minimizes its cost-function with respect to the argument being optimized. The asymptotic analysis of ASCAD leads to a time complexity which is the same as that of fuzzy c-means. A hard version of the algorithm and a novel validity criterion that considers aspect weights in order to estimate the number of clusters are also described. The proposed method is assessed over several artificial and real data sets.
Resumo:
Biogeography has been difficult to apply as a methodological approach because organismic biology is incomplete at levels where the process of formulating comparisons and analogies is complex. The study of insect biogeography became necessary because insects possess numerous evolutionary traits and play an important role as pollinators. Among insects, the euglossine bees, or orchid bees, attract interest because the study of their biology allows us to explain important steps in the evolution of social behavior and many other adaptive tradeoffs. We analyzed the distribution of morphological characteristics in Colombian orchid bees from an ecological perspective. The aim of this study was to observe the distribution of these attributes on a regional basis. Data corresponding to Colombian euglossine species were ordered with a correspondence analysis and with subsequent hierarchical clustering. Later, and based on community proprieties, we compared the resulting hierarchical model with the collection localities to seek to identify a biogeographic classification pattern. From this analysis, we derived a model that classifies the territory of Colombia into 11 biogeographic units or natural clusters. Ecological assumptions in concordance with the derived classification levels suggest that species characteristics associated with flight performance, nectar uptake, and social behavior are the factors that served to produce the current geographical structure.
Resumo:
Complex networks have attracted increasing interest from various fields of science. It has been demonstrated that each complex network model presents specific topological structures which characterize its connectivity and dynamics. Complex network classification relies on the use of representative measurements that describe topological structures. Although there are a large number of measurements, most of them are correlated. To overcome this limitation, this paper presents a new measurement for complex network classification based on partially self-avoiding walks. We validate the measurement on a data set composed by 40000 complex networks of four well-known models. Our results indicate that the proposed measurement improves correct classification of networks compared to the traditional ones. (C) 2012 American Institute of Physics. [http://dx.doi.org/10.1063/1.4737515]
Resumo:
The reproductive performance of cattle may be influenced by several factors, but mineral imbalances are crucial in terms of direct effects on reproduction. Several studies have shown that elements such as calcium, copper, iron, magnesium, selenium, and zinc are essential for reproduction and can prevent oxidative stress. However, toxic elements such as lead, nickel, and arsenic can have adverse effects on reproduction. In this paper, we applied a simple and fast method of multi-element analysis to bovine semen samples from Zebu and European classes used in reproduction programs and artificial insemination. Samples were analyzed by inductively coupled plasma spectrometry (ICP-MS) using aqueous medium calibration and the samples were diluted in a proportion of 1:50 in a solution containing 0.01% (vol/vol) Triton X-100 and 0.5% (vol/vol) nitric acid. Rhodium, iridium, and yttrium were used as the internal standards for ICP-MS analysis. To develop a reliable method of tracing the class of bovine semen, we used data mining techniques that make it possible to classify unknown samples after checking the differentiation of known-class samples. Based on the determination of 15 elements in 41 samples of bovine semen, 3 machine-learning tools for classification were applied to determine cattle class. Our results demonstrate the potential of support vector machine (SVM), multilayer perceptron (MLP), and random forest (RF) chemometric tools to identify cattle class. Moreover, the selection tools made it possible to reduce the number of chemical elements needed from 15 to just 8.
Resumo:
Multi-element analysis of honey samples was carried out with the aim of developing a reliable method of tracing the origin of honey. Forty-two chemical elements were determined (Al, Cu, Pb, Zn, Mn, Cd, Tl, Co, Ni, Rb, Ba, Be, Bi, U, V, Fe, Pt, Pd, Te, Hf, Mo, Sn, Sb, P, La, Mg, I, Sm, Tb, Dy, Sd, Th, Pr, Nd, Tm, Yb, Lu, Gd, Ho, Er, Ce, Cr) by inductively coupled plasma mass spectrometry (ICP-MS). Then, three machine learning tools for classification and two for attribute selection were applied in order to prove that it is possible to use data mining tools to find the region where honey originated. Our results clearly demonstrate the potential of Support Vector Machine (SVM), Multilayer Perceptron (MLP) and Random Forest (RF) chemometric tools for honey origin identification. Moreover, the selection tools allowed a reduction from 42 trace element concentrations to only 5. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Surveillance Levels (SLs) are categories for medical patients (used in Brazil) that represent different types of medical recommendations. SLs are defined according to risk factors and the medical and developmental history of patients. Each SL is associated with specific educational and clinical measures. The objective of the present paper was to verify computer-aided, automatic assignment of SLs. The present paper proposes a computer-aided approach for automatic recommendation of SLs. The approach is based on the classification of information from patient electronic records. For this purpose, a software architecture composed of three layers was developed. The architecture is formed by a classification layer that includes a linguistic module and machine learning classification modules. The classification layer allows for the use of different classification methods, including the use of preprocessed, normalized language data drawn from the linguistic module. We report the verification and validation of the software architecture in a Brazilian pediatric healthcare institution. The results indicate that selection of attributes can have a great effect on the performance of the system. Nonetheless, our automatic recommendation of surveillance level can still benefit from improvements in processing procedures when the linguistic module is applied prior to classification. Results from our efforts can be applied to different types of medical systems. The results of systems supported by the framework presented in this paper may be used by healthcare and governmental institutions to improve healthcare services in terms of establishing preventive measures and alerting authorities about the possibility of an epidemic.
Resumo:
We present a detailed study of carbon-enhanced metal-poor (CEMP) stars, based on high-resolution spectroscopic observations of a sample of 18 stars. The stellar spectra for this sample were obtained at the 4.2 m William Herschel Telescope in 2001 and 2002, using the Utrecht Echelle Spectrograph, at a resolving power R similar to 52 000 and S/N similar to 40, covering the wavelength range lambda lambda 3700-5700 angstrom. The atmospheric parameters determined for this sample indicate temperatures ranging from 4750 K to 7100 K, log g from 1.5 to 4.3, and metallicities -3.0 <= [Fe/H]<=-1.7. Elemental abundances for C, Na, Mg, Sc, Ti, Cr, Cu, Zn, Sr, Y, Zr, Ba, La, Ce, Nd, Sm, Eu, Gd, Dy are determined. Abundances for an additional 109 stars were taken from the literature and combined with the data of our sample. The literature sample reveals a lack of reliable abundance estimates for species that might be associated with the r-process elements for about 67% of CEMP stars, preventing a complete understanding of this class of stars, since [Ba/Eu] ratios are used to classify them. Although eight stars in our observed sample are also found in the literature sample, Eu abundances or limits are determined for four of these stars for the first time. From the observed correlations between C, Ba, and Eu, we argue that the CEMP-r/s class has the same astronomical origin as CEMP-s stars, highlighting the need for a more complete understanding of Eu production.
Resumo:
Most Cactaceae have succulent stems and inhabit dry or arid areas, but some are epiphytes of humid regions. Rhipsalis is the largest genus of epiphytic cacti. Species of Rhipsalis are notoriously difficult to identify, and the subgeneric classification of the genus has remained controversial. Between 1837 and 1995, eight different subgeneric classifications have been proposed for Rhipsalis. The most comprehensive taxonomic treatment of the genus recognized five subgenera, Phyllarthrorhipsalis, Rhipsalis, Epallagogonium, Calamorhipsalis, and Erythrorhipsalis, characterized mainly by stem morphology. Here, molecular phylogenetic information combined with morphological data is used to re-evaluate the former subgeneric classifications proposed for the genus. Three monophyletic subgenera are recognized, Rhipsalis, Calamorhipsalis and Erythrorhipsalis, which are mainly characterized by floral traits. The changes proposed include expanding the circumscription of Rhipsalis by the inclusion of species previously included in Phyllarthrorhipsalis and Epallagogoniwn and recognizing a broader Calamorhipsalis, also including species from subgenus Epallagogonium. The circumscription of Erythrorhipsalis remains unchanged. For each subgenus a list of synonyms, a brief description and a list of species included are presented. A key for the identification of subgenera is also provided.
Resumo:
Dimensionality reduction is employed for visual data analysis as a way to obtaining reduced spaces for high dimensional data or to mapping data directly into 2D or 3D spaces. Although techniques have evolved to improve data segregation on reduced or visual spaces, they have limited capabilities for adjusting the results according to user's knowledge. In this paper, we propose a novel approach to handling both dimensionality reduction and visualization of high dimensional data, taking into account user's input. It employs Partial Least Squares (PLS), a statistical tool to perform retrieval of latent spaces focusing on the discriminability of the data. The method employs a training set for building a highly precise model that can then be applied to a much larger data set very effectively. The reduced data set can be exhibited using various existing visualization techniques. The training data is important to code user's knowledge into the loop. However, this work also devises a strategy for calculating PLS reduced spaces when no training data is available. The approach produces increasingly precise visual mappings as the user feeds back his or her knowledge and is capable of working with small and unbalanced training sets.
Resumo:
Multivariate analyses of UV-Vis spectral data from cachaca wood extracts provide a simple and robust model to classify aged Brazilian cachacas according to the wood species used in the maturation barrels. The model is based on inspection of 93 extracts of oak and different Brazilian wood species by a non-aged cachaca used as an extraction solvent. Application of PCA (Principal Components Analysis) and HCA (Hierarchical Cluster Analysis) leads to identification of 6 clusters of cachaca wood extracts (amburana, amendoim, balsamo, castanheira, jatoba, and oak). LDA (Linear Discriminant Analysis) affords classification of 10 different wood species used in the cachaca extracts (amburana, amendoim, balsamo, cabreuva-parda, canela-sassafras, castanheira, jatoba, jequitiba-rosa, louro-canela, and oak) with an accuracy ranging from 80% (amendoim and castanheira) to 100% (balsamo and jequitiba-rosa). The methodology provides a low-cost alternative to methods based on liquid chromatography and mass spectrometry to classify cachacas aged in barrels that are composed of different wood species.
Resumo:
In multi-label classification, examples can be associated with multiple labels simultaneously. The task of learning from multi-label data can be addressed by methods that transform the multi-label classification problem into several single-label classification problems. The binary relevance approach is one of these methods, where the multi-label learning task is decomposed into several independent binary classification problems, one for each label in the set of labels, and the final labels for each example are determined by aggregating the predictions from all binary classifiers. However, this approach fails to consider any dependency among the labels. Aiming to accurately predict label combinations, in this paper we propose a simple approach that enables the binary classifiers to discover existing label dependency by themselves. An experimental study using decision trees, a kernel method as well as Naive Bayes as base-learning techniques shows the potential of the proposed approach to improve the multi-label classification performance.
Resumo:
Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Abstract Background The search for enriched (aka over-represented or enhanced) ontology terms in a list of genes obtained from microarray experiments is becoming a standard procedure for a system-level analysis. This procedure tries to summarize the information focussing on classification designs such as Gene Ontology, KEGG pathways, and so on, instead of focussing on individual genes. Although it is well known in statistics that association and significance are distinct concepts, only the former approach has been used to deal with the ontology term enrichment problem. Results BayGO implements a Bayesian approach to search for enriched terms from microarray data. The R source-code is freely available at http://blasto.iq.usp.br/~tkoide/BayGO in three versions: Linux, which can be easily incorporated into pre-existent pipelines; Windows, to be controlled interactively; and as a web-tool. The software was validated using a bacterial heat shock response dataset, since this stress triggers known system-level responses. Conclusion The Bayesian model accounts for the fact that, eventually, not all the genes from a given category are observable in microarray data due to low intensity signal, quality filters, genes that were not spotted and so on. Moreover, BayGO allows one to measure the statistical association between generic ontology terms and differential expression, instead of working only with the common significance analysis.