788 resultados para dINSCY, subspace clustering, data mining, parallelo, distribuito, algoritmo
Resumo:
The domain of Knowledge Discovery (KD) and Data Mining (DM) is of growing importance in a time where more and more data is produced and knowledge is one of the most precious assets. Having explored both the existing underlying theory, the results of the ongoing research in academia and the industry practices in the domain of KD and DM, we have found that this is a domain that still lacks some systematization. We also found that this systematization exists to a greater degree in the Software Engineering and Requirements Engineering domains, probably due to being more mature areas. We believe that it is possible to improve and facilitate the participation of enterprise stakeholders in the requirements engineering for KD projects by systematizing requirements engineering process for such projects. This will, in turn, result in more projects that end successfully, that is, with satisfied stakeholders, including in terms of time and budget constraints. With this in mind and based on all information found in the state-of-the art, we propose SysPRE - Systematized Process for Requirements Engineering in KD projects. We begin by proposing an encompassing generic description of the KD process, where the main focus is on the Requirements Engineering activities. This description is then used as a base for the application of the Design and Engineering Methodology for Organizations (DEMO) so that we can specify a formal ontology for this process. The resulting SysPRE ontology can serve as a base that can be used not only to make enterprises become aware of their own KD process and requirements engineering process in the KD projects, but also to improve such processes in reality, namely in terms of success rate.
Resumo:
Layer mortality due to heat stress is an important economic loss for the producer. The aim of this study was to determine the mortality pattern of layers reared in the region of Bastos, SP, Brazil, according to external environment and bird age. Data mining technique were used based on monthly mortality records of hens in production, 135 poultry houses, from January 2004 to August 2008. The external environment was characterized according maximum and minimum temperatures, obtained monthly at the meteorological station CATI in the city of Tupa, SP, Brazil. Mortality was classified as normal (<= 1.2%) or high (> 1.2%), considering the mortality limits mentioned in literature. Data mining technique produced a decision tree with nine levels and 23 leaves, with 62.6% of overall accuracy. The hit rate for the High class was 64.1% and 59.9% for Normal class. The decision tree allowed finding a pattern in the mortality data, generating a model for estimating mortality based on the thermal environment and bird age.
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
Foliar diagnosis is a method for assessing the nutritional status of agricultural crops, which helps in the understanding of soil fertility and rationalized application of fertilizers taking into account economic and environmental criteria. The study aimed to use the landrelief as criteria to assist in interpreting the spatial variability of nutrient content of the citrus leaf. The leaves were collected at regular intervals of 50 m, totaling 332 sampling points. Data were analyzed by descriptive statistics, geostatistics and induction of decision tree. With the aid of digital elevation model (MDE) and the profile planaltimetric, the area was divided into three different landrelief and sub-strands. The highest values for nutrients from the leaves of citrus were observed at the top (concave area) segments on a half-slope and lower slope. The nutrients from the citrus leaves showed high values of correlation (above 0.5) with the altitude of the study area. The technique of geostatistics and the induction of decision tree show that the relief is the variable with the greatest potential to interpret the maps of spatial variability of nutrients from the citrus leaves.
Resumo:
The relevance of rising healthcare costs is a main topic in complementary health companies in Brazil. In 2011, these expenses consumed more than 80% of the monthly health insurance in Brazil. Considering the administrative costs, it is observed that the companies operating in this market work, on average, at the threshold between profit and loss. This paper presents results after an investigation of the welfare costs of a health plan company in Brazil. It was based on the KDD process and explorative Data Mining. A diversity of results is presented, such as data summarization, providing compact descriptions of the data, revealing common features and intrinsic observations. Among the key findings was observed that a small portion of the population is responsible for the most demanding of resources devoted to health care
Resumo:
Currently, one of the biggest challenges for the field of data mining is to perform cluster analysis on complex data. Several techniques have been proposed but, in general, they can only achieve good results within specific areas providing no consensus of what would be the best way to group this kind of data. In general, these techniques fail due to non-realistic assumptions about the true probability distribution of the data. Based on this, this thesis proposes a new measure based on Cross Information Potential that uses representative points of the dataset and statistics extracted directly from data to measure the interaction between groups. The proposed approach allows us to use all advantages of this information-theoretic descriptor and solves the limitations imposed on it by its own nature. From this, two cost functions and three algorithms have been proposed to perform cluster analysis. As the use of Information Theory captures the relationship between different patterns, regardless of assumptions about the nature of this relationship, the proposed approach was able to achieve a better performance than the main algorithms in literature. These results apply to the context of synthetic data designed to test the algorithms in specific situations and to real data extracted from problems of different fields
Resumo:
The opening of the Brazilian market of electricity and competitiveness between companies in the energy sector make the search for useful information and tools that will assist in decision making activities, increase by the concessionaires. An important source of knowledge for these utilities is the time series of energy demand. The identification of behavior patterns and description of events become important for the planning execution, seeking improvements in service quality and financial benefits. This dissertation presents a methodology based on mining and representation tools of time series, in order to extract knowledge that relate series of electricity demand in various substations connected of a electric utility. The method exploits the relationship of duration, coincidence and partial order of events in multi-dimensionals time series. To represent the knowledge is used the language proposed by Mörchen (2005) called Time Series Knowledge Representation (TSKR). We conducted a case study using time series of energy demand of 8 substations interconnected by a ring system, which feeds the metropolitan area of Goiânia-GO, provided by CELG (Companhia Energética de Goiás), responsible for the service of power distribution in the state of Goiás (Brazil). Using the proposed methodology were extracted three levels of knowledge that describe the behavior of the system studied, representing clearly the system dynamics, becoming a tool to assist planning activities
Resumo:
The genome of all organisms is subject to injuries that can be caused by endogenous and environmental factors. If these lesions are not corrected, it can be fixed generating a mutation which can be lethal to the organisms. In order to prevent this, there are different DNA repair mechanisms. These mechanisms are well known in bacteria, yeast, human, but not in plants. Two plant models Oriza sativa and Arabidopsis thaliana had the genome sequenced and due to this some DNA repair genes have been characterized. The aim of this work is to characterized two sugarcane cDNAs that had homology to AP endonuclease: scARP1 and scARP3. In silico has been done with these two sequences and other from plants. It has been observed domain conservation on these sequences, but the cystein at 65 position that is a characteristic from the redox domain in APE1 protein was not so conservated in plants. Phylogenetic relationship showed two branches, one branch with dicots and monocots sequence and the other branch with only monocots sequences. Another approach in order to characterized these two cDNAs was to construct overexpression cassettes (sense and antisense orientation) using the 35S promoter. After that, these cassettes were transferred to the binary vector pPZP211. Furthermore, previously in the laboratory was obtained a plant from nicotiana tabacum containing the overexpression cassette in anti-sense orientation. It has been observed that this plant had a slow development and problems in setting seeds. After some manual crossing, some seeds were obtained (T2) and it was analyzed the T2 segregation. The third approach used in this work was to clone the promoter region from these two cDNAs by PCR walking. The sequences obtained were analyzed using the program PLANTCARE. It was observed in these sequences some motives that may be related to oxidative stress response
Resumo:
Background: Leptospirosis is an important zoonotic disease associated with poor areas of urban settings of developing countries and early diagnosis and prompt treatment may prevent disease. Although rodents are reportedly considered the main reservoirs of leptospirosis, dogs may develop the disease, may become asymptomatic carriers and may be used as sentinels for disease epidemiology. The use of Geographical Information Systems (GIS) combined with spatial analysis techniques allows the mapping of the disease and the identification and assessment of health risk factors. Besides the use of GIS and spatial analysis, the technique of data mining, decision tree, can provide a great potential to find a pattern in the behavior of the variables that determine the occurrence of leptospirosis. The objective of the present study was to apply Geographical Information Systems and data prospection (decision tree) to evaluate the risk factors for canine leptospirosis in an area of Curitiba, PR.Materials, Methods & Results: The present study was performed on the Vila Pantanal, a urban poor community in the city of Curitiba. A total of 287 dog blood samples were randomly obtained house-by-house in a two-day sampling on January 2010. In addition, a questionnaire was applied to owners at the time of sampling. Geographical coordinates related to each household of tested dog were obtained using a Global Positioning System (GPS) for mapping the spatial distribution of reagent and non-reagent dogs to leptospirosis. For the decision tree, risk factors included results of microagglutination test (MAT) from the serum of dogs, previous disease on the household, contact with rats or other dogs, dog breed, outdoors access, feeding, trash around house or backyard, open sewer proximity and flooding. A total of 189 samples (about 2/3 of overall samples) were randomly selected for the training file and consequent decision rules. The remained 98 samples were used for the testing file. The seroprevalence showed a pattern of spatial distribution that involved all the Pantanal area, without agglomeration of reagent animals. In relation to data mining, from 189 samples used in decision tree, a total of 165 (87.3%) animal samples were correctly classified, generating a Kappa index of 0.413. A total of 154 out of 159 (96.8%) samples were considered non-reagent and were correctly classified and only 5/159 (3.2%) were wrongly identified. on the other hand, only 11 (36.7%) reagent samples were correctly classified, with 19 (63.3%) samples failing diagnosis.Discussion: The spatial distribution that involved all the Pantanal area showed that all the animals in the area are at risk of contamination by Leptospira spp. Although most samples had been classified correctly by the decision tree, a degree of difficulty of separability related to seropositive animals was observed, with only 36.7% of the samples classified correctly. This can occur due to the fact of seronegative animals number is superior to the number of seropositive ones, taking the differences in the pattern of variable behavior. The data mining helped to evaluate the most important risk factors for leptospirosis in an urban poor community of Curitiba. The variables selected by decision tree reflected the important factors about the existence of the disease (default of sewer, presence of rats and rubbish and dogs with free access to street). The analyses showed the multifactorial character of the epidemiology of canine leptospirosis.
Resumo:
Cultivated peanut (Arachis hypogaea) is an important crop, widely grown in tropical and subtropical regions of the world. It is highly susceptible to several biotic and abiotic stresses to which wild species are resistant. As a first step towards the introgression of these resistance genes into cultivated peanut, a linkage map based on microsatellite markers was constructed, using an F-2 population obtained from a cross between two diploid wild species with AA genome (A. duranensis and A. stenosperma). A total of 271 new microsatellite markers were developed in the present study from SSR-enriched genomic libraries, expressed sequence tags (ESTs), and by data-mining sequences available in GenBank. of these, 66 were polymorphic for cultivated peanut. The 271 new markers plus another 162 published for peanut were screened against both progenitors and 204 of these (47.1%) were polymorphic, with 170 codominant and 34 dominant markers. The 80 codominant markers segregating 1:2:1 (P < 0.05) were initially used to establish the linkage groups. Distorted and dominant markers were subsequently included in the map. The resulting linkage map consists of 11 linkage groups covering 1,230.89 cM of total map distance, with an average distance of 7.24 cM between markers. This is the first microsatellite-based map published for Arachis, and the first map based on sequences that are all currently publicly available. Because most markers used were derived from ESTs and genomic libraries made using methylation-sensitive restriction enzymes, about one-third of the mapped markers are genic. Linkage group ordering is being validated in other mapping populations, with the aim of constructing a transferable reference map for Arachis.
Resumo:
The data mining of Eucalyptus ESTs genome finds four clusters (EGCEST2257E11.g, EGBGRT3213F11.g, and EGCCFB1223H11.g) from highly conservative 14-3-3 protein family which modulates a wide variety of cellular processes. Multiple alignments were built from twenty four sequences of 14-3-3 proteins searched into the GenBank databases and into the four pools of Eucalyptus genome programs. The alignment has shown two regions highly conservative on the sequences corresponding to the motifs of protein phosphorylation and nine highly conservative regions on the sequence corresponding to the linkage regions of alpha helices structure based on three dimensional of dimer functional structure. The differences of amino acid into the structural and functional domains of 14-3-3 plant protein were identified and can explain the functional diversity of different isoforms. The phylogenic protein trees were built by the maximum parsimony and neighborjoining procedures of Clustal X alignments and PAUP software for phylogenic analysis.
Resumo:
The analysis of large amounts of data is better performed by humans when represented in a graphical format. Therefore, a new research area called the Visual Data Mining is being developed endeavoring to use the number crunching power of computers to prepare data for visualization, allied to the ability of humans to interpret data presented graphically.This work presents the results of applying a visual data mining tool, called FastMapDB to detect the behavioral pattern exhibited by a dataset of clinical information about hemoglobinopathies known as thalassemia. FastMapDB is a visual data mining tool that get tabular data stored in a relational database such as dates, numbers and texts, and by considering them as points in a multidimensional space, maps them to a three-dimensional space. The intuitive three-dimensional representation of objects enables a data analyst to see the behavior of the characteristics from abnormal forms of hemoglobin, highlighting the differences when compared to data from a group without alteration.
Resumo:
Oxidative stress generating active oxygen species has been proved to be one of the underlying agents causing tissue injury after the exposure of Eucalyptus (Eucalyptus spp.) plants to a wide variety of stress conditions. The objective of this study was to perform data mining to identify favorable genes and alleles associated with the enzyme systems superoxide dismutase, catalase, peroxidases, and glutathione S-transferase that are related to tolerance for environmental stresses and damage caused by pests, diseases, herbicides, and by weeds themselves. This was undertaken by using the eucalyptus expressed-sequence database (https//forests.esalq.usp.br). The alignment results between amino acid and nucleotide sequences indicated that the studied enzymes were adequately represented in the ESTs database of the FORESTs project.