797 resultados para Educational data mining


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Traditional applications of feature selection in areas such as data mining, machine learning and pattern recognition aim to improve the accuracy and to reduce the computational cost of the model. It is done through the removal of redundant, irrelevant or noisy data, finding a representative subset of data that reduces its dimensionality without loss of performance. With the development of research in ensemble of classifiers and the verification that this type of model has better performance than the individual models, if the base classifiers are diverse, comes a new field of application to the research of feature selection. In this new field, it is desired to find diverse subsets of features for the construction of base classifiers for the ensemble systems. This work proposes an approach that maximizes the diversity of the ensembles by selecting subsets of features using a model independent of the learning algorithm and with low computational cost. This is done using bio-inspired metaheuristics with evaluation filter-based criteria

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clustering data is a very important task in data mining, image processing and pattern recognition problems. One of the most popular clustering algorithms is the Fuzzy C-Means (FCM). This thesis proposes to implement a new way of calculating the cluster centers in the procedure of FCM algorithm which are called ckMeans, and in some variants of FCM, in particular, here we apply it for those variants that use other distances. The goal of this change is to reduce the number of iterations and processing time of these algorithms without affecting the quality of the partition, or even to improve the number of correct classifications in some cases. Also, we developed an algorithm based on ckMeans to manipulate interval data considering interval membership degrees. This algorithm allows the representation of data without converting interval data into punctual ones, as it happens to other extensions of FCM that deal with interval data. In order to validate the proposed methodologies it was made a comparison between a clustering for ckMeans, K-Means and FCM algorithms (since the algorithm proposed in this paper to calculate the centers is similar to the K-Means) considering three different distances. We used several known databases. In this case, the results of Interval ckMeans were compared with the results of other clustering algorithms when applied to an interval database with minimum and maximum temperature of the month for a given year, referring to 37 cities distributed across continents

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Data clustering is applied to various fields such as data mining, image processing and pattern recognition technique. Clustering algorithms splits a data set into clusters such that elements within the same cluster have a high degree of similarity, while elements belonging to different clusters have a high degree of dissimilarity. The Fuzzy C-Means Algorithm (FCM) is a fuzzy clustering algorithm most used and discussed in the literature. The performance of the FCM is strongly affected by the selection of the initial centers of the clusters. Therefore, the choice of a good set of initial cluster centers is very important for the performance of the algorithm. However, in FCM, the choice of initial centers is made randomly, making it difficult to find a good set. This paper proposes three new methods to obtain initial cluster centers, deterministically, the FCM algorithm, and can also be used in variants of the FCM. In this work these initialization methods were applied in variant ckMeans.With the proposed methods, we intend to obtain a set of initial centers which are close to the real cluster centers. With these new approaches startup if you want to reduce the number of iterations to converge these algorithms and processing time without affecting the quality of the cluster or even improve the quality in some cases. Accordingly, cluster validation indices were used to measure the quality of the clusters obtained by the modified FCM and ckMeans algorithms with the proposed initialization methods when applied to various data sets

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The genome of all organisms is subject to injuries that can be caused by endogenous and environmental factors. If these lesions are not corrected, it can be fixed generating a mutation which can be lethal to the organisms. In order to prevent this, there are different DNA repair mechanisms. These mechanisms are well known in bacteria, yeast, human, but not in plants. Two plant models Oriza sativa and Arabidopsis thaliana had the genome sequenced and due to this some DNA repair genes have been characterized. The aim of this work is to characterized two sugarcane cDNAs that had homology to AP endonuclease: scARP1 and scARP3. In silico has been done with these two sequences and other from plants. It has been observed domain conservation on these sequences, but the cystein at 65 position that is a characteristic from the redox domain in APE1 protein was not so conservated in plants. Phylogenetic relationship showed two branches, one branch with dicots and monocots sequence and the other branch with only monocots sequences. Another approach in order to characterized these two cDNAs was to construct overexpression cassettes (sense and antisense orientation) using the 35S promoter. After that, these cassettes were transferred to the binary vector pPZP211. Furthermore, previously in the laboratory was obtained a plant from nicotiana tabacum containing the overexpression cassette in anti-sense orientation. It has been observed that this plant had a slow development and problems in setting seeds. After some manual crossing, some seeds were obtained (T2) and it was analyzed the T2 segregation. The third approach used in this work was to clone the promoter region from these two cDNAs by PCR walking. The sequences obtained were analyzed using the program PLANTCARE. It was observed in these sequences some motives that may be related to oxidative stress response

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background: Leptospirosis is an important zoonotic disease associated with poor areas of urban settings of developing countries and early diagnosis and prompt treatment may prevent disease. Although rodents are reportedly considered the main reservoirs of leptospirosis, dogs may develop the disease, may become asymptomatic carriers and may be used as sentinels for disease epidemiology. The use of Geographical Information Systems (GIS) combined with spatial analysis techniques allows the mapping of the disease and the identification and assessment of health risk factors. Besides the use of GIS and spatial analysis, the technique of data mining, decision tree, can provide a great potential to find a pattern in the behavior of the variables that determine the occurrence of leptospirosis. The objective of the present study was to apply Geographical Information Systems and data prospection (decision tree) to evaluate the risk factors for canine leptospirosis in an area of Curitiba, PR.Materials, Methods & Results: The present study was performed on the Vila Pantanal, a urban poor community in the city of Curitiba. A total of 287 dog blood samples were randomly obtained house-by-house in a two-day sampling on January 2010. In addition, a questionnaire was applied to owners at the time of sampling. Geographical coordinates related to each household of tested dog were obtained using a Global Positioning System (GPS) for mapping the spatial distribution of reagent and non-reagent dogs to leptospirosis. For the decision tree, risk factors included results of microagglutination test (MAT) from the serum of dogs, previous disease on the household, contact with rats or other dogs, dog breed, outdoors access, feeding, trash around house or backyard, open sewer proximity and flooding. A total of 189 samples (about 2/3 of overall samples) were randomly selected for the training file and consequent decision rules. The remained 98 samples were used for the testing file. The seroprevalence showed a pattern of spatial distribution that involved all the Pantanal area, without agglomeration of reagent animals. In relation to data mining, from 189 samples used in decision tree, a total of 165 (87.3%) animal samples were correctly classified, generating a Kappa index of 0.413. A total of 154 out of 159 (96.8%) samples were considered non-reagent and were correctly classified and only 5/159 (3.2%) were wrongly identified. on the other hand, only 11 (36.7%) reagent samples were correctly classified, with 19 (63.3%) samples failing diagnosis.Discussion: The spatial distribution that involved all the Pantanal area showed that all the animals in the area are at risk of contamination by Leptospira spp. Although most samples had been classified correctly by the decision tree, a degree of difficulty of separability related to seropositive animals was observed, with only 36.7% of the samples classified correctly. This can occur due to the fact of seronegative animals number is superior to the number of seropositive ones, taking the differences in the pattern of variable behavior. The data mining helped to evaluate the most important risk factors for leptospirosis in an urban poor community of Curitiba. The variables selected by decision tree reflected the important factors about the existence of the disease (default of sewer, presence of rats and rubbish and dogs with free access to street). The analyses showed the multifactorial character of the epidemiology of canine leptospirosis.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Cultivated peanut (Arachis hypogaea) is an important crop, widely grown in tropical and subtropical regions of the world. It is highly susceptible to several biotic and abiotic stresses to which wild species are resistant. As a first step towards the introgression of these resistance genes into cultivated peanut, a linkage map based on microsatellite markers was constructed, using an F-2 population obtained from a cross between two diploid wild species with AA genome (A. duranensis and A. stenosperma). A total of 271 new microsatellite markers were developed in the present study from SSR-enriched genomic libraries, expressed sequence tags (ESTs), and by data-mining sequences available in GenBank. of these, 66 were polymorphic for cultivated peanut. The 271 new markers plus another 162 published for peanut were screened against both progenitors and 204 of these (47.1%) were polymorphic, with 170 codominant and 34 dominant markers. The 80 codominant markers segregating 1:2:1 (P < 0.05) were initially used to establish the linkage groups. Distorted and dominant markers were subsequently included in the map. The resulting linkage map consists of 11 linkage groups covering 1,230.89 cM of total map distance, with an average distance of 7.24 cM between markers. This is the first microsatellite-based map published for Arachis, and the first map based on sequences that are all currently publicly available. Because most markers used were derived from ESTs and genomic libraries made using methylation-sensitive restriction enzymes, about one-third of the mapped markers are genic. Linkage group ordering is being validated in other mapping populations, with the aim of constructing a transferable reference map for Arachis.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The data mining of Eucalyptus ESTs genome finds four clusters (EGCEST2257E11.g, EGBGRT3213F11.g, and EGCCFB1223H11.g) from highly conservative 14-3-3 protein family which modulates a wide variety of cellular processes. Multiple alignments were built from twenty four sequences of 14-3-3 proteins searched into the GenBank databases and into the four pools of Eucalyptus genome programs. The alignment has shown two regions highly conservative on the sequences corresponding to the motifs of protein phosphorylation and nine highly conservative regions on the sequence corresponding to the linkage regions of alpha helices structure based on three dimensional of dimer functional structure. The differences of amino acid into the structural and functional domains of 14-3-3 plant protein were identified and can explain the functional diversity of different isoforms. The phylogenic protein trees were built by the maximum parsimony and neighborjoining procedures of Clustal X alignments and PAUP software for phylogenic analysis.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The analysis of large amounts of data is better performed by humans when represented in a graphical format. Therefore, a new research area called the Visual Data Mining is being developed endeavoring to use the number crunching power of computers to prepare data for visualization, allied to the ability of humans to interpret data presented graphically.This work presents the results of applying a visual data mining tool, called FastMapDB to detect the behavioral pattern exhibited by a dataset of clinical information about hemoglobinopathies known as thalassemia. FastMapDB is a visual data mining tool that get tabular data stored in a relational database such as dates, numbers and texts, and by considering them as points in a multidimensional space, maps them to a three-dimensional space. The intuitive three-dimensional representation of objects enables a data analyst to see the behavior of the characteristics from abnormal forms of hemoglobin, highlighting the differences when compared to data from a group without alteration.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

JUSTIFICATIVA E OBJETIVOS: Eventos considerados menores têm assumido papel fundamental na determinação da qualidade do serviço prestado na área da Anestesiologia. O objetivo do presente estudo foi avaliar as principais preocupações dos pacientes em relação ao período pós-anestésico e testar a hipótese de que os efeitos mais indesejados podem ser influenciados por características demográficas. MÉTODO: Um questionário foi respondido por 440 pacientes imediatamente antes da avaliação pré-anestésica. Foram listados os possíveis efeitos indesejáveis no período pós-operatório imediato, baseados em levantamento a partir de dados disponíveis na literatura e considerando o critério de frequência, mas não o de gravidade. Foram avaliados os dados demográficos e pesquisadas as nove preocupações mais frequentemente citadas. As informações coletadas a partir da análise dos questionários preenchidos pelos entrevistados foram relacionadas com seus dados antropométricos, socioeconômicos e educacionais, com o objetivo de avaliar a influência dessas variáveis no perfil das respostas. RESULTADOS: Entre os efeitos indesejados, o temor de acordar intubado foi o mais frequentemente citado como o mais importante, seguido de dor forte no local da cirurgia e acordar durante a cirurgia. A análise dos três efeitos mais indesejados em relação aos dados demográficos não evidenciou diferença estatística significativa, com exceção do item dor no local da cirurgia (menos citada entre pacientes do sexo masculino). CONCLUSÕES: As principais preocupações dos pacientes em relação ao período pós-anestésico são: acordar com um tubo na garganta, dor forte no local da cirurgia e a lembrança de estar acordado durante a cirurgia. A idade, o grau de escolaridade e a renda familiar não determinaram diferenças nas preocupações dos pacientes.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Realizou-se dois estudos transversais em 2002 (N=379) e 2003 (N=397) onde estimou-se a prevalência de enteroparasitas em crianças de cinco creches municipais de Botucatu, SP. Coletou-se variáveis socioeconômicas, sanitárias e educacionais e realizou-se exames coproparasitológicos. Giardia duodenalis apresentou prevalência de 23,7% (2002) e 21,4% (2003) seguido por Cryptosporidium sp com 15,5% (2002) e 3,7% (2003).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Oxidative stress generating active oxygen species has been proved to be one of the underlying agents causing tissue injury after the exposure of Eucalyptus (Eucalyptus spp.) plants to a wide variety of stress conditions. The objective of this study was to perform data mining to identify favorable genes and alleles associated with the enzyme systems superoxide dismutase, catalase, peroxidases, and glutathione S-transferase that are related to tolerance for environmental stresses and damage caused by pests, diseases, herbicides, and by weeds themselves. This was undertaken by using the eucalyptus expressed-sequence database (https//forests.esalq.usp.br). The alignment results between amino acid and nucleotide sequences indicated that the studied enzymes were adequately represented in the ESTs database of the FORESTs project.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Concept drift is a problem of increasing importance in machine learning and data mining. Data sets under analysis are no longer only static databases, but also data streams in which concepts and data distributions may not be stable over time. However, most learning algorithms produced so far are based on the assumption that data comes from a fixed distribution, so they are not suitable to handle concept drifts. Moreover, some concept drifts applications requires fast response, which means an algorithm must always be (re) trained with the latest available data. But the process of labeling data is usually expensive and/or time consuming when compared to unlabeled data acquisition, thus only a small fraction of the incoming data may be effectively labeled. Semi-supervised learning methods may help in this scenario, as they use both labeled and unlabeled data in the training process. However, most of them are also based on the assumption that the data is static. Therefore, semi-supervised learning with concept drifts is still an open challenge in machine learning. Recently, a particle competition and cooperation approach was used to realize graph-based semi-supervised learning from static data. In this paper, we extend that approach to handle data streams and concept drift. The result is a passive algorithm using a single classifier, which naturally adapts to concept changes, without any explicit drift detection mechanism. Its built-in mechanisms provide a natural way of learning from new data, gradually forgetting older knowledge as older labeled data items became less influent on the classification of newer data items. Some computer simulation are presented, showing the effectiveness of the proposed method.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

As a new modeling method, support vector regression (SVR) has been regarded as the state-of-the-art technique for regression and approximation. In this study, the SVR models had been introduced and developed to predict body and carcass-related characteristics of 2 strains of broiler chicken. To evaluate the prediction ability of SVR models, we compared their performance with that of neural network (NN) models. Evaluation of the prediction accuracy of models was based on the R-2, MS error, and bias. The variables of interest as model output were BW, empty BW, carcass, breast, drumstick, thigh, and wing weight in 2 strains of Ross and Cobb chickens based on intake dietary nutrients, including ME (kcal/bird per week), CP, TSAA, and Lys, all as grams per bird per week. A data set composed of 64 measurements taken from each strain were used for this analysis, where 44 data lines were used for model training, whereas the remaining 20 lines were used to test the created models. The results of this study revealed that it is possible to satisfactorily estimate the BW and carcass parts of the broiler chickens via their dietary nutrient intake. Through statistical criteria used to evaluate the performance of the SVR and NN models, the overall results demonstrate that the discussed models can be effective for accurate prediction of the body and carcass-related characteristics investigated here. However, the SVR method achieved better accuracy and generalization than the NN method. This indicates that the new data mining technique (SVR model) can be used as an alternative modeling tool for NN models. However, further reevaluation of this algorithm in the future is suggested.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper describes a data mining environment for knowledge discovery in bioinformatics applications. The system has a generic kernel that implements the mining functions to be applied to input primary databases, with a warehouse architecture, of biomedical information. Both supervised and unsupervised classification can be implemented within the kernel and applied to data extracted from the primary database, with the results being suitably stored in a complex object database for knowledge discovery. The kernel also includes a specific high-performance library that allows designing and applying the mining functions in parallel machines. The experimental results obtained by the application of the kernel functions are reported. © 2003 Elsevier Ltd. All rights reserved.