906 resultados para Decision tree


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Data mining is one of the hottest research areas nowadays as it has got wide variety of applications in common man’s life to make the world a better place to live. It is all about finding interesting hidden patterns in a huge history data base. As an example, from a sales data base, one can find an interesting pattern like “people who buy magazines tend to buy news papers also” using data mining. Now in the sales point of view the advantage is that one can place these things together in the shop to increase sales. In this research work, data mining is effectively applied to a domain called placement chance prediction, since taking wise career decision is so crucial for anybody for sure. In India technical manpower analysis is carried out by an organization named National Technical Manpower Information System (NTMIS), established in 1983-84 by India's Ministry of Education & Culture. The NTMIS comprises of a lead centre in the IAMR, New Delhi, and 21 nodal centres located at different parts of the country. The Kerala State Nodal Centre is located at Cochin University of Science and Technology. In Nodal Centre, they collect placement information by sending postal questionnaire to passed out students on a regular basis. From this raw data available in the nodal centre, a history data base was prepared. Each record in this data base includes entrance rank ranges, reservation, Sector, Sex, and a particular engineering. From each such combination of attributes from the history data base of student records, corresponding placement chances is computed and stored in the history data base. From this data, various popular data mining models are built and tested. These models can be used to predict the most suitable branch for a particular new student with one of the above combination of criteria. Also a detailed performance comparison of the various data mining models is done.This research work proposes to use a combination of data mining models namely a hybrid stacking ensemble for better predictions. A strategy to predict the overall absorption rate for various branches as well as the time it takes for all the students of a particular branch to get placed etc are also proposed. Finally, this research work puts forward a new data mining algorithm namely C 4.5 * stat for numeric data sets which has been proved to have competent accuracy over standard benchmarking data sets called UCI data sets. It also proposes an optimization strategy called parameter tuning to improve the standard C 4.5 algorithm. As a summary this research work passes through all four dimensions for a typical data mining research work, namely application to a domain, development of classifier models, optimization and ensemble methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The aim of this study is to show the importance of two classification techniques, viz. decision tree and clustering, in prediction of learning disabilities (LD) of school-age children. LDs affect about 10 percent of all children enrolled in schools. The problems of children with specific learning disabilities have been a cause of concern to parents and teachers for some time. Decision trees and clustering are powerful and popular tools used for classification and prediction in Data mining. Different rules extracted from the decision tree are used for prediction of learning disabilities. Clustering is the assignment of a set of observations into subsets, called clusters, which are useful in finding the different signs and symptoms (attributes) present in the LD affected child. In this paper, J48 algorithm is used for constructing the decision tree and K-means algorithm is used for creating the clusters. By applying these classification techniques, LD in any child can be identified

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Dentro de las actividades para el control de calidad en el laboratorio, los resultados finales de un analito en particular son considerados productos intermedios, dada la pertinencia otorgada al aseguramiento de la calidad como fin último de los programas de gestión de la calidad. Esta concepción precisa el establecimiento de instrumentos integrales para la detección de eventos como la contaminación cruzada y la adopción de medidas para evitar que se afecte la marcha analítica. Objetivo: el objetivo principal fue establecer un sistema para el monitoreo y control de la contaminación cruzada en el laboratorio de análisis microbiológico de alimentos. Materiales y métodos: la metodología empleada consistió en desarrollar diagramas de flujo para los procedimientos sobre el control de las poblaciones de mesófilos aerobios y mohos provenientes de la contaminación en los ambientes, superficies, material estéril y medios de cultivos. Dichos diagramas incluyeron un árbol de decisiones, diseñado para efectuar acciones de control con base en los intervalos de tolerancia, establecidos como herramienta objetiva hacia la toma de decisiones que normalicen los recuentos de las poblaciones microbianas en cuestión. Resultados: los límites de alerta más estrictos se obtuvieron para las poblaciones de mesófilos aerobios y mohos en los diferentes controles, excepto para el ambiente del área de preparación de medios y los correspondientes al material estéril. Conclusión: el proceso desarrollado permitió complementar el sistema de control de calidad interno en el laboratorio, al disponer de un medio objetivo para el cierre de no conformidades por contaminación cruzada.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Marco conceptual: La enfermedad renal crónica es un serio problema de salud pública en nuestro país por la gran cantidad de recursos económicos que requiere su atención. La hemodiálisis es el tratamiento más usado en nuestro medio; el acceso vascular y sus complicaciones derivadas son el principal aspecto que incrementa los costos de atención en éstos pacientes. Materiales y métodos: Se realizó un estudio económico de los accesos vasculares en pacientes incidentes de hemodiálisis en el año 2012 en la agencia RTS-Fundación Cardio Infantil. Se estableció el costo de creación y mantenimiento del acceso con catéter central, fístula arteriovenosa nativa, fístula arteriovenosa con injerto; y el costo de atención de las complicaciones para cada acceso. Se determinó la probabilidad de ocurrencia de complicaciones. Mediante un árbol de decisiones se trazó el comportamiento de cada acceso en un período de 5 años. Se establecieron los años de vida ajustados por calidad (QALY) en cada acceso y el costo para cada uno de éstos QALY. Resultados: de 36 pacientes incidentes de hemodiálisis en 2012 el 100% inició con catéter central, 16 pacientes cambiaron a fístula arteriovenosa nativa, 1 a fístula arteriovenosa con injerto que posteriormente pasó a CAPD, 15 continuaron su acceso con catéter y 4 pacientes fallecieron. En 5 años se obtuvieron 2,36 QALY para los pacientes con catéter central que costarían $ 24.813.036,39/QALY y 2,535 QALY para los pacientes con fístula nativa que costarían $ 6.634.870,64/QALY. Conclusiones: el presente estudio muestra que el acceso vascular mediante fístula arteriovenosa nativa es el más costo-efectivo que mediante catéter

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Los tratamientos para aumentar los niveles de cúmulo de diferenciación 4 – CD4 en personas que padecen la enfermedad ocasionada por el Virus de la Inmunodeficiencia Humana (VIH), son importantes tanto para el mejoramiento del bienestar de los pacientes, como para el buen funcionamiento de las instituciones de salud. La presente investigación compara la intervención farmacológica de dos líneas de tratamiento, Lamivudina, Zidovudina, Efavirenz contra Efavirenz, Emtricitabina, Disoproxilo de Tenofovir que se encuentran en la recomendación de esquema de primera línea según la Guía Práctica Clínica (2014). Se evaluó el efecto costo-efectivo de estos dos tratamientos basado en el aumento de los niveles de CD4 a lo largo de tres tiempos diferentes (inicial, 6 y 12 meses) y los costos de los medicamentos de acuerdo a los precios en Colombia según el SISMED en el año 2014. Se realizó un análisis de varianza factorial con medidas repetidas, un árbol de decisiones y un análisis de costo-efectividad incremental (ACEI). Se obtuvo información de 546 pacientes, tanto hombres como mujeres, de la Institución Asistencia Científica de Alta Complejidad S.A.S de la ciudad de Bogotá. Se encontró que el esquema 1 (Lamivudina, Zidovudina, Efavirenz) fue considerado más efectivo y menos costoso que el tratamiento 2 (Efavirenz, Emtricitabina, Disoproxilo de Tenofovir), sin embargo no se evidenció una alta frecuencia de efectos adversos que pueda contribuir a la escogencia de un tratamiento u otro. De acuerdo a estos resultados la institución o los médicos tratantes tienen una alternativa farmacoeconómica para la toma de decisión del tratamiento a utilizar y así iniciar la terapia antirretroviral de pacientes que conviven con VHI con carga viral indetectable.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Los tratamientos para aumentar los niveles de cúmulo de diferenciación 4 – CD4 en personas que padecen la enfermedad ocasionada por el Virus de la Inmunodeficiencia Humana (VIH), son importantes tanto para el mejoramiento del bienestar de los pacientes, como para el buen funcionamiento de las instituciones de salud. La presente investigación compara la intervención farmacológica de dos líneas de tratamiento, Lamivudina, Zidovudina, Efavirenz contra Efavirenz, Emtricitabina, Disoproxilo de Tenofovir que se encuentran en la recomendación de esquema de primera línea según la Guía Práctica Clínica (2014). Se evaluó el efecto costo-efectivo de estos dos tratamientos basado en el aumento de los niveles de CD4 a lo largo de tres tiempos diferentes (inicial, 6 y 12 meses) y los costos de los medicamentos de acuerdo a los precios en Colombia según el SISMED en el año 2014. Se realizó un análisis de varianza factorial con medidas repetidas, un árbol de decisiones y un análisis de costo-efectividad incremental (ACEI). Se obtuvo información de 546 pacientes, tanto hombres como mujeres, de la Institución Asistencia Científica de Alta Complejidad S.A.S de la ciudad de Bogotá. Se encontró que el esquema 1 (Lamivudina, Zidovudina, Efavirenz) fue considerado más efectivo y menos costoso que el tratamiento 2 (Efavirenz, Emtricitabina, Disoproxilo de Tenofovir), sin embargo no se evidenció una alta frecuencia de efectos adversos que pueda contribuir a la escogencia de un tratamiento u otro. De acuerdo a estos resultados la institución o los médicos tratantes tienen una alternativa farmacoeconómica para la toma de decisión del tratamiento a utilizar y así iniciar la terapia antirretroviral de pacientes que conviven con VHI con carga viral indetectable.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Biological emergencies such as the appearance of an exotic transboundary or emerging disease can become disasters. The question that faces Veterinary Services in developing countries is how to balance resources dedicated to active insurance measures, such as border control, surveillance, working with the governments of developing countries, and investing in improving veterinary knowledge and tools, with passive measures, such as contingency funds and vaccine banks. There is strong evidence that the animal health situation in developed countries has improved and is relatively stable. In addition, through trade with other countries, developing countries are becoming part of the international animal health system, the status of which is improving, though with occasional setbacks. However, despite these improvements, the risk of a possible biological disaster still remains, and has increased in recent times because of the threat of bioterrorism. This paper suggests that a model that combines decision tree analysis with epidemiology is required to identify critical points in food chains that should be strengthened to reduce the risk of emergencies and prevent emergencies from becoming disasters.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We have performed microarray hybridization studies on 40 clinical isolates from 12 common serovars within Salmonella enterica subspecies I to identify the conserved chromosomal gene pool. We were able to separate the core invariant portion of the genome by a novel mathematical approach using a decision tree based on genes ranked by increasing variance. All genes within the core component were confirmed using available sequence and microarray information for S. enterica subspecies I strains. The majority of genes within the core component had conserved homologues in Escherichia coli K-12 strain MG1655. However, many genes present in the conserved set which were absent or highly divergent in K-12 had close homologues in pathogenic bacteria such as Shigella flexneri and Pseudomonas aeruginosa. Genes within previously established virulence determinants such as SPI1 to SPI5 were conserved. In addition several genes within SPI6, all of SPI9, and three fimbrial operons (fim, bcf, and stb) were conserved within all S. enterica strains included in this study. Although many phage and insertion sequence elements were missing from the core component, approximately half the pseudogenes present in S. enterica serovar Typhi were conserved. Furthermore, approximately half the genes conserved in the core set encoded hypothetical proteins. Separation of the core and variant gene sets within S. enterica subspecies I has offered fundamental biological insight into the genetic basis of phenotypic similarity and diversity across S. enterica subspecies I and shown how the core genome of these pathogens differs from the closely related E. coli K-12 laboratory strain.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Prism family of algorithms induces modular classification rules which, in contrast to decision tree induction algorithms, do not necessarily fit together into a decision tree structure. Classifiers induced by Prism algorithms achieve a comparable accuracy compared with decision trees and in some cases even outperform decision trees. Both kinds of algorithms tend to overfit on large and noisy datasets and this has led to the development of pruning methods. Pruning methods use various metrics to truncate decision trees or to eliminate whole rules or single rule terms from a Prism rule set. For decision trees many pre-pruning and postpruning methods exist, however for Prism algorithms only one pre-pruning method has been developed, J-pruning. Recent work with Prism algorithms examined J-pruning in the context of very large datasets and found that the current method does not use its full potential. This paper revisits the J-pruning method for the Prism family of algorithms and develops a new pruning method Jmax-pruning, discusses it in theoretical terms and evaluates it empirically.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Ensemble learning techniques generate multiple classifiers, so called base classifiers, whose combined classification results are used in order to increase the overall classification accuracy. In most ensemble classifiers the base classifiers are based on the Top Down Induction of Decision Trees (TDIDT) approach. However, an alternative approach for the induction of rule based classifiers is the Prism family of algorithms. Prism algorithms produce modular classification rules that do not necessarily fit into a decision tree structure. Prism classification rulesets achieve a comparable and sometimes higher classification accuracy compared with decision tree classifiers, if the data is noisy and large. Yet Prism still suffers from overfitting on noisy and large datasets. In practice ensemble techniques tend to reduce the overfitting, however there exists no ensemble learner for modular classification rule inducers such as the Prism family of algorithms. This article describes the first development of an ensemble learner based on the Prism family of algorithms in order to enhance Prism’s classification accuracy by reducing overfitting.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Ensemble learning can be used to increase the overall classification accuracy of a classifier by generating multiple base classifiers and combining their classification results. A frequently used family of base classifiers for ensemble learning are decision trees. However, alternative approaches can potentially be used, such as the Prism family of algorithms that also induces classification rules. Compared with decision trees, Prism algorithms generate modular classification rules that cannot necessarily be represented in the form of a decision tree. Prism algorithms produce a similar classification accuracy compared with decision trees. However, in some cases, for example, if there is noise in the training and test data, Prism algorithms can outperform decision trees by achieving a higher classification accuracy. However, Prism still tends to overfit on noisy data; hence, ensemble learners have been adopted in this work to reduce the overfitting. This paper describes the development of an ensemble learner using a member of the Prism family as the base classifier to reduce the overfitting of Prism algorithms on noisy datasets. The developed ensemble classifier is compared with a stand-alone Prism classifier in terms of classification accuracy and resistance to noise.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Full-waveform laser scanning data acquired with a Riegl LMS-Q560 instrument were used to classify an orange orchard into orange trees, grass and ground using waveform parameters alone. Gaussian decomposition was performed on this data capture from the National Airborne Field Experiment in November 2006 using a custom peak-detection procedure and a trust-region-reflective algorithm for fitting Gauss functions. Calibration was carried out using waveforms returned from a road surface, and the backscattering coefficient c was derived for every waveform peak. The processed data were then analysed according to the number of returns detected within each waveform and classified into three classes based on pulse width and c. For single-peak waveforms the scatterplot of c versus pulse width was used to distinguish between ground, grass and orange trees. In the case of multiple returns, the relationship between first (or first plus middle) and last return c values was used to separate ground from other targets. Refinement of this classification, and further sub-classification into grass and orange trees was performed using the c versus pulse width scatterplots of last returns. In all cases the separation was carried out using a decision tree with empirical relationships between the waveform parameters. Ground points were successfully separated from orange tree points. The most difficult class to separate and verify was grass, but those points in general corresponded well with the grass areas identified in the aerial photography. The overall accuracy reached 91%, using photography and relative elevation as ground truth. The overall accuracy for two classes, orange tree and combined class of grass and ground, yielded 95%. Finally, the backscattering coefficient c of single-peak waveforms was also used to derive reflectance values of the three classes. The reflectance of the orange tree class (0.31) and ground class (0.60) are consistent with published values at the wavelength of the Riegl scanner (1550 nm). The grass class reflectance (0.46) falls in between the other two classes as might be expected, as this class has a mixture of the contributions of both vegetation and ground reflectance properties.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Land cover plays a key role in global to regional monitoring and modeling because it affects and is being affected by climate change and thus became one of the essential variables for climate change studies. National and international organizations require timely and accurate land cover information for reporting and management actions. The North American Land Change Monitoring System (NALCMS) is an international cooperation of organizations and entities of Canada, the United States, and Mexico to map land cover change of North America's changing environment. This paper presents the methodology to derive the land cover map of Mexico for the year 2005 which was integrated in the NALCMS continental map. Based on a time series of 250 m Moderate Resolution Imaging Spectroradiometer (MODIS) data and an extensive sample data base the complexity of the Mexican landscape required a specific approach to reflect land cover heterogeneity. To estimate the proportion of each land cover class for every pixel several decision tree classifications were combined to obtain class membership maps which were finally converted to a discrete map accompanied by a confidence estimate. The map yielded an overall accuracy of 82.5% (Kappa of 0.79) for pixels with at least 50% map confidence (71.3% of the data). An additional assessment with 780 randomly stratified samples and primary and alternative calls in the reference data to account for ambiguity indicated 83.4% overall accuracy (Kappa of 0.80). A high agreement of 83.6% for all pixels and 92.6% for pixels with a map confidence of more than 50% was found for the comparison between the land cover maps of 2005 and 2006. Further wall-to-wall comparisons to related land cover maps resulted in 56.6% agreement with the MODIS land cover product and a congruence of 49.5 with Globcover.