64 resultados para WEKA
Resumo:
The CTC algorithm, Consolidated Tree Construction algorithm, is a machine learning paradigm that was designed to solve a class imbalance problem, a fraud detection problem in the area of car insurance [1] where, besides, an explanation about the classification made was required. The algorithm is based on a decision tree construction algorithm, in this case the well-known C4.5, but it extracts knowledge from data using a set of samples instead of a single one as C4.5 does. In contrast to other methodologies based on several samples to build a classifier, such as bagging, the CTC builds a single tree and as a consequence, it obtains comprehensible classifiers. The main motivation of this implementation is to make public and available an implementation of the CTC algorithm. With this purpose we have implemented the algorithm within the well-known WEKA data mining environment http://www.cs.waikato.ac.nz/ml/weka/). WEKA is an open source project that contains a collection of machine learning algorithms written in Java for data mining tasks. J48 is the implementation of C4.5 algorithm within the WEKA package. We called J48Consolidated to the implementation of CTC algorithm based on the J48 Java class.
Resumo:
This work has as objectives the implementation of a intelligent computational tool to identify the non-technical losses and to select its most relevant features, considering information from the database with industrial consumers profiles of a power company. The solution to this problem is not trivial and not of regional character, the minimization of non-technical loss represents the guarantee of investments in product quality and maintenance of power systems, introduced by a competitive environment after the period of privatization in the national scene. This work presents using the WEKA software to the proposed objective, comparing various classification techniques and optimization through intelligent algorithms, this way, can be possible to automate applications on Smart Grids. © 2012 IEEE.
Resumo:
An important responsibility of the Environment Protection Authority, Victoria, is to set objectives for levels of environmental contaminants. To support the development of environmental objectives for water quality, a need has been identified to understand the dual impacts of concentration and duration of a contaminant on biota in freshwater streams. For suspended solids contamination, information reported by Newcombe and Jensen [ North American Journal of Fisheries Management , 16(4):693--727, 1996] study of freshwater fish and the daily suspended solids data from the United States Geological Survey stream monitoring network is utilised. The study group was requested to examine both the utility of the Newcombe and Jensen and the USA data, as well as the formulation of a procedure for use by the Environment Protection Authority Victoria that takes concentration and duration of harmful episodes into account when assessing water quality. The extent to which the impact of a toxic event on fish health could be modelled deterministically was also considered. It was found that concentration and exposure duration were the main compounding factors on the severity of effects of suspended solids on freshwater fish. A protocol for assessing the cumulative effect on fish health and a simple deterministic model, based on the biology of gill harm and recovery, was proposed. References D. W. T. Au, C. A. Pollino, R. S. S Wu, P. K. S. Shin, S. T. F. Lau, and J. Y. M. Tang. Chronic effects of suspended solids on gill structure, osmoregulation, growth, and triiodothyronine in juvenile green grouper epinephelus coioides . Marine Ecology Press Series , 266:255--264, 2004. J.C. Bezdek, S.K. Chuah, and D. Leep. Generalized k-nearest neighbor rules. Fuzzy Sets and Systems , 18:237--26, 1986. E. T. Champagne, K. L. Bett-Garber, A. M. McClung, and C. Bergman. {Sensory characteristics of diverse rice cultivars as influenced by genetic and environmental factors}. Cereal Chem. , {81}:{237--243}, {2004}. S. G. Cheung and P. K. S. Shin. Size effects of suspended particles on gill damage in green-lipped mussel perna viridis. Marine Pollution Bulletin , 51(8--12):801--810, 2005. D. H. Evans. The fish gill: site of action and model for toxic effects of environmental pollutants. Environmental Health Perspectives , 71:44--58, 1987. G. C. Grigg. The failure of oxygen transport in a fish at low levels of ambient oxygen. Comp. Biochem. Physiol. , 29:1253--1257, 1969. G. Holmes, A. Donkin, and I.H. Witten. {Weka: A machine learning workbench}. In Proceedings of the Second Australia and New Zealand Conference on Intelligent Information Systems , volume {24}, pages {357--361}, {Brisbane, Australia}, {1994}. {IEEE Computer Society}. D. D. Macdonald and C. P. Newcombe. Utility of the stress index for predicting suspended sediment effects: response to comments. North American Journal of Fisheries Management , 13:873--876, 1993. C. P. Newcombe. Suspended sediment in aquatic ecosystems: ill effects as a function of concentration and duration of exposure. Technical report, British Columbia Ministry of Environment, Lands and Parks, Habitat Protection branch, Victoria, 1994. C. P. Newcombe and J. O. T. Jensen. Channel suspended sediment and fisheries: A synthesis for quantitative assessment of risk and impact. North American Journal of Fisheries Management , 16(4):693--727, 1996. C. P. Newcombe and D. D. Macdonald. Effects of suspended sediments on aquatic ecosystems. North American Journal of Fisheries Management , 11(1):72--82, 1991. K. Schmidt-Nielsen. Scaling. Why is animal size so important? Cambridge University Press, NY, 1984. J. S. Schwartz, A. Simon, and L. Klimetz. Use of fish functional traits to associate in-stream suspended sediment transport metrics with biological impairment. Environmental Monitoring and Assessment , 179(1--4):347--369, 2011. E. Al Shaw and J. S. Richardson. Direct and indirect effects of sediment pulse duration on stream invertebrate assemb ages and rainbow trout ( Oncorhynchus mykiss ) growth and survival. Canadian Journal of Fish and Aquatic Science , 58:2213--2221, 2001. P. Tiwari and H. Hasegawa. {Demand for housing in Tokyo: A discrete choice analysis}. Regional Studies , {38}:{27--42}, {2004}. Y. Tramblay, A. Saint-Hilaire, T. B. M. J. Ouarda, F. Moatar, and B Hecht. Estimation of local extreme suspended sediment concentrations in california rivers. Science of the Total Environment , 408:4221--
Resumo:
Today's programming languages are supported by powerful third-party APIs. For a given application domain, it is common to have many competing APIs that provide similar functionality. Programmer productivity therefore depends heavily on the programmer's ability to discover suitable APIs both during an initial coding phase, as well as during software maintenance. The aim of this work is to support the discovery and migration of math APIs. Math APIs are at the heart of many application domains ranging from machine learning to scientific computations. Our approach, called MATHFINDER, combines executable specifications of mathematical computations with unit tests (operational specifications) of API methods. Given a math expression, MATHFINDER synthesizes pseudo-code comprised of API methods to compute the expression by mining unit tests of the API methods. We present a sequential version of our unit test mining algorithm and also design a more scalable data-parallel version. We perform extensive evaluation of MATHFINDER (1) for API discovery, where math algorithms are to be implemented from scratch and (2) for API migration, where client programs utilizing a math API are to be migrated to another API. We evaluated the precision and recall of MATHFINDER on a diverse collection of math expressions, culled from algorithms used in a wide range of application areas such as control systems and structural dynamics. In a user study to evaluate the productivity gains obtained by using MATHFINDER for API discovery, the programmers who used MATHFINDER finished their programming tasks twice as fast as their counterparts who used the usual techniques like web and code search, IDE code completion, and manual inspection of library documentation. For the problem of API migration, as a case study, we used MATHFINDER to migrate Weka, a popular machine learning library. Overall, our evaluation shows that MATHFINDER is easy to use, provides highly precise results across several math APIs and application domains even with a small number of unit tests per method, and scales to large collections of unit tests.
Resumo:
This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage.
Resumo:
No presente trabalho foram utilizados modelos de classificação para minerar dados relacionados à aprendizagem de Matemática e ao perfil de professores do ensino fundamental. Mais especificamente, foram abordados os fatores referentes aos educadores do Estado do Rio de Janeiro que influenciam positivamente e negativamente no desempenho dos alunos do 9 ano do ensino básico nas provas de Matemática. Os dados utilizados para extrair estas informações são disponibilizados pelo Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira que avalia o sistema educacional brasileiro em diversos níveis e modalidades de ensino, incluindo a Educação Básica, cuja avaliação, que foi foco deste estudo, é realizada pela Prova Brasil. A partir desta base, foi aplicado o processo de Descoberta de Conhecimento em Bancos de Dados (KDD - Knowledge Discovery in Databases), composto das etapas de preparação, mineração e pós-processamento dos dados. Os padrões foram extraídos dos modelos de classificação gerados pelas técnicas árvore de decisão, indução de regras e classificadores Bayesianos, cujos algoritmos estão implementados no software Weka (Waikato Environment for Knowledge Analysis). Além disso, foram aplicados métodos de grupos e uma metodologia para tornar as classes uniformemente distribuídas, afim de melhorar a precisão dos modelos obtidos. Os resultados apresentaram importantes fatores que contribuem para o ensino-aprendizagem de Matemática, assim como evidenciaram aspectos que comprometem negativamente o desempenho dos discentes. Por fim, os resultados extraídos fornecem ao educador e elaborador de políticas públicas fatores para uma análise que os auxiliem em posteriores tomadas de decisão.
Resumo:
Nos dias atuais, a maioria das operações feitas por empresas e organizações é armazenada em bancos de dados que podem ser explorados por pesquisadores com o objetivo de se obter informações úteis para auxílio da tomada de decisão. Devido ao grande volume envolvido, a extração e análise dos dados não é uma tarefa simples. O processo geral de conversão de dados brutos em informações úteis chama-se Descoberta de Conhecimento em Bancos de Dados (KDD - Knowledge Discovery in Databases). Uma das etapas deste processo é a Mineração de Dados (Data Mining), que consiste na aplicação de algoritmos e técnicas estatísticas para explorar informações contidas implicitamente em grandes bancos de dados. Muitas áreas utilizam o processo KDD para facilitar o reconhecimento de padrões ou modelos em suas bases de informações. Este trabalho apresenta uma aplicação prática do processo KDD utilizando a base de dados de alunos do 9 ano do ensino básico do Estado do Rio de Janeiro, disponibilizada no site do INEP, com o objetivo de descobrir padrões interessantes entre o perfil socioeconômico do aluno e seu desempenho obtido em Matemática na Prova Brasil 2011. Neste trabalho, utilizando-se da ferramenta chamada Weka (Waikato Environment for Knowledge Analysis), foi aplicada a tarefa de mineração de dados conhecida como associação, onde se extraiu regras por intermédio do algoritmo Apriori. Neste estudo foi possível descobrir, por exemplo, que alunos que já foram reprovados uma vez tendem a tirar uma nota inferior na prova de matemática, assim como alunos que nunca foram reprovados tiveram um melhor desempenho. Outros fatores, como a sua pretensão futura, a escolaridade dos pais, a preferência de matemática, o grupo étnico o qual o aluno pertence, se o aluno lê sites frequentemente, também influenciam positivamente ou negativamente no aprendizado do discente. Também foi feita uma análise de acordo com a infraestrutura da escola onde o aluno estuda e com isso, pôde-se afirmar que os padrões descobertos ocorrem independentemente se estes alunos estudam em escolas que possuem infraestrutura boa ou ruim. Os resultados obtidos podem ser utilizados para traçar perfis de estudantes que tem um melhor ou um pior desempenho em matemática e para a elaboração de políticas públicas na área de educação, voltadas ao ensino fundamental.
Resumo:
A presente pesquisa objetiva verificar a contribuição da tecnologia da informação na previsão de indicadores de desempenho da Empresa Alfa. Para a realização deste estudo, foi realizado um estudo de caso único a fim de aprofundar na pesquisa de forma exploratória e descritiva. As técnicas utilizadas para tal foram análise documental e entrevista, que deram suporte à pesquisa quantitativa atendendo ao objetivo proposto no estudo e na fundamentação teórica. A pesquisa teve como base principal os resultados dos indicadores de desempenho dos anos de 2012 e 2013 descritos no planejamento estratégico referente ao ano de 2013. Através desses resultados foi possível prever os resultados dos indicadores para 2014 utilizando o software Weka e assim realizar as análises necessárias. Os principais achados demonstram que a Empresa Alfa precisará antecipar ações para maximizar seus resultados evitando que impactem negativamente na rentabilidade, além de ter a necessidade de manter uma base de dados sólida e estruturada que possa subsidiar previsões confiáveis e alimentar futuramente o programa a fim de realizar novas previsões. O resultado da pesquisa aponta que o sistema de informações Weka contribui para a previsão de resultados, podendo antecipar ações para que a organização possa otimizar suas tomadas de decisões, tornando-as mais eficientes e eficazes.
Resumo:
供应链协调在国际供应链管理领域备受关注,是涉及多学科高度交叉、知识广泛集成的前沿热点研究领域。它综合了经济学、管理学、营销学、现代网络及信息科学等技术,通过供应链各利益实体之间竞争、协作,实现供应链整体效益的提升。其作为我国实现产业结构优化升级的重要途径,已经成为企业继自然资源、人力资本后的第三个利润增长源泉。 本论文在深入研究供应链管理、供应链协调及契约理论的基础上,分析指出了供应链契约协调研究发展的趋势。依据该趋势本文主要研究的内容是对当前供应链契约协调理论研究的补充和扩展。本文立足于解决我国供应链管理及协调中诸多实际问题,针对现代市场环境中客户需求个性化、多样性及不确定性的特点,研究了供应链契约的统一框架,方法及技术路线,重点研究了客户需求驱动的供应链分销契约协调问题,建立了不同条件下的契约优化模型,给出基于博弈均衡解的一系列证明,并分析了契约参数对供应链整体及各成员绩效的影响,最后基于SOA架构设计并实现了契约自动协商平台。论文的主要研究内容包括以下4个方面: 1. 建立了多对一供应链、确定性客户需求、完全信息下的契约优化模型。基于多个竞争的制造商和一个独立的、共同的零售商组成的多对一供应链分销过程,在客户需求与零售价线性相关、双方掌握完全信息的情况下,针对零售商的保留利润内生给定的特点,建立了收入共享契约框架下的Stackelberg博弈模型和数量折扣契约、两部费用契约下的契约参数优化模型;分析了契约的不同提供方、契约类型的选择和契约参数的优化对供应链整体绩效及利润在各方分配的影响;证明了当制造商所提供的产品具有高度可替代性时,实际增加了零售商的内生保留利润,即增强了零售商相对于制造商的议价能力,在这种情况下,制造商将更倾向于提供批发价契约而不是较为复杂的契约形式。最后通过数值仿真实验,分析验证了上述理论研究的结果。 2.构建了两阶段供应链、短视客户需求、不对称信息、产能约束条件下的混合契约优化模型。针对由单制造商和单零售商组成的双寡头垄断供应链、基于短视客户需求的报童模型、各方需求预测信息不对称的情况,建立了由预购契约和回购契约组成的改进型回购契约优化模型。通过该契约模型同时实现了协同预测、产能优化和供应链分销协调的目的。通过数值仿真实验,验证了改进型回购契约下的供应链协调。通过风险-利润边际和信息不对称度阐述了两段供应链无效性的原因。 3.建立了三阶段供应链、策略型需求、完全信息下的契约优化模型。在由制造商、零售商和理性客户组成的三阶段供应链结构中,根据理性客户及其对产品需求具有策略性的特点,基于零售商和客户间的理性预期均衡构建了研究策略型客户行为的模型框架。分析了集中式供应链绩效与批发价契约及价格补偿契约下分散式供应链绩效的关系,得出在这些契约协调下的策略型客户需求驱动型供应链分销渠道中,分散式供应链绩效严格优于集中式供应链绩效的创新性结论。 4.构建了基于SOA的契约自动协商平台。综合上述契约优化理论研究结论和已有的研究成果,抽象、封装了包含契约类型和契约参数的契约优化服务模型库。基于面向服务的体系架构(SOA)思想,运用数据挖掘软件Weka细化客户需求类型及供应链环境,采用Web Service技术封装各种契约服务,利用企业服务总线(ESB)提供各服务组件绑定、交互和管理通道,以及通过BPEL建模工具对各种服务进行符合逻辑的编排、重组和发布。通过所构建的契约协商平台,实现了供应链分销过程中契约协商的网络化、自动化、智能化和柔性化。
Resumo:
Using scientific methods in the humanities is at the forefront of objective literary analysis. However, processing big data is particularly complex when the subject matter is qualitative rather than numerical. Large volumes of text require specialized tools to produce quantifiable data from ideas and sentiments. Our team researched the extent to which tools such as Weka and MALLET can test hypotheses about qualitative information. We examined the claim that literary commentary exists within political environments and used US periodical articles concerning Russian literature in the early twentieth century as a case study. These tools generated useful quantitative data that allowed us to run stepwise binary logistic regressions. These statistical tests allowed for time series experiments using sea change and emergency models of history, as well as classification experiments with regard to author characteristics, social issues, and sentiment expressed. Both types of experiments supported our claim with varying degrees, but more importantly served as a definitive demonstration that digitally enhanced quantitative forms of analysis can apply to qualitative data. Our findings set the foundation for further experiments in the emerging field of digital humanities.
Resumo:
This work proposes an extended version of the well-known tree-augmented naive Bayes (TAN) classifier where the structure learning step is performed without requiring features to be connected to the class. Based on a modification of Edmonds' algorithm, our structure learning procedure explores a superset of the structures that are considered by TAN, yet achieves global optimality of the learning score function in a very efficient way (quadratic in the number of features, the same complexity as learning TANs). We enhance our procedure with a new score function that only takes into account arcs that are relevant to predict the class, as well as an optimization over the equivalent sample size during learning. These ideas may be useful for structure learning of Bayesian networks in general. A range of experiments shows that we obtain models with better prediction accuracy than naive Bayes and TAN, and comparable to the accuracy of the state-of-the-art classifier averaged one-dependence estimator (AODE). We release our implementation of ETAN so that it can be easily installed and run within Weka.
Resumo:
The main purpose of this thesis project is to prediction of symptom severity and cause in data from test battery of the Parkinson’s disease patient, which is based on data mining. The collection of the data is from test battery on a hand in computer. We use the Chi-Square method and check which variables are important and which are not important. Then we apply different data mining techniques on our normalize data and check which technique or method gives good results.The implementation of this thesis is in WEKA. We normalize our data and then apply different methods on this data. The methods which we used are Naïve Bayes, CART and KNN. We draw the Bland Altman and Spearman’s Correlation for checking the final results and prediction of data. The Bland Altman tells how the percentage of our confident level in this data is correct and Spearman’s Correlation tells us our relationship is strong. On the basis of results and analysis we see all three methods give nearly same results. But if we see our CART (J48 Decision Tree) it gives good result of under predicted and over predicted values that’s lies between -2 to +2. The correlation between the Actual and Predicted values is 0,794in CART. Cause gives the better percentage classification result then disability because it can use two classes.
Resumo:
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Due to the free nature of Wikipedia and allowing open access to everyone to edit articles the quality of articles may be affected. As all people don’t have equal level of knowledge and also different people have different opinions about a topic so there may be difference between the contributions made by different authors. To overcome this situation it is very important to classify the articles so that the articles of good quality can be separated from the poor quality articles and should be removed from the database. The aim of this study is to classify the articles of Wikipedia into two classes class 0 (poor quality) and class 1(good quality) using the Adaptive Neuro Fuzzy Inference System (ANFIS) and data mining techniques. Two ANFIS are built using the Fuzzy Logic Toolbox [1] available in Matlab. The first ANFIS is based on the rules obtained from J48 classifier in WEKA while the other one was built by using the expert’s knowledge. The data used for this research work contains 226 article’s records taken from the German version of Wikipedia. The dataset consists of 19 inputs and one output. The data was preprocessed to remove any similar attributes. The input variables are related to the editors, contributors, length of articles and the lifecycle of articles. In the end analysis of different methods implemented in this research is made to analyze the performance of each classification method used.
Resumo:
Parkinson's disease (PD) is a degenerative illness whose cardinal symptoms include rigidity, tremor, and slowness of movement. In addition to its widely recognized effects PD can have a profound effect on speech and voice.The speech symptoms most commonly demonstrated by patients with PD are reduced vocal loudness, monopitch, disruptions of voice quality, and abnormally fast rate of speech. This cluster of speech symptoms is often termed Hypokinetic Dysarthria.The disease can be difficult to diagnose accurately, especially in its early stages, due to this reason, automatic techniques based on Artificial Intelligence should increase the diagnosing accuracy and to help the doctors make better decisions. The aim of the thesis work is to predict the PD based on the audio files collected from various patients.Audio files are preprocessed in order to attain the features.The preprocessed data contains 23 attributes and 195 instances. On an average there are six voice recordings per person, By using data compression technique such as Discrete Cosine Transform (DCT) number of instances can be minimized, after data compression, attribute selection is done using several WEKA build in methods such as ChiSquared, GainRatio, Infogain after identifying the important attributes, we evaluate attributes one by one by using stepwise regression.Based on the selected attributes we process in WEKA by using cost sensitive classifier with various algorithms like MultiPass LVQ, Logistic Model Tree(LMT), K-Star.The classified results shows on an average 80%.By using this features 95% approximate classification of PD is acheived.This shows that using the audio dataset, PD could be predicted with a higher level of accuracy.
Predictive models for chronic renal disease using decision trees, naïve bayes and case-based methods
Resumo:
Data mining can be used in healthcare industry to “mine” clinical data to discover hidden information for intelligent and affective decision making. Discovery of hidden patterns and relationships often goes intact, yet advanced data mining techniques can be helpful as remedy to this scenario. This thesis mainly deals with Intelligent Prediction of Chronic Renal Disease (IPCRD). Data covers blood, urine test, and external symptoms applied to predict chronic renal disease. Data from the database is initially transformed to Weka (3.6) and Chi-Square method is used for features section. After normalizing data, three classifiers were applied and efficiency of output is evaluated. Mainly, three classifiers are analyzed: Decision Tree, Naïve Bayes, K-Nearest Neighbour algorithm. Results show that each technique has its unique strength in realizing the objectives of the defined mining goals. Efficiency of Decision Tree and KNN was almost same but Naïve Bayes proved a comparative edge over others. Further sensitivity and specificity tests are used as statistical measures to examine the performance of a binary classification. Sensitivity (also called recall rate in some fields) measures the proportion of actual positives which are correctly identified while Specificity measures the proportion of negatives which are correctly identified. CRISP-DM methodology is applied to build the mining models. It consists of six major phases: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.