881 resultados para mining algorithm


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The aim of this study was to group temporal profiles of 10-day composites NDVI product by similarity, which was obtained by the SPOT Vegetation sensor, for municipalities with high soybean production in the state of Paraná, Brazil, in the 2005/2006 cropping season. Data mining is a valuable tool that allows extracting knowledge from a database, identifying valid, new, potentially useful and understandable patterns. Therefore, it was used the methods for clusters generation by means of the algorithms K-Means, MAXVER and DBSCAN, implemented in the WEKA software package. Clusters were created based on the average temporal profiles of NDVI of the 277 municipalities with high soybean production in the state and the best results were found with the K-Means algorithm, grouping the municipalities into six clusters, considering the period from the beginning of October until the end of March, which is equivalent to the crop vegetative cycle. Half of the generated clusters presented spectro-temporal pattern, a characteristic of soybeans and were mostly under the soybean belt in the state of Paraná, which shows good results that were obtained with the proposed methodology as for identification of homogeneous areas. These results will be useful for the creation of regional soybean "masks" to estimate the planted area for this crop.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study aimed to identify differences in swine vocalization pattern according to animal gender and different stress conditions. A total of 150 barrow males and 150 females (Dalland® genetic strain), aged 100 days, were used in the experiment. Pigs were exposed to different stressful situations: thirst (no access to water), hunger (no access to food), and thermal stress (THI exceeding 74). For the control treatment, animals were kept under a comfort situation (animals with full access to food and water, with environmental THI lower than 70). Acoustic signals were recorded every 30 minutes, totaling six samples for each stress situation. Afterwards, the audios were analyzed by Praat® 5.1.19 software, generating a sound spectrum. For determination of stress conditions, data were processed by WEKA® 3.5 software, using the decision tree algorithm C4.5, known as J48 in the software environment, considering cross-validation with samples of 10% (10-fold cross-validation). According to the Decision Tree, the acoustic most important attribute for the classification of stress conditions was sound Intensity (root node). It was not possible to identify, using the tested attributes, the animal gender by vocal register. A decision tree was generated for recognition of situations of swine hunger, thirst, and heat stress from records of sound intensity, Pitch frequency, and Formant 1.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In the current study, epidemiology study is done by means of literature survey in groups identified to be at higher potential for DDIs as well as in other cases to explore patterns of DDIs and the factors affecting them. The structure of the FDA Adverse Event Reporting System (FAERS) database is studied and analyzed in detail to identify issues and challenges in data mining the drug-drug interactions. The necessary pre-processing algorithms are developed based on the analysis and the Apriori algorithm is modified to suit the process. Finally, the modules are integrated into a tool to identify DDIs. The results are compared using standard drug interaction database for validation. 31% of the associations obtained were identified to be new and the match with existing interactions was 69%. This match clearly indicates the validity of the methodology and its applicability to similar databases. Formulation of the results using the generic names expanded the relevance of the results to a global scale. The global applicability helps the health care professionals worldwide to observe caution during various stages of drug administration thus considerably enhancing pharmacovigilance

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Decision trees are very powerful tools for classification in data mining tasks that involves different types of attributes. When coming to handling numeric data sets, usually they are converted first to categorical types and then classified using information gain concepts. Information gain is a very popular and useful concept which tells you, whether any benefit occurs after splitting with a given attribute as far as information content is concerned. But this process is computationally intensive for large data sets. Also popular decision tree algorithms like ID3 cannot handle numeric data sets. This paper proposes statistical variance as an alternative to information gain as well as statistical mean to split attributes in completely numerical data sets. The new algorithm has been proved to be competent with respect to its information gain counterpart C4.5 and competent with many existing decision tree algorithms against the standard UCI benchmarking datasets using the ANOVA test in statistics. The specific advantages of this proposed new algorithm are that it avoids the computational overhead of information gain computation for large data sets with many attributes, as well as it avoids the conversion to categorical data from huge numeric data sets which also is a time consuming task. So as a summary, huge numeric datasets can be directly submitted to this algorithm without any attribute mappings or information gain computations. It also blends the two closely related fields statistics and data mining

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Data mining means to summarize information from large amounts of raw data. It is one of the key technologies in many areas of economy, science, administration and the internet. In this report we introduce an approach for utilizing evolutionary algorithms to breed fuzzy classifier systems. This approach was exercised as part of a structured procedure by the students Achler, Göb and Voigtmann as contribution to the 2006 Data-Mining-Cup contest, yielding encouragingly positive results.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present a new algorithm called TITANIC for computing concept lattices. It is based on data mining techniques for computing frequent itemsets. The algorithm is experimentally evaluated and compared with B. Ganter's Next-Closure algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Formal Concept Analysis is an unsupervised learning technique for conceptual clustering. We introduce the notion of iceberg concept lattices and show their use in Knowledge Discovery in Databases (KDD). Iceberg lattices are designed for analyzing very large databases. In particular they serve as a condensed representation of frequent patterns as known from association rule mining. In order to show the interplay between Formal Concept Analysis and association rule mining, we discuss the algorithm TITANIC. We show that iceberg concept lattices are a starting point for computing condensed sets of association rules without loss of information, and are a visualization method for the resulting rules.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Association rules are a popular knowledge discovery technique for warehouse basket analysis. They indicate which items of the warehouse are frequently bought together. The problem of association rule mining has first been stated in 1993. Five years later, several research groups discovered that this problem has a strong connection to Formal Concept Analysis (FCA). In this survey, we will first introduce some basic ideas of this connection along a specific algorithm, TITANIC, and show how FCA helps in reducing the number of resulting rules without loss of information, before giving a general overview over the history and state of the art of applying FCA for association rule mining.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Structured data represented in the form of graphs arises in several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated, load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening dataset, where the approach attains close-to linear speedup in a network of workstations.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The amount of textual information digitally stored is growing every day. However, our capability of processing and analyzing that information is not growing at the same pace. To overcome this limitation, it is important to develop semiautomatic processes to extract relevant knowledge from textual information, such as the text mining process. One of the main and most expensive stages of the text mining process is the text pre-processing stage, where the unstructured text should be transformed to structured format such as an attribute-value table. The stemming process, i.e. linguistics normalization, is usually used to find the attributes of this table. However, the stemming process is strongly dependent on the language in which the original textual information is given. Furthermore, for most languages, the stemming algorithms proposed in the literature are computationally expensive. In this work, several improvements of the well know Porter stemming algorithm for the Portuguese language, which explore the characteristics of this language, are proposed. Experimental results show that the proposed algorithm executes in far less time without affecting the quality of the generated stems.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Tendo como motivação o desenvolvimento de uma representação gráfica de redes com grande número de vértices, útil para aplicações de filtro colaborativo, este trabalho propõe a utilização de superfícies de coesão sobre uma base temática multidimensionalmente escalonada. Para isso, utiliza uma combinação de escalonamento multidimensional clássico e análise de procrustes, em algoritmo iterativo que encaminha soluções parciais, depois combinadas numa solução global. Aplicado a um exemplo de transações de empréstimo de livros pela Biblioteca Karl A. Boedecker, o algoritmo proposto produz saídas interpretáveis e coerentes tematicamente, e apresenta um stress menor que a solução por escalonamento clássico.