955 resultados para Large datasets
Resumo:
Inducing rules from very large datasets is one of the most challenging areas in data mining. Several approaches exist to scaling up classification rule induction to large datasets, namely data reduction and the parallelisation of classification rule induction algorithms. In the area of parallelisation of classification rule induction algorithms most of the work has been concentrated on the Top Down Induction of Decision Trees (TDIDT), also known as the ‘divide and conquer’ approach. However powerful alternative algorithms exist that induce modular rules. Most of these alternative algorithms follow the ‘separate and conquer’ approach of inducing rules, but very little work has been done to make the ‘separate and conquer’ approach scale better on large training data. This paper examines the potential of the recently developed blackboard based J-PMCRI methodology for parallelising modular classification rule induction algorithms that follow the ‘separate and conquer’ approach. A concrete implementation of the methodology is evaluated empirically on very large datasets.
Resumo:
The Prism family of algorithms induces modular classification rules which, in contrast to decision tree induction algorithms, do not necessarily fit together into a decision tree structure. Classifiers induced by Prism algorithms achieve a comparable accuracy compared with decision trees and in some cases even outperform decision trees. Both kinds of algorithms tend to overfit on large and noisy datasets and this has led to the development of pruning methods. Pruning methods use various metrics to truncate decision trees or to eliminate whole rules or single rule terms from a Prism rule set. For decision trees many pre-pruning and postpruning methods exist, however for Prism algorithms only one pre-pruning method has been developed, J-pruning. Recent work with Prism algorithms examined J-pruning in the context of very large datasets and found that the current method does not use its full potential. This paper revisits the J-pruning method for the Prism family of algorithms and develops a new pruning method Jmax-pruning, discusses it in theoretical terms and evaluates it empirically.
Resumo:
Ensemble learning techniques generate multiple classifiers, so called base classifiers, whose combined classification results are used in order to increase the overall classification accuracy. In most ensemble classifiers the base classifiers are based on the Top Down Induction of Decision Trees (TDIDT) approach. However, an alternative approach for the induction of rule based classifiers is the Prism family of algorithms. Prism algorithms produce modular classification rules that do not necessarily fit into a decision tree structure. Prism classification rulesets achieve a comparable and sometimes higher classification accuracy compared with decision tree classifiers, if the data is noisy and large. Yet Prism still suffers from overfitting on noisy and large datasets. In practice ensemble techniques tend to reduce the overfitting, however there exists no ensemble learner for modular classification rule inducers such as the Prism family of algorithms. This article describes the first development of an ensemble learner based on the Prism family of algorithms in order to enhance Prism’s classification accuracy by reducing overfitting.
Resumo:
In the recent years, the area of data mining has been experiencing considerable demand for technologies that extract knowledge from large and complex data sources. There has been substantial commercial interest as well as active research in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from large datasets. Artificial neural networks (NNs) are popular biologically-inspired intelligent methodologies, whose classification, prediction, and pattern recognition capabilities have been utilized successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction, and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks. © 2012 Wiley Periodicals, Inc.
Resumo:
Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.
Resumo:
This chapter introduces the latest practices and technologies in the interactive interpretation of environmental data. With environmental data becoming ever larger, more diverse and more complex, there is a need for a new generation of tools that provides new capabilities over and above those of the standard workhorses of science. These new tools aid the scientist in discovering interesting new features (and also problems) in large datasets by allowing the data to be explored interactively using simple, intuitive graphical tools. In this way, new discoveries are made that are commonly missed by automated batch data processing. This chapter discusses the characteristics of environmental science data, common current practice in data analysis and the supporting tools and infrastructure. New approaches are introduced and illustrated from the points of view of both the end user and the underlying technology. We conclude by speculating as to future developments in the field and what must be achieved to fulfil this vision.
Resumo:
n the past decade, the analysis of data has faced the challenge of dealing with very large and complex datasets and the real-time generation of data. Technologies to store and access these complex and large datasets are in place. However, robust and scalable analysis technologies are needed to extract meaningful information from these datasets. The research field of Information Visualization and Visual Data Analytics addresses this need. Information visualization and data mining are often used complementary to each other. Their common goal is the extraction of meaningful information from complex and possibly large data. However, though data mining focuses on the usage of silicon hardware, visualization techniques also aim to access the powerful image-processing capabilities of the human brain. This article highlights the research on data visualization and visual analytics techniques. Furthermore, we highlight existing visual analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.
Resumo:
We propose first, a simple task for the eliciting attitudes toward risky choice, the SGG lottery-panel task, which consists in a series of lotteries constructed to compensate riskier options with higher risk-return trade-offs. Using Principal Component Analysis technique, we show that the SGG lottery-panel task is capable of capturing two dimensions of individual risky decision making i.e. subjects’ average risk taking and their sensitivity towards variations in risk-return. From the results of a large experimental dataset, we confirm that the task systematically captures a number of regularities such as: A tendency to risk averse behavior (only around 10% of choices are compatible with risk neutrality); An attraction to certain payoffs compared to low risk lotteries, compatible with over-(under-) weighting of small (large) probabilities predicted in PT and; Gender differences, i.e. males being consistently less risk averse than females but both genders being similarly responsive to the increases in risk-premium. Another interesting result is that in hypothetical choices most individuals increase their risk taking responding to the increase in return to risk, as predicted by PT, while across panels with real rewards we see even more changes, but opposite to the expected pattern of riskier choices for higher risk-returns. Therefore, we conclude from our data that an “economic anomaly” emerges in the real reward choices opposite to the hypothetical choices. These findings are in line with Camerer's (1995) view that although in many domains, paid subjects probably do exert extra mental effort which improves their performance, choice over money gambles is not likely to be a domain in which effort will improve adherence to rational axioms (p. 635). Finally, we demonstrate that both dimensions of risk attitudes, average risk taking and sensitivity towards variations in the return to risk, are desirable not only to describe behavior under risk but also to explain behavior in other contexts, as illustrated by an example. In the second study, we propose three additional treatments intended to elicit risk attitudes under high stakes and mixed outcome (gains and losses) lotteries. Using a dataset obtained from a hypothetical implementation of the tasks we show that the new treatments are able to capture both dimensions of risk attitudes. This new dataset allows us to describe several regularities, both at the aggregate and within-subjects level. We find that in every treatment over 70% of choices show some degree of risk aversion and only between 0.6% and 15.3% of individuals are consistently risk neutral within the same treatment. We also confirm the existence of gender differences in the degree of risk taking, that is, in all treatments females prefer safer lotteries compared to males. Regarding our second dimension of risk attitudes we observe, in all treatments, an increase in risk taking in response to risk premium increases. Treatment comparisons reveal other regularities, such as a lower degree of risk taking in large stake treatments compared to low stake treatments and a lower degree of risk taking when losses are incorporated into the large stake lotteries. Results that are compatible with previous findings in the literature, for stake size effects (e.g., Binswanger, 1980; Antoni Bosch-Domènech & Silvestre, 1999; Hogarth & Einhorn, 1990; Holt & Laury, 2002; Kachelmeier & Shehata, 1992; Kühberger et al., 1999; B. J. Weber & Chapman, 2005; Wik et al., 2007) and domain effect (e.g., Brooks and Zank, 2005, Schoemaker, 1990, Wik et al., 2007). Whereas for small stake treatments, we find that the effect of incorporating losses into the outcomes is not so clear. At the aggregate level an increase in risk taking is observed, but also more dispersion in the choices, whilst at the within-subjects level the effect weakens. Finally, regarding responses to risk premium, we find that compared to only gains treatments sensitivity is lower in the mixed lotteries treatments (SL and LL). In general sensitivity to risk-return is more affected by the domain than the stake size. After having described the properties of risk attitudes as captured by the SGG risk elicitation task and its three new versions, it is important to recall that the danger of using unidimensional descriptions of risk attitudes goes beyond the incompatibility with modern economic theories like PT, CPT etc., all of which call for tests with multiple degrees of freedom. Being faithful to this recommendation, the contribution of this essay is an empirically and endogenously determined bi-dimensional specification of risk attitudes, useful to describe behavior under uncertainty and to explain behavior in other contexts. Hopefully, this will contribute to create large datasets containing a multidimensional description of individual risk attitudes, while at the same time allowing for a robust context, compatible with present and even future more complex descriptions of human attitudes towards risk.
Resumo:
This paper presents a novel approach to the automatic classification of very large data sets composed of terahertz pulse transient signals, highlighting their potential use in biochemical, biomedical, pharmaceutical and security applications. Two different types of THz spectra are considered in the classification process. Firstly a binary classification study of poly-A and poly-C ribonucleic acid samples is performed. This is then contrasted with a difficult multi-class classification problem of spectra from six different powder samples that although have fairly indistinguishable features in the optical spectrum, they also possess a few discernable spectral features in the terahertz part of the spectrum. Classification is performed using a complex-valued extreme learning machine algorithm that takes into account features in both the amplitude as well as the phase of the recorded spectra. Classification speed and accuracy are contrasted with that achieved using a support vector machine classifier. The study systematically compares the classifier performance achieved after adopting different Gaussian kernels when separating amplitude and phase signatures. The two signatures are presented as feature vectors for both training and testing purposes. The study confirms the utility of complex-valued extreme learning machine algorithms for classification of the very large data sets generated with current terahertz imaging spectrometers. The classifier can take into consideration heterogeneous layers within an object as would be required within a tomographic setting and is sufficiently robust to detect patterns hidden inside noisy terahertz data sets. The proposed study opens up the opportunity for the establishment of complex-valued extreme learning machine algorithms as new chemometric tools that will assist the wider proliferation of terahertz sensing technology for chemical sensing, quality control, security screening and clinic diagnosis. Furthermore, the proposed algorithm should also be very useful in other applications requiring the classification of very large datasets.
Resumo:
To effectively prevent the onset and reduce mortality from noncommunicable diseases, we must consider every individual as metabolically unique to allow for a personalized management to take place. Diet and gut microbiota are major components of the exposome that interact together with a genetic make-up in a complex interplay to result in an individual’s metabolic phenotype. In this context, foodomics approaches (such as nutrigenetics, nutrimetabolomics, nutritranscriptomics, nutriproteomics and metagenomics) are essential tools to assess an individual’s optimal metabolic space. These have recently been applied to large human cohorts to identify specific gene-metabolite, diet-metabolite and gene–diet interactions. As the gut microbiota is a key player in metabolic homeostasis, we suggest following a holistic investigation of metagenome–hyperbolome–diet interactions, the findings of which will provide the basis for developing personalized nutrition and personalized functional foods. However, examining these three-way interactions will only be possible when the challenge of large datasets integration will be overcome.
Resumo:
Background qtl.outbred is an extendible interface in the statistical environment, R, for combining quantitative trait loci (QTL) mapping tools. It is built as an umbrella package that enables outbred genotype probabilities to be calculated and/or imported into the software package R/qtl. Findings Using qtl.outbred, the genotype probabilities from outbred line cross data can be calculated by interfacing with a new and efficient algorithm developed for analyzing arbitrarily large datasets (included in the package) or imported from other sources such as the web-based tool, GridQTL. Conclusion qtl.outbred will improve the speed for calculating probabilities and the ability to analyse large future datasets. This package enables the user to analyse outbred line cross data accurately, but with similar effort than inbred line cross data.
Resumo:
MAPfastR is a software package developed to analyze QTL data from inbred and outbred line-crosses. The package includes a number of modules for fast and accurate QTL analyses. It has been developed in the R language for fast and comprehensive analyses of large datasets. MAPfastR is freely available at: http://www.computationalgenetics.se/?page_id=7.
Resumo:
A visualização de conjuntos de dados volumétricos é comum em diversas áreas de aplicação e há já alguns anos os diversos aspectos envolvidos nessas técnicas vêm sendo pesquisados. No entanto, apesar dos avanços das técnicas de visualização de volumes, a interação com grandes volumes de dados ainda apresenta desafios devido a questões de percepção (ou isolamento) de estruturas internas e desempenho computacional. O suporte do hardware gráfico para visualização baseada em texturas permite o desenvolvimento de técnicas eficientes de rendering que podem ser combinadas com ferramentas de recorte interativas para possibilitar a inspeção de conjuntos de dados tridimensionais. Muitos estudos abordam a otimização do desempenho de ferramentas de recorte, mas muito poucos tratam das metáforas de interação utilizadas por essas ferramentas. O objetivo deste trabalho é desenvolver ferramentas interativas, intuitivas e fáceis de usar para o recorte de imagens volumétricas. Inicialmente, é apresentado um estudo sobre as principais técnicas de visualização direta de volumes e como é feita a exploração desses volumes utilizando-se recorte volumétrico. Nesse estudo é identificada a solução que melhor se enquadra no presente trabalho para garantir a interatividade necessária. Após, são apresentadas diversas técnicas de interação existentes, suas metáforas e taxonomias, para determinar as possíveis técnicas de interação mais fáceis de serem utilizadas por ferramentas de recorte. A partir desse embasamento, este trabalho apresenta o desenvolvimento de três ferramentas de recorte genéricas implementadas usando-se duas metáforas de interação distintas que são freqüentemente utilizadas por usuários de aplicativos 3D: apontador virtual e mão virtual. A taxa de interação dessas ferramentas é obtida através de programas de fragmentos especiais executados diretamente no hardware gráfico. Estes programas especificam regiões dentro do volume a serem descartadas durante o rendering, com base em predicados geométricos. Primeiramente, o desempenho, precisão e preferência (por parte dos usuários) das ferramentas de recorte volumétrico são avaliados para comparar as metáforas de interação empregadas. Após, é avaliada a interação utilizando-se diferentes dispositivos de entrada para a manipulação do volume e ferramentas. A utilização das duas mãos ao mesmo tempo para essa manipulação também é testada. Os resultados destes experimentos de avaliação são apresentados e discutidos.
Resumo:
Bayesian networks are powerful tools as they represent probability distributions as graphs. They work with uncertainties of real systems. Since last decade there is a special interest in learning network structures from data. However learning the best network structure is a NP-Hard problem, so many heuristics algorithms to generate network structures from data were created. Many of these algorithms use score metrics to generate the network model. This thesis compare three of most used score metrics. The K-2 algorithm and two pattern benchmarks, ASIA and ALARM, were used to carry out the comparison. Results show that score metrics with hyperparameters that strength the tendency to select simpler network structures are better than score metrics with weaker tendency to select simpler network structures for both metrics (Heckerman-Geiger and modified MDL). Heckerman-Geiger Bayesian score metric works better than MDL with large datasets and MDL works better than Heckerman-Geiger with small datasets. The modified MDL gives similar results to Heckerman-Geiger for large datasets and close results to MDL for small datasets with stronger tendency to select simpler network structures
Resumo:
In this paper we would like to shed light the problem of efficiency and effectiveness of image classification in large datasets. As the amount of data to be processed and further classified has increased in the last years, there is a need for faster and more precise pattern recognition algorithms in order to perform online and offline training and classification procedures. We deal here with the problem of moist area classification in radar image in a fast manner. Experimental results using Optimum-Path Forest and its training set pruning algorithm also provided and discussed. © 2011 IEEE.