907 resultados para computationally efficient algorithm


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Top Down Induction of Decision Trees (TDIDT) is the most commonly used method of constructing a model from a dataset in the form of classification rules to classify previously unseen data. Alternative algorithms have been developed such as the Prism algorithm. Prism constructs modular rules which produce qualitatively better rules than rules induced by TDIDT. However, along with the increasing size of databases, many existing rule learning algorithms have proved to be computational expensive on large datasets. To tackle the problem of scalability, parallel classification rule induction algorithms have been introduced. As TDIDT is the most popular classifier, even though there are strongly competitive alternative algorithms, most parallel approaches to inducing classification rules are based on TDIDT. In this paper we describe work on a distributed classifier that induces classification rules in a parallel manner based on Prism.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Induction of classification rules is one of the most important technologies in data mining. Most of the work in this field has concentrated on the Top Down Induction of Decision Trees (TDIDT) approach. However, alternative approaches have been developed such as the Prism algorithm for inducing modular rules. Prism often produces qualitatively better rules than TDIDT but suffers from higher computational requirements. We investigate approaches that have been developed to minimize the computational requirements of TDIDT, in order to find analogous approaches that could reduce the computational requirements of Prism.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A piecewise uniform fitted mesh method turns out to be sufficient for the solution of a surprisingly wide variety of singularly perturbed problems involving steep gradients. The technique is applied to a model of adsorption in bidisperse solids for which two fitted mesh techniques, a fitted-mesh finite difference method (FMFDM) and fitted mesh collocation method (FMCM) are presented. A combination (FMCMD) of FMCM and the DASSL integration package is found to be most effective in solving the problems. Numerical solutions (FMFDM and FMCMD) were found to match the analytical solution when the adsorption isotherm is linear, even under conditions involving steep gradients for which global collocation fails. In particular, FMCMD is highly efficient for macropore diffusion control or micropore diffusion control. These techniques are simple and there is no limit on the range of the parameters. The techniques can be applied to a variety of adsorption and desorption problems in bidisperse solids with non-linear isotherm and for arbitrary particle geometry.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Background: Research in epistasis or gene-gene interaction detection for human complex traits has grown over the last few years. It has been marked by promising methodological developments, improved translation efforts of statistical epistasis to biological epistasis and attempts to integrate different omics information sources into the epistasis screening to enhance power. The quest for gene-gene interactions poses severe multiple-testing problems. In this context, the maxT algorithm is one technique to control the false-positive rate. However, the memory needed by this algorithm rises linearly with the amount of hypothesis tests. Gene-gene interaction studies will require a memory proportional to the squared number of SNPs. A genome-wide epistasis search would therefore require terabytes of memory. Hence, cache problems are likely to occur, increasing the computation time. In this work we present a new version of maxT, requiring an amount of memory independent from the number of genetic effects to be investigated. This algorithm was implemented in C++ in our epistasis screening software MBMDR-3.0.3. We evaluate the new implementation in terms of memory efficiency and speed using simulated data. The software is illustrated on real-life data for Crohn’s disease. Results: In the case of a binary (affected/unaffected) trait, the parallel workflow of MBMDR-3.0.3 analyzes all gene-gene interactions with a dataset of 100,000 SNPs typed on 1000 individuals within 4 days and 9 hours, using 999 permutations of the trait to assess statistical significance, on a cluster composed of 10 blades, containing each four Quad-Core AMD Opteron(tm) Processor 2352 2.1 GHz. In the case of a continuous trait, a similar run takes 9 days. Our program found 14 SNP-SNP interactions with a multiple-testing corrected p-value of less than 0.05 on real-life Crohn’s disease (CD) data. Conclusions: Our software is the first implementation of the MB-MDR methodology able to solve large-scale SNP-SNP interactions problems within a few days, without using much memory, while adequately controlling the type I error rates. A new implementation to reach genome-wide epistasis screening is under construction. In the context of Crohn’s disease, MBMDR-3.0.3 could identify epistasis involving regions that are well known in the field and could be explained from a biological point of view. This demonstrates the power of our software to find relevant phenotype-genotype higher-order associations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper, a new two-dimensional shear deformable beam element based on the absolute nodal coordinate formulation is proposed. The nonlinear elastic forces of the beam element are obtained using a continuum mechanics approach without employing a local element coordinate system. In this study, linear polynomials are used to interpolate both the transverse and longitudinal components of the displacement. This is different from other absolute nodal-coordinate-based beam elements where cubic polynomials are used in the longitudinal direction. The accompanying defects of the phenomenon known as shear locking are avoided through the adoption of selective integration within the numerical integration method. The proposed element is verified using several numerical examples, and the results are compared to analytical solutions and the results for an existing shear deformable beam element. It is shown that by using the proposed element, accurate linear and nonlinear static deformations, as well as realistic dynamic behavior, can be achieved with a smaller computational effort than by using existing shear deformable two-dimensional beam elements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We propose a novel, simple, efficient and distribution-free re-sampling technique for developing prediction intervals for returns and volatilities following ARCH/GARCH models. In particular, our key idea is to employ a Box–Jenkins linear representation of an ARCH/GARCH equation and then to adapt a sieve bootstrap procedure to the nonlinear GARCH framework. Our simulation studies indicate that the new re-sampling method provides sharp and well calibrated prediction intervals for both returns and volatilities while reducing computational costs by up to 100 times, compared to other available re-sampling techniques for ARCH/GARCH models. The proposed procedure is illustrated by an application to Yen/U.S. dollar daily exchange rate data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A multiple factor parametrization is described to permit the efficient calculation of collision efficiency (E) between electrically charged aerosol particles and neutral cloud droplets in numerical models of cloud and climate. The four-parameter representation summarizes the results obtained from a detailed microphysical model of E, which accounts for the different forces acting on the aerosol in the path of falling cloud droplets. The parametrization's range of validity is for aerosol particle radii of 0.4 to 10 mu m, aerosol particle densities of I to 2.0 g cm(-3), aerosol particle charges from neutral to 100 elementary charges and drop radii from 18.55 to 142 mu m. The parametrization yields values of E well within an order of magnitude of the detailed model's values, from a dataset of 3978 E values. Of these values 95% have modelled to parametrized ratios between 0.5 and 1.5 for aerosol particle sizes ranging between 0.4 and 2.0 mu m, and about 96% in the second size range. This parametrization speeds up the calculation of E by a factor of similar to 10(3) compared with the original microphysical model, permitting the inclusion of electric charge effects in numerical cloud and climate models.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Generally classifiers tend to overfit if there is noise in the training data or there are missing values. Ensemble learning methods are often used to improve a classifier's classification accuracy. Most ensemble learning approaches aim to improve the classification accuracy of decision trees. However, alternative classifiers to decision trees exist. The recently developed Random Prism ensemble learner for classification aims to improve an alternative classification rule induction approach, the Prism family of algorithms, which addresses some of the limitations of decision trees. However, Random Prism suffers like any ensemble learner from a high computational overhead due to replication of the data and the induction of multiple base classifiers. Hence even modest sized datasets may impose a computational challenge to ensemble learners such as Random Prism. Parallelism is often used to scale up algorithms to deal with large datasets. This paper investigates parallelisation for Random Prism, implements a prototype and evaluates it empirically using a Hadoop computing cluster.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Advances in hardware and software technologies allow to capture streaming data. The area of Data Stream Mining (DSM) is concerned with the analysis of these vast amounts of data as it is generated in real-time. Data stream classification is one of the most important DSM techniques allowing to classify previously unseen data instances. Different to traditional classifiers for static data, data stream classifiers need to adapt to concept changes (concept drift) in the stream in real-time in order to reflect the most recent concept in the data as accurately as possible. A recent addition to the data stream classifier toolbox is eRules which induces and updates a set of expressive rules that can easily be interpreted by humans. However, like most rule-based data stream classifiers, eRules exhibits a poor computational performance when confronted with continuous attributes. In this work, we propose an approach to deal with continuous data effectively and accurately in rule-based classifiers by using the Gaussian distribution as heuristic for building rule terms on continuous attributes. We show on the example of eRules that incorporating our method for continuous attributes indeed speeds up the real-time rule induction process while maintaining a similar level of accuracy compared with the original eRules classifier. We termed this new version of eRules with our approach G-eRules.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.