901 resultados para Clustering search algorithm


Relevância:

30.00% 30.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Evolutionary meta-algorithms for pulse shaping of broadband femtosecond duration laser pulses are proposed. The genetic algorithm searching the evolutionary landscape for desired pulse shapes consists of a population of waveforms (genes), each made from two concatenated vectors, specifying phases and magnitudes, respectively, over a range of frequencies. Frequency domain operators such as mutation, two-point crossover average crossover, polynomial phase mutation, creep and three-point smoothing as well as a time-domain crossover are combined to produce fitter offsprings at each iteration step. The algorithm applies roulette wheel selection; elitists and linear fitness scaling to the gene population. A differential evolution (DE) operator that provides a source of directed mutation and new wavelet operators are proposed. Using properly tuned parameters for DE, the meta-algorithm is used to solve a waveform matching problem. Tuning allows either a greedy directed search near the best known solution or a robust search across the entire parameter space.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents a hierarchical clustering method for semantic Web service discovery. This method aims to improve the accuracy and efficiency of the traditional service discovery using vector space model. The Web service is converted into a standard vector format through the Web service description document. With the help of WordNet, a semantic analysis is conducted to reduce the dimension of the term vector and to make semantic expansion to meet the user’s service request. The process and algorithm of hierarchical clustering based semantic Web service discovery is discussed. Validation is carried out on the dataset.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications. Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical compounds. Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Tensor clustering is an important tool that exploits intrinsically rich structures in real-world multiarray or Tensor datasets. Often in dealing with those datasets, standard practice is to use subspace clustering that is based on vectorizing multiarray data. However, vectorization of tensorial data does not exploit complete structure information. In this paper, we propose a subspace clustering algorithm without adopting any vectorization process. Our approach is based on a novel heterogeneous Tucker decomposition model taking into account cluster membership information. We propose a new clustering algorithm that alternates between different modes of the proposed heterogeneous tensor model. All but the last mode have closed-form updates. Updating the last mode reduces to optimizing over the multinomial manifold for which we investigate second order Riemannian geometry and propose a trust-region algorithm. Numerical experiments show that our proposed algorithm compete effectively with state-of-the-art clustering algorithms that are based on tensor factorization.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A novel technique for selecting the poles of orthonormal basis functions (OBF) in Volterra models of any order is presented. It is well-known that the usual large number of parameters required to describe the Volterra kernels can be significantly reduced by representing each kernel using an appropriate basis of orthonormal functions. Such a representation results in the so-called OBF Volterra model, which has a Wiener structure consisting of a linear dynamic generated by the orthonormal basis followed by a nonlinear static mapping given by the Volterra polynomial series. Aiming at optimizing the poles that fully parameterize the orthonormal bases, the exact gradients of the outputs of the orthonormal filters with respect to their poles are computed analytically by using a back-propagation-through-time technique. The expressions relative to the Kautz basis and to generalized orthonormal bases of functions (GOBF) are addressed; the ones related to the Laguerre basis follow straightforwardly as a particular case. The main innovation here is that the dynamic nature of the OBF filters is fully considered in the gradient computations. These gradients provide exact search directions for optimizing the poles of a given orthonormal basis. Such search directions can, in turn, be used as part of an optimization procedure to locate the minimum of a cost-function that takes into account the error of estimation of the system output. The Levenberg-Marquardt algorithm is adopted here as the optimization procedure. Unlike previous related work, the proposed approach relies solely on input-output data measured from the system to be modeled, i.e., no information about the Volterra kernels is required. Examples are presented to illustrate the application of this approach to the modeling of dynamic systems, including a real magnetic levitation system with nonlinear oscillatory behavior.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we present a genetic algorithm with new components to tackle capacitated lot sizing and scheduling problems with sequence dependent setups that appear in a wide range of industries, from soft drink bottling to food manufacturing. Finding a feasible solution to highly constrained problems is often a very difficult task. Various strategies have been applied to deal with infeasible solutions throughout the search. We propose a new scheme of classifying individuals based on nested domains to determine the solutions according to the level of infeasibility, which in our case represents bands of additional production hours (overtime). Within each band, individuals are just differentiated by their fitness function. As iterations are conducted, the widths of the bands are dynamically adjusted to improve the convergence of the individuals into the feasible domain. The numerical experiments on highly capacitated instances show the effectiveness of this computational tractable approach to guide the search toward the feasible domain. Our approach outperforms other state-of-the-art approaches and commercial solvers. (C) 2009 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Conventional procedures employed in the modeling of viscoelastic properties of polymer rely on the determination of the polymer`s discrete relaxation spectrum from experimentally obtained data. In the past decades, several analytical regression techniques have been proposed to determine an explicit equation which describes the measured spectra. With a diverse approach, the procedure herein introduced constitutes a simulation-based computational optimization technique based on non-deterministic search method arisen from the field of evolutionary computation. Instead of comparing numerical results, this purpose of this paper is to highlight some Subtle differences between both strategies and focus on what properties of the exploited technique emerge as new possibilities for the field, In oder to illustrate this, essayed cases show how the employed technique can outperform conventional approaches in terms of fitting quality. Moreover, in some instances, it produces equivalent results With much fewer fitting parameters, which is convenient for computational simulation applications. I-lie problem formulation and the rationale of the highlighted method are herein discussed and constitute the main intended contribution. (C) 2009 Wiley Periodicals, Inc. J Appl Polym Sci 113: 122-135, 2009

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present parallel algorithms on the BSP/CGM model, with p processors, to count and generate all the maximal cliques of a circle graph with n vertices and m edges. To count the number of all the maximal cliques, without actually generating them, our algorithm requires O(log p) communication rounds with O(nm/p) local computation time. We also present an algorithm to generate the first maximal clique in O(log p) communication rounds with O(nm/p) local computation, and to generate each one of the subsequent maximal cliques this algorithm requires O(log p) communication rounds with O(m/p) local computation. The maximal cliques generation algorithm is based on generating all maximal paths in a directed acyclic graph, and we present an algorithm for this problem that uses O(log p) communication rounds with O(m/p) local computation for each maximal path. We also show that the presented algorithms can be extended to the CREW PRAM model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents the formulation of a combinatorial optimization problem with the following characteristics: (i) the search space is the power set of a finite set structured as a Boolean lattice; (ii) the cost function forms a U-shaped curve when applied to any lattice chain. This formulation applies for feature selection in the context of pattern recognition. The known approaches for this problem are branch-and-bound algorithms and heuristics that explore partially the search space. Branch-and-bound algorithms are equivalent to the full search, while heuristics are not. This paper presents a branch-and-bound algorithm that differs from the others known by exploring the lattice structure and the U-shaped chain curves of the search space. The main contribution of this paper is the architecture of this algorithm that is based on the representation and exploration of the search space by new lattice properties proven here. Several experiments, with well known public data, indicate the superiority of the proposed method to the sequential floating forward selection (SFFS), which is a popular heuristic that gives good results in very short computational time. In all experiments, the proposed method got better or equal results in similar or even smaller computational time. (C) 2009 Elsevier Ltd. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This Thesis Work will concentrate on a very interesting problem, the Vehicle Routing Problem (VRP). In this problem, customers or cities have to be visited and packages have to be transported to each of them, starting from a basis point on the map. The goal is to solve the transportation problem, to be able to deliver the packages-on time for the customers,-enough package for each Customer,-using the available resources- and – of course - to be so effective as it is possible.Although this problem seems to be very easy to solve with a small number of cities or customers, it is not. In this problem the algorithm have to face with several constraints, for example opening hours, package delivery times, truck capacities, etc. This makes this problem a so called Multi Constraint Optimization Problem (MCOP). What’s more, this problem is intractable with current amount of computational power which is available for most of us. As the number of customers grow, the calculations to be done grows exponential fast, because all constraints have to be solved for each customers and it should not be forgotten that the goal is to find a solution, what is best enough, before the time for the calculation is up. This problem is introduced in the first chapter: form its basics, the Traveling Salesman Problem, using some theoretical and mathematical background it is shown, why is it so hard to optimize this problem, and although it is so hard, and there is no best algorithm known for huge number of customers, why is it a worth to deal with it. Just think about a huge transportation company with ten thousands of trucks, millions of customers: how much money could be saved if we would know the optimal path for all our packages.Although there is no best algorithm is known for this kind of optimization problems, we are trying to give an acceptable solution for it in the second and third chapter, where two algorithms are described: the Genetic Algorithm and the Simulated Annealing. Both of them are based on obtaining the processes of nature and material science. These algorithms will hardly ever be able to find the best solution for the problem, but they are able to give a very good solution in special cases within acceptable calculation time.In these chapters (2nd and 3rd) the Genetic Algorithm and Simulated Annealing is described in details, from their basis in the “real world” through their terminology and finally the basic implementation of them. The work will put a stress on the limits of these algorithms, their advantages and disadvantages, and also the comparison of them to each other.Finally, after all of these theories are shown, a simulation will be executed on an artificial environment of the VRP, with both Simulated Annealing and Genetic Algorithm. They will both solve the same problem in the same environment and are going to be compared to each other. The environment and the implementation are also described here, so as the test results obtained.Finally the possible improvements of these algorithms are discussed, and the work will try to answer the “big” question, “Which algorithm is better?”, if this question even exists.