106 resultados para Di Domenico, G.
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
In the past decade, the amount of data in biological field has become larger and larger; Bio-techniques for analysis of biological data have been developed and new tools have been introduced. Several computational methods are based on unsupervised neural network algorithms that are widely used for multiple purposes including clustering and visualization, i.e. the Self Organizing Maps (SOM). Unfortunately, even though this method is unsupervised, the performances in terms of quality of result and learning speed are strongly dependent from the neuron weights initialization. In this paper we present a new initialization technique based on a totally connected undirected graph, that report relations among some intersting features of data input. Result of experimental tests, where the proposed algorithm is compared to the original initialization techniques, shows that our technique assures faster learning and better performance in terms of quantization error.
Resumo:
Self-Organizing Map (SOM) algorithm has been extensively used for analysis and classification problems. For this kind of problems, datasets become more and more large and it is necessary to speed up the SOM learning. In this paper we present an application of the Simulated Annealing (SA) procedure to the SOM learning algorithm. The goal of the algorithm is to obtain fast learning and better performance in terms of matching of input data and regularity of the obtained map. An advantage of the proposed technique is that it preserves the simplicity of the basic algorithm. Several tests, carried out on different large datasets, demonstrate the effectiveness of the proposed algorithm in comparison with the original SOM and with some of its modification introduced to speed-up the learning.
Resumo:
In this work a new method for clustering and building a topographic representation of a bacteria taxonomy is presented. The method is based on the analysis of stable parts of the genome, the so-called “housekeeping genes”. The proposed method generates topographic maps of the bacteria taxonomy, where relations among different type strains can be visually inspected and verified. Two well known DNA alignement algorithms are applied to the genomic sequences. Topographic maps are optimized to represent the similarity among the sequences according to their evolutionary distances. The experimental analysis is carried out on 147 type strains of the Gammaprotebacteria class by means of the 16S rRNA housekeeping gene. Complete sequences of the gene have been retrieved from the NCBI public database. In the experimental tests the maps show clusters of homologous type strains and present some singular cases potentially due to incorrect classification or erroneous annotations in the database.
Resumo:
Three naming strategies are discussed that allow the processes of a distributed application to continue being addressed by their original logical name, along all the migrations they may be forced to undertake because of performance-improvement goals. A simple centralised solution is firstly discussed which showed a software bottleneck with the increase of the number of processes; other two solutions are considered that entail different communication schemes and different communication overheads for the naming protocol. All these strategies are based on the facility that each process is allowed to survive after migration, even in its original site, only to provide a forwarding service to those communications that used its obsolete address.
Resumo:
A new heuristic for the Steiner Minimal Tree problem is presented here. The method described is based on the detection of particular sets of nodes in networks, the “Hot Spot” sets, which are used to obtain better approximations of the optimal solutions. An algorithm is also proposed which is capable of improving the solutions obtained by classical heuristics, by means of a stirring process of the nodes in solution trees. Classical heuristics and an enumerative method are used CIS comparison terms in the experimental analysis which demonstrates the goodness of the heuristic discussed in this paper.
Resumo:
We present a method to enhance fault localization for software systems based on a frequent pattern mining algorithm. Our method is based on a large set of test cases for a given set of programs in which faults can be detected. The test executions are recorded as function call trees. Based on test oracles the tests can be classified into successful and failing tests. A frequent pattern mining algorithm is used to identify frequent subtrees in successful and failing test executions. This information is used to rank functions according to their likelihood of containing a fault. The ranking suggests an order in which to examine the functions during fault analysis. We validate our approach experimentally using a subset of Siemens benchmark programs.
Resumo:
Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.
Resumo:
The Konstanz Information Miner is a modular environment which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables easy integration of new algorithms, data manipulation or visualization methods as new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated.
Resumo:
A new heuristic for the Steiner minimal tree problem is presented. The method described is based on the detection of particular sets of nodes in networks, the “hot spot” sets, which are used to obtain better approximations of the optimal solutions. An algorithm is also proposed which is capable of improving the solutions obtained by classical heuristics, by means of a stirring process of the nodes in solution trees. Classical heuristics and an enumerative method are used as comparison terms in the experimental analysis which demonstrates the capability of the heuristic discussed
Resumo:
Ant colonies in nature provide a good model for a distributed, robust and adaptive routing algorithm. This paper proposes the adoption of the same strategy for the routing of packets in an Active Network. Traditional store-and-forward routers are replaced by active intermediate systems, which are able to perform computations on transient packets, in a way that results very helpful for developing and dynamically deploying new protocols. The adoption of the Active Networks paradigm associated with a cooperative learning environment produces a robust, decentralized routing algorithm capable of adapting to network traffic conditions.
Resumo:
New conceptual ideas on network architectures have been proposed in the recent past. Current store-andforward routers are replaced by active intermediate systems, which are able to perform computations on transient packets, in a way that results very helpful for developing and deploying new protocols in a short time. This paper introduces a new routing algorithm, based on a congestion metric, and inspired by the behavior of ants in nature. The use of the Active Networks paradigm associated with a cooperative learning environment produces a robust, decentralized algorithm capable of adapting quickly to changing conditions.