28 resultados para Data Structure Operations
Resumo:
Data from civil engineering projects can inform the operation of built infrastructure. This paper captures lessons for such data handover, from projects into operations, through interviews with leading clients and their supply chain. Clients are found to value receiving accurate and complete data. They recognise opportunities to use high quality information in decision-making about capital and operational expenditure; as well as in ensuring compliance with regulatory requirements. Providing this value to clients is a motivation for information management in projects. However, data handover is difficult as key people leave before project completion; and different data formats and structures are used in project delivery and operations. Lessons learnt from leading practice include defining data requirements at the outset, getting operations teams involved early, shaping the evolution of interoperable systems and standards, developing handover processes to check data rather than documentation, and fostering skills to use and update project data in operations
Resumo:
We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.
Resumo:
Bloom filters are a data structure for storing data in a compressed form. They offer excellent space and time efficiency at the cost of some loss of accuracy (so-called lossy compression). This work presents a yes-no Bloom filter, which as a data structure consisting of two parts: the yes-filter which is a standard Bloom filter and the no-filter which is another Bloom filter whose purpose is to represent those objects that were recognised incorrectly by the yes-filter (that is, to recognise the false positives of the yes-filter). By querying the no-filter after an object has been recognised by the yes-filter, we get a chance of rejecting it, which improves the accuracy of data recognition in comparison with the standard Bloom filter of the same total length. A further increase in accuracy is possible if one chooses objects to include in the no-filter so that the no-filter recognises as many as possible false positives but no true positives, thus producing the most accurate yes-no Bloom filter among all yes-no Bloom filters. This paper studies how optimization techniques can be used to maximize the number of false positives recognised by the no-filter, with the constraint being that it should recognise no true positives. To achieve this aim, an Integer Linear Program (ILP) is proposed for the optimal selection of false positives. In practice the problem size is normally large leading to intractable optimal solution. Considering the similarity of the ILP with the Multidimensional Knapsack Problem, an Approximate Dynamic Programming (ADP) model is developed making use of a reduced ILP for the value function approximation. Numerical results show the ADP model works best comparing with a number of heuristics as well as the CPLEX built-in solver (B&B), and this is what can be recommended for use in yes-no Bloom filters. In a wider context of the study of lossy compression algorithms, our researchis an example showing how the arsenal of optimization methods can be applied to improving the accuracy of compressed data.
Resumo:
In this paper a cell by cell anisotropic adaptive mesh technique is added to an existing staggered mesh Lagrange plus remap finite element ALE code for the solution of the Euler equations. The quadrilateral finite elements may be subdivided isotropically or anisotropically and a hierarchical data structure is employed. An efficient computational method is proposed, which only solves on the finest level of resolution that exists for each part of the domain with disjoint or hanging nodes being used at resolution transitions. The Lagrangian, equipotential mesh relaxation and advection (solution remapping) steps are generalised so that they may be applied on the dynamic mesh. It is shown that for a radial Sod problem and a two-dimensional Riemann problem the anisotropic adaptive mesh method runs over eight times faster.
Resumo:
Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.
Resumo:
One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.
Resumo:
In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.
Resumo:
This study investigates the superposition-based cooperative transmission system. In this system, a key point is for the relay node to detect data transmitted from the source node. This issued was less considered in the existing literature as the channel is usually assumed to be flat fading and a priori known. In practice, however, the channel is not only a priori unknown but subject to frequency selective fading. Channel estimation is thus necessary. Of particular interest is the channel estimation at the relay node which imposes extra requirement for the system resources. The authors propose a novel turbo least-square channel estimator by exploring the superposition structure of the transmission data. The proposed channel estimator not only requires no pilot symbols but also has significantly better performance than the classic approach. The soft-in-soft-out minimum mean square error (MMSE) equaliser is also re-derived to match the superimposed data structure. Finally computer simulation results are shown to verify the proposed algorithm.
Resumo:
Web Services for Remote Portlets (WSRP) is gaining attention among portal developers and vendors to enable easy development, increased richness in functionality, pluggability, and flexibility of deployment. Whilst currently not supporting all WSRP functionalities, open-source portal frameworks could in future use WSRP Consumers to access remote portlets found from a WSRP Producer registry service. This implies that we need a central registry for the remote portlets and a more expressive WSRP Consumer interface to implement the remote portlet functions. This paper reports on an investigation into a new system architecture, which includes a Web Services repository, registry, and client interface. The Web Services repository holds portlets as remote resource producers. A new data structure for expressing remote portlets is found and published by populating a Universal Description, Discovery and Integration (UDDI) registry. A remote portlet publish and search engine for UDDI has also been developed. Finally, a remote portlet client interface was developed as a Web application. The client interface supports remote portlet features, as well as window status and mode functions. Copyright (c) 2007 John Wiley & Sons, Ltd.
Resumo:
K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.
Resumo:
A parallel formulation of an algorithm for the histogram computation of n data items using an on-the-fly data decomposition and a novel quantum-like representation (QR) is developed. The QR transformation separates multiple data read operations from multiple bin update operations thereby making it easier to bind data items into their corresponding histogram bins. Under this model the steps required to compute the histogram is n/s + t steps, where s is a speedup factor and t is associated with pipeline latency. Here, we show that an overall speedup factor, s, is available for up to an eightfold acceleration. Our evaluation also shows that each one of these cells requires less area/time complexity compared to similar proposals found in the literature.
Resumo:
The Bloom filter is a space efficient randomized data structure for representing a set and supporting membership queries. Bloom filters intrinsically allow false positives. However, the space savings they offer outweigh the disadvantage if the false positive rates are kept sufficiently low. Inspired by the recent application of the Bloom filter in a novel multicast forwarding fabric, this paper proposes a variant of the Bloom filter, the optihash. The optihash introduces an optimization for the false positive rate at the stage of Bloom filter formation using the same amount of space at the cost of slightly more processing than the classic Bloom filter. Often Bloom filters are used in situations where a fixed amount of space is a primary constraint. We present the optihash as a good alternative to Bloom filters since the amount of space is the same and the improvements in false positives can justify the additional processing. Specifically, we show via simulations and numerical analysis that using the optihash the false positives occurrences can be reduced and controlled at a cost of small additional processing. The simulations are carried out for in-packet forwarding. In this framework, the Bloom filter is used as a compact link/route identifier and it is placed in the packet header to encode the route. At each node, the Bloom filter is queried for membership in order to make forwarding decisions. A false positive in the forwarding decision is translated into packets forwarded along an unintended outgoing link. By using the optihash, false positives can be reduced. The optimization processing is carried out in an entity termed the Topology Manger which is part of the control plane of the multicast forwarding fabric. This processing is only carried out on a per-session basis, not for every packet. The aim of this paper is to present the optihash and evaluate its false positive performances via simulations in order to measure the influence of different parameters on the false positive rate. The false positive rate for the optihash is then compared with the false positive probability of the classic Bloom filter.
Resumo:
Advances made over the past decade in structure determination from powder diffraction data are reviewed with particular emphasis on algorithmic developments and the successes and limitations of the technique. While global optimization methods have been successful in the solution of molecular crystal structures, new methods are required to make the solution of inorganic crystal structures more routine. The use of complementary techniques such as NMR to assist structure solution is discussed and the potential for the combined use of X-ray and neutron diffraction data for structure verification is explored. Structures that have proved difficult to solve from powder diffraction data are reviewed and the limitations of structure determination from powder diffraction data are discussed. Furthermore, the prospects of solving small protein crystal structures over the next decade are assessed.
Resumo:
The simulated annealing approach to structure solution from powder diffraction data, as implemented in the DASH program, is easily amenable to parallelization at the individual run level. Very large scale increases in speed of execution can therefore be achieved by distributing individual DASH runs over a network of computers. The GDASH program achieves this by packaging DASH in a form that enables it to run under the Univa UD Grid MP system, which harnesses networks of existing computing resources to perform calculations.
Resumo:
The simulated annealing approach to structure solution from powder diffraction data, as implemented in the DASH program, is easily amenable to parallelization at the individual run level. Modest increases in speed of execution can therefore be achieved by executing individual DASH runs on the individual cores of CPUs.