58 resultados para Data Structure Evaluation

em CentAUR: Central Archive University of Reading - UK


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Bloom filters are a data structure for storing data in a compressed form. They offer excellent space and time efficiency at the cost of some loss of accuracy (so-called lossy compression). This work presents a yes-no Bloom filter, which as a data structure consisting of two parts: the yes-filter which is a standard Bloom filter and the no-filter which is another Bloom filter whose purpose is to represent those objects that were recognised incorrectly by the yes-filter (that is, to recognise the false positives of the yes-filter). By querying the no-filter after an object has been recognised by the yes-filter, we get a chance of rejecting it, which improves the accuracy of data recognition in comparison with the standard Bloom filter of the same total length. A further increase in accuracy is possible if one chooses objects to include in the no-filter so that the no-filter recognises as many as possible false positives but no true positives, thus producing the most accurate yes-no Bloom filter among all yes-no Bloom filters. This paper studies how optimization techniques can be used to maximize the number of false positives recognised by the no-filter, with the constraint being that it should recognise no true positives. To achieve this aim, an Integer Linear Program (ILP) is proposed for the optimal selection of false positives. In practice the problem size is normally large leading to intractable optimal solution. Considering the similarity of the ILP with the Multidimensional Knapsack Problem, an Approximate Dynamic Programming (ADP) model is developed making use of a reduced ILP for the value function approximation. Numerical results show the ADP model works best comparing with a number of heuristics as well as the CPLEX built-in solver (B&B), and this is what can be recommended for use in yes-no Bloom filters. In a wider context of the study of lossy compression algorithms, our researchis an example showing how the arsenal of optimization methods can be applied to improving the accuracy of compressed data.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper a cell by cell anisotropic adaptive mesh technique is added to an existing staggered mesh Lagrange plus remap finite element ALE code for the solution of the Euler equations. The quadrilateral finite elements may be subdivided isotropically or anisotropically and a hierarchical data structure is employed. An efficient computational method is proposed, which only solves on the finest level of resolution that exists for each part of the domain with disjoint or hanging nodes being used at resolution transitions. The Lagrangian, equipotential mesh relaxation and advection (solution remapping) steps are generalised so that they may be applied on the dynamic mesh. It is shown that for a radial Sod problem and a two-dimensional Riemann problem the anisotropic adaptive mesh method runs over eight times faster.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Clustering is defined as the grouping of similar items in a set, and is an important process within the field of data mining. As the amount of data for various applications continues to increase, in terms of its size and dimensionality, it is necessary to have efficient clustering methods. A popular clustering algorithm is K-Means, which adopts a greedy approach to produce a set of K-clusters with associated centres of mass, and uses a squared error distortion measure to determine convergence. Methods for improving the efficiency of K-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting a more efficient data structure, notably a multi-dimensional binary search tree (KD-Tree) to store either centroids or data points. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient K-Means techniques in parallel computational environments. In this work, we provide a parallel formulation for the KD-Tree based K-Means algorithm and address its load balancing issues.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

One among the most influential and popular data mining methods is the k-Means algorithm for cluster analysis. Techniques for improving the efficiency of k-Means have been largely explored in two main directions. The amount of computation can be significantly reduced by adopting geometrical constraints and an efficient data structure, notably a multidimensional binary search tree (KD-Tree). These techniques allow to reduce the number of distance computations the algorithm performs at each iteration. A second direction is parallel processing, where data and computation loads are distributed over many processing nodes. However, little work has been done to provide a parallel formulation of the efficient sequential techniques based on KD-Trees. Such approaches are expected to have an irregular distribution of computation load and can suffer from load imbalance. This issue has so far limited the adoption of these efficient k-Means variants in parallel computing environments. In this work, we provide a parallel formulation of the KD-Tree based k-Means algorithm for distributed memory systems and address its load balancing issue. Three solutions have been developed and tested. Two approaches are based on a static partitioning of the data set and a third solution incorporates a dynamic load balancing policy.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the well known National Cancer Institute’s HIV-screening dataset. We present experimental results on a small-scale computing environment.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This study investigates the superposition-based cooperative transmission system. In this system, a key point is for the relay node to detect data transmitted from the source node. This issued was less considered in the existing literature as the channel is usually assumed to be flat fading and a priori known. In practice, however, the channel is not only a priori unknown but subject to frequency selective fading. Channel estimation is thus necessary. Of particular interest is the channel estimation at the relay node which imposes extra requirement for the system resources. The authors propose a novel turbo least-square channel estimator by exploring the superposition structure of the transmission data. The proposed channel estimator not only requires no pilot symbols but also has significantly better performance than the classic approach. The soft-in-soft-out minimum mean square error (MMSE) equaliser is also re-derived to match the superimposed data structure. Finally computer simulation results are shown to verify the proposed algorithm.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Web Services for Remote Portlets (WSRP) is gaining attention among portal developers and vendors to enable easy development, increased richness in functionality, pluggability, and flexibility of deployment. Whilst currently not supporting all WSRP functionalities, open-source portal frameworks could in future use WSRP Consumers to access remote portlets found from a WSRP Producer registry service. This implies that we need a central registry for the remote portlets and a more expressive WSRP Consumer interface to implement the remote portlet functions. This paper reports on an investigation into a new system architecture, which includes a Web Services repository, registry, and client interface. The Web Services repository holds portlets as remote resource producers. A new data structure for expressing remote portlets is found and published by populating a Universal Description, Discovery and Integration (UDDI) registry. A remote portlet publish and search engine for UDDI has also been developed. Finally, a remote portlet client interface was developed as a Web application. The client interface supports remote portlet features, as well as window status and mode functions. Copyright (c) 2007 John Wiley & Sons, Ltd.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

K-Means is a popular clustering algorithm which adopts an iterative refinement procedure to determine data partitions and to compute their associated centres of mass, called centroids. The straightforward implementation of the algorithm is often referred to as `brute force' since it computes a proximity measure from each data point to each centroid at every iteration of the K-Means process. Efficient implementations of the K-Means algorithm have been predominantly based on multi-dimensional binary search trees (KD-Trees). A combination of an efficient data structure and geometrical constraints allow to reduce the number of distance computations required at each iteration. In this work we present a general space partitioning approach for improving the efficiency and the scalability of the K-Means algorithm. We propose to adopt approximate hierarchical clustering methods to generate binary space partitioning trees in contrast to KD-Trees. In the experimental analysis, we have tested the performance of the proposed Binary Space Partitioning K-Means (BSP-KM) when a divisive clustering algorithm is used. We have carried out extensive experimental tests to compare the proposed approach to the one based on KD-Trees (KD-KM) in a wide range of the parameters space. BSP-KM is more scalable than KDKM, while keeping the deterministic nature of the `brute force' algorithm. In particular, the proposed space partitioning approach has shown to overcome the well-known limitation of KD-Trees in high-dimensional spaces and can also be adopted to improve the efficiency of other algorithms in which KD-Trees have been used.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Motivation: The ability of a simple method (MODCHECK) to determine the sequence–structure compatibility of a set of structural models generated by fold recognition is tested in a thorough benchmark analysis. Four Model Quality Assessment Programs (MQAPs) were tested on 188 targets from the latest LiveBench-9 automated structure evaluation experiment. We systematically test and evaluate whether the MQAP methods can successfully detect native-likemodels. Results: We show that compared with the other three methods tested MODCHECK is the most reliable method for consistently performing the best top model selection and for ranking the models. In addition, we show that the choice of model similarity score used to assess a model's similarity to the experimental structure can influence the overall performance of these tools. Although these MQAP methods fail to improve the model selection performance for methods that already incorporate protein three dimension (3D) structural information, an improvement is observed for methods that are purely sequence-based, including the best profile–profile methods. This suggests that even the best sequence-based fold recognition methods can still be improved by taking into account the 3D structural information.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A procedure is described in which patients are randomized between two experimental treatments and a control. At a series of interim analyses, each experimental treatment is compared with control. One of the experimental treatments might then be found sufficiently superior to the control for it to be declared the best treatment, and the trial stopped. Alternatively, experimental treatments might be eliminated from further consideration at any stage. It is shown how the procedure can be conducted while controlling overall error probabilities. Data concerning evaluation of different doses of riluzole in the treatment of motor neurone disease are used for illustration.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL’s Synergistic Processing Elements than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60% of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The Bloom filter is a space efficient randomized data structure for representing a set and supporting membership queries. Bloom filters intrinsically allow false positives. However, the space savings they offer outweigh the disadvantage if the false positive rates are kept sufficiently low. Inspired by the recent application of the Bloom filter in a novel multicast forwarding fabric, this paper proposes a variant of the Bloom filter, the optihash. The optihash introduces an optimization for the false positive rate at the stage of Bloom filter formation using the same amount of space at the cost of slightly more processing than the classic Bloom filter. Often Bloom filters are used in situations where a fixed amount of space is a primary constraint. We present the optihash as a good alternative to Bloom filters since the amount of space is the same and the improvements in false positives can justify the additional processing. Specifically, we show via simulations and numerical analysis that using the optihash the false positives occurrences can be reduced and controlled at a cost of small additional processing. The simulations are carried out for in-packet forwarding. In this framework, the Bloom filter is used as a compact link/route identifier and it is placed in the packet header to encode the route. At each node, the Bloom filter is queried for membership in order to make forwarding decisions. A false positive in the forwarding decision is translated into packets forwarded along an unintended outgoing link. By using the optihash, false positives can be reduced. The optimization processing is carried out in an entity termed the Topology Manger which is part of the control plane of the multicast forwarding fabric. This processing is only carried out on a per-session basis, not for every packet. The aim of this paper is to present the optihash and evaluate its false positive performances via simulations in order to measure the influence of different parameters on the false positive rate. The false positive rate for the optihash is then compared with the false positive probability of the classic Bloom filter.

Relevância:

50.00% 50.00%

Publicador:

Resumo:

We compare the characteristics of synthetic European droughts generated by the HiGEM1 coupled climate model run with present day atmospheric composition with observed drought events extracted from the CRU TS3 data set. The results demonstrate consistency in both the rate of drought occurrence and the spatiotemporal structure of the events. Estimates of the probability density functions for event area, duration and severity are shown to be similar with confidence > 90%. Encouragingly, HiGEM is shown to replicate the extreme tails of the observed distributions and thus the most damaging European drought events. The soil moisture state is shown to play an important role in drought development. Once a large-scale drought has been initiated it is found to be 50% more likely to continue if the local soil moisture is below the 40th percentile. In response to increased concentrations of atmospheric CO2, the modelled droughts are found to increase in duration, area and severity. The drought response can be largely attributed to temperature driven changes in relative humidity. 1 HiGEM is based on the latest climate configuration of the Met Office Hadley Centre Unified Model (HadGEM1) with the horizontal resolution increased to 1.25 x 0.83 degrees in longitude and latitude in the atmosphere and 1/3 x 1/3 degrees in the ocean.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The representation of the diurnal cycle in the Hadley Centre climate model is evaluated using simulations of the infrared radiances observed by Meteosat 7. In both the window and water vapour channels, the standard version of the model with 19 levels produces a good simulation of the geographical distributions of the mean radiances and of the amplitude of the diurnal cycle. Increasing the vertical resolution to 30 levels leads to further improvements in the mean fields. The timing of the maximum and minimum radiances reveals significant model errors, however, which are sensitive to the frequency with which the radiation scheme is called. In most regions, these errors are consistent with well documented errors in the timing of convective precipitation, which peaks before noon in the model, in contrast to the observed peak in the late afternoon or evening. When the radiation scheme is called every model time step (half an hour), as opposed to every three hours in the standard version, the timing of the minimum radiance is improved for convective regions over central Africa, due to the creation of upper-level layer-cloud by detrainment from the convection scheme, which persists well after the convection itself has dissipated. However, this produces a decoupling between the timing of the diurnal cycles of precipitation and window channel radiance. The possibility is raised that a similar decoupling may occur in reality and the implications of this for the retrieval of the diurnal cycle of precipitation from infrared radiances are discussed.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Regular visual observations of persistent contrails over Reading, UK, have been used to evaluate radiosonde measurements of temperature and humidity defining cold ice-supersaturated atmospheric regions which are assumed to be a necessary condition for persistent condensation trails (contrails) to form. Results show a good correlation between observations and predictions using data from Larkhill, 63 km from Reading. A statistical analysis of this result and the forecasts using data from four additional UK radiosonde stations are presented. The horizontal extent of supersaturated layers could be inferred from this to be several hundred kilometres. The necessity of bias corrections to radiosonde humidity measurements is discussed and an analysis of measured ice-supersaturated atmospheric layers in the troposphere is presented. It is found that ice supersaturation is more likely to occur in winter than in summer, with frequencies of 17.3% and 9.4%, respectively, which is mostly due to the layers being thicker in winter than in summer. The most probable height for them to occur is about 10 km.