32 resultados para proximity query, collision test, distance test, data compression, triangle test
em CentAUR: Central Archive University of Reading - UK
Resumo:
Bloom filters are a data structure for storing data in a compressed form. They offer excellent space and time efficiency at the cost of some loss of accuracy (so-called lossy compression). This work presents a yes-no Bloom filter, which as a data structure consisting of two parts: the yes-filter which is a standard Bloom filter and the no-filter which is another Bloom filter whose purpose is to represent those objects that were recognised incorrectly by the yes-filter (that is, to recognise the false positives of the yes-filter). By querying the no-filter after an object has been recognised by the yes-filter, we get a chance of rejecting it, which improves the accuracy of data recognition in comparison with the standard Bloom filter of the same total length. A further increase in accuracy is possible if one chooses objects to include in the no-filter so that the no-filter recognises as many as possible false positives but no true positives, thus producing the most accurate yes-no Bloom filter among all yes-no Bloom filters. This paper studies how optimization techniques can be used to maximize the number of false positives recognised by the no-filter, with the constraint being that it should recognise no true positives. To achieve this aim, an Integer Linear Program (ILP) is proposed for the optimal selection of false positives. In practice the problem size is normally large leading to intractable optimal solution. Considering the similarity of the ILP with the Multidimensional Knapsack Problem, an Approximate Dynamic Programming (ADP) model is developed making use of a reduced ILP for the value function approximation. Numerical results show the ADP model works best comparing with a number of heuristics as well as the CPLEX built-in solver (B&B), and this is what can be recommended for use in yes-no Bloom filters. In a wider context of the study of lossy compression algorithms, our researchis an example showing how the arsenal of optimization methods can be applied to improving the accuracy of compressed data.
Resumo:
It is generally assumed that the variability of neuronal morphology has an important effect on both the connectivity and the activity of the nervous system, but this effect has not been thoroughly investigated. Neuroanatomical archives represent a crucial tool to explore structure–function relationships in the brain. We are developing computational tools to describe, generate, store and render large sets of three–dimensional neuronal structures in a format that is compact, quantitative, accurate and readily accessible to the neuroscientist. Single–cell neuroanatomy can be characterized quantitatively at several levels. In computer–aided neuronal tracing files, a dendritic tree is described as a series of cylinders, each represented by diameter, spatial coordinates and the connectivity to other cylinders in the tree. This ‘Cartesian’ description constitutes a completely accurate mapping of dendritic morphology but it bears little intuitive information for the neuroscientist. In contrast, a classical neuroanatomical analysis characterizes neuronal dendrites on the basis of the statistical distributions of morphological parameters, e.g. maximum branching order or bifurcation asymmetry. This description is intuitively more accessible, but it only yields information on the collective anatomy of a group of dendrites, i.e. it is not complete enough to provide a precise ‘blueprint’ of the original data. We are adopting a third, intermediate level of description, which consists of the algorithmic generation of neuronal structures within a certain morphological class based on a set of ‘fundamental’, measured parameters. This description is as intuitive as a classical neuroanatomical analysis (parameters have an intuitive interpretation), and as complete as a Cartesian file (the algorithms generate and display complete neurons). The advantages of the algorithmic description of neuronal structure are immense. If an algorithm can measure the values of a handful of parameters from an experimental database and generate virtual neurons whose anatomy is statistically indistinguishable from that of their real counterparts, a great deal of data compression and amplification can be achieved. Data compression results from the quantitative and complete description of thousands of neurons with a handful of statistical distributions of parameters. Data amplification is possible because, from a set of experimental neurons, many more virtual analogues can be generated. This approach could allow one, in principle, to create and store a neuroanatomical database containing data for an entire human brain in a personal computer. We are using two programs, L–NEURON and ARBORVITAE, to investigate systematically the potential of several different algorithms for the generation of virtual neurons. Using these programs, we have generated anatomically plausible virtual neurons for several morphological classes, including guinea pig cerebellar Purkinje cells and cat spinal cord motor neurons. These virtual neurons are stored in an online electronic archive of dendritic morphology. This process highlights the potential and the limitations of the ‘computational neuroanatomy’ strategy for neuroscience databases.
Resumo:
Written for communications and electronic engineers, technicians and students, this book begins with an introduction to data communications, and goes on to explain the concept of layered communications. Other chapters deal with physical communications channels, baseband digital transmission, analog data transmission, error control and data compression codes, physical layer standards, the data link layer, the higher layers of the protocol hierarchy, and local are networks (LANS). Finally, the book explores some likely future developments.
Resumo:
The resolution of remotely sensed data is becoming increasingly fine, and there are now many sources of data with a pixel size of 1 m x 1 m. This produces huge amounts of data that have to be stored, processed and transmitted. For environmental applications this resolution possibly provides far more data than are needed: data overload. This poses the question: how much is too much? We have explored two resolutions of data-20 in pixel SPOT data and I in pixel Computerized Airborne Multispectral Imaging System (CAMIS) data from Fort A. P. Hill (Virginia, USA), using the variogram of geostatistics. For both we used the normalized difference vegetation index (NDVI). Three scales of spatial variation were identified in both the SPOT and 1 in data: there was some overlap at the intermediate spatial scales of about 150 in and of 500 m-600 in. We subsampled the I in data and scales of variation of about 30 in and of 300 in were identified consistently until the separation between pixel centroids was 15 in (or 1 in 225pixels). At this stage, spatial scales of about 100m and 600m were described, which suggested that only now was there a real difference in the amount of spatial information available from an environmental perspective. These latter were similar spatial scales to those identified from the SPOT image. We have also analysed I in CAMIS data from Fort Story (Virginia, USA) for comparison and the outcome is similar.:From these analyses it seems that a pixel size of 20m is adequate for many environmental applications, and that if more detail is required the higher resolution data could be sub-sampled to a 10m separation between pixel centroids without any serious loss of information. This reduces significantly the amount of data that needs to be stored, transmitted and analysed and has important implications for data compression.
Resumo:
Empirical orthogonal function (EOF) analysis is a powerful tool for data compression and dimensionality reduction used broadly in meteorology and oceanography. Often in the literature, EOF modes are interpreted individually, independent of other modes. In fact, it can be shown that no such attribution can generally be made. This review demonstrates that in general individual EOF modes (i) will not correspond to individual dynamical modes, (ii) will not correspond to individual kinematic degrees of freedom, (iii) will not be statistically independent of other EOF modes, and (iv) will be strongly influenced by the nonlocal requirement that modes maximize variance over the entire domain. The goal of this review is not to argue against the use of EOF analysis in meteorology and oceanography; rather, it is to demonstrate the care that must be taken in the interpretation of individual modes in order to distinguish the medium from the message.
Resumo:
This dissertation deals with aspects of sequential data assimilation (in particular ensemble Kalman filtering) and numerical weather forecasting. In the first part, the recently formulated Ensemble Kalman-Bucy (EnKBF) filter is revisited. It is shown that the previously used numerical integration scheme fails when the magnitude of the background error covariance grows beyond that of the observational error covariance in the forecast window. Therefore, we present a suitable integration scheme that handles the stiffening of the differential equations involved and doesn’t represent further computational expense. Moreover, a transform-based alternative to the EnKBF is developed: under this scheme, the operations are performed in the ensemble space instead of in the state space. Advantages of this formulation are explained. For the first time, the EnKBF is implemented in an atmospheric model. The second part of this work deals with ensemble clustering, a phenomenon that arises when performing data assimilation using of deterministic ensemble square root filters in highly nonlinear forecast models. Namely, an M-member ensemble detaches into an outlier and a cluster of M-1 members. Previous works may suggest that this issue represents a failure of EnSRFs; this work dispels that notion. It is shown that ensemble clustering can be reverted also due to nonlinear processes, in particular the alternation between nonlinear expansion and compression of the ensemble for different regions of the attractor. Some EnSRFs that use random rotations have been developed to overcome this issue; these formulations are analyzed and their advantages and disadvantages with respect to common EnSRFs are discussed. The third and last part contains the implementation of the Robert-Asselin-Williams (RAW) filter in an atmospheric model. The RAW filter is an improvement to the widely popular Robert-Asselin filter that successfully suppresses spurious computational waves while avoiding any distortion in the mean value of the function. Using statistical significance tests both at the local and field level, it is shown that the climatology of the SPEEDY model is not modified by the changed time stepping scheme; hence, no retuning of the parameterizations is required. It is found the accuracy of the medium-term forecasts is increased by using the RAW filter.
Resumo:
While over-dispersion in capture–recapture studies is well known to lead to poor estimation of population size, current diagnostic tools to detect the presence of heterogeneity have not been specifically developed for capture–recapture studies. To address this, a simple and efficient method of testing for over-dispersion in zero-truncated count data is developed and evaluated. The proposed method generalizes an over-dispersion test previously suggested for un-truncated count data and may also be used for testing residual over-dispersion in zero-inflation data. Simulations suggest that the asymptotic distribution of the test statistic is standard normal and that this approximation is also reasonable for small sample sizes. The method is also shown to be more efficient than an existing test for over-dispersion adapted for the capture–recapture setting. Studies with zero-truncated and zero-inflated count data are used to illustrate the test procedures.
Resumo:
We propose a novel method for scoring the accuracy of protein binding site predictions – the Binding-site Distance Test (BDT) score. Recently, the Matthews Correlation Coefficient (MCC) has been used to evaluate binding site predictions, both by developers of new methods and by the assessors for the community wide prediction experiment – CASP8. Whilst being a rigorous scoring method, the MCC does not take into account the actual 3D location of the predicted residues from the observed binding site. Thus, an incorrectly predicted site that is nevertheless close to the observed binding site will obtain an identical score to the same number of nonbinding residues predicted at random. The MCC is somewhat affected by the subjectivity of determining observed binding residues and the ambiguity of choosing distance cutoffs. By contrast the BDT method produces continuous scores ranging between 0 and 1, relating to the distance between the predicted and observed residues. Residues predicted close to the binding site will score higher than those more distant, providing a better reflection of the true accuracy of predictions. The CASP8 function predictions were evaluated using both the MCC and BDT methods and the scores were compared. The BDT was found to strongly correlate with the MCC scores whilst also being less susceptible to the subjectivity of defining binding residues. We therefore suggest that this new simple score is a potentially more robust method for future evaluations of protein-ligand binding site predictions.
Resumo:
A score test is developed for binary clinical trial data, which incorporates patient non-compliance while respecting randomization. It is assumed in this paper that compliance is all-or-nothing, in the sense that a patient either accepts all of the treatment assigned as specified in the protocol, or none of it. Direct analytic comparisons of the adjusted test statistic for both the score test and the likelihood ratio test are made with the corresponding test statistics that adhere to the intention-to-treat principle. It is shown that no gain in power is possible over the intention-to-treat analysis, by adjusting for patient non-compliance. Sample size formulae are derived and simulation studies are used to demonstrate that the sample size approximation holds. Copyright © 2003 John Wiley & Sons, Ltd.
Resumo:
We describe a simple comparative method for determining whether rates of diversification are correlated with continuous traits in species-level phylogenies. This involves comparing traits of species with net speciation rate (number of nodes linking extant species with the root divided by the root to tip evolutionary distance), using a phylogenetically corrected correlation. We use simulations to examine the power of this test. We find that the approach has acceptable power to uncover relationships between speciation and a continuous trait and is robust to background random extinction; however, the power of the approach is reduced when the rate of trait evolution is decreased. The test has low power to relate diversification to traits when extinction rate is correlated with the trait. Clearly, there are inherent limitations in using only data on extant species to infer correlates of extinction; however, this approach is potentially a powerful tool in analyzing correlates of speciation.
Resumo:
Accurate knowledge of species’ habitat associations is important for conservation planning and policy. Assessing habitat associations is a vital precursor to selecting appropriate indicator species for prioritising sites for conservation or assessing trends in habitat quality. However, much existing knowledge is based on qualitative expert opinion or local scale studies, and may not remain accurate across different spatial scales or geographic locations. Data from biological recording schemes have the potential to provide objective measures of habitat association, with the ability to account for spatial variation. We used data on 50 British butterfly species as a test case to investigate the correspondence of data-derived measures of habitat association with expert opinion, from two different butterfly recording schemes. One scheme collected large quantities of occurrence data (c. 3 million records) and the other, lower quantities of standardised monitoring data (c. 1400 sites). We used general linear mixed effects models to derive scores of association with broad-leaf woodland for both datasets and compared them with scores canvassed from experts. Scores derived from occurrence and abundance data both showed strongly positive correlations with expert opinion. However, only for occurrence data did these fell within the range of correlations between experts. Data-derived scores showed regional spatial variation in the strength of butterfly associations with broad-leaf woodland, with a significant latitudinal trend in 26% of species. Sub-sampling of the data suggested a mean sample size of 5000 occurrence records per species to gain an accurate estimation of habitat association, although habitat specialists are likely to be readily detected using several hundred records. Occurrence data from recording schemes can thus provide easily obtained, objective, quantitative measures of habitat association.
Resumo:
The no response test is a new scheme in inverse problems for partial differential equations which was recently proposed in [D. R. Luke and R. Potthast, SIAM J. Appl. Math., 63 (2003), pp. 1292–1312] in the framework of inverse acoustic scattering problems. The main idea of the scheme is to construct special probing waves which are small on some test domain. Then the response for these waves is constructed. If the response is small, the unknown object is assumed to be a subset of the test domain. The response is constructed from one, several, or many particular solutions of the problem under consideration. In this paper, we investigate the convergence of the no response test for the reconstruction information about inclusions D from the Cauchy values of solutions to the Helmholtz equation on an outer surface $\partial\Omega$ with $\overline{D} \subset \Omega$. We show that the one‐wave no response test provides a criterion to test the analytic extensibility of a field. In particular, we investigate the construction of approximations for the set of singular points $N(u)$ of the total fields u from one given pair of Cauchy data. Thus, the no response test solves a particular version of the classical Cauchy problem. Also, if an infinite number of fields is given, we prove that a multifield version of the no response test reconstructs the unknown inclusion D. This is the first convergence analysis which could be achieved for the no response test.
Resumo:
Ecological risk assessments must increasingly consider the effects of chemical mixtures on the environment as anthropogenic pollution continues to grow in complexity. Yet testing every possible mixture combination is impractical and unfeasible; thus, there is an urgent need for models that can accurately predict mixture toxicity from single-compound data. Currently, two models are frequently used to predict mixture toxicity from single-compound data: Concentration addition and independent action (IA). The accuracy of the predictions generated by these models is currently debated and needs to be resolved before their use in risk assessments can be fully justified. The present study addresses this issue by determining whether the IA model adequately described the toxicity of binary mixtures of five pesticides and other environmental contaminants (cadmium, chlorpyrifos, diuron, nickel, and prochloraz) each with dissimilar modes of action on the reproduction of the nematode Caenorhabditis elegans. In three out of 10 cases, the IA model failed to describe mixture toxicity adequately with significant or antagonism being observed. In a further three cases, there was an indication of synergy, antagonism, and effect-level-dependent deviations, respectively, but these were not statistically significant. The extent of the significant deviations that were found varied, but all were such that the predicted percentage effect seen on reproductive output would have been wrong by 18 to 35% (i.e., the effect concentration expected to cause a 50% effect led to an 85% effect). The presence of such a high number and variety of deviations has important implications for the use of existing mixture toxicity models for risk assessments, especially where all or part of the deviation is synergistic.
Resumo:
In this article, we use the no-response test idea, introduced in Luke and Potthast (2003) and Potthast (Preprint) and the inverse obstacle problem, to identify the interface of the discontinuity of the coefficient gamma of the equation del (.) gamma(x)del + c(x) with piecewise regular gamma and bounded function c(x). We use infinitely many Cauchy data as measurement and give a reconstructive method to localize the interface. We will base this multiwave version of the no-response test on two different proofs. The first one contains a pointwise estimate as used by the singular sources method. The second one is built on an energy (or an integral) estimate which is the basis of the probe method. As a conclusion of this, the probe and the singular sources methods are equivalent regarding their convergence and the no-response test can be seen as a unified framework for these methods. As a further contribution, we provide a formula to reconstruct the values of the jump of gamma(x), x is an element of partial derivative D at the boundary. A second consequence of this formula is that the blow-up rate of the indicator functions of the probe and singular sources methods at the interface is given by the order of the singularity of the fundamental solution.