876 resultados para Distributed data
Resumo:
Context awareness, dynamic reconfiguration at runtime and heterogeneity are key characteristics of future distributed systems, particularly in ubiquitous and mobile computing scenarios. The main contributions of this dissertation are theoretical as well as architectural concepts facilitating information exchange and fusion in heterogeneous and dynamic distributed environments. Our main focus is on bridging the heterogeneity issues and, at the same time, considering uncertain, imprecise and unreliable sensor information in information fusion and reasoning approaches. A domain ontology is used to establish a common vocabulary for the exchanged information. We thereby explicitly support different representations for the same kind of information and provide Inter-Representation Operations that convert between them. Special account is taken of the conversion of associated meta-data that express uncertainty and impreciseness. The Unscented Transformation, for example, is applied to propagate Gaussian normal distributions across highly non-linear Inter-Representation Operations. Uncertain sensor information is fused using the Dempster-Shafer Theory of Evidence as it allows explicit modelling of partial and complete ignorance. We also show how to incorporate the Dempster-Shafer Theory of Evidence into probabilistic reasoning schemes such as Hidden Markov Models in order to be able to consider the uncertainty of sensor information when deriving high-level information from low-level data. For all these concepts we provide architectural support as a guideline for developers of innovative information exchange and fusion infrastructures that are particularly targeted at heterogeneous dynamic environments. Two case studies serve as proof of concept. The first case study focuses on heterogeneous autonomous robots that have to spontaneously form a cooperative team in order to achieve a common goal. The second case study is concerned with an approach for user activity recognition which serves as baseline for a context-aware adaptive application. Both case studies demonstrate the viability and strengths of the proposed solution and emphasize that the Dempster-Shafer Theory of Evidence should be preferred to pure probability theory in applications involving non-linear Inter-Representation Operations.
Resumo:
Linear graph reduction is a simple computational model in which the cost of naming things is explicitly represented. The key idea is the notion of "linearity". A name is linear if it is only used once, so with linear naming you cannot create more than one outstanding reference to an entity. As a result, linear naming is cheap to support and easy to reason about. Programs can be translated into the linear graph reduction model such that linear names in the program are implemented directly as linear names in the model. Nonlinear names are supported by constructing them out of linear names. The translation thus exposes those places where the program uses names in expensive, nonlinear ways. Two applications demonstrate the utility of using linear graph reduction: First, in the area of distributed computing, linear naming makes it easy to support cheap cross-network references and highly portable data structures, Linear naming also facilitates demand driven migration of tasks and data around the network without requiring explicit guidance from the programmer. Second, linear graph reduction reveals a new characterization of the phenomenon of state. Systems in which state appears are those which depend on certain -global- system properties. State is not a localizable phenomenon, which suggests that our usual object oriented metaphor for state is flawed.
Resumo:
We consider the often-studied problem of sorting, for a parallel computer. Given an input array distributed evenly over p processors, the task is to compute the sorted output array, also distributed over the p processors. Many existing algorithms take the approach of approximately load-balancing the output, leaving each processor with Θ(n/p) elements. However, in many cases, approximate load-balancing leads to inefficiencies in both the sorting itself and in further uses of the data after sorting. We provide a deterministic parallel sorting algorithm that uses parallel selection to produce any output distribution exactly, particularly one that is perfectly load-balanced. Furthermore, when using a comparison sort, this algorithm is 1-optimal in both computation and communication. We provide an empirical study that illustrates the efficiency of exact data splitting, and shows an improvement over two sample sort algorithms.
Resumo:
One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By an essential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur in many compositional situations, such as household budget patterns, time budgets, palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful in such situations. From consideration of such examples it seems sensible to build up a model in two stages, the first determining where the zeros will occur and the second how the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data
Resumo:
Flood modelling of urban areas is still at an early stage, partly because until recently topographic data of sufficiently high resolution and accuracy have been lacking in urban areas. However, Digital Surface Models (DSMs) generated from airborne scanning laser altimetry (LiDAR) having sub-metre spatial resolution have now become available, and these are able to represent the complexities of urban topography. The paper describes the development of a LiDAR post-processor for urban flood modelling based on the fusion of LiDAR and digital map data. The map data are used in conjunction with LiDAR data to identify different object types in urban areas, though pattern recognition techniques are also employed. Post-processing produces a Digital Terrain Model (DTM) for use as model bathymetry, and also a friction parameter map for use in estimating spatially-distributed friction coefficients. In vegetated areas, friction is estimated from LiDAR-derived vegetation height, and (unlike most vegetation removal software) the method copes with short vegetation less than ~1m high, which may occupy a substantial fraction of even an urban floodplain. The DTM and friction parameter map may also be used to help to generate an unstructured mesh of a vegetated urban floodplain for use by a 2D finite element model. The mesh is decomposed to reflect floodplain features having different frictional properties to their surroundings, including urban features such as buildings and roads as well as taller vegetation features such as trees and hedges. This allows a more accurate estimation of local friction. The method produces a substantial node density due to the small dimensions of many urban features.
Resumo:
Flooding is a major hazard in both rural and urban areas worldwide, but it is in urban areas that the impacts are most severe. An investigation of the ability of high resolution TerraSAR-X data to detect flooded regions in urban areas is described. An important application for this would be the calibration and validation of the flood extent predicted by an urban flood inundation model. To date, research on such models has been hampered by lack of suitable distributed validation data. The study uses a 3m resolution TerraSAR-X image of a 1-in-150 year flood near Tewkesbury, UK, in 2007, for which contemporaneous aerial photography exists for validation. The DLR SETES SAR simulator was used in conjunction with airborne LiDAR data to estimate regions of the TerraSAR-X image in which water would not be visible due to radar shadow or layover caused by buildings and taller vegetation, and these regions were masked out in the flood detection process. A semi-automatic algorithm for the detection of floodwater was developed, based on a hybrid approach. Flooding in rural areas adjacent to the urban areas was detected using an active contour model (snake) region-growing algorithm seeded using the un-flooded river channel network, which was applied to the TerraSAR-X image fused with the LiDAR DTM to ensure the smooth variation of heights along the reach. A simpler region-growing approach was used in the urban areas, which was initialized using knowledge of the flood waterline in the rural areas. Seed pixels having low backscatter were identified in the urban areas using supervised classification based on training areas for water taken from the rural flood, and non-water taken from the higher urban areas. Seed pixels were required to have heights less than a spatially-varying height threshold determined from nearby rural waterline heights. Seed pixels were clustered into urban flood regions based on their close proximity, rather than requiring that all pixels in the region should have low backscatter. This approach was taken because it appeared that urban water backscatter values were corrupted in some pixels, perhaps due to contributions from side-lobes of strong reflectors nearby. The TerraSAR-X urban flood extent was validated using the flood extent visible in the aerial photos. It turned out that 76% of the urban water pixels visible to TerraSAR-X were correctly detected, with an associated false positive rate of 25%. If all urban water pixels were considered, including those in shadow and layover regions, these figures fell to 58% and 19% respectively. These findings indicate that TerraSAR-X is capable of providing useful data for the calibration and validation of urban flood inundation models.
Resumo:
GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.
Resumo:
Matheron's usual variogram estimator can result in unreliable variograms when data are strongly asymmetric or skewed. Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. This paper examines the effects of underlying asymmetry on the variogram and on the accuracy of prediction, and the second one examines the effects arising from outliers. Standard geostatistical texts suggest ways of dealing with underlying asymmetry; however, this is based on informed intuition rather than detailed investigation. To determine whether the methods generally used to deal with underlying asymmetry are appropriate, the effects of different coefficients of skewness on the shape of the experimental variogram and on the model parameters were investigated. Simulated annealing was used to create normally distributed random fields of different size from variograms with different nugget:sill ratios. These data were then modified to give different degrees of asymmetry and the experimental variogram was computed in each case. The effects of standard data transformations on the form of the variogram were also investigated. Cross-validation was used to assess quantitatively the performance of the different variogram models for kriging. The results showed that the shape of the variogram was affected by the degree of asymmetry, and that the effect increased as the size of data set decreased. Transformations of the data were more effective in reducing the skewness coefficient in the larger sets of data. Cross-validation confirmed that variogram models from transformed data were more suitable for kriging than were those from the raw asymmetric data. The results of this study have implications for the 'standard best practice' in dealing with asymmetry in data for geostatistical analyses. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. The first paper of this series examined the effects of the former on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to create normally distributed random fields of different size that are realizations of known processes described by variograms with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were computed from these data by Matheron's estimator and by three robust estimators. The effects of standard data transformations on the coefficient of skewness and on the variogram were also investigated. Cross-validation was used to assess the performance of models fitted to experimental variograms computed from a range of data contaminated by outliers for kriging. The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both square roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near to the edge or the centre of the field. The results of cross-validation showed that the robust estimators and the removal of outliers were the most effective ways of dealing with outliers for variogram estimation and kriging. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids.
Resumo:
Recently, two approaches have been introduced that distribute the molecular fragment mining problem. The first approach applies a master/worker topology, the second approach, a completely distributed peer-to-peer system, solves the scalability problem due to the bottleneck at the master node. However, in many real world scenarios the participating computing nodes cannot communicate directly due to administrative policies such as security restrictions. Thus, potential computing power is not accessible to accelerate the mining run. To solve this shortcoming, this work introduces a hierarchical topology of computing resources, which distributes the management over several levels and adapts to the natural structure of those multi-domain architectures. The most important aspect is the load balancing scheme, which has been designed and optimized for the hierarchical structure. The approach allows dynamic aggregation of heterogenous computing resources and is applied to wide area network scenarios.
Resumo:
This paper presents a simple Bayesian approach to sample size determination in clinical trials. It is required that the trial should be large enough to ensure that the data collected will provide convincing evidence either that an experimental treatment is better than a control or that it fails to improve upon control by some clinically relevant difference. The method resembles standard frequentist formulations of the problem, and indeed in certain circumstances involving 'non-informative' prior information it leads to identical answers. In particular, unlike many Bayesian approaches to sample size determination, use is made of an alternative hypothesis that an experimental treatment is better than a control treatment by some specified magnitude. The approach is introduced in the context of testing whether a single stream of binary observations are consistent with a given success rate p(0). Next the case of comparing two independent streams of normally distributed responses is considered, first under the assumption that their common variance is known and then for unknown variance. Finally, the more general situation in which a large sample is to be collected and analysed according to the asymptotic properties of the score statistic is explored. Copyright (C) 2007 John Wiley & Sons, Ltd.
Resumo:
The monophyly of the Peltophorum group, one of nine informal groups recognized by Polhill in the Caesalpinieae, was tested using sequence data from the trnL-F, rbcL, and rps16 regions of the chloroplast genome. Exemplars were included from all 16 genera of the Peltophorum group, and from 15 genera representing seven of the other eight informal groups in the tribe. The data were analyzed separately and in combined analyses using parsimony and Bayesian methods. The analysis method had little effect on the topology of well-supported relationships. The molecular data recovered a generally well-supported phylogeny with many intergeneric relationships resolved. Results show that the Peltophorum group as currently delimited is polyphyletic, but that eight genera plus one undescribed genus form a core Peltophorum group, which is referred to here as the Peltophorum group sensu stricto. These genera are Bussea, Conzattia, Colvillea, Delonix, Heteroflorum (inedit.), Lemuropisum, Parkinsonia, Peltophorum, and Schizolobium. The remaining eight genera of the Peltophorum group s.l. are distributed across the Caesalpinieae. Morphological support for the redelimited Peltophorum group and the other recovered clades was assessed, and no unique synapomorphy was found for the Peltophorum group s.s. A proposal for the reclassification of the Peltophorum group s.l. is presented.
Resumo:
Heterogeneity in lifetime data may be modelled by multiplying an individual's hazard by an unobserved frailty. We test for the presence of frailty of this kind in univariate and bivariate data with Weibull distributed lifetimes, using statistics based on the ordered Cox-Snell residuals from the null model of no frailty. The form of the statistics is suggested by outlier testing in the gamma distribution. We find through simulation that the sum of the k largest or k smallest order statistics, for suitably chosen k , provides a powerful test when the frailty distribution is assumed to be gamma or positive stable, respectively. We provide recommended values of k for sample sizes up to 100 and simple formulae for estimated critical values for tests at the 5% level.
Resumo:
Physiological parameters measured by an embedded body sensor system were demonstrated to respond to changes of the air temperature in an office environment. The thermal parameters were monitored with the use of a wireless sensor system that made possible to turn any existing room into a field laboratory. Two human subjects were monitored over daily activities and at various steady-state thermal conditions when the air temperature of the room was altered from 22-23°C to 25-28°C. The subjects indicated their thermal feeling on questionnaires. The measured skin temperature was distributed close to the calculated mean skin temperature corresponding to the given activity level. The variation of Galvanic Skin Response (GSR) reflected the evaporative heat loss through the body surfaces and indicated whether sweating occurred on the subjects. Further investigations are needed to fully evaluate the influence of thermal and other factors on the output given by the investigated body sensor system.