958 resultados para Data sets storage


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação de mestrado, Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2015

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The P-found protein folding and unfolding simulation repository is designed to allow scientists to perform data mining and other analyses across large, distributed simulation data sets. There are two storage components in P-found: a primary repository of simulation data that is used to populate the second component, and a data warehouse that contains important molecular properties. These properties may be used for data mining studies. Here we demonstrate how grid technologies can support multiple, distributed P-found installations. In particular, we look at two aspects: firstly, how grid data management technologies can be used to access the distributed data warehouses; and secondly, how the grid can be used to transfer analysis programs to the primary repositories — this is an important and challenging aspect of P-found, due to the large data volumes involved and the desire of scientists to maintain control of their own data. The grid technologies we are developing with the P-found system will allow new large data sets of protein folding simulations to be accessed and analysed in novel ways, with significant potential for enabling scientific discovery.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The size and complexity of data sets generated within ecosystem-level programmes merits their capture, curation, storage and analysis, synthesis and visualisation using Big Data approaches. This review looks at previous attempts to organise and analyse such data through the International Biological Programme and draws on the mistakes made and the lessons learned for effective Big Data approaches to current Research Councils United Kingdom (RCUK) ecosystem-level programmes, using Biodiversity and Ecosystem Service Sustainability (BESS) and Environmental Virtual Observatory Pilot (EVOp) as exemplars. The challenges raised by such data are identified, explored and suggestions are made for the two major issues of extending analyses across different spatio-temporal scales and for the effective integration of quantitative and qualitative data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data is one of the domains in grid research that deals with the storage, replication, and management of large data sets in a distributed environment. The all-data-to-all sites replication scheme such as read-one write-all and tree grid structure (TGS) are the popular techniques being used for replication and management of data in this domain. However, these techniques have its weaknesses in terms of data storage capacity and also data access times due to some number of sites must ‘agree’ in common to execute certain transactions. In this paper, we propose the all-data-to-some-sites scheme called the neighbor replication on triangular grid (NRTG) technique by considering only neighbors have the replicated data, and thus, minimizes the storage capacity as well as high update availability. Also, the technique tolerates failures such as server failures, site failure or even network partitioning using remote procedure call (RPC).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data in grid research deals with storage, replication, and management of large data sets in a distributed environment. The all-data-to-all-sites replication schemes, like Read-One Write-All (ROWA) and Tree Grid Structure (TGS), are the popular techniques in grid. However, these techniques have a weakness in data storage capacity and data access times. In this paper, we propose the all-data-to-some-sites scheme called the 'Neighbour Replication on Triangular Grid' (NRTG) technique. The proposed scheme minimises the storage capacity as well as data access time with high update availability. It also tolerates failures such as server and site failures.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The need to estimate a particular quantile of a distribution is an important problem that frequently arises in many computer vision and signal processing applications. For example, our work was motivated by the requirements of many semiautomatic surveillance analytics systems that detect abnormalities in close-circuit television footage using statistical models of low-level motion features. In this paper, we specifically address the problem of estimating the running quantile of a data stream when the memory for storing observations is limited. We make the following several major contributions: 1) we highlight the limitations of approaches previously described in the literature that make them unsuitable for nonstationary streams; 2) we describe a novel principle for the utilization of the available storage space; 3) we introduce two novel algorithms that exploit the proposed principle in different ways; and 4) we present a comprehensive evaluation and analysis of the proposed algorithms and the existing methods in the literature on both synthetic data sets and three large real-world streams acquired in the course of operation of an existing commercial surveillance system. Our findings convincingly demonstrate that both of the proposed methods are highly successful and vastly outperform the existing alternatives. We show that the better of the two algorithms (data-aligned histogram) exhibits far superior performance in comparison with the previously described methods, achieving more than 10 times lower estimate errors on real-world data, even when its available working memory is an order of magnitude smaller.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data deduplication describes a class of approaches that reduce the storage capacity needed to store data or the amount of data that has to be transferred over a network. These approaches detect coarse-grained redundancies within a data set, e.g. a file system, and remove them.rnrnOne of the most important applications of data deduplication are backup storage systems where these approaches are able to reduce the storage requirements to a small fraction of the logical backup data size.rnThis thesis introduces multiple new extensions of so-called fingerprinting-based data deduplication. It starts with the presentation of a novel system design, which allows using a cluster of servers to perform exact data deduplication with small chunks in a scalable way.rnrnAfterwards, a combination of compression approaches for an important, but often over- looked, data structure in data deduplication systems, so called block and file recipes, is introduced. Using these compression approaches that exploit unique properties of data deduplication systems, the size of these recipes can be reduced by more than 92% in all investigated data sets. As file recipes can occupy a significant fraction of the overall storage capacity of data deduplication systems, the compression enables significant savings.rnrnA technique to increase the write throughput of data deduplication systems, based on the aforementioned block and file recipes, is introduced next. The novel Block Locality Caching (BLC) uses properties of block and file recipes to overcome the chunk lookup disk bottleneck of data deduplication systems. This chunk lookup disk bottleneck either limits the scalability or the throughput of data deduplication systems. The presented BLC overcomes the disk bottleneck more efficiently than existing approaches. Furthermore, it is shown that it is less prone to aging effects.rnrnFinally, it is investigated if large HPC storage systems inhibit redundancies that can be found by fingerprinting-based data deduplication. Over 3 PB of HPC storage data from different data sets have been analyzed. In most data sets, between 20 and 30% of the data can be classified as redundant. According to these results, future work in HPC storage systems should further investigate how data deduplication can be integrated into future HPC storage systems.rnrnThis thesis presents important novel work in different area of data deduplication re- search.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data sets describing the state of the earth's atmosphere are of great importance in the atmospheric sciences. Over the last decades, the quality and sheer amount of the available data increased significantly, resulting in a rising demand for new tools capable of handling and analysing these large, multidimensional sets of atmospheric data. The interdisciplinary work presented in this thesis covers the development and the application of practical software tools and efficient algorithms from the field of computer science, aiming at the goal of enabling atmospheric scientists to analyse and to gain new insights from these large data sets. For this purpose, our tools combine novel techniques with well-established methods from different areas such as scientific visualization and data segmentation. In this thesis, three practical tools are presented. Two of these tools are software systems (Insight and IWAL) for different types of processing and interactive visualization of data, the third tool is an efficient algorithm for data segmentation implemented as part of Insight.Insight is a toolkit for the interactive, three-dimensional visualization and processing of large sets of atmospheric data, originally developed as a testing environment for the novel segmentation algorithm. It provides a dynamic system for combining at runtime data from different sources, a variety of different data processing algorithms, and several visualization techniques. Its modular architecture and flexible scripting support led to additional applications of the software, from which two examples are presented: the usage of Insight as a WMS (web map service) server, and the automatic production of a sequence of images for the visualization of cyclone simulations. The core application of Insight is the provision of the novel segmentation algorithm for the efficient detection and tracking of 3D features in large sets of atmospheric data, as well as for the precise localization of the occurring genesis, lysis, merging and splitting events. Data segmentation usually leads to a significant reduction of the size of the considered data. This enables a practical visualization of the data, statistical analyses of the features and their events, and the manual or automatic detection of interesting situations for subsequent detailed investigation. The concepts of the novel algorithm, its technical realization, and several extensions for avoiding under- and over-segmentation are discussed. As example applications, this thesis covers the setup and the results of the segmentation of upper-tropospheric jet streams and cyclones as full 3D objects. Finally, IWAL is presented, which is a web application for providing an easy interactive access to meteorological data visualizations, primarily aimed at students. As a web application, the needs to retrieve all input data sets and to install and handle complex visualization tools on a local machine are avoided. The main challenge in the provision of customizable visualizations to large numbers of simultaneous users was to find an acceptable trade-off between the available visualization options and the performance of the application. Besides the implementational details, benchmarks and the results of a user survey are presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this study, we assess the climate mitigation potential from afforestation in a mountainous snow-rich region (Switzerland) with strongly varying environmental conditions. Using radiative forcing calculations, we quantify both the carbon sequestration potential and the effect of albedo change at high resolution. We calculate the albedo radiative forcing based on remotely sensed data sets of albedo, global radiation and snow cover. Carbon sequestration is estimated from changes in carbon stocks based on national inventories. We first estimate the spatial pattern of radiative forcing (RF) across Switzerland assuming homogeneous transitions from open land to forest. This highlights where forest expansion still exhibits climatic benefits when including the radiative forcing of albedo change. Second, given that forest expansion is currently the dominant land-use change process in the Swiss Alps, we calculate the radiative forcing that occurred between 1985 and 1997. Our results show that the net RF of forest expansion ranges from −24 W m−2 at low elevations of the northern Prealps to 2 W m−2 at high elevations of the Central Alps. The albedo RF increases with increasing altitude, which offsets the CO2 RF at high elevations with long snow-covered periods, high global radiation and low carbon sequestration. Albedo RF is particularly relevant during transitions from open land to open forest but not in later stages of forest development. Between 1985 and 1997, when overall forest expansion in Switzerland was approximately 4%, the albedo RF offset the CO2 RF by an average of 40%. We conclude that the albedo RF should be considered at an appropriately high resolution when estimating the climatic effect of forestation in temperate mountainous regions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Signal integration determines cell fate on the cellular level, affects cognitive processes and affective responses on the behavioural level, and is likely to be involved in psychoneurobiological processes underlying mood disorders. Interactions between stimuli may subjected to time effects. Time-dependencies of interactions between stimuli typically lead to complex cell responses and complex responses on the behavioural level. We show that both three-factor models and time series models can be used to uncover such time-dependencies. However, we argue that for short longitudinal data the three factor modelling approach is more suitable. In order to illustrate both approaches, we re-analysed previously published short longitudinal data sets. We found that in human embryonic kidney 293 cells cells the interaction effect in the regulation of extracellular signal-regulated kinase (ERK) 1 signalling activation by insulin and epidermal growth factor is subjected to a time effect and dramatically decays at peak values of ERK activation. In contrast, we found that the interaction effect induced by hypoxia and tumour necrosis factor-alpha for the transcriptional activity of the human cyclo-oxygenase-2 promoter in HEK293 cells is time invariant at least in the first 12-h time window after stimulation. Furthermore, we applied the three-factor model to previously reported animal studies. In these studies, memory storage was found to be subjected to an interaction effect of the beta-adrenoceptor agonist clenbuterol and certain antagonists acting on the alpha-1-adrenoceptor / glucocorticoid-receptor system. Our model-based analysis suggests that only if the antagonist drug is administer in a critical time window, then the interaction effect is relevant.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Sequences of two chloroplast photosystem genes, psaA and psbB, together comprising about 3,500 bp, were obtained for all five major groups of extant seed plants and several outgroups among other vascular plants. Strongly supported, but significantly conflicting, phylogenetic signals were obtained in parsimony analyses from partitions of the data into first and second codon positions versus third positions. In the former, both genes agreed on a monophyletic gymnosperms, with Gnetales closely related to certain conifers. In the latter, Gnetales are inferred to be the sister group of all other seed plants, with gymnosperms paraphyletic. None of the data supported the modern ‘‘anthophyte hypothesis,’’ which places Gnetales as the sister group of flowering plants. A series of simulation studies were undertaken to examine the error rate for parsimony inference. Three kinds of errors were examined: random error, systematic bias (both properties of finite data sets), and statistical inconsistency owing to long-branch attraction (an asymptotic property). Parsimony reconstructions were extremely biased for third-position data for psbB. Regardless of the true underlying tree, a tree in which Gnetales are sister to all other seed plants was likely to be reconstructed for these data. None of the combinations of genes or partitions permits the anthophyte tree to be reconstructed with high probability. Simulations of progressively larger data sets indicate the existence of long-branch attraction (statistical inconsistency) for third-position psbB data if either the anthophyte tree or the gymnosperm tree is correct. This is also true for the anthophyte tree using either psaA third positions or psbB first and second positions. A factor contributing to bias and inconsistency is extremely short branches at the base of the seed plant radiation, coupled with extremely high rates in Gnetales and nonseed plant outgroups. M. J. Sanderson,* M. F. Wojciechowski,*† J.-M. Hu,* T. Sher Khan,* and S. G. Brady

Relevância:

90.00% 90.00%

Publicador:

Resumo:

1. Ecological data sets often use clustered measurements or use repeated sampling in a longitudinal design. Choosing the correct covariance structure is an important step in the analysis of such data, as the covariance describes the degree of similarity among the repeated observations. 2. Three methods for choosing the covariance are: the Akaike information criterion (AIC), the quasi-information criterion (QIC), and the deviance information criterion (DIC). We compared the methods using a simulation study and using a data set that explored effects of forest fragmentation on avian species richness over 15 years. 3. The overall success was 80.6% for the AIC, 29.4% for the QIC and 81.6% for the DIC. For the forest fragmentation study the AIC and DIC selected the unstructured covariance, whereas the QIC selected the simpler autoregressive covariance. Graphical diagnostics suggested that the unstructured covariance was probably correct. 4. We recommend using DIC for selecting the correct covariance structure.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

An educational priority of many nations is to enhance mathematical learning in early childhood. One area in need of special attention is that of statistics. This paper argues for a renewed focus on statistical reasoning in the beginning school years, with opportunities for children to engage in data modelling activities. Such modelling involves investigations of meaningful phenomena, deciding what is worthy of attention (i.e., identifying complex attributes), and then progressing to organising, structuring, visualising, and representing data. Results are reported from the first year of a three-year longitudinal study in which three classes of first-grade children and their teachers engaged in activities that required the creation of data models. The theme of “Looking after our Environment,” a component of the children’s science curriculum at the time, provided the context for the activities. Findings focus on how the children dealt with given complex attributes and how they generated their own attributes in classifying broad data sets, and the nature of the models the children created in organising, structuring, and representing their data.