987 resultados para data replication


Relevância:

70.00% 70.00%

Publicador:

Resumo:

This thesis presents a study of the Grid data access patterns in distributed analysis in the CMS experiment at the LHC accelerator. This study ranges from the deep analysis of the historical patterns of access to the most relevant data types in CMS, to the exploitation of a supervised Machine Learning classification system to set-up a machinery able to eventually predict future data access patterns - i.e. the so-called dataset “popularity” of the CMS datasets on the Grid - with focus on specific data types. All the CMS workflows run on the Worldwide LHC Computing Grid (WCG) computing centers (Tiers), and in particular the distributed analysis systems sustains hundreds of users and applications submitted every day. These applications (or “jobs”) access different data types hosted on disk storage systems at a large set of WLCG Tiers. The detailed study of how this data is accessed, in terms of data types, hosting Tiers, and different time periods, allows to gain precious insight on storage occupancy over time and different access patterns, and ultimately to extract suggested actions based on this information (e.g. targetted disk clean-up and/or data replication). In this sense, the application of Machine Learning techniques allows to learn from past data and to gain predictability potential for the future CMS data access patterns. Chapter 1 provides an introduction to High Energy Physics at the LHC. Chapter 2 describes the CMS Computing Model, with special focus on the data management sector, also discussing the concept of dataset popularity. Chapter 3 describes the study of CMS data access patterns with different depth levels. Chapter 4 offers a brief introduction to basic machine learning concepts and gives an introduction to its application in CMS and discuss the results obtained by using this approach in the context of this thesis.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Vertebral fracture risk is a heritable complex trait. The aim of this study was to identify genetic susceptibility factors for osteoporotic vertebral fractures applying a genome-wide association study (GWAS) approach. The GWAS discovery was based on the Rotterdam Study, a population-based study of elderly Dutch individuals aged >55years; and comprising 329 cases and 2666 controls with radiographic scoring (McCloskey-Kanis) and genetic data. Replication of one top-associated SNP was pursued by de-novo genotyping of 15 independent studies across Europe, the United States, and Australia and one Asian study. Radiographic vertebral fracture assessment was performed using McCloskey-Kanis or Genant semi-quantitative definitions. SNPs were analyzed in relation to vertebral fracture using logistic regression models corrected for age and sex. Fixed effects inverse variance and Han-Eskin alternative random effects meta-analyses were applied. Genome-wide significance was set at p<5×10-8. In the discovery, a SNP (rs11645938) on chromosome 16q24 was associated with the risk for vertebral fractures at p=4.6×10-8. However, the association was not significant across 5720 cases and 21,791 controls from 14 studies. Fixed-effects meta-analysis summary estimate was 1.06 (95% CI: 0.98-1.14; p=0.17), displaying high degree of heterogeneity (I2=57%; Qhet p=0.0006). Under Han-Eskin alternative random effects model the summary effect was significant (p=0.0005). The SNP maps to a region previously found associated with lumbar spine bone mineral density (LS-BMD) in two large meta-analyses from the GEFOS consortium. A false positive association in the GWAS discovery cannot be excluded, yet, the low-powered setting of the discovery and replication settings (appropriate to identify risk effect size >1.25) may still be consistent with an effect size <1.10, more of the type expected in complex traits. Larger effort in studies with standardized phenotype definitions is needed to confirm or reject the involvement of this locus on the risk for vertebral fractures.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Erasure codes are an efficient means of storing data across a network in comparison to data replication, as they tend to reduce the amount of data stored in the network and offer increased resilience in the presence of node failures. The codes perform poorly though, when repair of a failed node is called for, as they typically require the entire file to be downloaded to repair a failed node. A new class of erasure codes, termed as regenerating codes were recently introduced, that do much better in this respect. However, given the variety of efficient erasure codes available in the literature, there is considerable interest in the construction of coding schemes that would enable traditional erasure codes to be used, while retaining the feature that only a fraction of the data need be downloaded for node repair. In this paper, we present a simple, yet powerful, framework that does precisely this. Under this framework, the nodes are partitioned into two types and encoded using two codes in a manner that reduces the problem of node-repair to that of erasure-decoding of the constituent codes. Depending upon the choice of the two codes, the framework can be used to avail one or more of the following advantages: simultaneous minimization of storage space and repair-bandwidth, low complexity of operation, fewer disk reads at helper nodes during repair, and error detection and correction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Erasure codes are an efficient means of storing data across a network in comparison to data replication, as they tend to reduce the amount of data stored in the network and offer increased resilience in the presence of node failures. The codes perform poorly though, when repair of a failed node is called for, as they typically require the entire file to be downloaded to repair a failed node. A new class of erasure codes, termed as regenerating codes were recently introduced, that do much better in this respect. However, given the variety of efficient erasure codes available in the literature, there is considerable interest in the construction of coding schemes that would enable traditional erasure codes to be used, while retaining the feature that only a fraction of the data need be downloaded for node repair. In this paper, we present a simple, yet powerful, framework that does precisely this. Under this framework, the nodes are partitioned into two types and encoded using two codes in a manner that reduces the problem of node-repair to that of erasure-decoding of the constituent codes. Depending upon the choice of the two codes, the framework can be used to avail one or more of the following advantages: simultaneous minimization of storage space and repair-bandwidth, low complexity of operation, fewer disk reads at helper nodes during repair, and error detection and correction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Increasingly infrastructure providers are supplying the cloud marketplace with storage and on-demand compute resources to host cloud applications. From an application user's point of view, it is desirable to identify the most appropriate set of available resources on which to execute an application. Resource choice can be complex and may involve comparing available hardware specifications, operating systems, value-added services, such as network configuration or data replication, and operating costs, such as hosting cost and data throughput. Providers' cost models often change and new commodity cost models, such as spot pricing, have been introduced to offer significant savings. In this paper, a software abstraction layer is used to discover infrastructure resources for a particular application, across multiple providers, by using a two-phase constraints-based approach. In the first phase, a set of possible infrastructure resources are identified for a given application. In the second phase, a heuristic is used to select the most appropriate resources from the initial set. For some applications a cost-based heuristic is most appropriate; for others a performance-based heuristic may be used. A financial services application and a high performance computing application are used to illustrate the execution of the proposed resource discovery mechanism. The experimental result shows the proposed model could dynamically select an appropriate set of resouces that match the application's requirements.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Mobile computing has enabled users to seamlessly access databases even when they are on the move. However, in the absence of readily available high-quality communication, users are often forced to operate disconnected from the network. As a result, software applications have to be redesigned to take advantage of this environment while accommodating the new challenges posed by mobility. In particular, there is a need for replication and synchronization services in order to guarantee availability of data and functionality, (including updates) in disconnected mode. To this end we propose a scalable and highly available data replication and management service. The proposed replication technique is compared with a baseline replication technique and shown to exhibit high availability, fault tolerance and minimal access times of the data and services, which are very important in an environment with low-quality communication links.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recent advances m hardware technologies such as portable
computers and wireless communication networks have led to the emergence of mobile computing systems. Thus, availability and accessibility of the data and services become important issues of mobile computing systems. In this paper, we present a data replication and management scheme tailored for such environments In the proposed scheme data is replicated synchronously over stationary sites while for the mobile network, data is replicated asynchronously based on commonly visited sites for each user. The proposed scheme is compared with other techniques and is shown to require less communication cost for an operation as well as provide higher degree of data availability.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Citizen science involves collaboration between multi-sector agencies and the public to address a natural resource management issue. The Sea Search citizen science programme involves community groups in monitoring and collecting subtidal rocky reef and intertidal rocky shore data in Victorian Marine Protected Areas (MPAs), Australia. In this study we compared volunteer and scientifically collected data and the volunteer motivation for participation in the Sea Search programme. Intertidal rocky shore volunteer-collected data was found to be typically comparable to data collected by scientists for species richness and diversity measures. For subtidal monitoring there was also no significant difference for species richness recorded by scientists and volunteers. However, low statistical power suggest only large changes could be detected due to reduced data replication. Generally volunteers recorded lower species diversity for biological groups compared to scientists, albeit not significant. Species abundance measures for algae species were significantly different between volunteers and scientists. These results suggest difficulty in identification and abundance measurements by volunteers and the need for additional training requirements necessary for surveying algae assemblages. The subtidal monitoring results also highlight the difficulties of collecting data in exposed rocky reef habitats with weather conditions and volunteer diver availability constraining sampling effort. The prime motivation for volunteer participation in Sea Search was to assist with scientific research followed closely by wanting to work close to nature. This study revealed two important themes for volunteer engagement in Sea Search: 1) volunteer training and participation and, 2) usability of volunteer collected data for MPA managers. Volunteer-collected data through the Sea Search citizen science programme has the potential to provide useable data to assist in informed management practices of Victoria’s MPAs, but requires the support and commitment from all partners involved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Relevância:

60.00% 60.00%

Publicador:

Resumo:

3D geographic information system (GIS) is data and computation intensive in nature. Internet users are usually equipped with low-end personal computers and network connections of limited bandwidth. Data reduction and performance optimization techniques are of critical importance in quality of service (QoS) management for online 3D GIS. In this research, QoS management issues regarding distributed 3D GIS presentation were studied to develop 3D TerraFly, an interactive 3D GIS that supports high quality online terrain visualization and navigation. ^ To tackle the QoS management challenges, multi-resolution rendering model, adaptive level of detail (LOD) control and mesh simplification algorithms were proposed to effectively reduce the terrain model complexity. The rendering model is adaptively decomposed into sub-regions of up-to-three detail levels according to viewing distance and other dynamic quality measurements. The mesh simplification algorithm was designed as a hybrid algorithm that combines edge straightening and quad-tree compression to reduce the mesh complexity by removing geometrically redundant vertices. The main advantage of this mesh simplification algorithm is that grid mesh can be directly processed in parallel without triangulation overhead. Algorithms facilitating remote accessing and distributed processing of volumetric GIS data, such as data replication, directory service, request scheduling, predictive data retrieving and caching were also proposed. ^ A prototype of the proposed 3D TerraFly implemented in this research demonstrates the effectiveness of our proposed QoS management framework in handling interactive online 3D GIS. The system implementation details and future directions of this research are also addressed in this thesis. ^

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A replicação de base de dados tem como objectivo a cópia de dados entre bases de dados distribuídas numa rede de computadores. A replicação de dados é importante em várias situações, desde a realização de cópias de segurança da informação, ao balanceamento de carga, à distribuição da informação por vários locais, até à integração de sistemas heterogéneos. A replicação possibilita uma diminuição do tráfego de rede, pois os dados ficam disponíveis localmente possibilitando também o seu acesso no caso de indisponibilidade da rede. Esta dissertação baseia-se na realização de um trabalho que consistiu no desenvolvimento de uma aplicação genérica para a replicação de bases de dados a disponibilizar como open source software. A aplicação desenvolvida possibilita a integração de dados entre vários sistemas, com foco na integração de dados heterogéneos, na fragmentação de dados e também na possibilidade de adaptação a várias situações. ABSTRACT: Data replication is a mechanism to synchronize and integrate data between distributed databases over a computer network. Data replication is an important tool in several situations, such as the creation of backup systems, load balancing between various nodes, distribution of information between various locations, integration of heterogeneous systems. Replication enables a reduction in network traffic, because data remains available locally even in the event of a temporary network failure. This thesis is based on the work carried out to develop an application for database replication to be made accessible as open source software. The application that was built allows for data integration between various systems, with particular focus on, amongst others, the integration of heterogeneous data, the fragmentation of data, replication in cascade, data format changes between replicas, master/slave and multi master synchronization.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

We conducted data-mining analyses of genome wide association (GWA) studies of the CATIE and MGS-GAIN datasets, and found 13 markers in the two physically linked genes, PTPN21 and EML5, showing nominally significant association with schizophrenia. Linkage disequilibrium (LD) analysis indicated that all 7 markers from PTPN21 shared high LD (r(2)>0.8), including rs2274736 and rs2401751, the two non-synonymous markers with the most significant association signals (rs2401751, P=1.10 × 10(-3) and rs2274736, P=1.21 × 10(-3)). In a meta-analysis of all 13 replication datasets with a total of 13,940 subjects, we found that the two non-synonymous markers are significantly associated with schizophrenia (rs2274736, OR=0.92, 95% CI: 0.86-0.97, P=5.45 × 10(-3) and rs2401751, OR=0.92, 95% CI: 0.86-0.97, P=5.29 × 10(-3)). One SNP (rs7147796) in EML5 is also significantly associated with the disease (OR=1.08, 95% CI: 1.02-1.14, P=6.43 × 10(-3)). These 3 markers remain significant after Bonferroni correction. Furthermore, haplotype conditioned analyses indicated that the association signals observed between rs2274736/rs2401751 and rs7147796 are statistically independent. Given the results that 2 non-synonymous markers in PTPN21 are associated with schizophrenia, further investigation of this locus is warranted.