3 resultados para Distributed data

em CORA - Cork Open Research Archive - University College Cork - Ireland


Relevância:

40.00% 40.00%

Publicador:

Resumo:

It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: Statin therapy reduces the risk of occlusive vascular events, but uncertainty remains about potential effects on cancer. We sought to provide a detailed assessment of any effects on cancer of lowering LDL cholesterol (LDL-C) with a statin using individual patient records from 175,000 patients in 27 large-scale statin trials. Methods and Findings: Individual records of 134,537 participants in 22 randomised trials of statin versus control (median duration 4.8 years) and 39,612 participants in 5 trials of more intensive versus less intensive statin therapy (median duration 5.1 years) were obtained. Reducing LDL-C with a statin for about 5 years had no effect on newly diagnosed cancer or on death from such cancers in either the trials of statin versus control (cancer incidence: 3755 [1.4% per year [py]] versus 3738 [1.4% py], RR 1.00 [95% CI 0.96-1.05]; cancer mortality: 1365 [0.5% py] versus 1358 [0.5% py], RR 1.00 [95% CI 0.93-1.08]) or in the trials of more versus less statin (cancer incidence: 1466 [1.6% py] vs 1472 [1.6% py], RR 1.00 [95% CI 0.93-1.07]; cancer mortality: 447 [0.5% py] versus 481 [0.5% py], RR 0.93 [95% CI 0.82-1.06]). Moreover, there was no evidence of any effect of reducing LDL-C with statin therapy on cancer incidence or mortality at any of 23 individual categories of sites, with increasing years of treatment, for any individual statin, or in any given subgroup. In particular, among individuals with low baseline LDL-C (<2 mmol/L), there was no evidence that further LDL-C reduction (from about 1.7 to 1.3 mmol/L) increased cancer risk (381 [1.6% py] versus 408 [1.7% py]; RR 0.92 [99% CI 0.76-1.10]). Conclusions: In 27 randomised trials, a median of five years of statin therapy had no effect on the incidence of, or mortality from, any type of cancer (or the aggregate of all cancer).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.