5 resultados para Big data, Spark, Hadoop

em CORA - Cork Open Research Archive - University College Cork - Ireland


Relevância:

100.00% 100.00%

Publicador:

Resumo:

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The amount and quality of available biomass is a key factor for the sustainable livestock industry and agricultural management related decision making. Globally 31.5% of land cover is grassland while 80% of Ireland’s agricultural land is grassland. In Ireland, grasslands are intensively managed and provide the cheapest feed source for animals. This dissertation presents a detailed state of the art review of satellite remote sensing of grasslands, and the potential application of optical (Moderate–resolution Imaging Spectroradiometer (MODIS)) and radar (TerraSAR-X) time series imagery to estimate the grassland biomass at two study sites (Moorepark and Grange) in the Republic of Ireland using both statistical and state of the art machine learning algorithms. High quality weather data available from the on-site weather station was also used to calculate the Growing Degree Days (GDD) for Grange to determine the impact of ancillary data on biomass estimation. In situ and satellite data covering 12 years for the Moorepark and 6 years for the Grange study sites were used to predict grassland biomass using multiple linear regression, Neuro Fuzzy Inference Systems (ANFIS) models. The results demonstrate that a dense (8-day composite) MODIS image time series, along with high quality in situ data, can be used to retrieve grassland biomass with high performance (R2 = 0:86; p < 0:05, RMSE = 11.07 for Moorepark). The model for Grange was modified to evaluate the synergistic use of vegetation indices derived from remote sensing time series and accumulated GDD information. As GDD is strongly linked to the plant development, or phonological stage, an improvement in biomass estimation would be expected. It was observed that using the ANFIS model the biomass estimation accuracy increased from R2 = 0:76 (p < 0:05) to R2 = 0:81 (p < 0:05) and the root mean square error was reduced by 2.72%. The work on the application of optical remote sensing was further developed using a TerraSAR-X Staring Spotlight mode time series over the Moorepark study site to explore the extent to which very high resolution Synthetic Aperture Radar (SAR) data of interferometrically coherent paddocks can be exploited to retrieve grassland biophysical parameters. After filtering out the non-coherent plots it is demonstrated that interferometric coherence can be used to retrieve grassland biophysical parameters (i. e., height, biomass), and that it is possible to detect changes due to the grass growth, and grazing and mowing events, when the temporal baseline is short (11 days). However, it not possible to automatically uniquely identify the cause of these changes based only on the SAR backscatter and coherence, due to the ambiguity caused by tall grass laid down due to the wind. Overall, the work presented in this dissertation has demonstrated the potential of dense remote sensing and weather data time series to predict grassland biomass using machine-learning algorithms, where high quality ground data were used for training. At present a major limitation for national scale biomass retrieval is the lack of spatial and temporal ground samples, which can be partially resolved by minor modifications in the existing PastureBaseIreland database by adding the location and extent ofeach grassland paddock in the database. As far as remote sensing data requirements are concerned, MODIS is useful for large scale evaluation but due to its coarse resolution it is not possible to detect the variations within the fields and between the fields at the farm scale. However, this issue will be resolved in terms of spatial resolution by the Sentinel-2 mission, and when both satellites (Sentinel-2A and Sentinel-2B) are operational the revisit time will reduce to 5 days, which together with Landsat-8, should enable sufficient cloud-free data for operational biomass estimation at a national scale. The Synthetic Aperture Radar Interferometry (InSAR) approach is feasible if there are enough coherent interferometric pairs available, however this is difficult to achieve due to the temporal decorrelation of the signal. For repeat-pass InSAR over a vegetated area even an 11 days temporal baseline is too large. In order to achieve better coherence a very high resolution is required at the cost of spatial coverage, which limits its scope for use in an operational context at a national scale. Future InSAR missions with pair acquisition in Tandem mode will minimize the temporal decorrelation over vegetation areas for more focused studies. The proposed approach complements the current paradigm of Big Data in Earth Observation, and illustrates the feasibility of integrating data from multiple sources. In future, this framework can be used to build an operational decision support system for retrieval of grassland biophysical parameters based on data from long term planned optical missions (e. g., Landsat, Sentinel) that will ensure the continuity of data acquisition. Similarly, Spanish X-band PAZ and TerraSAR-X2 missions will ensure the continuity of TerraSAR-X and COSMO-SkyMed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The organisational decision making environment is complex, and decision makers must deal with uncertainty and ambiguity on a continuous basis. Managing and handling decision problems and implementing a solution, requires an understanding of the complexity of the decision domain to the point where the problem and its complexity, as well as the requirements for supporting decision makers, can be described. Research in the Decision Support Systems domain has been extensive over the last thirty years with an emphasis on the development of further technology and better applications on the one hand, and on the other hand, a social approach focusing on understanding what decision making is about and how developers and users should interact. This research project considers a combined approach that endeavours to understand the thinking behind managers’ decision making, as well as their informational and decisional guidance and decision support requirements. This research utilises a cognitive framework, developed in 1985 by Humphreys and Berkeley that juxtaposes the mental processes and ideas of decision problem definition and problem solution that are developed in tandem through cognitive refinement of the problem, based on the analysis and judgement of the decision maker. The framework facilitates the separation of what is essentially a continuous process, into five distinct levels of abstraction of manager’s thinking, and suggests a structure for the underlying cognitive activities. Alter (2004) argues that decision support provides a richer basis than decision support systems, in both practice and research. The constituent literature on decision support, especially in regard to modern high profile systems, including Business Intelligence and Business analytics, can give the impression that all ‘smart’ organisations utilise decision support and data analytics capabilities for all of their key decision making activities. However this empirical investigation indicates a very different reality.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

As economies, societies, and environments change, official statistics evolve and develop to reflect those changes. In reaction to disruptive innovations arising from globalisation, technological advances, and cultural changes, the pace of change of official statistics will accelerate in the future. The motivation for change may also be more existential than that of the past as official statisticians consider the survival of their discipline. This article examines some of the emerging developments and questions whether they present threats or offer opportunities.