8 resultados para big data storage
em CORA - Cork Open Research Archive - University College Cork - Ireland
Resumo:
A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.
Resumo:
Electron microscopy (EM) has advanced in an exponential way since the first transmission electron microscope (TEM) was built in the 1930’s. The urge to ‘see’ things is an essential part of human nature (talk of ‘seeing is believing’) and apart from scanning tunnel microscopes which give information about the surface, EM is the only imaging technology capable of really visualising atomic structures in depth down to single atoms. With the development of nanotechnology the demand to image and analyse small things has become even greater and electron microscopes have found their way from highly delicate and sophisticated research grade instruments to key-turn and even bench-top instruments for everyday use in every materials research lab on the planet. The semiconductor industry is as dependent on the use of EM as life sciences and pharmaceutical industry. With this generalisation of use for imaging, the need to deploy advanced uses of EM has become more and more apparent. The combination of several coinciding beams (electron, ion and even light) to create DualBeam or TripleBeam instruments for instance enhances the usefulness from pure imaging to manipulating on the nanoscale. And when it comes to the analytic power of EM with the many ways the highly energetic electrons and ions interact with the matter in the specimen there is a plethora of niches which evolved during the last two decades, specialising in every kind of analysis that can be thought of and combined with EM. In the course of this study the emphasis was placed on the application of these advanced analytical EM techniques in the context of multiscale and multimodal microscopy – multiscale meaning across length scales from micrometres or larger to nanometres, multimodal meaning numerous techniques applied to the same sample volume in a correlative manner. In order to demonstrate the breadth and potential of the multiscale and multimodal concept an integration of it was attempted in two areas: I) Biocompatible materials using polycrystalline stainless steel and II) Semiconductors using thin multiferroic films. I) The motivation to use stainless steel (316L medical grade) comes from the potential modulation of endothelial cell growth which can have a big impact on the improvement of cardio-vascular stents – which are mainly made of 316L – through nano-texturing of the stent surface by focused ion beam (FIB) lithography. Patterning with FIB has never been reported before in connection with stents and cell growth and in order to gain a better understanding of the beam-substrate interaction during patterning a correlative microscopy approach was used to illuminate the patterning process from many possible angles. Electron backscattering diffraction (EBSD) was used to analyse the crystallographic structure, FIB was used for the patterning and simultaneously visualising the crystal structure as part of the monitoring process, scanning electron microscopy (SEM) and atomic force microscopy (AFM) were employed to analyse the topography and the final step being 3D visualisation through serial FIB/SEM sectioning. II) The motivation for the use of thin multiferroic films stems from the ever-growing demand for increased data storage at lesser and lesser energy consumption. The Aurivillius phase material used in this study has a high potential in this area. Yet it is necessary to show clearly that the film is really multiferroic and no second phase inclusions are present even at very low concentrations – ~0.1vol% could already be problematic. Thus, in this study a technique was developed to analyse ultra-low density inclusions in thin multiferroic films down to concentrations of 0.01%. The goal achieved was a complete structural and compositional analysis of the films which required identification of second phase inclusions (through elemental analysis EDX(Energy Dispersive X-ray)), localise them (employing 72 hour EDX mapping in the SEM), isolate them for the TEM (using FIB) and give an upper confidence limit of 99.5% to the influence of the inclusions on the magnetic behaviour of the main phase (statistical analysis).
Resumo:
It is estimated that the quantity of digital data being transferred, processed or stored at any one time currently stands at 4.4 zettabytes (4.4 × 2 70 bytes) and this figure is expected to have grown by a factor of 10 to 44 zettabytes by 2020. Exploiting this data is, and will remain, a significant challenge. At present there is the capacity to store 33% of digital data in existence at any one time; by 2020 this capacity is expected to fall to 15%. These statistics suggest that, in the era of Big Data, the identification of important, exploitable data will need to be done in a timely manner. Systems for the monitoring and analysis of data, e.g. stock markets, smart grids and sensor networks, can be made up of massive numbers of individual components. These components can be geographically distributed yet may interact with one another via continuous data streams, which in turn may affect the state of the sender or receiver. This introduces a dynamic causality, which further complicates the overall system by introducing a temporal constraint that is difficult to accommodate. Practical approaches to realising the system described above have led to a multiplicity of analysis techniques, each of which concentrates on specific characteristics of the system being analysed and treats these characteristics as the dominant component affecting the results being sought. The multiplicity of analysis techniques introduces another layer of heterogeneity, that is heterogeneity of approach, partitioning the field to the extent that results from one domain are difficult to exploit in another. The question is asked can a generic solution for the monitoring and analysis of data that: accommodates temporal constraints; bridges the gap between expert knowledge and raw data; and enables data to be effectively interpreted and exploited in a transparent manner, be identified? The approach proposed in this dissertation acquires, analyses and processes data in a manner that is free of the constraints of any particular analysis technique, while at the same time facilitating these techniques where appropriate. Constraints are applied by defining a workflow based on the production, interpretation and consumption of data. This supports the application of different analysis techniques on the same raw data without the danger of incorporating hidden bias that may exist. To illustrate and to realise this approach a software platform has been created that allows for the transparent analysis of data, combining analysis techniques with a maintainable record of provenance so that independent third party analysis can be applied to verify any derived conclusions. In order to demonstrate these concepts, a complex real world example involving the near real-time capturing and analysis of neurophysiological data from a neonatal intensive care unit (NICU) was chosen. A system was engineered to gather raw data, analyse that data using different analysis techniques, uncover information, incorporate that information into the system and curate the evolution of the discovered knowledge. The application domain was chosen for three reasons: firstly because it is complex and no comprehensive solution exists; secondly, it requires tight interaction with domain experts, thus requiring the handling of subjective knowledge and inference; and thirdly, given the dearth of neurophysiologists, there is a real world need to provide a solution for this domain
Resumo:
The amount and quality of available biomass is a key factor for the sustainable livestock industry and agricultural management related decision making. Globally 31.5% of land cover is grassland while 80% of Ireland’s agricultural land is grassland. In Ireland, grasslands are intensively managed and provide the cheapest feed source for animals. This dissertation presents a detailed state of the art review of satellite remote sensing of grasslands, and the potential application of optical (Moderate–resolution Imaging Spectroradiometer (MODIS)) and radar (TerraSAR-X) time series imagery to estimate the grassland biomass at two study sites (Moorepark and Grange) in the Republic of Ireland using both statistical and state of the art machine learning algorithms. High quality weather data available from the on-site weather station was also used to calculate the Growing Degree Days (GDD) for Grange to determine the impact of ancillary data on biomass estimation. In situ and satellite data covering 12 years for the Moorepark and 6 years for the Grange study sites were used to predict grassland biomass using multiple linear regression, Neuro Fuzzy Inference Systems (ANFIS) models. The results demonstrate that a dense (8-day composite) MODIS image time series, along with high quality in situ data, can be used to retrieve grassland biomass with high performance (R2 = 0:86; p < 0:05, RMSE = 11.07 for Moorepark). The model for Grange was modified to evaluate the synergistic use of vegetation indices derived from remote sensing time series and accumulated GDD information. As GDD is strongly linked to the plant development, or phonological stage, an improvement in biomass estimation would be expected. It was observed that using the ANFIS model the biomass estimation accuracy increased from R2 = 0:76 (p < 0:05) to R2 = 0:81 (p < 0:05) and the root mean square error was reduced by 2.72%. The work on the application of optical remote sensing was further developed using a TerraSAR-X Staring Spotlight mode time series over the Moorepark study site to explore the extent to which very high resolution Synthetic Aperture Radar (SAR) data of interferometrically coherent paddocks can be exploited to retrieve grassland biophysical parameters. After filtering out the non-coherent plots it is demonstrated that interferometric coherence can be used to retrieve grassland biophysical parameters (i. e., height, biomass), and that it is possible to detect changes due to the grass growth, and grazing and mowing events, when the temporal baseline is short (11 days). However, it not possible to automatically uniquely identify the cause of these changes based only on the SAR backscatter and coherence, due to the ambiguity caused by tall grass laid down due to the wind. Overall, the work presented in this dissertation has demonstrated the potential of dense remote sensing and weather data time series to predict grassland biomass using machine-learning algorithms, where high quality ground data were used for training. At present a major limitation for national scale biomass retrieval is the lack of spatial and temporal ground samples, which can be partially resolved by minor modifications in the existing PastureBaseIreland database by adding the location and extent ofeach grassland paddock in the database. As far as remote sensing data requirements are concerned, MODIS is useful for large scale evaluation but due to its coarse resolution it is not possible to detect the variations within the fields and between the fields at the farm scale. However, this issue will be resolved in terms of spatial resolution by the Sentinel-2 mission, and when both satellites (Sentinel-2A and Sentinel-2B) are operational the revisit time will reduce to 5 days, which together with Landsat-8, should enable sufficient cloud-free data for operational biomass estimation at a national scale. The Synthetic Aperture Radar Interferometry (InSAR) approach is feasible if there are enough coherent interferometric pairs available, however this is difficult to achieve due to the temporal decorrelation of the signal. For repeat-pass InSAR over a vegetated area even an 11 days temporal baseline is too large. In order to achieve better coherence a very high resolution is required at the cost of spatial coverage, which limits its scope for use in an operational context at a national scale. Future InSAR missions with pair acquisition in Tandem mode will minimize the temporal decorrelation over vegetation areas for more focused studies. The proposed approach complements the current paradigm of Big Data in Earth Observation, and illustrates the feasibility of integrating data from multiple sources. In future, this framework can be used to build an operational decision support system for retrieval of grassland biophysical parameters based on data from long term planned optical missions (e. g., Landsat, Sentinel) that will ensure the continuity of data acquisition. Similarly, Spanish X-band PAZ and TerraSAR-X2 missions will ensure the continuity of TerraSAR-X and COSMO-SkyMed.
Resumo:
Error correcting codes are combinatorial objects, designed to enable reliable transmission of digital data over noisy channels. They are ubiquitously used in communication, data storage etc. Error correction allows reconstruction of the original data from received word. The classical decoding algorithms are constrained to output just one codeword. However, in the late 50’s researchers proposed a relaxed error correction model for potentially large error rates known as list decoding. The research presented in this thesis focuses on reducing the computational effort and enhancing the efficiency of decoding algorithms for several codes from algorithmic as well as architectural standpoint. The codes in consideration are linear block codes closely related to Reed Solomon (RS) codes. A high speed low complexity algorithm and architecture are presented for encoding and decoding RS codes based on evaluation. The implementation results show that the hardware resources and the total execution time are significantly reduced as compared to the classical decoder. The evaluation based encoding and decoding schemes are modified and extended for shortened RS codes and software implementation shows substantial reduction in memory footprint at the expense of latency. Hermitian codes can be seen as concatenated RS codes and are much longer than RS codes over the same aphabet. A fast, novel and efficient VLSI architecture for Hermitian codes is proposed based on interpolation decoding. The proposed architecture is proven to have better than Kötter’s decoder for high rate codes. The thesis work also explores a method of constructing optimal codes by computing the subfield subcodes of Generalized Toric (GT) codes that is a natural extension of RS codes over several dimensions. The polynomial generators or evaluation polynomials for subfield-subcodes of GT codes are identified based on which dimension and bound for the minimum distance are computed. The algebraic structure for the polynomials evaluating to subfield is used to simplify the list decoding algorithm for BCH codes. Finally, an efficient and novel approach is proposed for exploiting powerful codes having complex decoding but simple encoding scheme (comparable to RS codes) for multihop wireless sensor network (WSN) applications.
Resumo:
The organisational decision making environment is complex, and decision makers must deal with uncertainty and ambiguity on a continuous basis. Managing and handling decision problems and implementing a solution, requires an understanding of the complexity of the decision domain to the point where the problem and its complexity, as well as the requirements for supporting decision makers, can be described. Research in the Decision Support Systems domain has been extensive over the last thirty years with an emphasis on the development of further technology and better applications on the one hand, and on the other hand, a social approach focusing on understanding what decision making is about and how developers and users should interact. This research project considers a combined approach that endeavours to understand the thinking behind managers’ decision making, as well as their informational and decisional guidance and decision support requirements. This research utilises a cognitive framework, developed in 1985 by Humphreys and Berkeley that juxtaposes the mental processes and ideas of decision problem definition and problem solution that are developed in tandem through cognitive refinement of the problem, based on the analysis and judgement of the decision maker. The framework facilitates the separation of what is essentially a continuous process, into five distinct levels of abstraction of manager’s thinking, and suggests a structure for the underlying cognitive activities. Alter (2004) argues that decision support provides a richer basis than decision support systems, in both practice and research. The constituent literature on decision support, especially in regard to modern high profile systems, including Business Intelligence and Business analytics, can give the impression that all ‘smart’ organisations utilise decision support and data analytics capabilities for all of their key decision making activities. However this empirical investigation indicates a very different reality.
Resumo:
As economies, societies, and environments change, official statistics evolve and develop to reflect those changes. In reaction to disruptive innovations arising from globalisation, technological advances, and cultural changes, the pace of change of official statistics will accelerate in the future. The motivation for change may also be more existential than that of the past as official statisticians consider the survival of their discipline. This article examines some of the emerging developments and questions whether they present threats or offer opportunities.
Resumo:
Multiferroic materials displaying coupled ferroelectric and ferromagnetic order parameters could provide a means for data storage whereby bits could be written electrically and read magnetically, or vice versa. Thin films of Aurivillius phase Bi6Ti2.8Fe1.52Mn0.68O18, previously prepared by a chemical solution deposition (CSD) technique, are multiferroics demonstrating magnetoelectric coupling at room temperature. Here, we demonstrate the growth of a similar composition, Bi6Ti2.99Fe1.46Mn0.55O18, via the liquid injection chemical vapor deposition technique. High-resolution magnetic measurements reveal a considerably higher in-plane ferromagnetic signature than CSD grown films (MS = 24.25 emu/g (215 emu/cm3), MR = 9.916 emu/g (81.5 emu/cm3), HC = 170 Oe). A statistical analysis of the results from a thorough microstructural examination of the samples, allows us to conclude that the ferromagnetic signature can be attributed to the Aurivillius phase, with a confidence level of 99.95%. In addition, we report the direct piezoresponse force microscopy visualization of ferroelectric switching while going through a full in-plane magnetic field cycle, where increased volumes (8.6 to 14% compared with 4 to 7% for the CSD-grown films) of the film engage in magnetoelectric coupling and demonstrate both irreversible and reversible magnetoelectric domain switching.