Biblioteca Digital

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

Veja mais

Extending Science Gateway Frameworks to Support Big Data Applications in the Cloud

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cloud computing offers massive scalability and elasticity required by many scien-tific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new oppor-tunities for application developers. This paper investigates how workflow sys-tems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.

Veja mais

Compositional data analysis of Holocene sediments from the West Bengal Sundarbans, India: Geochemical proxies for grain-size variability in a delta environment

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper is part of a special issue of Applied Geochemistry focusing on reliable applications of compositional multivariate statistical methods. This study outlines the application of compositional data analysis (CoDa) to calibration of geochemical data and multivariate statistical modelling of geochemistry and grain-size data from a set of Holocene sedimentary cores from the Ganges-Brahmaputra (G-B) delta. Over the last two decades, understanding near-continuous records of sedimentary sequences has required the use of core-scanning X-ray fluorescence (XRF) spectrometry, for both terrestrial and marine sedimentary sequences. Initial XRF data are generally unusable in ‘raw-format’, requiring data processing in order to remove instrument bias, as well as informed sequence interpretation. The applicability of these conventional calibration equations to core-scanning XRF data are further limited by the constraints posed by unknown measurement geometry and specimen homogeneity, as well as matrix effects. Log-ratio based calibration schemes have been developed and applied to clastic sedimentary sequences focusing mainly on energy dispersive-XRF (ED-XRF) core-scanning. This study has applied high resolution core-scanning XRF to Holocene sedimentary sequences from the tidal-dominated Indian Sundarbans, (Ganges-Brahmaputra delta plain). The Log-Ratio Calibration Equation (LRCE) was applied to a sub-set of core-scan and conventional ED-XRF data to quantify elemental composition. This provides a robust calibration scheme using reduced major axis regression of log-ratio transformed geochemical data. Through partial least squares (PLS) modelling of geochemical and grain-size data, it is possible to derive robust proxy information for the Sundarbans depositional environment. The application of these techniques to Holocene sedimentary data offers an improved methodological framework for unravelling Holocene sedimentation patterns.

Veja mais

High-performance parallel systems for data-intensive computing

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Thesis (Ph.D.)--University of Washington, 2016-08

Veja mais

Strategies for protecting intellectual property when using CUDA applications on graphics processing units

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Recent advances in the massively parallel computational abilities of graphical processing units (GPUs) have increased their use for general purpose computation, as companies look to take advantage of big data processing techniques. This has given rise to the potential for malicious software targeting GPUs, which is of interest to forensic investigators examining the operation of software. The ability to carry out reverse-engineering of software is of great importance within the security and forensics elds, particularly when investigating malicious software or carrying out forensic analysis following a successful security breach. Due to the complexity of the Nvidia CUDA (Compute Uni ed Device Architecture) framework, it is not clear how best to approach the reverse engineering of a piece of CUDA software. We carry out a review of the di erent binary output formats which may be encountered from the CUDA compiler, and their implications on reverse engineering. We then demonstrate the process of carrying out disassembly of an example CUDA application, to establish the various techniques available to forensic investigators carrying out black-box disassembly and reverse engineering of CUDA binaries. We show that the Nvidia compiler, using default settings, leaks useful information. Finally, we demonstrate techniques to better protect intellectual property in CUDA algorithm implementations from reverse engineering.

Veja mais

Processing Bio-Argo nitrate concentration at the DAC Level

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The only method used to date to measure dissolved nitrate concentration (NITRATE) with sensors mounted on profiling floats is based on the absorption of light at ultraviolet wavelengths by nitrate ion (Johnson and Coletti, 2002; Johnson et al., 2010; 2013; D’Ortenzio et al., 2012). Nitrate has a modest UV absorption band with a peak near 210 nm, which overlaps with the stronger absorption band of bromide, which has a peak near 200 nm. In addition, there is a much weaker absorption due to dissolved organic matter and light scattering by particles (Ogura and Hanya, 1966). The UV spectrum thus consists of three components, bromide, nitrate and a background due to organics and particles. The background also includes thermal effects on the instrument and slow drift. All of these latter effects (organics, particles, thermal effects and drift) tend to be smooth spectra that combine to form an absorption spectrum that is linear in wavelength over relatively short wavelength spans. If the light absorption spectrum is measured in the wavelength range around 217 to 240 nm (the exact range is a bit of a decision by the operator), then the nitrate concentration can be determined. Two different instruments based on the same optical principles are in use for this purpose. The In Situ Ultraviolet Spectrophotometer (ISUS) built at MBARI or at Satlantic has been mounted inside the pressure hull of a Teledyne/Webb Research APEX and NKE Provor profiling floats and the optics penetrate through the upper end cap into the water. The Satlantic Submersible Ultraviolet Nitrate Analyzer (SUNA) is placed on the outside of APEX, Provor, and Navis profiling floats in its own pressure housing and is connected to the float through an underwater cable that provides power and communications. Power, communications between the float controller and the sensor, and data processing requirements are essentially the same for both ISUS and SUNA. There are several possible algorithms that can be used for the deconvolution of nitrate concentration from the observed UV absorption spectrum (Johnson and Coletti, 2002; Arai et al., 2008; Sakamoto et al., 2009; Zielinski et al., 2011). In addition, the default algorithm that is available in Satlantic sensors is a proprietary approach, but this is not generally used on profiling floats. There are some tradeoffs in every approach. To date almost all nitrate sensors on profiling floats have used the Temperature Compensated Salinity Subtracted (TCSS) algorithm developed by Sakamoto et al. (2009), and this document focuses on that method. It is likely that there will be further algorithm development and it is necessary that the data systems clearly identify the algorithm that is used. It is also desirable that the data system allow for recalculation of prior data sets using new algorithms. To accomplish this, the float must report not just the computed nitrate, but the observed light intensity. Then, the rule to obtain only one NITRATE parameter is, if the spectrum is present then, the NITRATE should be recalculated from the spectrum while the computation of nitrate concentration can also generate useful diagnostics of data quality.

Veja mais

CATARINA 2012-Leg 1 - CTD-O2 Data report

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The CATARINA Leg1 cruise was carried out from June 22 to July 24 2012 on board the B/O Sarmiento de Gamboa, under the scientific supervision of Aida Rios (CSIC-IIM). It included the occurrence of the OVIDE hydrological section that was performed in June 2002, 2004, 2006, 2008 and 2010, as part of the CLIVAR program (name A25) ), and under the supervision of Herlé Mercier (CNRSLPO). This section begins near Lisbon (Portugal), runs through the West European Basin and the Iceland Basin, crosses the Reykjanes Ridge (300 miles north of Charlie-Gibbs Fracture Zone, and ends at Cape Hoppe (southeast tip of Greenland). The objective of this repeated hydrological section is to monitor the variability of water mass properties and main current transports in the basin, complementing the international observation array relevant for climate studies. In addition, the Labrador Sea was partly sampled (stations 101-108) between Greenland and Newfoundland, but heavy weather conditions prevented the achievement of the section south of 53°40’N. The quality of CTD data is essential to reach the first objective of the CATARINA project, i.e. to quantify the Meridional Overturning Circulation and water mass ventilation changes and their effect on the changes in the anthropogenic carbon ocean uptake and storage capacity. The CATARINA project was mainly funded by the Spanish Ministry of Sciences and Innovation and co-funded by the Fondo Europeo de Desarrollo Regional. The hydrological OVIDE section includes 95 surface-bottom stations from coast to coast, collecting profiles of temperature, salinity, oxygen and currents, spaced by 2 to 25 Nm depending on the steepness of the topography. The position of the stations closely follows that of OVIDE 2002. In addition, 8 stations were carried out in the Labrador Sea. From the 24 bottles closed at various depth at each stations, samples of sea water are used for salinity and oxygen calibration, and for measurements of biogeochemical components that are not reported here. The data were acquired with a Seabird CTD (SBE911+) and an SBE43 for the dissolved oxygen, belonging to the Spanish UTM group. The software SBE data processing was used after decoding and cleaning the raw data. Then, the LPO matlab toolbox was used to calibrate and bin the data as it was done for the previous OVIDE cruises, using on the one hand pre and post-cruise calibration results for the pressure and temperature sensors (done at Ifremer) and on the other hand the water samples of the 24 bottles of the rosette at each station for the salinity and dissolved oxygen data. A final accuracy of 0.002°C, 0.002 psu and 0.04 ml/l (2.3 umol/kg) was obtained on final profiles of temperature, salinity and dissolved oxygen, compatible with international requirements issued from the WOCE program.

Veja mais

927 resultados para Illinois. Data Processing Center.

Filtro por publicador