977 resultados para Data Deduplication Compression
Resumo:
Data deduplication describes a class of approaches that reduce the storage capacity needed to store data or the amount of data that has to be transferred over a network. These approaches detect coarse-grained redundancies within a data set, e.g. a file system, and remove them.rnrnOne of the most important applications of data deduplication are backup storage systems where these approaches are able to reduce the storage requirements to a small fraction of the logical backup data size.rnThis thesis introduces multiple new extensions of so-called fingerprinting-based data deduplication. It starts with the presentation of a novel system design, which allows using a cluster of servers to perform exact data deduplication with small chunks in a scalable way.rnrnAfterwards, a combination of compression approaches for an important, but often over- looked, data structure in data deduplication systems, so called block and file recipes, is introduced. Using these compression approaches that exploit unique properties of data deduplication systems, the size of these recipes can be reduced by more than 92% in all investigated data sets. As file recipes can occupy a significant fraction of the overall storage capacity of data deduplication systems, the compression enables significant savings.rnrnA technique to increase the write throughput of data deduplication systems, based on the aforementioned block and file recipes, is introduced next. The novel Block Locality Caching (BLC) uses properties of block and file recipes to overcome the chunk lookup disk bottleneck of data deduplication systems. This chunk lookup disk bottleneck either limits the scalability or the throughput of data deduplication systems. The presented BLC overcomes the disk bottleneck more efficiently than existing approaches. Furthermore, it is shown that it is less prone to aging effects.rnrnFinally, it is investigated if large HPC storage systems inhibit redundancies that can be found by fingerprinting-based data deduplication. Over 3 PB of HPC storage data from different data sets have been analyzed. In most data sets, between 20 and 30% of the data can be classified as redundant. According to these results, future work in HPC storage systems should further investigate how data deduplication can be integrated into future HPC storage systems.rnrnThis thesis presents important novel work in different area of data deduplication re- search.
Resumo:
The main goal of this research is to design an efficient compression al~ gorithm for fingerprint images. The wavelet transform technique is the principal tool used to reduce interpixel redundancies and to obtain a parsimonious representation for these images. A specific fixed decomposition structure is designed to be used by the wavelet packet in order to save on the computation, transmission, and storage costs. This decomposition structure is based on analysis of information packing performance of several decompositions, two-dimensional power spectral density, effect of each frequency band on the reconstructed image, and the human visual sensitivities. This fixed structure is found to provide the "most" suitable representation for fingerprints, according to the chosen criteria. Different compression techniques are used for different subbands, based on their observed statistics. The decision is based on the effect of each subband on the reconstructed image according to the mean square criteria as well as the sensitivities in human vision. To design an efficient quantization algorithm, a precise model for distribution of the wavelet coefficients is developed. The model is based on the generalized Gaussian distribution. A least squares algorithm on a nonlinear function of the distribution model shape parameter is formulated to estimate the model parameters. A noise shaping bit allocation procedure is then used to assign the bit rate among subbands. To obtain high compression ratios, vector quantization is used. In this work, the lattice vector quantization (LVQ) is chosen because of its superior performance over other types of vector quantizers. The structure of a lattice quantizer is determined by its parameters known as truncation level and scaling factor. In lattice-based compression algorithms reported in the literature the lattice structure is commonly predetermined leading to a nonoptimized quantization approach. In this research, a new technique for determining the lattice parameters is proposed. In the lattice structure design, no assumption about the lattice parameters is made and no training and multi-quantizing is required. The design is based on minimizing the quantization distortion by adapting to the statistical characteristics of the source in each subimage. 11 Abstract Abstract Since LVQ is a multidimensional generalization of uniform quantizers, it produces minimum distortion for inputs with uniform distributions. In order to take advantage of the properties of LVQ and its fast implementation, while considering the i.i.d. nonuniform distribution of wavelet coefficients, the piecewise-uniform pyramid LVQ algorithm is proposed. The proposed algorithm quantizes almost all of source vectors without the need to project these on the lattice outermost shell, while it properly maintains a small codebook size. It also resolves the wedge region problem commonly encountered with sharply distributed random sources. These represent some of the drawbacks of the algorithm proposed by Barlaud [26). The proposed algorithm handles all types of lattices, not only the cubic lattices, as opposed to the algorithms developed by Fischer [29) and Jeong [42). Furthermore, no training and multiquantizing (to determine lattice parameters) is required, as opposed to Powell's algorithm [78). For coefficients with high-frequency content, the positive-negative mean algorithm is proposed to improve the resolution of reconstructed images. For coefficients with low-frequency content, a lossless predictive compression scheme is used to preserve the quality of reconstructed images. A method to reduce bit requirements of necessary side information is also introduced. Lossless entropy coding techniques are subsequently used to remove coding redundancy. The algorithms result in high quality reconstructed images with better compression ratios than other available algorithms. To evaluate the proposed algorithms their objective and subjective performance comparisons with other available techniques are presented. The quality of the reconstructed images is important for a reliable identification. Enhancement and feature extraction on the reconstructed images are also investigated in this research. A structural-based feature extraction algorithm is proposed in which the unique properties of fingerprint textures are used to enhance the images and improve the fidelity of their characteristic features. The ridges are extracted from enhanced grey-level foreground areas based on the local ridge dominant directions. The proposed ridge extraction algorithm, properly preserves the natural shape of grey-level ridges as well as precise locations of the features, as opposed to the ridge extraction algorithm in [81). Furthermore, it is fast and operates only on foreground regions, as opposed to the adaptive floating average thresholding process in [68). Spurious features are subsequently eliminated using the proposed post-processing scheme.
Resumo:
Active Grids are a form of grid infrastructure where the grid network is active and programmable. These grids directly support applications with value added services such as data migration, compression, adaptation and monitoring. Services such as these are particularly important for eResearch applications which by their very nature are performance critical and data intensive. We propose an architecture for improving the flexibility of Active Grids through web services. These enable Active Grid services to be easily and flexibly configured, monitored and deployed from practically any platform or application. The architecture is called WeSPNI ('Web Services based on Programmable Networks Infrastructure'). We present the architecture together with some early experimental results on using web services to monitor data movement in an active grid.
Resumo:
We present an algorithm for estimating dense image correspondences. Our versatile approach lends itself to various tasks typical for video post-processing, including image morphing, optical flow estimation, stereo rectification, disparity/depth reconstruction, and baseline adjustment. We incorporate recent advances in feature matching, energy minimization, stereo vision, and data clustering into our approach. At the core of our correspondence estimation we use Efficient Belief Propagation for energy minimization. While state-of-the-art algorithms only work on thumbnail-sized images, our novel feature downsampling scheme in combination with a simple, yet efficient data term compression, can cope with high-resolution data. The incorporation of SIFT (Scale-Invariant Feature Transform) features into data term computation further resolves matching ambiguities, making long-range correspondence estimation possible. We detect occluded areas by evaluating the correspondence symmetry, we further apply Geodesic matting to automatically determine plausible values in these regions.
Resumo:
High-speed videokeratoscopy is an emerging technique that enables study of the corneal surface and tear-film dynamics. Unlike its static predecessor, this new technique results in a very large amount of digital data for which storage needs become significant. We aimed to design a compression technique that would use mathematical functions to parsimoniously fit corneal surface data with a minimum number of coefficients. Since the Zernike polynomial functions that have been traditionally used for modeling corneal surfaces may not necessarily correctly represent given corneal surface data in terms of its optical performance, we introduced the concept of Zernike polynomial-based rational functions. Modeling optimality criteria were employed in terms of both the rms surface error as well as the point spread function cross-correlation. The parameters of approximations were estimated using a nonlinear least-squares procedure based on the Levenberg-Marquardt algorithm. A large number of retrospective videokeratoscopic measurements were used to evaluate the performance of the proposed rational-function-based modeling approach. The results indicate that the rational functions almost always outperform the traditional Zernike polynomial approximations with the same number of coefficients.
Resumo:
We present the design and deployment results for PosNet - a large-scale, long-duration sensor network that gathers summary position and status information from mobile nodes. The mobile nodes have a fixed-sized memory buffer to which position data is added at a constant rate, and from which data is downloaded at a non-constant rate. We have developed a novel algorithm that performs online summarization of position data within the buffer, where the algorithm naturally accommodates data input and output rate mismatch, and also provides a delay-tolerant approach to data transport. The algorithm has been extensively tested in a large-scale long-duration cattle monitoring and control application.
Resumo:
The EEG time series has been subjected to various formalisms of analysis to extract meaningful information regarding the underlying neural events. In this paper the linear prediction (LP) method has been used for analysis and presentation of spectral array data for the better visualisation of background EEG activity. It has also been used for signal generation, efficient data storage and transmission of EEG. The LP method is compared with the standard Fourier method of compressed spectral array (CSA) of the multichannel EEG data. The autocorrelation autoregressive (AR) technique is used for obtaining the LP coefficients with a model order of 15. While the Fourier method reduces the data only by half, the LP method just requires the storage of signal variance and LP coefficients. The signal generated using white Gaussian noise as the input to the LP filter has a high correlation coefficient of 0.97 with that of original signal, thus making LP as a useful tool for storage and transmission of EEG. The biological significance of Fourier method and the LP method in respect to the microstructure of neuronal events in the generation of EEG is discussed.
Resumo:
Two methods based on wavelet/wavelet packet expansion to denoise and compress optical tomography data containing scattered noise are presented, In the first, the wavelet expansion coefficients of noisy data are shrunk using a soft threshold. In the second, the data are expanded into a wavelet packet tree upon which a best basis search is done. The resulting coefficients are truncated on the basis of energy content. It can be seen that the first method results in efficient denoising of experimental data when scattering particle density in the medium surrounding the object was up to 12.0 x 10(6) per cm(3). This method achieves a compression ratio of approximate to 8:1. The wavelet packet based method resulted in a compression of up to 11:1 and also exhibited reasonable noise reduction capability. Tomographic reconstructions obtained from denoised data are presented. (C) 1999 Published by Elsevier Science B.V. All rights reserved,
Resumo:
Introduction of processor based instruments in power systems is resulting in the rapid growth of the measured data volume. The present practice in most of the utilities is to store only some of the important data in a retrievable fashion for a limited period. Subsequently even this data is either deleted or stored in some back up devices. The investigations presented here explore the application of lossless data compression techniques for the purpose of archiving all the operational data - so that they can be put to more effective use. Four arithmetic coding methods suitably modified for handling power system steady state operational data are proposed here. The performance of the proposed methods are evaluated using actual data pertaining to the Southern Regional Grid of India. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Low power consumption per channel and data rate minimization are two key challenges which need to be addressed in future generations of neural recording systems (NRS). Power consumption can be reduced by avoiding unnecessary processing whereas data rate is greatly decreased by sending spike time-stamps along with spike features as opposed to raw digitized data. Dynamic range in NRS can vary with time due to change in electrode-neuron distance or background noise, which demands adaptability. An analog-to-digital converter (ADC) is one of the most important blocks in a NRS. This paper presents an 8-bit SAR ADC in 0.13-mu m CMOS technology along with input and reference buffer. A novel energy efficient digital-to-analog converter switching scheme is proposed, which consumes 37% less energy than the present state-of-the-art. The use of a ping-pong input sampling scheme is emphasized for multichannel input to alleviate the bandwidth requirement of the input buffer. To reduce the data rate, the A/D process is only enabled through the in-built background noise rejection logic to ensure that the noise is not processed. The ADC resolution can be adjusted from 8 to 1 bit in 1-bit step based on the input dynamic range. The ADC consumes 8.8 mu W from 1 V supply at 1 MS/s speed. It achieves effective number of bits of 7.7 bits and FoM of 42.3 fJ/conversion-step.
Resumo:
First, the compression-awaited data are regarded Lis character strings which are produced by virtual information source mapping M. then the model of the virtual information source M is established by neural network and SVM. Last we construct a lossless data compression (coding) scheme based oil neural network and SVM with the model, an integer function and a SVM discriminant. The scheme differs from the old entropy coding (compressions) inwardly, and it can compress some data compressed by the old entropy coding.
Resumo:
In this paper, an introduction of wavelet transform and multi-resolution analysis is presented. We describe three data compression methods based on wavelet transform for spectral information,and by using the multi-resolution analysis, we compressed spectral data by Daubechies's compactly supported orthogonal wavelet and orthogonal cubic B-splines wavelet, Using orthogonal cubic B-splines wavelet and coefficients of sharpening signal are set to zero, only very few large coefficients are stored, and a favourable data compression can be achieved.
Resumo:
In many applications in applied statistics researchers reduce the complexity of a data set by combining a group of variables into a single measure using factor analysis or an index number. We argue that such compression loses information if the data actually has high dimensionality. We advocate the use of a non-parametric estimator, commonly used in physics (the Takens estimator), to estimate the correlation dimension of the data prior to compression. The advantage of this approach over traditional linear data compression approaches is that the data does not have to be linearized. Applying our ideas to the United Nations Human Development Index we find that the four variables that are used in its construction have dimension three and the index loses information.