931 resultados para large data sets
Resumo:
An important competence of human data analysts is to interpret and explain the meaning of the results of data analysis to end-users. However, existing automatic solutions for intelligent data analysis provide limited help to interpret and communicate information to non-expert users. In this paper we present a general approach to generating explanatory descriptions about the meaning of quantitative sensor data. We propose a type of web application: a virtual newspaper with automatically generated news stories that describe the meaning of sensor data. This solution integrates a variety of techniques from intelligent data analysis into a web-based multimedia presentation system. We validated our approach in a real world problem and demonstrate its generality using data sets from several domains. Our experience shows that this solution can facilitate the use of sensor data by general users and, therefore, can increase the utility of sensor network infrastructures.
Resumo:
In recent years, applications in domains such as telecommunications, network security or large scale sensor networks showed the limits of the traditional store-then-process paradigm. In this context, Stream Processing Engines emerged as a candidate solution for all these applications demanding for high processing capacity with low processing latency guarantees. With Stream Processing Engines, data streams are not persisted but rather processed on the fly, producing results continuously. Current Stream Processing Engines, either centralized or distributed, do not scale with the input load due to single-node bottlenecks. Moreover, they are based on static configurations that lead to either under or over-provisioning. This Ph.D. thesis discusses StreamCloud, an elastic paralleldistributed stream processing engine that enables for processing of large data stream volumes. Stream- Cloud minimizes the distribution and parallelization overhead introducing novel techniques that split queries into parallel subqueries and allocate them to independent sets of nodes. Moreover, Stream- Cloud elastic and dynamic load balancing protocols enable for effective adjustment of resources depending on the incoming load. Together with the parallelization and elasticity techniques, Stream- Cloud defines a novel fault tolerance protocol that introduces minimal overhead while providing fast recovery. StreamCloud has been fully implemented and evaluated using several real word applications such as fraud detection applications or network analysis applications. The evaluation, conducted using a cluster with more than 300 cores, demonstrates the large scalability, the elasticity and fault tolerance effectiveness of StreamCloud. Resumen En los útimos años, aplicaciones en dominios tales como telecomunicaciones, seguridad de redes y redes de sensores de gran escala se han encontrado con múltiples limitaciones en el paradigma tradicional de bases de datos. En este contexto, los sistemas de procesamiento de flujos de datos han emergido como solución a estas aplicaciones que demandan una alta capacidad de procesamiento con una baja latencia. En los sistemas de procesamiento de flujos de datos, los datos no se persisten y luego se procesan, en su lugar los datos son procesados al vuelo en memoria produciendo resultados de forma continua. Los actuales sistemas de procesamiento de flujos de datos, tanto los centralizados, como los distribuidos, no escalan respecto a la carga de entrada del sistema debido a un cuello de botella producido por la concentración de flujos de datos completos en nodos individuales. Por otra parte, éstos están basados en configuraciones estáticas lo que conducen a un sobre o bajo aprovisionamiento. Esta tesis doctoral presenta StreamCloud, un sistema elástico paralelo-distribuido para el procesamiento de flujos de datos que es capaz de procesar grandes volúmenes de datos. StreamCloud minimiza el coste de distribución y paralelización por medio de una técnica novedosa la cual particiona las queries en subqueries paralelas repartiéndolas en subconjuntos de nodos independientes. Ademas, Stream- Cloud posee protocolos de elasticidad y equilibrado de carga que permiten una optimización de los recursos dependiendo de la carga del sistema. Unidos a los protocolos de paralelización y elasticidad, StreamCloud define un protocolo de tolerancia a fallos que introduce un coste mínimo mientras que proporciona una rápida recuperación. StreamCloud ha sido implementado y evaluado mediante varias aplicaciones del mundo real tales como aplicaciones de detección de fraude o aplicaciones de análisis del tráfico de red. La evaluación ha sido realizada en un cluster con más de 300 núcleos, demostrando la alta escalabilidad y la efectividad tanto de la elasticidad, como de la tolerancia a fallos de StreamCloud.
Resumo:
Spatial variability of Vertisol properties is relevant for identifying those zones with physical degradation. In this sense, one has to face the problem of identifying the origin and distribution of spatial variability patterns. The objectives of the present work were (i) to quantify the spatial structure of different physical properties collected from a Vertisol, (ii) to search for potential correlations between different spatial patterns and (iii) to identify relevant components through multivariate spatial analysis. The study was conducted on a Vertisol (Typic Hapludert) dedicated to sugarcane (Saccharum officinarum L.) production during the last sixty years. We used six soil properties collected from a squared grid (225 points) (penetrometer resistance (PR), total porosity, fragmentation dimension (Df), vertical electrical conductivity (ECv), horizontal electrical conductivity (ECh) and soil water content (WC)). All the original data sets were z-transformed before geostatistical analysis. Three different types of semivariogram models were necessary for fitting individual experimental semivariograms. This suggests the different natures of spatial variability patterns. Soil water content rendered the largest nugget effect (C0 = 0.933) while soil total porosity showed the largest range of spatial correlation (A = 43.92 m). The bivariate geostatistical analysis also rendered significant cross-semivariance between different paired soil properties. However, four different semivariogram models were required in that case. This indicates an underlying co-regionalization between different soil properties, which is of interest for delineating management zones within sugarcane fields. Cross-semivariograms showed larger correlation ranges than individual, univariate, semivariograms (A ≥ 29 m). All the findings were supported by multivariate spatial analysis, which showed the influence of soil tillage operations, harvesting machinery and irrigation water distribution on the status of the investigated area.
Resumo:
Most empirical disciplines promote the reuse and sharing of datasets, as it leads to greater possibility of replication. While this is increasingly the case in Empirical Software Engineering, some of the most popular bug-fix datasets are now known to be biased. This raises two significants concerns: first, that sample bias may lead to underperforming prediction models, and second, that the external validity of the studies based on biased datasets may be suspect. This issue has raised considerable consternation in the ESE literature in recent years. However, there is a confounding factor of these datasets that has not been examined carefully: size. Biased datasets are sampling only some of the data that could be sampled, and doing so in a biased fashion; but biased samples could be smaller, or larger. Smaller data sets in general provide less reliable bases for estimating models, and thus could lead to inferior model performance. In this setting, we ask the question, what affects performance more? bias, or size? We conduct a detailed, large-scale meta-analysis, using simulated datasets sampled with bias from a high-quality dataset which is relatively free of bias. Our results suggest that size always matters just as much bias direction, and in fact much more than bias direction when considering information-retrieval measures such as AUC and F-score. This indicates that at least for prediction models, even when dealing with sampling bias, simply finding larger samples can sometimes be sufficient. Our analysis also exposes the complexity of the bias issue, and raises further issues to be explored in the future.
Resumo:
Images acquired during free breathing using first-pass gadolinium-enhanced myocardial perfusion magnetic resonance imaging (MRI) exhibit a quasiperiodic motion pattern that needs to be compensated for if a further automatic analysis of the perfusion is to be executed. In this work, we present a method to compensate this movement by combining independent component analysis (ICA) and image registration: First, we use ICA and a time?frequency analysis to identify the motion and separate it from the intensity change induced by the contrast agent. Then, synthetic reference images are created by recombining all the independent components but the one related to the motion. Therefore, the resulting image series does not exhibit motion and its images have intensities similar to those of their original counterparts. Motion compensation is then achieved by using a multi-pass image registration procedure. We tested our method on 39 image series acquired from 13 patients, covering the basal, mid and apical areas of the left heart ventricle and consisting of 58 perfusion images each. We validated our method by comparing manually tracked intensity profiles of the myocardial sections to automatically generated ones before and after registration of 13 patient data sets (39 distinct slices). We compared linear, non-linear, and combined ICA based registration approaches and previously published motion compensation schemes. Considering run-time and accuracy, a two-step ICA based motion compensation scheme that first optimizes a translation and then for non-linear transformation performed best and achieves registration of the whole series in 32 ± 12 s on a recent workstation. The proposed scheme improves the Pearsons correlation coefficient between manually and automatically obtained time?intensity curves from .84 ± .19 before registration to .96 ± .06 after registration