982 resultados para Online analytical processing (OLAP)
Resumo:
The concepts of on-line transactional processing (OLTP) and on-line analytical processing (OLAP) are often confused with the technologies or models that are used to design transactional and analytics based information systems. This in some way has contributed to existence of gaps between the semantics in information captured during transactional processing and information stored for analytical use. In this paper, we propose the use of a unified semantics design model, as a solution to help bridge the semantic gaps between data captured by OLTP systems and the information provided by OLAP systems. The central focus of this design approach is on enabling business intelligence using not just data, but data with context.
Resumo:
Big data è il termine usato per descrivere una raccolta di dati così estesa in termini di volume,velocità e varietà da richiedere tecnologie e metodi analitici specifici per l'estrazione di valori significativi. Molti sistemi sono sempre più costituiti e caratterizzati da enormi moli di dati da gestire,originati da sorgenti altamente eterogenee e con formati altamente differenziati,oltre a qualità dei dati estremamente eterogenei. Un altro requisito in questi sistemi potrebbe essere il fattore temporale: sempre più sistemi hanno bisogno di ricevere dati significativi dai Big Data il prima possibile,e sempre più spesso l’input da gestire è rappresentato da uno stream di informazioni continuo. In questo campo si inseriscono delle soluzioni specifiche per questi casi chiamati Online Stream Processing. L’obiettivo di questa tesi è di proporre un prototipo funzionante che elabori dati di Instant Coupon provenienti da diverse fonti con diversi formati e protocolli di informazioni e trasmissione e che memorizzi i dati elaborati in maniera efficiente per avere delle risposte in tempo reale. Le fonti di informazione possono essere di due tipologie: XMPP e Eddystone. Il sistema una volta ricevute le informazioni in ingresso, estrapola ed elabora codeste fino ad avere dati significativi che possono essere utilizzati da terze parti. Lo storage di questi dati è fatto su Apache Cassandra. Il problema più grosso che si è dovuto risolvere riguarda il fatto che Apache Storm non prevede il ribilanciamento delle risorse in maniera automatica, in questo caso specifico però la distribuzione dei clienti durante la giornata è molto varia e ricca di picchi. Il sistema interno di ribilanciamento sfrutta tecnologie innovative come le metriche e sulla base del throughput e della latenza esecutiva decide se aumentare/diminuire il numero di risorse o semplicemente non fare niente se le statistiche sono all’interno dei valori di soglia voluti.
Resumo:
Tese submetida à Universidade Portucalense para obtenção do grau de Mestre em Informática, elaborada sob a orientação de Prof. Doutor Reis Lima e Eng. Jorge S. Coelho.
Resumo:
Dissertação de Mestrado
Resumo:
Se basa en un análisis teórico de los sistemas de información como lo es el almacenaje de datos, cubos OLAP e inteligencia de negocios. Seguidamente, se hace un análisis de los sectores económicos de Colombia con un especial interés sobre el sector de alimentos, de esta manera conceptualizar la empresa sobre la cual este trabajo se enfocara. Se encontrará un análisis del caso de éxito Summerwood Corporation, el cual brindará una justificación para la propuesta final presentada a la empresa Dipsa Food, Pyme dedicada a la producción de alimentos no perecederos ubicada en la ciudad de Bogotá D.C –Colombia, la cual tiene gran interés en cuanto al desarrollo de nuevas tecnologías que brinden información fidedigna para la toma de decisiones
Resumo:
The increasing availability of mobility data and the awareness of its importance and value have been motivating many researchers to the development of models and tools for analyzing movement data. This paper presents a brief survey of significant research works about modeling, processing and visualization of data about moving objects. We identified some key research fields that will provide better features for online analysis of movement data. As result of the literature review, we suggest a generic multi-layer architecture for the development of an online analysis processing software tool, which will be used for the definition of the future work of our team.
Resumo:
Inspired by the relational algebra of data processing, this paper addresses the foundations of data analytical processing from a linear algebra perspective. The paper investigates, in particular, how aggregation operations such as cross tabulations and data cubes essential to quantitative analysis of data can be expressed solely in terms of matrix multiplication, transposition and the Khatri–Rao variant of the Kronecker product. The approach offers a basis for deriving an algebraic theory of data consolidation, handling the quantitative as well as qualitative sides of data science in a natural, elegant and typed way. It also shows potential for parallel analytical processing, as the parallelization theory of such matrix operations is well acknowledged.
Resumo:
Proton nuclear magnetic resonance (H-1 NMR) spectroscopy for detection of biochemical changes in biological samples is a successful technique. However, the achieved NMR resolution is not sufficiently high when the analysis is performed with intact cells. To improve spectral resolution, high resolution magic angle spinning (HR-MAS) is used and the broad signals are separated by a T-2 filter based on the CPMG pulse sequence. Additionally, HR-MAS experiments with a T-2 filter are preceded by a water suppression procedure. The goal of this work is to demonstrate that the experimental procedures of water suppression and T-2 or diffusing filters are unnecessary steps when the filter diagonalization method (FDM) is used to process the time domain HR-MAS signals. Manipulation of the FDM results, represented as a tabular list of peak positions, widths, amplitudes and phases, allows the removal of water signals without the disturbing overlapping or nearby signals. Additionally, the FDM can also be used for phase correction and noise suppression, and to discriminate between sharp and broad lines. Results demonstrate the applicability of the FDM post-acquisition processing to obtain high quality HR-MAS spectra of heterogeneous biological materials.
Resumo:
Title on spine: Library applications of data processing: 1972.
Resumo:
This thesis is a study of performance management of Complex Event Processing (CEP) systems. Since CEP systems have distinct characteristics from other well-studied computer systems such as batch and online transaction processing systems and database-centric applications, these characteristics introduce new challenges and opportunities to the performance management for CEP systems. Methodologies used in benchmarking CEP systems in many performance studies focus on scaling the load injection, but not considering the impact of the functional capabilities of CEP systems. This thesis proposes the approach of evaluating the performance of CEP engines’ functional behaviours on events and develops a benchmark platform for CEP systems: CEPBen. The CEPBen benchmark platform is developed to explore the fundamental functional performance of event processing systems: filtering, transformation and event pattern detection. It is also designed to provide a flexible environment for exploring new metrics and influential factors for CEP systems and evaluating the performance of CEP systems. Studies on factors and new metrics are carried out using the CEPBen benchmark platform on Esper. Different measurement points of response time in performance management of CEP systems are discussed and response time of targeted event is proposed to be used as a metric for quality of service evaluation combining with the traditional response time in CEP systems. Maximum query load as a capacity indicator regarding to the complexity of queries and number of live objects in memory as a performance indicator regarding to the memory management are proposed in performance management of CEP systems. Query depth is studied as a performance factor that influences CEP system performance.
Resumo:
The presence of inhibitory substances in biological forensic samples has, and continues to affect the quality of the data generated following DNA typing processes. Although the chemistries used during the procedures have been enhanced to mitigate the effects of these deleterious compounds, some challenges remain. Inhibitors can be components of the samples, the substrate where samples were deposited or chemical(s) associated to the DNA purification step. Therefore, a thorough understanding of the extraction processes and their ability to handle the various types of inhibitory substances can help define the best analytical processing for any given sample. A series of experiments were conducted to establish the inhibition tolerance of quantification and amplification kits using common inhibitory substances in order to determine if current laboratory practices are optimal for identifying potential problems associated with inhibition. DART mass spectrometry was used to determine the amount of inhibitor carryover after sample purification, its correlation to the initial inhibitor input in the sample and the overall effect in the results. Finally, a novel alternative at gathering investigative leads from samples that would otherwise be ineffective for DNA typing due to the large amounts of inhibitory substances and/or environmental degradation was tested. This included generating data associated with microbial peak signatures to identify locations of clandestine human graves. Results demonstrate that the current methods for assessing inhibition are not necessarily accurate, as samples that appear inhibited in the quantification process can yield full DNA profiles, while those that do not indicate inhibition may suffer from lowered amplification efficiency or PCR artifacts. The extraction methods tested were able to remove >90% of the inhibitors from all samples with the exception of phenol, which was present in variable amounts whenever the organic extraction approach was utilized. Although the results attained suggested that most inhibitors produce minimal effect on downstream applications, analysts should practice caution when selecting the best extraction method for particular samples, as casework DNA samples are often present in small quantities and can contain an overwhelming amount of inhibitory substances.
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
El proceso de toma de decisiones en las bibliotecas universitarias es de suma importancia, sin embargo, se encuentra complicaciones como la gran cantidad de fuentes de datos y los grandes volúmenes de datos a analizar. Las bibliotecas universitarias están acostumbradas a producir y recopilar una gran cantidad de información sobre sus datos y servicios. Las fuentes de datos comunes son el resultado de sistemas internos, portales y catálogos en línea, evaluaciones de calidad y encuestas. Desafortunadamente estas fuentes de datos sólo se utilizan parcialmente para la toma de decisiones debido a la amplia variedad de formatos y estándares, así como la falta de métodos eficientes y herramientas de integración. Este proyecto de tesis presenta el análisis, diseño e implementación del Data Warehouse, que es un sistema integrado de toma de decisiones para el Centro de Documentación Juan Bautista Vázquez. En primer lugar se presenta los requerimientos y el análisis de los datos en base a una metodología, esta metodología incorpora elementos claves incluyendo el análisis de procesos, la calidad estimada, la información relevante y la interacción con el usuario que influyen en una decisión bibliotecaria. A continuación, se propone la arquitectura y el diseño del Data Warehouse y su respectiva implementación la misma que soporta la integración, procesamiento y el almacenamiento de datos. Finalmente los datos almacenados se analizan a través de herramientas de procesamiento analítico y la aplicación de técnicas de Bibliomining ayudando a los administradores del centro de documentación a tomar decisiones óptimas sobre sus recursos y servicios.
Resumo:
The presence of inhibitory substances in biological forensic samples has, and continues to affect the quality of the data generated following DNA typing processes. Although the chemistries used during the procedures have been enhanced to mitigate the effects of these deleterious compounds, some challenges remain. Inhibitors can be components of the samples, the substrate where samples were deposited or chemical(s) associated to the DNA purification step. Therefore, a thorough understanding of the extraction processes and their ability to handle the various types of inhibitory substances can help define the best analytical processing for any given sample. A series of experiments were conducted to establish the inhibition tolerance of quantification and amplification kits using common inhibitory substances in order to determine if current laboratory practices are optimal for identifying potential problems associated with inhibition. DART mass spectrometry was used to determine the amount of inhibitor carryover after sample purification, its correlation to the initial inhibitor input in the sample and the overall effect in the results. Finally, a novel alternative at gathering investigative leads from samples that would otherwise be ineffective for DNA typing due to the large amounts of inhibitory substances and/or environmental degradation was tested. This included generating data associated with microbial peak signatures to identify locations of clandestine human graves. Results demonstrate that the current methods for assessing inhibition are not necessarily accurate, as samples that appear inhibited in the quantification process can yield full DNA profiles, while those that do not indicate inhibition may suffer from lowered amplification efficiency or PCR artifacts. The extraction methods tested were able to remove >90% of the inhibitors from all samples with the exception of phenol, which was present in variable amounts whenever the organic extraction approach was utilized. Although the results attained suggested that most inhibitors produce minimal effect on downstream applications, analysts should practice caution when selecting the best extraction method for particular samples, as casework DNA samples are often present in small quantities and can contain an overwhelming amount of inhibitory substances.^
Resumo:
Geographic Data Warehouses (GDW) are one of the main technologies used in decision-making processes and spatial analysis, and the literature proposes several conceptual and logical data models for GDW. However, little effort has been focused on studying how spatial data redundancy affects SOLAP (Spatial On-Line Analytical Processing) query performance over GDW. In this paper, we investigate this issue. Firstly, we compare redundant and non-redundant GDW schemas and conclude that redundancy is related to high performance losses. We also analyze the issue of indexing, aiming at improving SOLAP query performance on a redundant GDW. Comparisons of the SB-index approach, the star-join aided by R-tree and the star-join aided by GiST indicate that the SB-index significantly improves the elapsed time in query processing from 25% up to 99% with regard to SOLAP queries defined over the spatial predicates of intersection, enclosure and containment and applied to roll-up and drill-down operations. We also investigate the impact of the increase in data volume on the performance. The increase did not impair the performance of the SB-index, which highly improved the elapsed time in query processing. Performance tests also show that the SB-index is far more compact than the star-join, requiring only a small fraction of at most 0.20% of the volume. Moreover, we propose a specific enhancement of the SB-index to deal with spatial data redundancy. This enhancement improved performance from 80 to 91% for redundant GDW schemas.