6 resultados para scalability
em DRUM (Digital Repository at the University of Maryland)
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
In the past decade, systems that extract information from millions of Internet documents have become commonplace. Knowledge graphs -- structured knowledge bases that describe entities, their attributes and the relationships between them -- are a powerful tool for understanding and organizing this vast amount of information. However, a significant obstacle to knowledge graph construction is the unreliability of the extracted information, due to noise and ambiguity in the underlying data or errors made by the extraction system and the complexity of reasoning about the dependencies between these noisy extractions. My dissertation addresses these challenges by exploiting the interdependencies between facts to improve the quality of the knowledge graph in a scalable framework. I introduce a new approach called knowledge graph identification (KGI), which resolves the entities, attributes and relationships in the knowledge graph by incorporating uncertain extractions from multiple sources, entity co-references, and ontological constraints. I define a probability distribution over possible knowledge graphs and infer the most probable knowledge graph using a combination of probabilistic and logical reasoning. Such probabilistic models are frequently dismissed due to scalability concerns, but my implementation of KGI maintains tractable performance on large problems through the use of hinge-loss Markov random fields, which have a convex inference objective. This allows the inference of large knowledge graphs using 4M facts and 20M ground constraints in 2 hours. To further scale the solution, I develop a distributed approach to the KGI problem which runs in parallel across multiple machines, reducing inference time by 90%. Finally, I extend my model to the streaming setting, where a knowledge graph is continuously updated by incorporating newly extracted facts. I devise a general approach for approximately updating inference in convex probabilistic models, and quantify the approximation error by defining and bounding inference regret for online models. Together, my work retains the attractive features of probabilistic models while providing the scalability necessary for large-scale knowledge graph construction. These models have been applied on a number of real-world knowledge graph projects, including the NELL project at Carnegie Mellon and the Google Knowledge Graph.
Resumo:
This dissertation investigates the connection between spectral analysis and frame theory. When considering the spectral properties of a frame, we present a few novel results relating to the spectral decomposition. We first show that scalable frames have the property that the inner product of the scaling coefficients and the eigenvectors must equal the inverse eigenvalues. From this, we prove a similar result when an approximate scaling is obtained. We then focus on the optimization problems inherent to the scalable frames by first showing that there is an equivalence between scaling a frame and optimization problems with a non-restrictive objective function. Various objective functions are considered, and an analysis of the solution type is presented. For linear objectives, we can encourage sparse scalings, and with barrier objective functions, we force dense solutions. We further consider frames in high dimensions, and derive various solution techniques. From here, we restrict ourselves to various frame classes, to add more specificity to the results. Using frames generated from distributions allows for the placement of probabilistic bounds on scalability. For discrete distributions (Bernoulli and Rademacher), we bound the probability of encountering an ONB, and for continuous symmetric distributions (Uniform and Gaussian), we show that symmetry is retained in the transformed domain. We also prove several hyperplane-separation results. With the theory developed, we discuss graph applications of the scalability framework. We make a connection with graph conditioning, and show the in-feasibility of the problem in the general case. After a modification, we show that any complete graph can be conditioned. We then present a modification of standard PCA (robust PCA) developed by Cand\`es, and give some background into Electron Energy-Loss Spectroscopy (EELS). We design a novel scheme for the processing of EELS through robust PCA and least-squares regression, and test this scheme on biological samples. Finally, we take the idea of robust PCA and apply the technique of kernel PCA to perform robust manifold learning. We derive the problem and present an algorithm for its solution. There is also discussion of the differences with RPCA that make theoretical guarantees difficult.
Resumo:
Sequences of timestamped events are currently being generated across nearly every domain of data analytics, from e-commerce web logging to electronic health records used by doctors and medical researchers. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, researchers often compare two groups, or cohorts, of event sequences to find the differences and similarities between outcomes and processes. With temporal event sequence data, this task is complex because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or metrics about the timestamps themselves (e.g., duration of an event). Running statistical tests to cover all these cases and determining which results are significant becomes cumbersome. Current visual analytics tools for comparing groups of event sequences emphasize a purely statistical or purely visual approach for comparison. Visual analytics tools leverage humans' ability to easily see patterns and anomalies that they were not expecting, but is limited by uncertainty in findings. Statistical tools emphasize finding significant differences in the data, but often requires researchers have a concrete question and doesn't facilitate more general exploration of the data. Combining visual analytics tools with statistical methods leverages the benefits of both approaches for quicker and easier insight discovery. Integrating statistics into a visualization tool presents many challenges on the frontend (e.g., displaying the results of many different metrics concisely) and in the backend (e.g., scalability challenges with running various metrics on multi-dimensional data at once). I begin by exploring the problem of comparing cohorts of event sequences and understanding the questions that analysts commonly ask in this task. From there, I demonstrate that combining automated statistics with an interactive user interface amplifies the benefits of both types of tools, thereby enabling analysts to conduct quicker and easier data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: (1) a taxonomy of metrics for comparing cohorts of temporal event sequences, (2) a statistical framework for exploratory data analysis with a method I refer to as high-volume hypothesis testing (HVHT), (3) a family of visualizations and guidelines for interaction techniques that are useful for understanding and parsing the results, and (4) a user study, five long-term case studies, and five short-term case studies which demonstrate the utility and impact of these methods in various domains: four in the medical domain, one in web log analysis, two in education, and one each in social networks, sports analytics, and security. My dissertation contributes an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learning and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences.
Resumo:
With the proliferation of new mobile devices and applications, the demand for ubiquitous wireless services has increased dramatically in recent years. The explosive growth in the wireless traffic requires the wireless networks to be scalable so that they can be efficiently extended to meet the wireless communication demands. In a wireless network, the interference power typically grows with the number of devices without necessary coordination among them. On the other hand, large scale coordination is always difficult due to the low-bandwidth and high-latency interfaces between access points (APs) in traditional wireless networks. To address this challenge, cloud radio access network (C-RAN) has been proposed, where a pool of base band units (BBUs) are connected to the distributed remote radio heads (RRHs) via high bandwidth and low latency links (i.e., the front-haul) and are responsible for all the baseband processing. But the insufficient front-haul link capacity may limit the scale of C-RAN and prevent it from fully utilizing the benefits made possible by the centralized baseband processing. As a result, the front-haul link capacity becomes a bottleneck in the scalability of C-RAN. In this dissertation, we explore the scalable C-RAN in the effort of tackling this challenge. In the first aspect of this dissertation, we investigate the scalability issues in the existing wireless networks and propose a novel time-reversal (TR) based scalable wireless network in which the interference power is naturally mitigated by the focusing effects of TR communications without coordination among APs or terminal devices (TDs). Due to this nice feature, it is shown that the system can be easily extended to serve more TDs. Motivated by the nice properties of TR communications in providing scalable wireless networking solutions, in the second aspect of this dissertation, we apply the TR based communications to the C-RAN and discover the TR tunneling effects which alleviate the traffic load in the front-haul links caused by the increment of TDs. We further design waveforming schemes to optimize the downlink and uplink transmissions in the TR based C-RAN, which are shown to improve the downlink and uplink transmission accuracies. Consequently, the traffic load in the front-haul links is further alleviated by the reducing re-transmissions caused by transmission errors. Moreover, inspired by the TR-based C-RAN, we propose the compressive quantization scheme which applies to the uplink of multi-antenna C-RAN so that more antennas can be utilized with the limited front-haul capacity, which provide rich spatial diversity such that the massive TDs can be served more efficiently.
Resumo:
The proliferation of new mobile communication devices, such as smartphones and tablets, has led to an exponential growth in network traffic. The demand for supporting the fast-growing consumer data rates urges the wireless service providers and researchers to seek a new efficient radio access technology, which is the so-called 5G technology, beyond what current 4G LTE can provide. On the other hand, ubiquitous RFID tags, sensors, actuators, mobile phones and etc. cut across many areas of modern-day living, which offers the ability to measure, infer and understand the environmental indicators. The proliferation of these devices creates the term of the Internet of Things (IoT). For the researchers and engineers in the field of wireless communication, the exploration of new effective techniques to support 5G communication and the IoT becomes an urgent task, which not only leads to fruitful research but also enhance the quality of our everyday life. Massive MIMO, which has shown the great potential in improving the achievable rate with a very large number of antennas, has become a popular candidate. However, the requirement of deploying a large number of antennas at the base station may not be feasible in indoor scenarios. Does there exist a good alternative that can achieve similar system performance to massive MIMO for indoor environment? In this dissertation, we address this question by proposing the time-reversal technique as a counterpart of massive MIMO in indoor scenario with the massive multipath effect. It is well known that radio signals will experience many multipaths due to the reflection from various scatters, especially in indoor environments. The traditional TR waveform is able to create a focusing effect at the intended receiver with very low transmitter complexity in a severe multipath channel. TR's focusing effect is in essence a spatial-temporal resonance effect that brings all the multipaths to arrive at a particular location at a specific moment. We show that by using time-reversal signal processing, with a sufficiently large bandwidth, one can harvest the massive multipaths naturally existing in a rich-scattering environment to form a large number of virtual antennas and achieve the desired massive multipath effect with a single antenna. Further, we explore the optimal bandwidth for TR system to achieve maximal spectral efficiency. Through evaluating the spectral efficiency, the optimal bandwidth for TR system is found determined by the system parameters, e.g., the number of users and backoff factor, instead of the waveform types. Moreover, we investigate the tradeoff between complexity and performance through establishing a generalized relationship between the system performance and waveform quantization in a practical communication system. It is shown that a 4-bit quantized waveforms can be used to achieve the similar bit-error-rate compared to the TR system with perfect precision waveforms. Besides 5G technology, Internet of Things (IoT) is another terminology that recently attracts more and more attention from both academia and industry. In the second part of this dissertation, the heterogeneity issue within the IoT is explored. One of the significant heterogeneity considering the massive amount of devices in the IoT is the device heterogeneity, i.e., the heterogeneous bandwidths and associated radio-frequency (RF) components. The traditional middleware techniques result in the fragmentation of the whole network, hampering the objects interoperability and slowing down the development of a unified reference model for the IoT. We propose a novel TR-based heterogeneous system, which can address the bandwidth heterogeneity and maintain the benefit of TR at the same time. The increase of complexity in the proposed system lies in the digital processing at the access point (AP), instead of at the devices' ends, which can be easily handled with more powerful digital signal processor (DSP). Meanwhile, the complexity of the terminal devices stays low and therefore satisfies the low-complexity and scalability requirement of the IoT. Since there is no middleware in the proposed scheme and the additional physical layer complexity concentrates on the AP side, the proposed heterogeneous TR system better satisfies the low-complexity and energy-efficiency requirement for the terminal devices (TDs) compared with the middleware approach.