965 resultados para Data Analytics
Resumo:
n the past decade, the analysis of data has faced the challenge of dealing with very large and complex datasets and the real-time generation of data. Technologies to store and access these complex and large datasets are in place. However, robust and scalable analysis technologies are needed to extract meaningful information from these datasets. The research field of Information Visualization and Visual Data Analytics addresses this need. Information visualization and data mining are often used complementary to each other. Their common goal is the extraction of meaningful information from complex and possibly large data. However, though data mining focuses on the usage of silicon hardware, visualization techniques also aim to access the powerful image-processing capabilities of the human brain. This article highlights the research on data visualization and visual analytics techniques. Furthermore, we highlight existing visual analytics techniques, systems, and applications including a perspective on the field from the chemical process industry.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.
Resumo:
Con l'avanzare della tecnologia, i Big Data hanno assunto un ruolo importante. In questo lavoro è stato implementato, in linguaggio Java, un software volto alla analisi dei Big Data mediante R e Hadoop/MapReduce. Il software è stato utilizzato per analizzare le tracce rilasciate da Google, riguardanti il funzionamento dei suoi data center.
Resumo:
Thesis (Master's)--University of Washington, 2016-06
Resumo:
Parkinson's disease is a complex heterogeneous disorder with urgent need for disease-modifying therapies. Progress in successful therapeutic approaches for PD will require an unprecedented level of collaboration. At a workshop hosted by Parkinson's UK and co-organized by Critical Path Institute's (C-Path) Coalition Against Major Diseases (CAMD) Consortiums, investigators from industry, academia, government and regulatory agencies agreed on the need for sharing of data to enable future success. Government agencies included EMA, FDA, NINDS/NIH and IMI (Innovative Medicines Initiative). Emerging discoveries in new biomarkers and genetic endophenotypes are contributing to our understanding of the underlying pathophysiology of PD. In parallel there is growing recognition that early intervention will be key for successful treatments aimed at disease modification. At present, there is a lack of a comprehensive understanding of disease progression and the many factors that contribute to disease progression heterogeneity. Novel therapeutic targets and trial designs that incorporate existing and new biomarkers to evaluate drug effects independently and in combination are required. The integration of robust clinical data sets is viewed as a powerful approach to hasten medical discovery and therapies, as is being realized across diverse disease conditions employing big data analytics for healthcare. The application of lessons learned from parallel efforts is critical to identify barriers and enable a viable path forward. A roadmap is presented for a regulatory, academic, industry and advocacy driven integrated initiative that aims to facilitate and streamline new drug trials and registrations in Parkinson's disease.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
Kandidaatintyö on toteutettu kirjallisuuskatsauksena, jonka tavoitteena on selvittää data-analytiikan käyttökohteita ja datan hyödyntämisen vaikutusta liiketoimintaan. Työ käsittelee data-analytiikan käyttöä ja datan tehokkaan hyödyntämisen haasteita. Työ on rajattu tarkastelemaan yrityksen talouden ohjausta, jossa analytiikkaa käytetään johdon ja rahoituksen laskentatoimessa. Datan määrän eksponentiaalinen kasvunopeus luo data-analytiikan käytölle uusia haasteita ja mahdollisuuksia. Datalla itsessään ei kuitenkaan ole suurta arvoa yritykselle, vaan arvo syntyy prosessoinnin kautta. Vaikka data-analytiikkaa tutkitaan ja käytetään jo runsaasti, se tarjoaa paljon nykyisiä sovelluksia suurempia mahdollisuuksia. Yksi työn keskeisimmistä tuloksista on, että data-analytiikalla voidaan tehostaa johdon laskentatoimea ja helpottaa rahoituksen laskentatoimen tehtäviä. Tarjolla olevan datan määrä kasvaa kuitenkin niin nopeasti, että käytettävissä oleva teknologia ja osaamisen taso eivät pysy kehityksessä mukana. Varsinkin big datan laajempi käyttöönotto ja sen tehokas hyödyntäminen vaikuttavat jatkossa talouden ohjauksen käytäntöihin ja sovelluksiin yhä enemmän.
Resumo:
Americans are accustomed to a wide range of data collection in their lives: census, polls, surveys, user registrations, and disclosure forms. When logging onto the Internet, users’ actions are being tracked everywhere: clicking, typing, tapping, swiping, searching, and placing orders. All of this data is stored to create data-driven profiles of each user. Social network sites, furthermore, set the voluntarily sharing of personal data as the default mode of engagement. But people’s time and energy devoted to creating this massive amount of data, on paper and online, are taken for granted. Few people would consider their time and energy spent on data production as labor. Even if some people do acknowledge their labor for data, they believe it is accessory to the activities at hand. In the face of pervasive data collection and the rising time spent on screens, why do people keep ignoring their labor for data? How has labor for data been become invisible, as something that is disregarded by many users? What does invisible labor for data imply for everyday cultural practices in the United States? Invisible Labor for Data addresses these questions. I argue that three intertwined forces contribute to framing data production as being void of labor: data production institutions throughout history, the Internet’s technological infrastructure (especially with the implementation of algorithms), and the multiplication of virtual spaces. There is a common tendency in the framework of human interactions with computers to deprive data and bodies of their materiality. My Introduction and Chapter 1 offer theoretical interventions by reinstating embodied materiality and redefining labor for data as an ongoing process. The middle Chapters present case studies explaining how labor for data is pushed to the margin of the narratives about data production. I focus on a nationwide debate in the 1960s on whether the U.S. should build a databank, contemporary Big Data practices in the data broker and the Internet industries, and the group of people who are hired to produce data for other people’s avatars in the virtual games. I conclude with a discussion on how the new development of crowdsourcing projects may usher in the new chapter in exploiting invisible and discounted labor for data.
Resumo:
In today’s big data world, data is being produced in massive volumes, at great velocity and from a variety of different sources such as mobile devices, sensors, a plethora of small devices hooked to the internet (Internet of Things), social networks, communication networks and many others. Interactive querying and large-scale analytics are being increasingly used to derive value out of this big data. A large portion of this data is being stored and processed in the Cloud due the several advantages provided by the Cloud such as scalability, elasticity, availability, low cost of ownership and the overall economies of scale. There is thus, a growing need for large-scale cloud-based data management systems that can support real-time ingest, storage and processing of large volumes of heterogeneous data. However, in the pay-as-you-go Cloud environment, the cost of analytics can grow linearly with the time and resources required. Reducing the cost of data analytics in the Cloud thus remains a primary challenge. In my dissertation research, I have focused on building efficient and cost-effective cloud-based data management systems for different application domains that are predominant in cloud computing environments. In the first part of my dissertation, I address the problem of reducing the cost of transactional workloads on relational databases to support database-as-a-service in the Cloud. The primary challenges in supporting such workloads include choosing how to partition the data across a large number of machines, minimizing the number of distributed transactions, providing high data availability, and tolerating failures gracefully. I have designed, built and evaluated SWORD, an end-to-end scalable online transaction processing system, that utilizes workload-aware data placement and replication to minimize the number of distributed transactions that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement of data, and during query execution at runtime. In the second part of my dissertation, I focus on sampling-based progressive analytics as a means to reduce the cost of data analytics in the relational domain. Sampling has been traditionally used by data scientists to get progressive answers to complex analytical tasks over large volumes of data. Typically, this involves manually extracting samples of increasing data size (progressive samples) for exploratory querying. This provides the data scientists with user control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. I propose a new progressive data-parallel computation framework, NOW!, that provides support for progressive analytics over big data. In particular, NOW! enables progressive relational (SQL) query support in the Cloud using unique progress semantics that allow efficient and deterministic query processing over samples providing meaningful early results and provenance to data scientists. NOW! enables the provision of early results using significantly fewer resources thereby enabling a substantial reduction in the cost incurred during such analytics. Finally, I propose NSCALE, a system for efficient and cost-effective complex analytics on large-scale graph-structured data in the Cloud. The system is based on the key observation that a wide range of complex analysis tasks over graph data require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph; examples include ego network analysis, motif counting in biological networks, finding social circles in social networks, personalized recommendations, link prediction, etc. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high execution overheads. Further, the lack of support for extracting the relevant portions of the graph that are of interest to an analysis task and loading it onto distributed memory leads to poor scalability. NSCALE allows users to write programs at the level of neighborhoods or subgraphs rather than at the level of vertices, and to declaratively specify the subgraphs of interest. It enables the efficient distributed execution of these neighborhood-centric complex analysis tasks over largescale graphs, while minimizing resource consumption and communication cost, thereby substantially reducing the overall cost of graph data analytics in the Cloud. The results of our extensive experimental evaluation of these prototypes with several real-world data sets and applications validate the effectiveness of our techniques which provide orders-of-magnitude reductions in the overheads of distributed data querying and analysis in the Cloud.
Resumo:
Sequences of timestamped events are currently being generated across nearly every domain of data analytics, from e-commerce web logging to electronic health records used by doctors and medical researchers. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, researchers often compare two groups, or cohorts, of event sequences to find the differences and similarities between outcomes and processes. With temporal event sequence data, this task is complex because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or metrics about the timestamps themselves (e.g., duration of an event). Running statistical tests to cover all these cases and determining which results are significant becomes cumbersome. Current visual analytics tools for comparing groups of event sequences emphasize a purely statistical or purely visual approach for comparison. Visual analytics tools leverage humans' ability to easily see patterns and anomalies that they were not expecting, but is limited by uncertainty in findings. Statistical tools emphasize finding significant differences in the data, but often requires researchers have a concrete question and doesn't facilitate more general exploration of the data. Combining visual analytics tools with statistical methods leverages the benefits of both approaches for quicker and easier insight discovery. Integrating statistics into a visualization tool presents many challenges on the frontend (e.g., displaying the results of many different metrics concisely) and in the backend (e.g., scalability challenges with running various metrics on multi-dimensional data at once). I begin by exploring the problem of comparing cohorts of event sequences and understanding the questions that analysts commonly ask in this task. From there, I demonstrate that combining automated statistics with an interactive user interface amplifies the benefits of both types of tools, thereby enabling analysts to conduct quicker and easier data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: (1) a taxonomy of metrics for comparing cohorts of temporal event sequences, (2) a statistical framework for exploratory data analysis with a method I refer to as high-volume hypothesis testing (HVHT), (3) a family of visualizations and guidelines for interaction techniques that are useful for understanding and parsing the results, and (4) a user study, five long-term case studies, and five short-term case studies which demonstrate the utility and impact of these methods in various domains: four in the medical domain, one in web log analysis, two in education, and one each in social networks, sports analytics, and security. My dissertation contributes an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learning and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences.
Resumo:
Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.
Resumo:
Incluye bibliografía
Resumo:
The new digital technologies have led to widespread use of cloud computing, recognition of the potential of big data analytics, and significant progress in aspects of the Internet of Things, such as home automation, smart cities and grids and digital manufacturing. In addition to closing gaps in respect of the basic necessities of access and usage, now the conditions must be established for using the new platforms and finding ways to participate actively in the creation of content and even new applications and platforms. This message runs through the three chapters of this book. Chapter I presents the main features of the digital revolution, emphasizing that today’s world economy is a digital economy. Chapter II examines the region’s strengths and weaknesses with respect to digital access and consumption. Chapter III reviews the main policy debates and urges countries to take a more proactive approach towards, for example, regulation, network neutrality and combating cybercrime. The conclusion highlights two crucial elements: first, the need to take steps towards a single regional digital market that can compete in a world of global platforms by tapping the benefits of economies of scale and developing network economies; and second, the significance of the next stage of the digital agenda for Latin America and the Caribbean (eLAC2018), which will embody the latest updates to a cooperation strategy that has been in place for over a decade.
Resumo:
In CMS è stato lanciato un progetto di Data Analytics e, all’interno di esso, un’attività specifica pilota che mira a sfruttare tecniche di Machine Learning per predire la popolarità dei dataset di CMS. Si tratta di un’osservabile molto delicata, la cui eventuale predizione premetterebbe a CMS di costruire modelli di data placement più intelligenti, ampie ottimizzazioni nell’uso dello storage a tutti i livelli Tiers, e formerebbe la base per l’introduzione di un solito sistema di data management dinamico e adattivo. Questa tesi descrive il lavoro fatto sfruttando un nuovo prototipo pilota chiamato DCAFPilot, interamente scritto in python, per affrontare questa sfida.
Resumo:
El paradigma de procesamiento de eventos CEP plantea la solución al reto del análisis de grandes cantidades de datos en tiempo real, como por ejemplo, monitorización de los valores de bolsa o el estado del tráfico de carreteras. En este paradigma los eventos recibidos deben procesarse sin almacenarse debido a que el volumen de datos es demasiado elevado y a las necesidades de baja latencia. Para ello se utilizan sistemas distribuidos con una alta escalabilidad, elevado throughput y baja latencia. Este tipo de sistemas son usualmente complejos y el tiempo de aprendizaje requerido para su uso es elevado. Sin embargo, muchos de estos sistemas carecen de un lenguaje declarativo de consultas en el que expresar la computación que se desea realizar sobre los eventos recibidos. En este trabajo se ha desarrollado un lenguaje declarativo de consultas similar a SQL y un compilador que realiza la traducción de este lenguaje al lenguaje nativo del sistema de procesamiento masivo de eventos. El lenguaje desarrollado en este trabajo es similar a SQL, con el que se encuentran familiarizados un gran número de desarrolladores y por tanto aprender este lenguaje no supondría un gran esfuerzo. Así el uso de este lenguaje logra reducir los errores en ejecución de la consulta desplegada sobre el sistema distribuido al tiempo que se abstrae al programador de los detalles de este sistema.---ABSTRACT---The complex event processing paradigm CEP has become the solution for high volume data analytics which demand scalability, high throughput, and low latency. Examples of applications which use this paradigm are financial processing or traffic monitoring. A distributed system is used to achieve the performance requisites. These same requisites force the distributed system not to store the events but to process them on the fly as they are received. These distributed systems are complex systems which require a considerably long time to learn and use. The majority of such distributed systems lack a declarative language in which to express the computation to perform over incoming events. In this work, a new SQL-like declarative language and a compiler have been developed. This compiler translates this new language to the distributed system native language. Due to its similarity with SQL a vast amount of developers who are already familiar with SQL will need little time to learn this language. Thus, this language reduces the execution failures at the time the programmer no longer needs to know every single detail of the underlying distributed system to submit a query.