891 resultados para Big data analytics
Resumo:
The real purpose of collecting big data is to identify causality in the hope that this will facilitate credible predictivity . But the search for causality can trap one into infinite regress, and thus one takes refuge in seeking associations between variables in data sets. Regrettably, the mere knowledge of associations does not enable predictivity. Associations need to be embedded within the framework of probability calculus to make coherent predictions. This is so because associations are a feature of probability models, and hence they do not exist outside the framework of a model. Measures of association, like correlation, regression, and mutual information merely refute a preconceived model. Estimated measures of associations do not lead to a probability model; a model is the product of pure thought. This paper discusses these and other fundamentals that are germane to seeking associations in particular, and machine learning in general. ACM Computing Classification System (1998): H.1.2, H.2.4., G.3.
Resumo:
Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to be analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham’s razor non-plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.
Resumo:
Al Large Hadron Collider (LHC) ogni anno di acquisizione dati vengono raccolti più di 30 petabyte di dati dalle collisioni. Per processare questi dati è necessario produrre un grande volume di eventi simulati attraverso tecniche Monte Carlo. Inoltre l'analisi fisica richiede accesso giornaliero a formati di dati derivati per centinaia di utenti. La Worldwide LHC Computing GRID (WLCG) è una collaborazione interazionale di scienziati e centri di calcolo che ha affrontato le sfide tecnologiche di LHC, rendendone possibile il programma scientifico. Con il prosieguo dell'acquisizione dati e la recente approvazione di progetti ambiziosi come l'High-Luminosity LHC, si raggiungerà presto il limite delle attuali capacità di calcolo. Una delle chiavi per superare queste sfide nel prossimo decennio, anche alla luce delle ristrettezze economiche dalle varie funding agency nazionali, consiste nell'ottimizzare efficientemente l'uso delle risorse di calcolo a disposizione. Il lavoro mira a sviluppare e valutare strumenti per migliorare la comprensione di come vengono monitorati i dati sia di produzione che di analisi in CMS. Per questa ragione il lavoro è comprensivo di due parti. La prima, per quanto riguarda l'analisi distribuita, consiste nello sviluppo di uno strumento che consenta di analizzare velocemente i log file derivanti dalle sottomissioni di job terminati per consentire all'utente, alla sottomissione successiva, di sfruttare meglio le risorse di calcolo. La seconda parte, che riguarda il monitoring di jobs sia di produzione che di analisi, sfrutta tecnologie nel campo dei Big Data per un servizio di monitoring più efficiente e flessibile. Un aspetto degno di nota di tali miglioramenti è la possibilità di evitare un'elevato livello di aggregazione dei dati già in uno stadio iniziale, nonché di raccogliere dati di monitoring con una granularità elevata che tuttavia consenta riprocessamento successivo e aggregazione “on-demand”.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
Cloud computing offers massive scalability and elasticity required by many scien-tific and commercial applications. Combining the computational and data handling capabilities of clouds with parallel processing also has the potential to tackle Big Data problems efficiently. Science gateway frameworks and workflow systems enable application developers to implement complex applications and make these available for end-users via simple graphical user interfaces. The integration of such frameworks with Big Data processing tools on the cloud opens new oppor-tunities for application developers. This paper investigates how workflow sys-tems and science gateways can be extended with Big Data processing capabilities. A generic approach based on infrastructure aware workflows is suggested and a proof of concept is implemented based on the WS-PGRADE/gUSE science gateway framework and its integration with the Hadoop parallel data processing solution based on the MapReduce paradigm in the cloud. The provided analysis demonstrates that the methods described to integrate Big Data processing with workflows and science gateways work well in different cloud infrastructures and application scenarios, and can be used to create massively parallel applications for scientific analysis of Big Data.
Resumo:
This dissertation offers a critical international political economy (IPE) analysis of the ways in which consumer information has been governed throughout the formal history of consumer finance (1840 – present). Drawing primarily on the United States, this project problematizes the notion of consumer financial big data as a ‘new era’ by tracing its roots historically from late nineteenth century through to the present. Using a qualitative case study approach, this project applies a unique theoretical framework to three instances of governance in consumer credit big data. Throughout, the historically specific means used to govern consumer credit data are rooted in dominant ideas, institutions and material factors.
Resumo:
Advocates of Big Data assert that we are in the midst of an epistemological revolution, promising the displacement of the modernist methodological hegemony of causal analysis and theory generation. It is alleged that the growing ‘deluge’ of digitally generated data, and the development of computational algorithms to analyse them, has enabled new inductive ways of accessing everyday relational interactions through their ‘datafication’. This paper critically engages with these discourses of Big Data and complexity, particularly as they operate in the discipline of International Relations, where it is alleged that Big Data approaches have the potential for developing self-governing societal capacities for resilience and adaptation through the real-time reflexive awareness and management of risks and problems as they arise. The epistemological and ontological assumptions underpinning Big Data are then analysed to suggest that critical and posthumanist approaches have come of age through these discourses, enabling process-based and relational understandings to be translated into policy and governance practices. The paper thus raises some questions for the development of critical approaches to new posthuman forms of governance and knowledge production.
How the World Learned to Stop Worrying and Love Failure: Big Data, Resilience and Emergent Causality
Resumo:
In modernity, failure was the discourse of critique, today, it is increasingly the discourse of power: failure has changed its allegiances. Over the last two decades, failure has been enfolded into discourses of power, facilitating the development of new policy approaches. Foremost among governing approaches that seek to include and to govern through failure is that of resilience. This article seeks to reflect upon how the understanding of failure has become transformed in this process, particularly linking this transformation to the radical appreciation of contingency and of the limits to instrumental cause-and-effect approaches to rule. Whereas modernity was shaped by a contestation over failure as an epistemological boundary, under conditions of contingency and complexity there appears to be a new consensus on failure as an ontological necessity. This problematic ‘ontological turn’ is illustrated using examples of changing approaches to risks, especially anthropogenic understandings of environmental threats, formerly seen as ‘natural’.
Resumo:
This paper synthesizes and discusses the spatial and temporal patterns of archaeological sites in Ireland, spanning the Neolithic period and the Bronze Age transition (4300–1900 cal BC), in order to explore the timing and implications of the main changes that occurred in the archaeological record of that period. Large amounts of new data are sourced from unpublished developer-led excavations and combined with national archives, published excavations and online databases. Bayesian radiocarbon models and context- and sample-sensitive summed radiocarbon probabilities are used to examine the dataset. The study captures the scale and timing of the initial expansion of Early Neolithic settlement and the ensuing attenuation of all such activity—an apparent boom-and-bust cycle. The Late Neolithic and Chalcolithic periods are characterised by a resurgence and diversification of activity. Contextualisation and spatial analysis of radiocarbon data reveals finer-scale patterning than is usually possible with summed-probability approaches: the boom-and-bust models of prehistoric populations may, in fact, be a misinterpretation of more subtle demographic changes occurring at the same time as cultural change and attendant differences in the archaeological record.
Resumo:
Much has been written about Big Data from a technical, economical, juridical and ethical perspective. Still, very little empirical and comparative data is available on how Big Data is approached and regulated in Europe and beyond. This contribution makes a first effort to fill that gap by presenting the reactions to a survey on Big Data from the Data Protection Authorities of fourteen European countries and a comparative legal research of eleven countries. This contribution presents those results, addressing 10 challenges for the regulation of Big Data.
Resumo:
Market research is often conducted through conventional methods such as surveys, focus groups and interviews. But the drawbacks of these methods are that they can be costly and timeconsuming. This study develops a new method, based on a combination of standard techniques like sentiment analysis and normalisation, to conduct market research in a manner that is free and quick. The method can be used in many application-areas, but this study focuses mainly on the veganism market to identify vegan food preferences in the form of a profile. Several food words are identified, along with their distribution between positive and negative sentiments in the profile. Surprisingly, non-vegan foods such as cheese, cake, milk, pizza and chicken dominate the profile, indicating that there is a significant market for vegan-suitable alternatives for such foods. Meanwhile, vegan-suitable foods such as coconut, potato, blueberries, kale and tofu also make strong appearances in the profile. Validation is performed by using the method on Volkswagen vehicle data to identify positive and negative sentiment across five car models. Some results were found to be consistent with sales figures and expert reviews, while others were inconsistent. The reliability of the method is therefore questionable, so the results should be used with caution.
Resumo:
Hotel chains have access to a treasure trove of “big data” on individual hotels’ monthly electricity and water consumption. Benchmarked comparisons of hotels within a specific chain create the opportunity to cost-effectively improve the environmental performance of specific hotels. This paper describes a simple approach for using such data to achieve the joint goals of reducing operating expenditure and achieving broad sustainability goals. In recent years, energy economists have used such “big data” to generate insights about the energy consumption of the residential, commercial, and industrial sectors. Lessons from these studies are directly applicable for the hotel sector. A hotel’s administrative data provide a “laboratory” for conducting random control trials to establish what works in enhancing hotel energy efficiency.
Resumo:
Americans are accustomed to a wide range of data collection in their lives: census, polls, surveys, user registrations, and disclosure forms. When logging onto the Internet, users’ actions are being tracked everywhere: clicking, typing, tapping, swiping, searching, and placing orders. All of this data is stored to create data-driven profiles of each user. Social network sites, furthermore, set the voluntarily sharing of personal data as the default mode of engagement. But people’s time and energy devoted to creating this massive amount of data, on paper and online, are taken for granted. Few people would consider their time and energy spent on data production as labor. Even if some people do acknowledge their labor for data, they believe it is accessory to the activities at hand. In the face of pervasive data collection and the rising time spent on screens, why do people keep ignoring their labor for data? How has labor for data been become invisible, as something that is disregarded by many users? What does invisible labor for data imply for everyday cultural practices in the United States? Invisible Labor for Data addresses these questions. I argue that three intertwined forces contribute to framing data production as being void of labor: data production institutions throughout history, the Internet’s technological infrastructure (especially with the implementation of algorithms), and the multiplication of virtual spaces. There is a common tendency in the framework of human interactions with computers to deprive data and bodies of their materiality. My Introduction and Chapter 1 offer theoretical interventions by reinstating embodied materiality and redefining labor for data as an ongoing process. The middle Chapters present case studies explaining how labor for data is pushed to the margin of the narratives about data production. I focus on a nationwide debate in the 1960s on whether the U.S. should build a databank, contemporary Big Data practices in the data broker and the Internet industries, and the group of people who are hired to produce data for other people’s avatars in the virtual games. I conclude with a discussion on how the new development of crowdsourcing projects may usher in the new chapter in exploiting invisible and discounted labor for data.