15 resultados para Big data, Spark, Hadoop

em Aston University Research Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

At the moment, the phrases “big data” and “analytics” are often being used as if they were magic incantations that will solve all an organization’s problems at a stroke. The reality is that data on its own, even with the application of analytics, will not solve any problems. The resources that analytics and big data can consume represent a significant strategic risk if applied ineffectively. Any analysis of data needs to be guided, and to lead to action. So while analytics may lead to knowledge and intelligence (in the military sense of that term), it also needs the input of knowledge and intelligence (in the human sense of that term). And somebody then has to do something new or different as a result of the new insights, or it won’t have been done to any purpose. Using an analytics example concerning accounts payable in the public sector in Canada, this paper reviews thinking from the domains of analytics, risk management and knowledge management, to show some of the pitfalls, and to present a holistic picture of how knowledge management might help tackle the challenges of big data and analytics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Semantics, knowledge and Grids represent three spaces where people interact, understand, learn and create. Grids represent the advanced cyber-infrastructures and evolution. Big data influence the evolution of semantics, knowledge and Grids. Exploring semantics, knowledge and Grids on big data helps accelerate the shift of scientific paradigm, the fourth industrial revolution, and the transformational innovation of technologies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Parkinson's disease is a complex heterogeneous disorder with urgent need for disease-modifying therapies. Progress in successful therapeutic approaches for PD will require an unprecedented level of collaboration. At a workshop hosted by Parkinson's UK and co-organized by Critical Path Institute's (C-Path) Coalition Against Major Diseases (CAMD) Consortiums, investigators from industry, academia, government and regulatory agencies agreed on the need for sharing of data to enable future success. Government agencies included EMA, FDA, NINDS/NIH and IMI (Innovative Medicines Initiative). Emerging discoveries in new biomarkers and genetic endophenotypes are contributing to our understanding of the underlying pathophysiology of PD. In parallel there is growing recognition that early intervention will be key for successful treatments aimed at disease modification. At present, there is a lack of a comprehensive understanding of disease progression and the many factors that contribute to disease progression heterogeneity. Novel therapeutic targets and trial designs that incorporate existing and new biomarkers to evaluate drug effects independently and in combination are required. The integration of robust clinical data sets is viewed as a powerful approach to hasten medical discovery and therapies, as is being realized across diverse disease conditions employing big data analytics for healthcare. The application of lessons learned from parallel efforts is critical to identify barriers and enable a viable path forward. A roadmap is presented for a regulatory, academic, industry and advocacy driven integrated initiative that aims to facilitate and streamline new drug trials and registrations in Parkinson's disease.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

GraphChi is the first reported disk-based graph engine that can handle billion-scale graphs on a single PC efficiently. GraphChi is able to execute several advanced data mining, graph mining and machine learning algorithms on very large graphs. With the novel technique of parallel sliding windows (PSW) to load subgraph from disk to memory for vertices and edges updating, it can achieve data processing performance close to and even better than those of mainstream distributed graph engines. GraphChi mentioned that its memory is not effectively utilized with large dataset, which leads to suboptimal computation performances. In this paper we are motivated by the concepts of 'pin ' from TurboGraph and 'ghost' from GraphLab to propose a new memory utilization mode for GraphChi, which is called Part-in-memory mode, to improve the GraphChi algorithm performance. The main idea is to pin a fixed part of data inside the memory during the whole computing process. Part-in-memory mode is successfully implemented with only about 40 additional lines of code to the original GraphChi engine. Extensive experiments are performed with large real datasets (including Twitter graph with 1.4 billion edges). The preliminary results show that Part-in-memory mode memory management approach effectively reduces the GraphChi running time by up to 60% in PageRank algorithm. Interestingly it is found that a larger portion of data pinned in memory does not always lead to better performance in the case that the whole dataset cannot be fitted in memory. There exists an optimal portion of data which should be kept in the memory to achieve the best computational performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multiple transformative forces target marketing, many of which derive from new technologies that allow us to sample thinking in real time (i.e., brain imaging), or to look at large aggregations of decisions (i.e., big data). There has been an inclination to refer to the intersection of these technologies with the general topic of marketing as “neuromarketing”. There has not been a serious effort to frame neuromarketing, which is the goal of this paper. Neuromarketing can be compared to neuroeconomics, wherein neuroeconomics is generally focused on how individuals make “choices”, and represent distributions of choices. Neuromarketing, in contrast, focuses on how a distribution of choices can be shifted or “influenced”, which can occur at multiple “scales” of behavior (e.g., individual, group, or market/society). Given influence can affect choice through many cognitive modalities, and not just that of valuation of choice options, a science of influence also implies a need to develop a model of cognitive function integrating attention, memory, and reward/aversion function. The paper concludes with a brief description of three domains of neuromarketing application for studying influence, and their caveats.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper researches on Matthew Effect in Sina Weibo microblogger. We choose the microblogs in the ranking list of Hot Microblog App in Sina Weibo microblogger as target of our study. The differences of repost number of microblogs in the ranking list between before and after the time when it enter the ranking list of Hot Microblog app are analyzed. And we compare the spread features of the microblogs in the ranking list with those hot microblogs not in the list and those ordinary microblogs of users who have some microblog in the ranking list before. Our study proves the existence of Matthew Effect in social network. © 2013 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The miniaturization, sophistication, proliferation, and accessibility of technologies are enabling the capture of more and previously inaccessible phenomena in Parkinson's disease (PD). However, more information has not translated into a greater understanding of disease complexity to satisfy diagnostic and therapeutic needs. Challenges include noncompatible technology platforms, the need for wide-scale and long-term deployment of sensor technology (among vulnerable elderly patients in particular), and the gap between the "big data" acquired with sensitive measurement technologies and their limited clinical application. Major opportunities could be realized if new technologies are developed as part of open-source and/or open-hardware platforms that enable multichannel data capture sensitive to the broad range of motor and nonmotor problems that characterize PD and are adaptable into self-adjusting, individualized treatment delivery systems. The International Parkinson and Movement Disorders Society Task Force on Technology is entrusted to convene engineers, clinicians, researchers, and patients to promote the development of integrated measurement and closed-loop therapeutic systems with high patient adherence that also serve to (1) encourage the adoption of clinico-pathophysiologic phenotyping and early detection of critical disease milestones, (2) enhance the tailoring of symptomatic therapy, (3) improve subgroup targeting of patients for future testing of disease-modifying treatments, and (4) identify objective biomarkers to improve the longitudinal tracking of impairments in clinical care and research. This article summarizes the work carried out by the task force toward identifying challenges and opportunities in the development of technologies with potential for improving the clinical management and the quality of life of individuals with PD. © 2016 International Parkinson and Movement Disorder Society.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Sensing technology is a key enabler of the Internet of Things (IoT) and could produce huge volume data to contribute the Big Data paradigm. Modelling of sensing information is an important and challenging topic, which influences essentially the quality of smart city systems. In this paper, the author discusses the relevant technologies and information modelling in the context of smart city and especially reports the investigation of how to model sensing and location information in order to support smart city development.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Purpose – The purpose of this paper is to examine challenges and potential of big data in heterogeneous business networks and relate these to an implemented logistics solution. Design/methodology/approach – The paper establishes an overview of challenges and opportunities of current significance in the area of big data, specifically in the context of transparency and processes in heterogeneous enterprise networks. Within this context, the paper presents how existing components and purpose-driven research were combined for a solution implemented in a nationwide network for less-than-truckload consignments. Findings – Aside from providing an extended overview of today’s big data situation, the findings have shown that technical means and methods available today can comprise a feasible process transparency solution in a large heterogeneous network where legacy practices, reporting lags and incomplete data exist, yet processes are sensitive to inadequate policy changes. Practical implications – The means introduced in the paper were found to be of utility value in improving process efficiency, transparency and planning in logistics networks. The particular system design choices in the presented solution allow an incremental introduction or evolution of resource handling practices, incorporating existing fragmentary, unstructured or tacit knowledge of experienced personnel into the theoretically founded overall concept. Originality/value – The paper extends previous high-level view on the potential of big data, and presents new applied research and development results in a logistics application.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Text summarization has been studied for over a half century, but traditional methods process texts empirically and neglect the fundamental characteristics and principles of language use and understanding. Automatic summarization is a desirable technique for processing big data. This reference summarizes previous text summarization approaches in a multi-dimensional category space, introduces a multi-dimensional methodology for research and development, unveils the basic characteristics and principles of language use and understanding, investigates some fundamental mechanisms of summarization, studies dimensions on representations, and proposes a multi-dimensional evaluation mechanism. Investigation extends to incorporating pictures into summary and to the summarization of videos, graphs and pictures, and converges to a general summarization method. Further, some basic behaviors of summarization are studied in the complex cyber-physical-social space. Finally, a creative summarization mechanism is proposed as an effort toward the creative summarization of things, which is an open process of interactions among physical objects, data, people, and systems in cyber-physical-social space through a multi-dimensional lens of semantic computing. The insights can inspire research and development of many computing areas.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Visualising data for exploratory analysis is a big challenge in scientific and engineering domains where there is a need to gain insight into the structure and distribution of the data. Typically, visualisation methods like principal component analysis and multi-dimensional scaling are used, but it is difficult to incorporate prior knowledge about structure of the data into the analysis. In this technical report we discuss a complementary approach based on an extension of a well known non-linear probabilistic model, the Generative Topographic Mapping. We show that by including prior information of the covariance structure into the model, we are able to improve both the data visualisation and the model fit.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The pricing of Big 4 industry leadership Is examined for a sample of U.K. publicly-listed companies, and adds to the evidence from the Australian and U.S. audit markets that city-specific industry leadership commands a fee premium. There is a significant fee premium for city-specific industry leaders relative to other Big 4 auditors, but no evidence that either the top-ranked or second-ranked firm nationally commands a fee premium relative to other Big 4 auditors, after controlling for city-level industry leadership. We also test for Big 4 fee premiums relative to non-Big 4 auditors and the U.K. data suggest a three-level hierarchy based on audit fee differentials: (1) Big 4 city-specific industry leaders have the largest fees; (2) other Big 4 auditors (noncity leaders) and second-tier national firms have comparable fees that are lower than Big 4 city leaders but larger than third-tier firms; and (3) third-tier accounting firms have the lowest fees.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Purpose – The purpose of this editorial is to stimulate debate and discussion amongst marketing scholarship regarding the implications for scientific research of increasingly large amounts of data and sophisticated data analytic techniques. Design/methodology/approach – The authors respond to a recent editorial in WIRED magazine which heralds the demise of the scientific method in the face of the vast data sets now available. Findings – The authors propose that more data makes theory more important, not less. They differentiate between raw prediction and scientific knowledge – which is aimed at explanation. Research limitations/implications – These thoughts are preliminary and intended to spark thinking and debate, not represent editorial policy. Due to space constraints, the coverage of many issues is necessarily brief. Practical implications – Marketing researchers should find these thoughts at the very least stimulating, and may wish to investigate these issues further. Originality/value – This piece should provide some interesting food for thought for marketing researchers.