21 resultados para big data storage
em Aston University Research Archive
Resumo:
In this paper we evaluate and compare two representativeand popular distributed processing engines for large scalebig data analytics, Spark and graph based engine GraphLab. Wedesign a benchmark suite including representative algorithmsand datasets to compare the performances of the computingengines, from performance aspects of running time, memory andCPU usage, network and I/O overhead. The benchmark suite istested on both local computer cluster and virtual machines oncloud. By varying the number of computers and memory weexamine the scalability of the computing engines with increasingcomputing resources (such as CPU and memory). We also runcross-evaluation of generic and graph based analytic algorithmsover graph processing and generic platforms to identify thepotential performance degradation if only one processing engineis available. It is observed that both computing engines showgood scalability with increase of computing resources. WhileGraphLab largely outperforms Spark for graph algorithms, ithas close running time performance as Spark for non-graphalgorithms. Additionally the running time with Spark for graphalgorithms over cloud virtual machines is observed to increaseby almost 100% compared to over local computer clusters.
Resumo:
At the moment, the phrases “big data” and “analytics” are often being used as if they were magic incantations that will solve all an organization’s problems at a stroke. The reality is that data on its own, even with the application of analytics, will not solve any problems. The resources that analytics and big data can consume represent a significant strategic risk if applied ineffectively. Any analysis of data needs to be guided, and to lead to action. So while analytics may lead to knowledge and intelligence (in the military sense of that term), it also needs the input of knowledge and intelligence (in the human sense of that term). And somebody then has to do something new or different as a result of the new insights, or it won’t have been done to any purpose. Using an analytics example concerning accounts payable in the public sector in Canada, this paper reviews thinking from the domains of analytics, risk management and knowledge management, to show some of the pitfalls, and to present a holistic picture of how knowledge management might help tackle the challenges of big data and analytics.
Resumo:
Semantics, knowledge and Grids represent three spaces where people interact, understand, learn and create. Grids represent the advanced cyber-infrastructures and evolution. Big data influence the evolution of semantics, knowledge and Grids. Exploring semantics, knowledge and Grids on big data helps accelerate the shift of scientific paradigm, the fourth industrial revolution, and the transformational innovation of technologies.
Resumo:
The increasing demand for high capacity data storage requires decreasing the head-to-tape gap and reducing the track width. A problem very often encountered is the development of adhesive debris on the heads at low humidity and high temperatures that can lead to an increase of space between the head and media, and thus a decrease in the playback signal. The influence of stains on the playback signal of reading heads is studied using RAW (Read After Write) tests and their influence on the wear of the heads by using indentation technique. The playback signal has been found to vary and the errors to increase as stains form a patchy pattern and grow in size to form a continuous layer. The indentation technique shows that stains reduce the wear rate of the heads. In addition, the wear tends to be more pronounced at the leading edge of the head compared to the trailing one. Chemical analysis of the stains using ferrite samples in conjunction with MP (metal particulate) tapes shows that stains contain iron particles and polymeric binder transferred from the MP tape. The chemical anchors in the binder used to grip the iron particles now react with the ferrite surface to create strong chemical bonds. At high humidity, a thin layer of iron oxyhydroxide forms on the surface of the ferrite. This soft material increases the wear rate and so reduces the amount of stain present on the heads. The stability of the binder under high humidity and under high temperature as well as the chemical reactions that might occur on the ferrite poles of the heads influences the dynamic behaviour of stains. A model of stain formation taking into account the channels of binder degradation and evolution upon different environmental conditions is proposed.
Resumo:
We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013–Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.
Resumo:
Parkinson's disease is a complex heterogeneous disorder with urgent need for disease-modifying therapies. Progress in successful therapeutic approaches for PD will require an unprecedented level of collaboration. At a workshop hosted by Parkinson's UK and co-organized by Critical Path Institute's (C-Path) Coalition Against Major Diseases (CAMD) Consortiums, investigators from industry, academia, government and regulatory agencies agreed on the need for sharing of data to enable future success. Government agencies included EMA, FDA, NINDS/NIH and IMI (Innovative Medicines Initiative). Emerging discoveries in new biomarkers and genetic endophenotypes are contributing to our understanding of the underlying pathophysiology of PD. In parallel there is growing recognition that early intervention will be key for successful treatments aimed at disease modification. At present, there is a lack of a comprehensive understanding of disease progression and the many factors that contribute to disease progression heterogeneity. Novel therapeutic targets and trial designs that incorporate existing and new biomarkers to evaluate drug effects independently and in combination are required. The integration of robust clinical data sets is viewed as a powerful approach to hasten medical discovery and therapies, as is being realized across diverse disease conditions employing big data analytics for healthcare. The application of lessons learned from parallel efforts is critical to identify barriers and enable a viable path forward. A roadmap is presented for a regulatory, academic, industry and advocacy driven integrated initiative that aims to facilitate and streamline new drug trials and registrations in Parkinson's disease.
Resumo:
Information systems have developed to the stage that there is plenty of data available in most organisations but there are still major problems in turning that data into information for management decision making. This thesis argues that the link between decision support information and transaction processing data should be through a common object model which reflects the real world of the organisation and encompasses the artefacts of the information system. The CORD (Collections, Objects, Roles and Domains) model is developed which is richer in appropriate modelling abstractions than current Object Models. A flexible Object Prototyping tool based on a Semantic Data Storage Manager has been developed which enables a variety of models to be stored and experimented with. A statistical summary table model COST (Collections of Objects Statistical Table) has been developed within CORD and is shown to be adequate to meet the modelling needs of Decision Support and Executive Information Systems. The COST model is supported by a statistical table creator and editor COSTed which is also built on top of the Object Prototyper and uses the CORD model to manage its metadata.
Resumo:
The aim of the research project was to gain d complete and accurate accounting of the needs and deficiencies of materials selection and design data, with particular attention given to the feasibility of a computerised materials selection system that would include application analysis, property data and screening techniques. The project also investigates and integrates the three major aspects of materials resources, materials selection and materials recycling. Consideration of the materials resource base suggests that, though our discovery potential has increased, geologic availability is the ultimate determinant and several metals may well become scarce at the same time, thus compounding the problem of substitution. With around 2- to 20- million units of engineering materials data, the use of a computer is the only logical answer for scientific selection of materials. The system developed at Aston is used for data storage, mathematical computation and output. The system enables programs to be run in batch and interactive (on-line) mode. The program with modification can also handle such variables as quantity of mineral resources, energy cost of materials and depletion and utilisation rates of strateqic materials. The work also carries out an in-depth study of copper recycling in the U.K. and concludes that, somewhere in the region of 2 million tonnes of copper is missing from the recycling cycle. It also sets out guidelines on product design and conservation policies from the recyclability point of view.
Resumo:
A survey of the existing state-of-the-art of turbine blade manufacture highlights two operations that have not been automated namely that of loading of a turbine blade into an encapsulation die, and that of removing a machined blade from the encapsulation block. The automation of blade decapsulation has not been pursued. In order to develop a system to automate the loading of an encapsulation die a prototype mechanical handling robot has been designed together with a computer controlled encapsulation die. The robot has been designed as a mechanical handling robot of cylindrical geometry, suitable for use in a circular work cell. It is the prototype for a production model to be called `The Cybermate'. The prototype robot is mechanically complete but due to unforeseen circumstances the robot control system is not available (the development of the control system did not form a part of this project), hence it has not been possible to fully test and assess the robot mechanical design. Robot loading of the encapsulation die has thus been simulated. The research work with regard to the encapsulation die has focused on the development of computer controlled, hydraulically actuated, location pins. Such pins compensate for the inherent positional inaccuracy of the loading robot and reproduce the dexterity of the human operator. Each pin comprises a miniature hydraulic cylinder, controlled by a standard bidirectional flow control valve. The precision positional control is obtained through pulsing of the valves under software control, with positional feedback from an 8-bit transducer. A test-rig comprising one hydraulic location pin together with an opposing spring loaded pin has demonstrated that such a pin arrangement can be controlled with a repeatability of +/-.00045'. In addition this test-rig has demonstrated that such a pin arrangement can be used to gauge and compensate for the dimensional error of the component held between the pins, by offsetting the pin datum positions to allow for the component error. A gauging repeatability of +/- 0.00015' was demonstrated. This work has led to the design and manufacture of an encapsulation die comprising ten such pins and the associated computer software. All aspects of the control software except blade gauging and positional data storage have been demonstrated. Work is now required to achieve the accuracy of control demonstrated by the single pin test-rig, with each of the ten pins in the encapsulation die. This would allow trials of the complete loading cycle to take place.
Resumo:
We consider a variation of the prototype combinatorial optimization problem known as graph colouring. Our optimization goal is to colour the vertices of a graph with a fixed number of colours, in a way to maximize the number of different colours present in the set of nearest neighbours of each given vertex. This problem, which we pictorially call palette-colouring, has been recently addressed as a basic example of a problem arising in the context of distributed data storage. Even though it has not been proved to be NP-complete, random search algorithms find the problem hard to solve. Heuristics based on a naive belief propagation algorithm are observed to work quite well in certain conditions. In this paper, we build upon the mentioned result, working out the correct belief propagation algorithm, which needs to take into account the many-body nature of the constraints present in this problem. This method improves the naive belief propagation approach at the cost of increased computational effort. We also investigate the emergence of a satisfiable-to-unsatisfiable 'phase transition' as a function of the vertex mean degree, for different ensembles of sparse random graphs in the large size ('thermodynamic') limit.
Resumo:
Mode-locked lasers emitting a train of femtosecond pulses called dissipative solitons are an enabling technology for metrology, high-resolution spectroscopy, fibre optic communications, nano-optics and many other fields of science and applications. Recently, the vector nature of dissipative solitons has been exploited to demonstrate mode locked lasing with both locked and rapidly evolving states of polarisation. Here, for an erbium-doped fibre laser mode locked with carbon nanotubes, we demonstrate the first experimental and theoretical evidence of a new class of slowly evolving vector solitons characterized by a double-scroll chaotic polarisation attractor substantially different from Lorenz, Rössler and Ikeda strange attractors. The underlying physics comprises a long time scale coherent coupling of two polarisation modes. The observed phenomena, apart from the fundamental interest, provide a base for advances in secure communications, trapping and manipulation of atoms and nanoparticles, control of magnetisation in data storage devices and many other areas. © 2014 CIOMP. All rights reserved.
Resumo:
GraphChi is the first reported disk-based graph engine that can handle billion-scale graphs on a single PC efficiently. GraphChi is able to execute several advanced data mining, graph mining and machine learning algorithms on very large graphs. With the novel technique of parallel sliding windows (PSW) to load subgraph from disk to memory for vertices and edges updating, it can achieve data processing performance close to and even better than those of mainstream distributed graph engines. GraphChi mentioned that its memory is not effectively utilized with large dataset, which leads to suboptimal computation performances. In this paper we are motivated by the concepts of 'pin ' from TurboGraph and 'ghost' from GraphLab to propose a new memory utilization mode for GraphChi, which is called Part-in-memory mode, to improve the GraphChi algorithm performance. The main idea is to pin a fixed part of data inside the memory during the whole computing process. Part-in-memory mode is successfully implemented with only about 40 additional lines of code to the original GraphChi engine. Extensive experiments are performed with large real datasets (including Twitter graph with 1.4 billion edges). The preliminary results show that Part-in-memory mode memory management approach effectively reduces the GraphChi running time by up to 60% in PageRank algorithm. Interestingly it is found that a larger portion of data pinned in memory does not always lead to better performance in the case that the whole dataset cannot be fitted in memory. There exists an optimal portion of data which should be kept in the memory to achieve the best computational performance.
Resumo:
We report on a new vector model of an erbium-doped fibre laser mode locked with carbon nanotubes. This model goes beyond the limitations of the previously used models based on either coupled nonlinear Schrödinger or Ginzburg-Landau equations. Unlike the previous models, it accounts for the vector nature of the interaction between an optical field and an erbium-doped active medium, slow relaxation dynamics of erbium ions, linear birefringence in a fibre, linear and circular birefringence of a laser cavity caused by in-cavity polarization controller and light-induced anisotropy caused by elliptically polarized pump field. Interplay of aforementioned factors changes coherent coupling of two polarization modes at a long time scale and so results in a new family of vector solitons (VSs) with fast and slowly evolving states of polarization. The observed VSs can be of interest in secure communications, trapping and manipulation of atoms and nanoparticles, control of magnetization in data storage devices and many other areas.
Resumo:
Multiple transformative forces target marketing, many of which derive from new technologies that allow us to sample thinking in real time (i.e., brain imaging), or to look at large aggregations of decisions (i.e., big data). There has been an inclination to refer to the intersection of these technologies with the general topic of marketing as “neuromarketing”. There has not been a serious effort to frame neuromarketing, which is the goal of this paper. Neuromarketing can be compared to neuroeconomics, wherein neuroeconomics is generally focused on how individuals make “choices”, and represent distributions of choices. Neuromarketing, in contrast, focuses on how a distribution of choices can be shifted or “influenced”, which can occur at multiple “scales” of behavior (e.g., individual, group, or market/society). Given influence can affect choice through many cognitive modalities, and not just that of valuation of choice options, a science of influence also implies a need to develop a model of cognitive function integrating attention, memory, and reward/aversion function. The paper concludes with a brief description of three domains of neuromarketing application for studying influence, and their caveats.
Resumo:
This paper researches on Matthew Effect in Sina Weibo microblogger. We choose the microblogs in the ranking list of Hot Microblog App in Sina Weibo microblogger as target of our study. The differences of repost number of microblogs in the ranking list between before and after the time when it enter the ranking list of Hot Microblog app are analyzed. And we compare the spread features of the microblogs in the ranking list with those hot microblogs not in the list and those ordinary microblogs of users who have some microblog in the ranking list before. Our study proves the existence of Matthew Effect in social network. © 2013 IEEE.