49 resultados para big data storage
Resumo:
We present an overview of the MELODIES project, which is developing new data-intensive environmental services based on data from Earth Observation satellites, government databases, national and European agencies and more. We focus here on the capabilities and benefits of the project’s “technical platform”, which applies cloud computing and Linked Data technologies to enable the development of these services, providing flexibility and scalability.
Resumo:
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.
Resumo:
The Environmental Data Abstraction Library provides a modular data management library for bringing new and diverse datatypes together for visualisation within numerous software packages, including the ncWMS viewing service, which already has very wide international uptake. The structure of EDAL is presented along with examples of its use to compare satellite, model and in situ data types within the same visualisation framework. We emphasize the value of this capability for cross calibration of datasets and evaluation of model products against observations, including preparation for data assimilation.
Resumo:
An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.
Resumo:
The accurate prediction of storms is vital to the oil and gas sector for the management of their operations. An overview of research exploring the prediction of storms by ensemble prediction systems is presented and its application to the oil and gas sector is discussed. The analysis method used requires larger amounts of data storage and computer processing time than other more conventional analysis methods. To overcome these difficulties eScience techniques have been utilised. These techniques potentially have applications to the oil and gas sector to help incorporate environmental data into their information systems
Resumo:
Models play a vital role in supporting a range of activities in numerous domains. We rely on models to support the design, visualisation, analysis and representation of parts of the world around us, and as such significant research effort has been invested into numerous areas of modelling; including support for model semantics, dynamic states and behaviour, temporal data storage and visualisation. Whilst these efforts have increased our capabilities and allowed us to create increasingly powerful software-based models, the process of developing models, supporting tools and /or data structures remains difficult, expensive and error-prone. In this paper we define from literature the key factors in assessing a model’s quality and usefulness: semantic richness, support for dynamic states and object behaviour, temporal data storage and visualisation. We also identify a number of shortcomings in both existing modelling standards and model development processes and propose a unified generic process to guide users through the development of semantically rich, dynamic and temporal models.
Resumo:
The discourse surrounding the virtual has moved away from the utopian thinking accompanying the rise of the Internet in the 1990s. The Cyber-gurus of the last decades promised a technotopia removed from materiality and the confines of the flesh and the built environment, a liberation from old institutions and power structures. But since then, the virtual has grown into a distinct yet related sphere of cultural and political production that both parallels and occasionally flows over into the old world of material objects. The strict dichotomy of matter and digital purity has been replaced more recently with a more complex model where both the world of stuff and the world of knowledge support, resist and at the same time contain each other. Online social networks amplify and extend existing ones; other cultural interfaces like youtube have not replaced the communal experience of watching moving images in a semi-public space (the cinema) or the semi-private space (the family living room). Rather the experience of viewing is very much about sharing and communicating, offering interpretations and comments. Many of the web’s strongest entities (Amazon, eBay, Gumtree etc.) sit exactly at this juncture of applying tools taken from the knowledge management industry to organize the chaos of the material world along (post-)Fordist rationality. Since the early 1990s there have been many artistic and curatorial attempts to use the Internet as a platform of producing and exhibiting art, but a lot of these were reluctant to let go of the fantasy of digital freedom. Storage Room collapses the binary opposition of real and virtual space by using online data storage as a conduit for IRL art production. The artworks here will not be available for viewing online in a 'screen' environment but only as part of a downloadable package with the intention that the exhibition could be displayed (in a physical space) by any interested party and realised as ambitiously or minimally as the downloader wishes, based on their means. The artists will therefore also supply a set of instructions for the physical installation of the work alongside the digital files. In response to this curatorial initiative, File Transfer Protocol invites seven UK based artists to produce digital art for a physical environment, addressing the intersection between the virtual and the material. The files range from sound, video, digital prints and net art, blueprints for an action to take place, something to be made, a conceptual text piece, etc. About the works and artists: Polly Fibre is the pseudonym of London-based artist Christine Ellison. Ellison creates live music using domestic devices such as sewing machines, irons and slide projectors. Her costumes and stage sets propose a physical manifestation of the virtual space that is created inside software like Photoshop. For this exhibition, Polly Fibre invites the audience to create a musical composition using a pair of amplified scissors and a turntable. http://www.pollyfibre.com John Russell, a founding member of 1990s art group Bank, is an artist, curator and writer who explores in his work the contemporary political conditions of the work of art. In his digital print, Russell collages together visual representations of abstract philosophical ideas and transforms them into a post apocalyptic landscape that is complex and banal at the same time. www.john-russell.org The work of Bristol based artist Jem Nobel opens up a dialogue between the contemporary and the legacy of 20th century conceptual art around questions of collectivism and participation, authorship and individualism. His print SPACE concretizes the representation of the most common piece of Unicode: the vacant space between words. In this way, the gap itself turns from invisible cipher to sign. www.jemnoble.com Annabel Frearson is rewriting Mary Shelley's Frankenstein using all and only the words from the original text. Frankenstein 2, or the Monster of Main Stream, is read in parts by different performers, embodying the psychotic character of the protagonist, a mongrel hybrid of used language. www.annabelfrearson.com Darren Banks uses fragments of effect laden Holywood films to create an impossible space. The fictitious parts don't add up to a convincing material reality, leaving the viewer with a failed amalgamation of simulations of sophisticated technologies. www.darrenbanks.co.uk FIELDCLUB is collaboration between artist Paul Chaney and researcher Kenna Hernly. Chaney and Hernly developed together a project that critically examines various proposals for the management of sustainable ecological systems. Their FIELDMACHINE invites the public to design an ideal agricultural field. By playing with different types of crops that are found in the south west of England, it is possible for the user, for example, to create a balanced, but protein poor, diet or to simply decide to 'get rid' of half the population. The meeting point of the Platonic field and it physical consequences, generates a geometric abstraction that investigates the relationship between modernist utopianism and contemporary actuality. www.fieldclub.co.uk Pil and Galia Kollectiv, who have also curated the exhibition are London-based artists and run the xero, kline & coma gallery. Here they present a dialogue between two computers. The conversation opens with a simple text book problem in business studies. But gradually the language, mimicking the application of game theory in the business sector, becomes more abstract. The two interlocutors become adversaries trapped forever in a competition without winners. www.kollectiv.co.uk
Resumo:
SOA (Service Oriented Architecture), workflow, the Semantic Web, and Grid computing are key enabling information technologies in the development of increasingly sophisticated e-Science infrastructures and application platforms. While the emergence of Cloud computing as a new computing paradigm has provided new directions and opportunities for e-Science infrastructure development, it also presents some challenges. Scientific research is increasingly finding that it is difficult to handle “big data” using traditional data processing techniques. Such challenges demonstrate the need for a comprehensive analysis on using the above mentioned informatics techniques to develop appropriate e-Science infrastructure and platforms in the context of Cloud computing. This survey paper describes recent research advances in applying informatics techniques to facilitate scientific research particularly from the Cloud computing perspective. Our particular contributions include identifying associated research challenges and opportunities, presenting lessons learned, and describing our future vision for applying Cloud computing to e-Science. We believe our research findings can help indicate the future trend of e-Science, and can inform funding and research directions in how to more appropriately employ computing technologies in scientific research. We point out the open research issues hoping to spark new development and innovation in the e-Science field.
Resumo:
Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.
Resumo:
The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.