31 resultados para Big data, Spark, Hadoop


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The CHARMe project enables the annotation of climate data with key pieces of supporting information that we term “commentary”. Commentary reflects the experience that has built up in the user community, and can help new or less-expert users (such as consultants, SMEs, experts in other fields) to understand and interpret complex data. In the context of global climate services, the CHARMe system will record, retain and disseminate this commentary on climate datasets, and provide a means for feeding back this experience to the data providers. Based on novel linked data techniques and standards, the project has developed a core system, data model and suite of open-source tools to enable this information to be shared, discovered and exploited by the community.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We present an overview of the MELODIES project, which is developing new data-intensive environmental services based on data from Earth Observation satellites, government databases, national and European agencies and more. We focus here on the capabilities and benefits of the project’s “technical platform”, which applies cloud computing and Linked Data technologies to enable the development of these services, providing flexibility and scalability.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Environmental Data Abstraction Library provides a modular data management library for bringing new and diverse datatypes together for visualisation within numerous software packages, including the ncWMS viewing service, which already has very wide international uptake. The structure of EDAL is presented along with examples of its use to compare satellite, model and in situ data types within the same visualisation framework. We emphasize the value of this capability for cross calibration of datasets and evaluation of model products against observations, including preparation for data assimilation.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

An important application of Big Data Analytics is the real-time analysis of streaming data. Streaming data imposes unique challenges to data mining algorithms, such as concept drifts, the need to analyse the data on the fly due to unbounded data streams and scalable algorithms due to potentially high throughput of data. Real-time classification algorithms that are adaptive to concept drifts and fast exist, however, most approaches are not naturally parallel and are thus limited in their scalability. This paper presents work on the Micro-Cluster Nearest Neighbour (MC-NN) classifier. MC-NN is based on an adaptive statistical data summary based on Micro-Clusters. MC-NN is very fast and adaptive to concept drift whilst maintaining the parallel properties of the base KNN classifier. Also MC-NN is competitive compared with existing data stream classifiers in terms of accuracy and speed.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Automatic generation of classification rules has been an increasingly popular technique in commercial applications such as Big Data analytics, rule based expert systems and decision making systems. However, a principal problem that arises with most methods for generation of classification rules is the overfit-ting of training data. When Big Data is dealt with, this may result in the generation of a large number of complex rules. This may not only increase computational cost but also lower the accuracy in predicting further unseen instances. This has led to the necessity of developing pruning methods for the simplification of rules. In addition, classification rules are used further to make predictions after the completion of their generation. As efficiency is concerned, it is expected to find the first rule that fires as soon as possible by searching through a rule set. Thus a suit-able structure is required to represent the rule set effectively. In this chapter, the authors introduce a unified framework for construction of rule based classification systems consisting of three operations on Big Data: rule generation, rule simplification and rule representation. The authors also review some existing methods and techniques used for each of the three operations and highlight their limitations. They introduce some novel methods and techniques developed by them recently. These methods and techniques are also discussed in comparison to existing ones with respect to efficient processing of Big Data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently been proposed as an alternative to the popular Random Forests classifier, which is based on decision trees. Random Prism is based on the Prism family of algorithms, which is more robust to noise. However, like most ensemble classification approaches, Random Prism also does not scale well on large training data. This paper presents a thorough discussion of Random Prism and a recently proposed parallel version of it called Parallel Random Prism. Parallel Random Prism is based on the MapReduce programming paradigm. The paper provides, for the first time, novel theoretical analysis of the proposed technique and in-depth experimental study that show that Parallel Random Prism scales well on a large number of training examples, a large number of data features and a large number of processors. Expressiveness of decision rules that our technique produces makes it a natural choice for Big Data applications where informed decision making increases the user’s trust in the system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study assesses Autism-Spectrum Quotient (AQ) scores in a ‘big data’ sample collected through the UK Channel 4 television website, following the broadcasting of a medical education program. We examine correlations between the AQ and age, sex, occupation, and UK geographic region in 450,394 individuals. We predicted that age and geography would not be correlated with AQ, whilst sex and occupation would have a correlation. Mean AQ for the total sample score was m = 19.83 (SD = 8.71), slightly higher than a previous systematic review of 6,900 individuals in a non-clinical sample (mean of means = 16.94) This likely reflects that this big-data sample includes individuals with autism who in the systematic review score much higher (mean of means = 35.19). As predicted, sex and occupation differences were observed: on average, males (m = 21.55, SD = 8.82) scored higher than females (m = 18.95; SD = 8.52), and individuals working in a STEM career (m = 21.92, SD = 8.92) scored higher than individuals non-STEM careers (m = 18.92, SD = 8.48). Also as predicted, age and geographic region were not meaningfully correlated with AQ. These results support previous findings relating to sex and STEM careers in the largest set of individuals for which AQ scores have been reported and suggest the AQ is a useful self-report measure of autistic traits

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This introduction to the Virtual Special Issue surveys the development of spatial housing economics from its roots in neo-classical theory, through more recent developments in social interactions modelling, and touching on the role of institutions, path dependence and economic history. The survey also points to some of the more promising future directions for the subject that are beginning to appear in the literature. The survey covers elements hedonic models, spatial econometrics, neighbourhood models, housing market areas, housing supply, models of segregation, migration, housing tenure, sub-national house price modelling including the so-called ripple effect, and agent-based models. Possible future directions are set in the context of a selection of recent papers that have appeared in Urban Studies. Nevertheless, there are still important gaps in the literature that merit further attention, arising at least partly from emerging policy problems. These include more research on housing and biodiversity, the relationship between housing and civil unrest, the effects of changing age distributions - notably housing for the elderly - and the impact of different international institutional structures. Methodologically, developments in Big Data provide an exciting framework for future work.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The General Election for the 56th United Kingdom Parliament was held on 7 May 2015. Tweets related to UK politics, not only those with the specific hashtag ”#GE2015”, have been collected in the period between March 1 and May 31, 2015. The resulting dataset contains over 28 million tweets for a total of 118 GB in uncompressed format or 15 GB in compressed format. This study describes the method that was used to collect the tweets and presents some analysis, including a political sentiment index, and outlines interesting research directions on Big Social Data based on Twitter microblogging.