856 resultados para Data Driven Clustering


Relevância:

30.00% 30.00%

Publicador:

Resumo:

A Bayesian method of classifying observations that are assumed to come from a number of distinct subpopulations is outlined. The method is illustrated with simulated data and applied to the classification of farms according to their level and variability of income. The resultant classification shows a greater diversity of technical charactersitics within farm types than is conventionally the case. The range of mean farm income between groups in the new classification is wider than that of the conventional method and the variability of income within groups is narrower. Results show that the highest income group in 2000 included large specialist dairy farmers and pig and poultry producers, whilst in 2001 it included large and small specialist dairy farms and large mixed dairy and arable farms. In both years the lowest income group is dominated by non-milk producing livestock farms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Over the last decade, a number of new methods of population genetic analysis based on likelihood have been introduced. This review describes and explains the general statistical techniques that have recently been used, and discusses the underlying population genetic models. Experimental papers that use these methods to infer human demographic and phylogeographic history are reviewed. It appears that the use of likelihood has hitherto had little impact in the field of human population genetics, which is still primarily driven by more traditional approaches. However, with the current uncertainty about the effects of natural selection, population structure and ascertainment of single-nucleotide polymorphism markers, it is suggested that likelihood-based methods may have a greater impact in the future.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Studies of ignorance-driven decision making have been employed to analyse when ignorance should prove advantageous on theoretical grounds or else they have been employed to examine whether human behaviour is consistent with an ignorance-driven inference strategy (e. g., the recognition heuristic). In the current study we examine whether-under conditions where such inferences might be expected-the advantages that theoretical analyses predict are evident in human performance data. A single experiment shows that, when asked to make relative wealth judgements, participants reliably use recognition as a basis for their judgements. Their wealth judgements under these conditions are reliably more accurate when some of the target names are unknown than when participants recognize all of the names (a "less-is-more effect"). These results are consistent across a number of variations: the number of options given to participants and the nature of the wealth judgement. A basic model of recognition-based inference predicts these effects.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A large volume of visual content is inaccessible until effective and efficient indexing and retrieval of such data is achieved. In this paper, we introduce the DREAM system, which is a knowledge-assisted semantic-driven context-aware visual information retrieval system applied in the film post production domain. We mainly focus on the automatic labelling and topic map related aspects of the framework. The use of the context- related collateral knowledge, represented by a novel probabilistic based visual keyword co-occurrence matrix, had been proven effective via the experiments conducted during system evaluation. The automatically generated semantic labels were fed into the Topic Map Engine which can automatically construct ontological networks using Topic Maps technology, which dramatically enhances the indexing and retrieval performance of the system towards an even higher semantic level.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this work a method for building multiple-model structures is presented. A clustering algorithm that uses data from the system is employed to define the architecture of the multiple-model, including the size of the region covered by each model, and the number of models. A heating ventilation and air conditioning system is used as a testbed of the proposed method.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this work a method for building multiple-model structures is presented. A clustering algorithm that uses data from the system is employed to define the architecture of the multiple-model, including the size of the region covered by each model, and the number of models. A heating ventilation and air conditioning system is used as a testbed of the proposed method.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The Self-Organizing Map (SOM) is a popular unsupervised neural network able to provide effective clustering and data visualization for data represented in multidimensional input spaces. In this paper, we describe Fast Learning SOM (FLSOM) which adopts a learning algorithm that improves the performance of the standard SOM with respect to the convergence time in the training phase. We show that FLSOM also improves the quality of the map by providing better clustering quality and topology preservation of multidimensional input data. Several tests have been carried out on different multidimensional datasets, which demonstrate better performances of the algorithm in comparison with the original SOM.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A novel framework referred to as collaterally confirmed labelling (CCL) is proposed, aiming at localising the visual semantics to regions of interest in images with textual keywords. Both the primary image and collateral textual modalities are exploited in a mutually co-referencing and complementary fashion. The collateral content and context-based knowledge is used to bias the mapping from the low-level region-based visual primitives to the high-level visual concepts defined in a visual vocabulary. We introduce the notion of collateral context, which is represented as a co-occurrence matrix of the visual keywords. A collaborative mapping scheme is devised using statistical methods like Gaussian distribution or Euclidean distance together with collateral content and context-driven inference mechanism. We introduce a novel high-level visual content descriptor that is devised for performing semantic-based image classification and retrieval. The proposed image feature vector model is fundamentally underpinned by the CCL framework. Two different high-level image feature vector models are developed based on the CCL labelling of results for the purposes of image data clustering and retrieval, respectively. A subset of the Corel image collection has been used for evaluating our proposed method. The experimental results to-date already indicate that the proposed semantic-based visual content descriptors outperform both traditional visual and textual image feature models. (C) 2007 Elsevier B.V. All rights reserved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Dissolved organic carbon (DOC) concentrations have been rising in streams and lakes draining catchments with organic soils across Northern Europe. These increases have shown a correlation with decreased sulphate and chloride concentrations. One hypothesis to explain this phenomenon is that these relationships are due an increased in DOC release from soils to freshwaters, caused by a decline in pollutant sulphur and sea-salt deposition. We carried out controlled deposition experiments in the laboratory on intact peat and organomineral O-horizon cores to test this hypothesis. Preliminary data showed a clear correlation between the change in soil water pH and change in DOC concentrations, however uncertainty still remains about whether this is due to changes in biological activity or chemical solubility.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Measured process data normally contain inaccuracies because the measurements are obtained using imperfect instruments. As well as random errors one can expect systematic bias caused by miscalibrated instruments or outliers caused by process peaks such as sudden power fluctuations. Data reconciliation is the adjustment of a set of process data based on a model of the process so that the derived estimates conform to natural laws. In this paper, techniques for the detection and identification of both systematic bias and outliers in dynamic process data are presented. A novel technique for the detection and identification of systematic bias is formulated and presented. The problem of detection, identification and elimination of outliers is also treated using a modified version of a previously available clustering technique. These techniques are also combined to provide a global dynamic data reconciliation (DDR) strategy. The algorithms presented are tested in isolation and in combination using dynamic simulations of two continuous stirred tank reactors (CSTR).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

OPAL is an English national programme that takes scientists into the community to investigate environmental issues. Biological monitoring plays a pivotal role covering topics of: i) soil and earthworms; ii) air, lichens and tar spot on sycamore; iii) water and aquatic invertebrates; iv) biodiversity and hedgerows; v) climate, clouds and thermal comfort. Each survey has been developed by an interdisciplinary team and tested by voluntary, statutory and community sectors. Data are submitted via the web and instantly mapped. Preliminary results are presented, together with a discussion on data quality and uncertainty. Communities also investigate local pollution issues, ranging from nitrogen deposition on heathlands to traffic emissions on roadside vegetation. Over 200,000 people have participated so far, including over 1000 schools and 1000 voluntary groups. Benefits include a substantial, growing database on biodiversity and habitat condition, much from previously unsampled sites particularly in urban areas, and a more engaged public.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this study, we systematically compare a wide range of observational and numerical precipitation datasets for Central Asia. Data considered include two re-analyses, three datasets based on direct observations, and the output of a regional climate model simulation driven by a global re-analysis. These are validated and intercompared with respect to their ability to represent the Central Asian precipitation climate. In each of the datasets, we consider the mean spatial distribution and the seasonal cycle of precipitation, the amplitude of interannual variability, the representation of individual yearly anomalies, the precipitation sensitivity (i.e. the response to wet and dry conditions), and the temporal homogeneity of precipitation. Additionally, we carried out part of these analyses for datasets available in real time. The mutual agreement between the observations is used as an indication of how far these data can be used for validating precipitation data from other sources. In particular, we show that the observations usually agree qualitatively on anomalies in individual years while it is not always possible to use them for the quantitative validation of the amplitude of interannual variability. The regional climate model is capable of improving the spatial distribution of precipitation. At the same time, it strongly underestimates summer precipitation and its variability, while interannual variations are well represented during the other seasons, in particular in the Central Asian mountains during winter and spring

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. This work proposes a fully decentralised algorithm (Epidemic K-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art distributed K-Means algorithms based on sampling methods. The experimental analysis confirms that the proposed algorithm is a practical and accurate distributed K-Means implementation for networked systems of very large and extreme scale.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Ensemble clustering (EC) can arise in data assimilation with ensemble square root filters (EnSRFs) using non-linear models: an M-member ensemble splits into a single outlier and a cluster of M−1 members. The stochastic Ensemble Kalman Filter does not present this problem. Modifications to the EnSRFs by a periodic resampling of the ensemble through random rotations have been proposed to address it. We introduce a metric to quantify the presence of EC and present evidence to dispel the notion that EC leads to filter failure. Starting from a univariate model, we show that EC is not a permanent but transient phenomenon; it occurs intermittently in non-linear models. We perform a series of data assimilation experiments using a standard EnSRF and a modified EnSRF by a resampling though random rotations. The modified EnSRF thus alleviates issues associated with EC at the cost of traceability of individual ensemble trajectories and cannot use some of algorithms that enhance performance of standard EnSRF. In the non-linear regimes of low-dimensional models, the analysis root mean square error of the standard EnSRF slowly grows with ensemble size if the size is larger than the dimension of the model state. However, we do not observe this problem in a more complex model that uses an ensemble size much smaller than the dimension of the model state, along with inflation and localisation. Overall, we find that transient EC does not handicap the performance of the standard EnSRF.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.