21 resultados para Data clustering. Fuzzy C-Means. Cluster centers initialization. Validation indices


Relevância:

40.00% 40.00%

Publicador:

Resumo:

Data analytic applications are characterized by large data sets that are subject to a series of processing phases. Some of these phases are executed sequentially but others can be executed concurrently or in parallel on clusters, grids or clouds. The MapReduce programming model has been applied to process large data sets in cluster and cloud environments. For developing an application using MapReduce there is a need to install/configure/access specific frameworks such as Apache Hadoop or Elastic MapReduce in Amazon Cloud. It would be desirable to provide more flexibility in adjusting such configurations according to the application characteristics. Furthermore the composition of the multiple phases of a data analytic application requires the specification of all the phases and their orchestration. The original MapReduce model and environment lacks flexible support for such configuration and composition. Recognizing that scientific workflows have been successfully applied to modeling complex applications, this paper describes our experiments on implementing MapReduce as subworkflows in the AWARD framework (Autonomic Workflow Activities Reconfigurable and Dynamic). A text mining data analytic application is modeled as a complex workflow with multiple phases, where individual workflow nodes support MapReduce computations. As in typical MapReduce environments, the end user only needs to define the application algorithms for input data processing and for the map and reduce functions. In the paper we present experimental results when using the AWARD framework to execute MapReduce workflows deployed over multiple Amazon EC2 (Elastic Compute Cloud) instances.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

A new algorithm for the velocity vector estimation of moving ships using Single Look Complex (SLC) SAR data in strip map acquisition mode is proposed. The algorithm exploits both amplitude and phase information of the Doppler decompressed data spectrum, with the aim to estimate both the azimuth antenna pattern and the backscattering coefficient as function of the look angle. The antenna pattern estimation provides information about the target velocity; the backscattering coefficient can be used for vessel classification. The range velocity is retrieved in the slow time frequency domain by estimating the antenna pattern effects induced by the target motion, while the azimuth velocity is calculated by the estimated range velocity and the ship orientation. Finally, the algorithm is tested on simulated SAR SLC data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The solubilities of two C-tetraalkylcalix[4]resorcinarenes, namely C-tetramethylcalix[4]resorcinarene and C-tetrapentylcalix[4]resorcinarene, in supercritical carbon dioxide (SCCO2) were measured in a flow-type apparatus at a temperature range from (313.2 to 333.2) K and at pressures from (12.0 to 35.0) MPa. The C-tetraalkylcalix[4]resorcinarenes were synthesized applying our optimized procedure and fully characterized by means of gel permeation chromatography, infrared and nuclear magnetic resonance spectroscopy. The solubilities of the C-tetraalkylcalix[4]resorcinarenes in SCCO2 were determined by analysis of the extracts obtained by HPLC with ultraviolet (UV) detection methodology adapted by our team. Four semiempirical density-based models, and the SoaveRedlichKwong cubic equation of state (SRK CEoS) with classical mixing rules, were applied to correlate the solubility of the calix[4]resorcinarenes in the SC CO2. The physical properties required for the modeling were estimated and reported.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The Evidence Accumulation Clustering (EAC) paradigm is a clustering ensemble method which derives a consensus partition from a collection of base clusterings obtained using different algorithms. It collects from the partitions in the ensemble a set of pairwise observations about the co-occurrence of objects in a same cluster and it uses these co-occurrence statistics to derive a similarity matrix, referred to as co-association matrix. The Probabilistic Evidence Accumulation for Clustering Ensembles (PEACE) algorithm is a principled approach for the extraction of a consensus clustering from the observations encoded in the co-association matrix based on a probabilistic model for the co-association matrix parameterized by the unknown assignments of objects to clusters. In this paper we extend the PEACE algorithm by deriving a consensus solution according to a MAP approach with Dirichlet priors defined for the unknown probabilistic cluster assignments. In particular, we study the positive regularization effect of Dirichlet priors on the final consensus solution with both synthetic and real benchmark data.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

In the present paper we compare clustering solutions using indices of paired agreement. We propose a new method - IADJUST - to correct indices of paired agreement, excluding agreement by chance. This new method overcomes previous limitations known in the literature as it permits the correction of any index. We illustrate its use in external clustering validation, to measure the accordance between clusters and an a priori known structure. The adjusted indices are intended to provide a realistic measure of clustering performance that excludes agreement by chance with ground truth. We use simulated data sets, under a range of scenarios - considering diverse numbers of clusters, clusters overlaps and balances - to discuss the pertinence and the precision of our proposal. Precision is established based on comparisons with the analytical approach for correction specific indices that can be corrected in this way are used for this purpose. The pertinence of the proposed correction is discussed when making a detailed comparison between the performance of two classical clustering approaches, namely Expectation-Maximization (EM) and K-Means (KM) algorithms. Eight indices of paired agreement are studied and new corrected indices are obtained.