765 resultados para Grouping, clustering, campi, associazione
Resumo:
Chebyshev-inequality-based convex relaxations of Chance-Constrained Programs (CCPs) are shown to be useful for learning classifiers on massive datasets. In particular, an algorithm that integrates efficient clustering procedures and CCP approaches for computing classifiers on large datasets is proposed. The key idea is to identify high density regions or clusters from individual class conditional densities and then use a CCP formulation to learn a classifier on the clusters. The CCP formulation ensures that most of the data points in a cluster are correctly classified by employing a Chebyshev-inequality-based convex relaxation. This relaxation is heavily dependent on the second-order statistics. However, this formulation and in general such relaxations that depend on the second-order moments are susceptible to moment estimation errors. One of the contributions of the paper is to propose several formulations that are robust to such errors. In particular a generic way of making such formulations robust to moment estimation errors is illustrated using two novel confidence sets. An important contribution is to show that when either of the confidence sets is employed, for the special case of a spherical normal distribution of clusters, the robust variant of the formulation can be posed as a second-order cone program. Empirical results show that the robust formulations achieve accuracies comparable to that with true moments, even when moment estimates are erroneous. Results also illustrate the benefits of employing the proposed methodology for robust classification of large-scale datasets.
Resumo:
Learning from Positive and Unlabelled examples (LPU) has emerged as an important problem in data mining and information retrieval applications. Existing techniques are not ideally suited for real world scenarios where the datasets are linearly inseparable, as they either build linear classifiers or the non-linear classifiers fail to achieve the desired performance. In this work, we propose to extend maximum margin clustering ideas and present an iterative procedure to design a non-linear classifier for LPU. In particular, we build a least squares support vector classifier, suitable for handling this problem due to symmetry of its loss function. Further, we present techniques for appropriately initializing the labels of unlabelled examples and for enforcing the ratio of positive to negative examples while obtaining these labels. Experiments on real-world datasets demonstrate that the non-linear classifier designed using the proposed approach gives significantly better generalization performance than the existing relevant approaches for LPU.
Resumo:
Data clustering groups data so that data which are similar to each other are in the same group and data which are dissimilar to each other are in different groups. Since generally clustering is a subjective activity, it is possible to get different clusterings of the same data depending on the need. This paper attempts to find the best clustering of the data by first carrying out feature selection and using only the selected features, for clustering. A PSO (Particle Swarm Optimization)has been used for clustering but feature selection has also been carried out simultaneously. The performance of the above proposed algorithm is evaluated on some benchmark data sets. The experimental results shows the proposed methodology outperforms the previous approaches such as basic PSO and Kmeans for the clustering problem.
Resumo:
Regionalization approaches are widely used in water resources engineering to identify hydrologically homogeneous groups of watersheds that are referred to as regions. Pooled information from sites (depicting watersheds) in a region forms the basis to estimate quantiles associated with hydrological extreme events at ungauged/sparsely gauged sites in the region. Conventional regionalization approaches can be effective when watersheds (data points) corresponding to different regions can be separated using straight lines or linear planes in the space of watershed related attributes. In this paper, a kernel-based Fuzzy c-means (KFCM) clustering approach is presented for use in situations where such linear separation of regions cannot be accomplished. The approach uses kernel-based functions to map the data points from the attribute space to a higher-dimensional space where they can be separated into regions by linear planes. A procedure to determine optimal number of regions with the KFCM approach is suggested. Further, formulations to estimate flood quantiles at ungauged sites with the approach are developed. Effectiveness of the approach is demonstrated through Monte-Carlo simulation experiments and a case study on watersheds in United States. Comparison of results with those based on conventional Fuzzy c-means clustering, Region-of-influence approach and a prior study indicate that KFCM approach outperforms the other approaches in forming regions that are closer to being statistically homogeneous and in estimating flood quantiles at ungauged sites. Key Points
Resumo:
The complexity in visualizing volumetric data often limits the scope of direct exploration of scalar fields. Isocontour extraction is a popular method for exploring scalar fields because of its simplicity in presenting features in the data. In this paper, we present a novel representation of contours with the aim of studying the similarity relationship between the contours. The representation maps contours to points in a high-dimensional transformation-invariant descriptor space. We leverage the power of this representation to design a clustering based algorithm for detecting symmetric regions in a scalar field. Symmetry detection is a challenging problem because it demands both segmentation of the data and identification of transformation invariant segments. While the former task can be addressed using topological analysis of scalar fields, the latter requires geometry based solutions. Our approach combines the two by utilizing the contour tree for segmenting the data and the descriptor space for determining transformation invariance. We discuss two applications, query driven exploration and asymmetry visualization, that demonstrate the effectiveness of the approach.
Resumo:
Pure alpha-Al2O3 exhibits a very high degree of thermodynamical stability among all metal oxides and forms an inert oxide scale in a range of structural alloys at high temperatures. We report that amorphous Al2O3 thin films sputter deposited over crystalline Si instead show a surprisingly active interface. On annealing, crystallization begins with nuclei of a phase closely resembling gamma-Alumina forming almost randomly in an amorphous matrix, and with increasing frequency near the substrate/film interface. This nucleation is marked by the signature appearance of sharp (400) and (440) reflections and the formation of a diffuse diffraction halo with an outer maximal radius of approximate to 0.23 nm enveloping the direct beam. The microstructure then evolves by a cluster-coalescence growth mechanism suggestive of swift nucleation and sluggish diffusional kinetics, while locally the Al ions redistribute slowly from chemisorbed and tetrahedral sites to higher anion coordinated sites. Chemical state plots constructed from XPS data and simple calculations of the diffraction patterns from hypothetically distorted lattices suggest that the true origins of the diffuse diffraction halo are probably related to a complex change in the electronic structure spurred by the a-gamma transformation rather than pure structural disorder. Concurrent to crystallization within the film, a substantially thick interfacial reaction zone also builds up at the film/substrate interface with the excess Al acting as a cationic source. (C) 2015 AIP Publishing LLC.
Resumo:
We propose a new approach to clustering. Our idea is to map cluster formation to coalition formation in cooperative games, and to use the Shapley value of the patterns to identify clusters and cluster representatives. We show that the underlying game is convex and this leads to an efficient biobjective clustering algorithm that we call BiGC. The algorithm yields high-quality clustering with respect to average point-to-center distance (potential) as well as average intracluster point-to-point distance (scatter). We demonstrate the superiority of BiGC over state-of-the-art clustering algorithms (including the center based and the multiobjective techniques) through a detailed experimentation using standard cluster validity criteria on several benchmark data sets. We also show that BiGC satisfies key clustering properties such as order independence, scale invariance, and richness.
Resumo:
Clustering techniques which can handle incomplete data have become increasingly important due to varied applications in marketing research, medical diagnosis and survey data analysis. Existing techniques cope up with missing values either by using data modification/imputation or by partial distance computation, often unreliable depending on the number of features available. In this paper, we propose a novel approach for clustering data with missing values, which performs the task by Symmetric Non-Negative Matrix Factorization (SNMF) of a complete pair-wise similarity matrix, computed from the given incomplete data. To accomplish this, we define a novel similarity measure based on Average Overlap similarity metric which can effectively handle missing values without modification of data. Further, the similarity measure is more reliable than partial distances and inherently possesses the properties required to perform SNMF. The experimental evaluation on real world datasets demonstrates that the proposed approach is efficient, scalable and shows significantly better performance compared to the existing techniques.
Resumo:
Motivated by multi-distribution divergences, which originate in information theory, we propose a notion of `multipoint' kernels, and study their applications. We study a class of kernels based on Jensen type divergences and show that these can be extended to measure similarity among multiple points. We study tensor flattening methods and develop a multi-point (kernel) spectral clustering (MSC) method. We further emphasize on a special case of the proposed kernels, which is a multi-point extension of the linear (dot-product) kernel and show the existence of cubic time tensor flattening algorithm in this case. Finally, we illustrate the usefulness of our contributions using standard data sets and image segmentation tasks.
Resumo:
Homogeneous temperature regions are necessary for use in hydrometeorological studies. The regions are often delineated by analysing statistics derived from time series of maximum, minimum or mean temperature, rather than attributes influencing temperature. This practice cannot yield meaningful regions in data-sparse areas. Further, independent validation of the delineated regions for homogeneity in temperature is not possible, as temperature records form the basis to arrive at the regions. To address these issues, a two-stage clustering approach is proposed in this study to delineate homogeneous temperature regions. First stage of the approach involves (1) determining correlation structure between observed temperature over the study area and possible predictors (large-scale atmospheric variables) influencing the temperature and (2) using the correlation structure as the basis to delineate sites in the study area into clusters. Second stage of the approach involves analysis on each of the clusters to (1) identify potential predictors (large-scale atmospheric variables) influencing temperature at sites in the cluster and (2) partition the cluster into homogeneous fuzzy temperature regions using the identified potential predictors. Application of the proposed approach to India yielded 28 homogeneous regions that were demonstrated to be effective when compared to an alternate set of 6 regions that were previously delineated over the study area. Intersite cross-correlations of monthly maximum and minimum temperatures in the existing regions were found to be weak and negative for several months, which is undesirable. This problem was not found in the case of regions delineated using the proposed approach. Utility of the proposed regions in arriving at estimates of potential evapotranspiration for ungauged locations in the study area is demonstrated.
Resumo:
Resumen: Este artículo analiza la relación entre la agrupación espacial de la distribución del ingreso y la desigualdad en las provincias de Argentina. El objetivo de este trabajo es usar técnicas espaciales para analizar hasta que punto la agrupación espacial de la distribución del ingreso afecta la desigualdad de la distribución del ingreso en un contexto regional de Argentina. En general, la literatura de desigualdad implícitamente considera a cada región o provincia como una entidad independiente y el potencial para la observación de la interacción a través del espacio a menudo se ha ignorado. Mientras tanto, la autocorrelación espacial ocurre cuando la distribución espacial de la variable de interés exhibe un patrón sistemático. Yo computo tres medidas de autocorrelación espacial global: La I de Moran, c de Geary, y G de Getis y Ord, como grado de CLUSTERING provincial entre 1991 y 2002. La principal conclusión del trabajo es que hay evidencia que provincias con desigualdad relativamente alta (baja) tienden a ser localizadas cerca de otras provincias con alta (baja) desigualdad más a menudo de lo esperado debido al azar. Por ende cada provincia no debería ser vista como una observación independiente, como ha sido supuesto implícitamente en estudios previos sobre la desigualdad de ingresos regional.