873 resultados para constrained clustering
Resumo:
A computationally efficient agglomerative clustering algorithm based on multilevel theory is presented. Here, the data set is divided randomly into a number of partitions. The samples of each such partition are clustered separately using hierarchical agglomerative clustering algorithm to form sub-clusters. These are merged at higher levels to get the final classification. This algorithm leads to the same classification as that of hierarchical agglomerative clustering algorithm when the clusters are well separated. The advantages of this algorithm are short run time and small storage requirement. It is observed that the savings, in storage space and computation time, increase nonlinearly with the sample size.
Resumo:
K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.
Resumo:
Objective To investigate the epidemic characteristics of human cutaneous anthrax (CA) in China, detect the spatiotemporal clusters at the county level for preemptive public health interventions, and evaluate the differences in the epidemiological characteristics within and outside clusters. Methods CA cases reported during 2005–2012 from the national surveillance system were evaluated at the county level using space-time scan statistic. Comparative analysis of the epidemic characteristics within and outside identified clusters was performed using using the χ2 test or Kruskal-Wallis test. Results The group of 30–39 years had the highest incidence of CA, and the fatality rate increased with age, with persons ≥70 years showing a fatality rate of 4.04%. Seasonality analysis showed that most of CA cases occurred between May/June and September/October of each year. The primary spatiotemporal cluster contained 19 counties from June 2006 to May 2010, and it was mainly located straddling the borders of Sichuan, Gansu, and Qinghai provinces. In these high-risk areas, CA cases were predominantly found among younger, local, males, shepherds, who were living on agriculture and stockbreeding and characterized with high morbidity, low mortality and a shorter period from illness onset to diagnosis. Conclusion CA was geographically and persistently clustered in the Southwestern China during 2005–2012, with notable differences in the epidemic characteristics within and outside spatiotemporal clusters; this demonstrates the necessity for CA interventions such as enhanced surveillance, health education, mandatory and standard decontamination or disinfection procedures to be geographically targeted to the areas identified in this study.
Resumo:
The crush bands that form during plastic deformation of closed-cell metal foams are often inclined at 11-20 degrees to the loading axis, allowing for shear displacement of one part of the foam with respect to the other. Such displacement is prevented by the presence of a lateral constraint. This was analysed in this study, which shows that resistance against shear by the constraint leads to the strain-hardening effect in the foam that has been reported in a recent experimental study. (C) 2009 Acta Materialia Inc. Published by Elsevier Ltd. All rights reserved.
Resumo:
The family of location and scale mixtures of Gaussians has the ability to generate a number of flexible distributional forms. The family nests as particular cases several important asymmetric distributions like the Generalized Hyperbolic distribution. The Generalized Hyperbolic distribution in turn nests many other well known distributions such as the Normal Inverse Gaussian. In a multivariate setting, an extension of the standard location and scale mixture concept is proposed into a so called multiple scaled framework which has the advantage of allowing different tail and skewness behaviours in each dimension with arbitrary correlation between dimensions. Estimation of the parameters is provided via an EM algorithm and extended to cover the case of mixtures of such multiple scaled distributions for application to clustering. Assessments on simulated and real data confirm the gain in degrees of freedom and flexibility in modelling data of varying tail behaviour and directional shape.
Resumo:
We propose a family of multivariate heavy-tailed distributions that allow variable marginal amounts of tailweight. The originality comes from introducing multidimensional instead of univariate scale variables for the mixture of scaled Gaussian family of distributions. In contrast to most existing approaches, the derived distributions can account for a variety of shapes and have a simple tractable form with a closed-form probability density function whatever the dimension. We examine a number of properties of these distributions and illustrate them in the particular case of Pearson type VII and t tails. For these latter cases, we provide maximum likelihood estimation of the parameters and illustrate their modelling flexibility on simulated and real data clustering examples.
Resumo:
New algorithms for the continuous wavelet transform are developed that are easy to apply, each consisting of a single-pass finite impulse response (FIR) filter, and several times faster than the fastest existing algorithms. The single-pass filter, named WT-FIR-1, is made possible by applying constraint equations to least-squares estimation of filter coefficients, which removes the need for separate low-pass and high-pass filters. Non-dyadic two-scale relations are developed and it is shown that filters based on them can work more efficiently than dyadic ones. Example applications to the Mexican hat wavelet are presented.
Resumo:
The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.
Resumo:
Clustering identities in a video is a useful task to aid in video search, annotation and retrieval, and cast identification. However, reliably clustering faces across multiple videos is challenging task due to variations in the appearance of the faces, as videos are captured in an uncontrolled environment. A person's appearance may vary due to session variations including: lighting and background changes, occlusions, changes in expression and make up. In this paper we propose the novel Local Total Variability Modelling (Local TVM) approach to cluster faces across a news video corpus; and incorporate this into a novel two stage video clustering system. We first cluster faces within a single video using colour, spatial and temporal cues; after which we use face track modelling and hierarchical agglomerative clustering to cluster faces across the entire corpus. We compare different face recognition approaches within this framework. Experiments on a news video database show that the Local TVM technique is able effectively model the session variation observed in the data, resulting in improved clustering performance, with much greater computational efficiency than other methods.
Resumo:
This paper presents a method of designing a minimax filter in the presence of large plant uncertainties and constraints on the mean squared values of the estimates. The minimax filtering problem is reformulated in the framework of a deterministic optimal control problem and the method of solution employed, invokes the matrix Minimum Principle. The constrained linear filter and its relation to singular control problems has been illustrated. For the class of problems considered here it is shown that the filter can he constrained separately after carrying out the mini maximization. Numorieal examples are presented to illustrate the results.
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
Online content services can greatly benefit from personalisation features that enable delivery of content that is suited to each user's specific interests. This thesis presents a system that applies text analysis and user modeling techniques in an online news service for the purpose of personalisation and user interest analysis. The system creates a detailed thematic profile for each content item and observes user's actions towards content items to learn user's preferences. A handcrafted taxonomy of concepts, or ontology, is used in profile formation to extract relevant concepts from the text. User preference learning is automatic and there is no need for explicit preference settings or ratings from the user. Learned user profiles are segmented into interest groups using clustering techniques with the objective of providing a source of information for the service provider. Some theoretical background for chosen techniques is presented while the main focus is in finding practical solutions to some of the current information needs, which are not optimally served with traditional techniques.
Resumo:
The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.
Resumo:
Real-time scheduling algorithms, such as Rate Monotonic and Earliest Deadline First, guarantee that calculations are performed within a pre-defined time. As many real-time systems operate on limited battery power, these algorithms have been enhanced with power-aware properties. In this thesis, 13 power-aware real-time scheduling algorithms for processor, device and system-level use are explored.