974 resultados para Cluster Analysis


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cluster analysis via a finite mixture model approach is considered. With this approach to clustering, the data can be partitioned into a specified number of clusters g by first fitting a mixture model with g components. An outright clustering of the data is then obtained by assigning an observation to the component to which it has the highest estimated posterior probability of belonging; that is, the ith cluster consists of those observations assigned to the ith component (i = 1,..., g). The focus is on the use of mixtures of normal components for the cluster analysis of data that can be regarded as being continuous. But attention is also given to the case of mixed data, where the observations consist of both continuous and discrete variables.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We describe a network module detection approach which combines a rapid and robust clustering algorithm with an objective measure of the coherence of the modules identified. The approach is applied to the network of genetic regulatory interactions surrounding the tumor suppressor gene p53. This algorithm identifies ten clusters in the p53 network, which are visually coherent and biologically plausible.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes the application of a new technique, rough clustering, to the problem of market segmentation. Rough clustering produces different solutions to k-means analysis because of the possibility of multiple cluster membership of objects. Traditional clustering methods generate extensional descriptions of groups, that show which objects are members of each cluster. Clustering techniques based on rough sets theory generate intensional descriptions, which outline the main characteristics of each cluster. In this study, a rough cluster analysis was conducted on a sample of 437 responses from a larger study of the relationship between shopping orientation (the general predisposition of consumers toward the act of shopping) and intention to purchase products via the Internet. The cluster analysis was based on five measures of shopping orientation: enjoyment, personalization, convenience, loyalty, and price. The rough clusters obtained provide interpretations of different shopping orientations present in the data without the restriction of attempting to fit each object into only one segment. Such descriptions can be an aid to marketers attempting to identify potential segments of consumers.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A culster analysis was performed on 78 cases of Alzheimer's disease (AD) to identify possible pathological subtypes of the disease. Data on 47 neuropathological variables, inculding features of the gross brain and the density and distribution of senile plaques (SP) and neurofibrillary tangles (NFT) were used to describe each case. Cluster analysis is a multivariate statistical method which combines together in groups, AD cases with the most similar neuropathological characteristics. The majority of cases (83%) were clustered into five such groups. The analysis suggested that an initial division of the 78 cases could be made into two major groups: (1) a large group (68%) in which the distribution of SP and NFT was restricted to a relatively small number of brain regions, and (2) a smaller group (15%) in which the lesions were more widely disseminated throughout the neocortex. Each of these groups could be subdivided on the degree of capillary amyloid angiopathy (CAA) present. In addition, those cases with a restricted development of SP/NFT and CAA could be divided further into an early and a late onset form. Familial AD cases did not cluster as a separate group but were either distributed between four of the five groups or were cases with unique combinations of pathological features not closely related to any of the groups. It was concluded that multivariate statistical methods may be of value in the classification of AD into subtypes. © 1994 Springer-Verlag.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Two contrasting multivariate statistical methods, viz., principal components analysis (PCA) and cluster analysis were applied to the study of neuropathological variations between cases of Alzheimer's disease (AD). To compare the two methods, 78 cases of AD were analyzed, each characterised by measurements of 47 neuropathological variables. Both methods of analysis revealed significant variations between AD cases. These variations were related primarily to differences in the distribution and abundance of senile plaques (SP) and neurofibrillary tangles (NFT) in the brain. Cluster analysis classified the majority of AD cases into five groups which could represent subtypes of AD. However, PCA suggested that variation between cases was more continuous with no distinct subtypes. Hence, PCA may be a more appropriate method than cluster analysis in the study of neuropathological variations between AD cases.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This thesis seeks to describe the development of an inexpensive and efficient clustering technique for multivariate data analysis. The technique starts from a multivariate data matrix and ends with graphical representation of the data and pattern recognition discriminant function. The technique also results in distances frequency distribution that might be useful in detecting clustering in the data or for the estimation of parameters useful in the discrimination between the different populations in the data. The technique can also be used in feature selection. The technique is essentially for the discovery of data structure by revealing the component parts of the data. lhe thesis offers three distinct contributions for cluster analysis and pattern recognition techniques. The first contribution is the introduction of transformation function in the technique of nonlinear mapping. The second contribution is the us~ of distances frequency distribution instead of distances time-sequence in nonlinear mapping, The third contribution is the formulation of a new generalised and normalised error function together with its optimal step size formula for gradient method minimisation. The thesis consists of five chapters. The first chapter is the introduction. The second chapter describes multidimensional scaling as an origin of nonlinear mapping technique. The third chapter describes the first developing step in the technique of nonlinear mapping that is the introduction of "transformation function". The fourth chapter describes the second developing step of the nonlinear mapping technique. This is the use of distances frequency distribution instead of distances time-sequence. The chapter also includes the new generalised and normalised error function formulation. Finally, the fifth chapter, the conclusion, evaluates all developments and proposes a new program. for cluster analysis and pattern recognition by integrating all the new features.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

PURPOSE: Two common approaches to identify subgroups of patients with bipolar disorder are clustering methodology (mixture analysis) based on the age of onset, and a birth cohort analysis. This study investigates if a birth cohort effect will influence the results of clustering on the age of onset, using a large, international database. METHODS: The database includes 4037 patients with a diagnosis of bipolar I disorder, previously collected at 36 collection sites in 23 countries. Generalized estimating equations (GEE) were used to adjust the data for country median age, and in some models, birth cohort. Model-based clustering (mixture analysis) was then performed on the age of onset data using the residuals. Clinical variables in subgroups were compared. RESULTS: There was a strong birth cohort effect. Without adjusting for the birth cohort, three subgroups were found by clustering. After adjusting for the birth cohort or when considering only those born after 1959, two subgroups were found. With results of either two or three subgroups, the youngest subgroup was more likely to have a family history of mood disorders and a first episode with depressed polarity. However, without adjusting for birth cohort (three subgroups), family history and polarity of the first episode could not be distinguished between the middle and oldest subgroups. CONCLUSION: These results using international data confirm prior findings using single country data, that there are subgroups of bipolar I disorder based on the age of onset, and that there is a birth cohort effect. Including the birth cohort adjustment altered the number and characteristics of subgroups detected when clustering by age of onset. Further investigation is needed to determine if combining both approaches will identify subgroups that are more useful for research.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The paper treats the task for cluster analysis of a given assembly of objects on the basis of the information contained in the description table of these objects. Various methods of cluster analysis are briefly considered. Heuristic method and rules for classification of the given assembly of objects are presented for the cases when their division into classes and the number of classes is not known. The algorithm is checked by a test example and two program products (PP) – learning systems and software for company management. Analysis of the results is presented.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This study subdivides the Weddell Sea, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis uses 28 environmental variables for the sea surface, 25 variables for the seabed and 9 variables for the analysis between surface and bottom variables. The data were taken during the years 1983-2013. Some data were interpolated. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared for the identification of the most reasonable method. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested. For the seabed 8 and 12 clusters were identified as reasonable numbers for clustering the Weddell Sea. For the sea surface the numbers 8 and 13 and for the top/bottom analysis 8 and 3 were identified, respectively. Additionally, the results of 20 clusters are presented for the three alternatives offering the first small scale environmental regionalization of the Weddell Sea. Especially the results of 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.