907 resultados para Data clustering
Resumo:
Motivation: The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. Results: We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation) and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too.
Resumo:
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.
Resumo:
We have undertaken two-dimensional gel electrophoresis proteomic profiling on a series of cell lines with different recombinant antibody production rates. Due to the nature of gel-based experiments not all protein spots are detected across all samples in an experiment, and hence datasets are invariably incomplete. New approaches are therefore required for the analysis of such graduated datasets. We approached this problem in two ways. Firstly, we applied a missing value imputation technique to calculate missing data points. Secondly, we combined a singular value decomposition based hierarchical clustering with the expression variability test to identify protein spots whose expression correlates with increased antibody production. The results have shown that while imputation of missing data was a useful method to improve the statistical analysis of such data sets, this was of limited use in differentiating between the samples investigated, and highlighted a small number of candidate proteins for further investigation. (c) 2006 Elsevier B.V. All rights reserved.
Resumo:
Quality of life has been shown to be poor among people living with chronic hepatitis C However, it is not clear how this relates to the presence of symptoms and their severity. The aim of this study was to describe the typology of a broad array of symptoms that were attributed to hepatitis C virus (HCV) infection. Phase I used qualitative methods to identify symptoms. In Phase 2, 188 treatment-naive people living with HCV participated in a quantitative survey. The most prevalent symptom was physical tiredness (86%) followed by irritability (75%), depression (70%), mental tiredness (70%), and abdominal pain (68%). Temporal clustering of symptoms was reported in 62% of participants. Principal components analysis identified four symptom clusters: neuropsychiatric (mental tiredness, poor concentration, forgetfulness, depression, irritability, physical tiredness, and sleep problems); gastrointestinal (day sweats, nausea, food intolerance, night sweats, abdominal pain, poor appetite, and diarrhea); algesic (joint pain, muscle pain, and general body pain); and dysesthetic (noise sensitivity, light sensitivity, skin. problems, and headaches). These data demonstrate that symptoms are prevalent in treatment-naive people with HCV and support the hypothesis that symptom clustering occurs.
Resumo:
This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).
Resumo:
In this paper we present an efficient k-Means clustering algorithm for two dimensional data. The proposed algorithm re-organizes dataset into a form of nested binary tree*. Data items are compared at each node with only two nearest means with respect to each dimension and assigned to the one that has the closer mean. The main intuition of our research is as follows: We build the nested binary tree. Then we scan the data in raster order by in-order traversal of the tree. Lastly we compare data item at each node to the only two nearest means to assign the value to the intendant cluster. In this way we are able to save the computational cost significantly by reducing the number of comparisons with means and also by the least use to Euclidian distance formula. Our results showed that our method can perform clustering operation much faster than the classical ones. © Springer-Verlag Berlin Heidelberg 2005
Resumo:
Non-technical losses (NTL) identification and prediction are important tasks for many utilities. Data from customer information system (CIS) can be used for NTL analysis. However, in order to accurately and efficiently perform NTL analysis, the original data from CIS need to be pre-processed before any detailed NTL analysis can be carried out. In this paper, we propose a feature selection based method for CIS data pre-processing in order to extract the most relevant information for further analysis such as clustering and classifications. By removing irrelevant and redundant features, feature selection is an essential step in data mining process in finding optimal subset of features to improve the quality of result by giving faster time processing, higher accuracy and simpler results with fewer features. Detailed feature selection analysis is presented in the paper. Both time-domain and load shape data are compared based on the accuracy, consistency and statistical dependencies between features.
Resumo:
This paper describes the application of a new technique, rough clustering, to the problem of market segmentation. Rough clustering produces different solutions to k-means analysis because of the possibility of multiple cluster membership of objects. Traditional clustering methods generate extensional descriptions of groups, that show which objects are members of each cluster. Clustering techniques based on rough sets theory generate intensional descriptions, which outline the main characteristics of each cluster. In this study, a rough cluster analysis was conducted on a sample of 437 responses from a larger study of the relationship between shopping orientation (the general predisposition of consumers toward the act of shopping) and intention to purchase products via the Internet. The cluster analysis was based on five measures of shopping orientation: enjoyment, personalization, convenience, loyalty, and price. The rough clusters obtained provide interpretations of different shopping orientations present in the data without the restriction of attempting to fit each object into only one segment. Such descriptions can be an aid to marketers attempting to identify potential segments of consumers.
Resumo:
Relationships between clustering, description length, and regularisation are pointed out, motivating the introduction of a cost function with a description length interpretation and the unusual and useful property of having its minimum approximated by the densest mode of a distribution. A simple inverse kinematics example is used to demonstrate that this property can be used to select and learn one branch of a multi-valued mapping. This property is also used to develop a method for setting regularisation parameters according to the scale on which structure is exhibited in the training data. The regularisation technique is demonstrated on two real data sets, a classification problem and a regression problem.
Resumo:
This study tested three hypotheses: (1) that there is clustering of the neuronal cytoplasmic inclusions (NCI), astrocytic plaques (AP) and ballooned neurons (BN) in corticobasal degeneration (CBD), (2) that the clusters of NCI and BN are not spatially correlated, and (3) that the lesions are correlated with disease ‘stage’. In 50% of the regions, clusters of lesions were 400–800 µm in diameter and regularly distributed parallel to the tissue boundary. Clusters of NCI and BN were larger in laminae II/III and V/VI, respectively. In a third of regions, the clusters of BN and NCI were negatively spatially correlated. Cluster size of the BN in the parahippocampal gyrus (PHG) was positively correlated with disease ‘stage’. The data suggest the following: (1) degeneration of the cortico-cortical pathways in CBD, (2) clusters of NCI and BN may affect different anatomical pathways and (3) BN may develop after the NCI in the PHG.
Resumo:
Correlations between the clustering patterns of the vacuolation ('spongiform change'), prion protein (PrP) deposits, and surviving neurons were studied in the cerebral cortex, hippocampus, and cerebellum in 11 cases of sporadic Creutzfeldt-Jakob disease (sCJD). Differences in the sizes of the clusters of vacuoles were observed between brain regions and in the cerebral cortex, between the upper and lower laminae. With the exception of the parietal cortex, mean cluster size of the vacuoles was similar to that of the PrP deposits in each brain area. Clusters of the vacuoles were spatially correlated with the density of surviving neurons and with the clusters of PrP deposits in 47% and 53% of cortical areas analysed respectively but there were few spatial correlation between the PrP deposits and the density of surviving neurons. The data suggest that the pathology of sCJD may spread through the brain via specific anatomical pathways. Development of the clusters of vacuoles is spatially related to surviving neurons while the appearance of clusters of PrP deposits is related to the development of the vacuolation.
Resumo:
Clustering techniques such as k-means and hierarchical clustering are commonly used to analyze DNA microarray derived gene expression data. However, the interactions between processes underlying the cell activity suggest that the complexity of the microarray data structure may not be fully represented with discrete clustering methods.
Resumo:
The best way of finding “natural groups” in management research remains subject to debate and within the literature there is no accepted consensus. The principle motivation behind this study is to explore the effect of choices of method upon strategic group research, an area that has suffered enduring criticism, as we believe that these method choices are still not fully exploited. Our study is novel in the use of a variety of more robust clustering and validation techniques, rarely used in management research, some borrowed from the natural sciences, which may provide a useful and more robust base for this type of research. Our results confirm that methods do exist to address the concerns over strategic group research and adoption of our chosen methods will improve the quality of management research.
Resumo:
The clustering pattern of diffuse, primitive and classic β-amyloid (Aβ) deposits was studied in the upper laminae of the frontal cortex of 9 patients with sporadic Alzheimer's disease (AD). Aβ stained tissue was counterstained with collagen type IV antiserum to determine whether the clusters of Aβ deposits were related to blood vessels. In all patients, Aβ deposits and blood vessels were clustered, with in many patients, a regular periodicity of clusters along the cortex parallel to the pia. The classic Aβ deposit clusters coincided with those of the larger blood vessels in all patients and with clusters of smaller blood vessels in 4 patients. Diffuse deposit clusters were related to blood vessels in 3 patients. Primitive deposit clusters were either unrelated to or negatively correlated with the blood vessels in six patients. Hence, Aβ deposit subtypes differ in their relationship to blood vessels. The data suggest a direct and specific role for the larger blood vessels in the formation of amyloid cores in AD. © 1995.
Resumo:
The spatial distribution patterns of the diffuse, primitive, and classic beta-amyloid (Abeta) deposits were studied in areas of the medial temporal lobe in 12 cases of Down's Syndrome (DS) 35 to 67 years of age. Large clusters of diffuse deposits were present in the youngest patients; cluster size then declined with patient age but increased again in the oldest patients. By contrast, the cluster sizes of the primitive and classic deposits increased with age to a maximum in patients 45 to 55 and 60 years of age respectively and declined in size in the oldest patients. In the parahippocampal gyrus (PHG), the clusters of the primitive deposits were most highly clustered in cases of intermediate age. The data suggest a developmental sequence in DS in which Abeta is deposited initially in the form of large clusters of diffuse deposits that are then gradually replaced by clusters of primitive and classic deposits. The oldest patients were an exception to this sequence in that the pattern of clustering resembled that of the youngest patients.