148 resultados para Clustering analysis
Resumo:
Motivation: The clustering of gene profiles across some experimental conditions of interest contributes significantly to the elucidation of unknown gene function, the validation of gene discoveries and the interpretation of biological processes. However, this clustering problem is not straightforward as the profiles of the genes are not all independently distributed and the expression levels may have been obtained from an experimental design involving replicated arrays. Ignoring the dependence between the gene profiles and the structure of the replicated data can result in important sources of variability in the experiments being overlooked in the analysis, with the consequent possibility of misleading inferences being made. We propose a random-effects model that provides a unified approach to the clustering of genes with correlated expression levels measured in a wide variety of experimental situations. Our model is an extension of the normal mixture model to account for the correlations between the gene profiles and to enable covariate information to be incorporated into the clustering process. Hence the model is applicable to longitudinal studies with or without replication, for example, time-course experiments by using time as a covariate, and to cross-sectional experiments by using categorical covariates to represent the different experimental classes. Results: We show that our random-effects model can be fitted by maximum likelihood via the EM algorithm for which the E(expectation) and M(maximization) steps can be implemented in closed form. Hence our model can be fitted deterministically without the need for time-consuming Monte Carlo approximations. The effectiveness of our model-based procedure for the clustering of correlated gene profiles is demonstrated on three real datasets, representing typical microarray experimental designs, covering time-course, repeated-measurement and cross-sectional data. In these examples, relevant clusters of the genes are obtained, which are supported by existing gene-function annotation. A synthetic dataset is considered too.
Resumo:
Context: The relationships among the different eating disorders that exist in the community are poorly understood, especially for residual disorders in which bingeing or purging occurs in the absence of other behaviors. Objective: To examine a community sample for the number of mutually exclusive weight and eating profiles. Design: Data regarding lifetime eating disorder symptoms and weight range were submitted to a latent profile analysis. Profiles were compared regarding personality, current eating and weight, retrospectively reported life events, and lifetime depressive psychopathology. Setting: Longitudinal study among female twins from the Australian Twin Registry in whom eating was assessed by a telephone interview. Participants: A community sample of 1002 twins (individuals) who had participated in earlier waves of data collection. Main Outcome Measures: Number and clinical character of latent profiles. Results: The best fit was a 5-profile solution with women who were (1) of normal weight with few lifetime eating disorders (4.3%), (2) overweight (10.6% had a lifetime eating disorder), (3) underweight and generally had no eating disorders except for 5.3% who had restricting anorexia nervosa, (4) of low to normal weight (89.0% had a lifetime eating disorder), and (5) obese (37.0% had a lifetime eating disorder). Each profile contained more than 1 type of lifetime eating disorder except for the third profile. Women in the first and third profiles had the best functioning, with women in the fourth and fifth profiles having similarly poorer functioning. The women in the fourth group had a symptom profile distinctive from the other 4 groups in terms of severity; they were also more likely to have had lifetime major depression and suicidality. Conclusion: Lifetime weight ranges and the severity of eating disorder symptoms affected clustering more than the type of eating disorder symptom.
Resumo:
Quality of life has been shown to be poor among people living with chronic hepatitis C However, it is not clear how this relates to the presence of symptoms and their severity. The aim of this study was to describe the typology of a broad array of symptoms that were attributed to hepatitis C virus (HCV) infection. Phase I used qualitative methods to identify symptoms. In Phase 2, 188 treatment-naive people living with HCV participated in a quantitative survey. The most prevalent symptom was physical tiredness (86%) followed by irritability (75%), depression (70%), mental tiredness (70%), and abdominal pain (68%). Temporal clustering of symptoms was reported in 62% of participants. Principal components analysis identified four symptom clusters: neuropsychiatric (mental tiredness, poor concentration, forgetfulness, depression, irritability, physical tiredness, and sleep problems); gastrointestinal (day sweats, nausea, food intolerance, night sweats, abdominal pain, poor appetite, and diarrhea); algesic (joint pain, muscle pain, and general body pain); and dysesthetic (noise sensitivity, light sensitivity, skin. problems, and headaches). These data demonstrate that symptoms are prevalent in treatment-naive people with HCV and support the hypothesis that symptom clustering occurs.
Resumo:
This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).
Resumo:
We describe a network module detection approach which combines a rapid and robust clustering algorithm with an objective measure of the coherence of the modules identified. The approach is applied to the network of genetic regulatory interactions surrounding the tumor suppressor gene p53. This algorithm identifies ten clusters in the p53 network, which are visually coherent and biologically plausible.
Resumo:
Web transaction data between Web visitors and Web functionalities usually convey user task-oriented behavior pattern. Mining such type of click-stream data will lead to capture usage pattern information. Nowadays Web usage mining technique has become one of most widely used methods for Web recommendation, which customizes Web content to user-preferred style. Traditional techniques of Web usage mining, such as Web user session or Web page clustering, association rule and frequent navigational path mining can only discover usage pattern explicitly. They, however, cannot reveal the underlying navigational activities and identify the latent relationships that are associated with the patterns among Web users as well as Web pages. In this work, we propose a Web recommendation framework incorporating Web usage mining technique based on Probabilistic Latent Semantic Analysis (PLSA) model. The main advantages of this method are, not only to discover usage-based access pattern, but also to reveal the underlying latent factor as well. With the discovered user access pattern, we then present user more interested content via collaborative recommendation. To validate the effectiveness of proposed approach, we conduct experiments on real world datasets and make comparisons with some existing traditional techniques. The preliminary experimental results demonstrate the usability of the proposed approach.
Resumo:
Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.
Resumo:
This paper describes the application of a new technique, rough clustering, to the problem of market segmentation. Rough clustering produces different solutions to k-means analysis because of the possibility of multiple cluster membership of objects. Traditional clustering methods generate extensional descriptions of groups, that show which objects are members of each cluster. Clustering techniques based on rough sets theory generate intensional descriptions, which outline the main characteristics of each cluster. In this study, a rough cluster analysis was conducted on a sample of 437 responses from a larger study of the relationship between shopping orientation (the general predisposition of consumers toward the act of shopping) and intention to purchase products via the Internet. The cluster analysis was based on five measures of shopping orientation: enjoyment, personalization, convenience, loyalty, and price. The rough clusters obtained provide interpretations of different shopping orientations present in the data without the restriction of attempting to fit each object into only one segment. Such descriptions can be an aid to marketers attempting to identify potential segments of consumers.
Resumo:
Rectangular dropshafts, commonly used in sewers and storm water systems, are characterised by significant flow aeration. New detailed air-water flow measurements were conducted in a near-full-scale dropshaft at large discharges. In the shaft pool and outflow channel, the results demonstrated the complexity of different competitive air entrainment mechanisms. Bubble size measurements showed a broad range of entrained bubble sizes. Analysis of streamwise distributions of bubbles suggested further some clustering process in the bubbly flow although, in the outflow channel, bubble chords were in average smaller than in the shaft pool. A robust hydrophone was tested to measure bubble acoustic spectra and to assess its field application potential. The acoustic results characterised accurately the order of magnitude of entrained bubble sizes, but the transformation from acoustic frequencies to bubble radii did not predict correctly the probability distribution functions of bubble sizes.
Resumo:
For fuel cell CO clean up application, the presence of water with silica membranes greatly reduces their selectivity to CO. We show results of a new functional carbonised template membrane of around 13nm thickness which offered hydrothermal stability with no compromise to the membrane’s H2/CO permselectivity of 16. Lost permeance was also regenerated.
Resumo:
In an open channel, a hydraulic jump is the rapid transition from super- to sub-critical flow associated with strong turbulence and air bubble entrainment in the mixing layer. New experiments were performed at relatively large Reynolds numbers using phase-detection probes. Some new signal analysis provided characteristic air-water time and length scales of the vortical structures advecting the air bubbles in the developing shear flow. An analysis of the longitudinal air-water flow structure suggested little bubble clustering in the mixing layer, although an interparticle arrival time analysis showed some preferential bubble clustering for small bubbles with chord times below 3 ms. Correlation analyses yielded longitudinal air-water time scales Txx*V1/d1 of about 0.8 in average. The transverse integral length scale Z/d1 of the eddies advecting entrained bubbles was typically between 0.25 and 0.4, irrespective of the inflow conditions within the range of the investigations. Overall the findings highlighted the complicated nature of the air-water flow
Resumo:
This paper presents a new relative measure of signal complexity, referred to here as relative structural complexity, which is based on the matching pursuit (MP) decomposition. By relative, we refer to the fact that this new measure is highly dependent on the decomposition dictionary used by MP. The structural part of the definition points to the fact that this new measure is related to the structure, or composition, of the signal under analysis. After a formal definition, the proposed relative structural complexity measure is used in the analysis of newborn EEG. To do this, firstly, a time-frequency (TF) decomposition dictionary is specifically designed to compactly represent the newborn EEG seizure state using MP. We then show, through the analysis of synthetic and real newborn EEG data, that the relative structural complexity measure can indicate changes in EEG structure as it transitions between the two EEG states; namely seizure and background (non-seizure).
Resumo:
Intracellular Wolbachia infections are extremely common in arthropods and exert profound control over the reproductive biology of the host. However, very little is known about the underlying molecular mechanisms which mediate these interactions with the host. We examined protein synthesis by Wolbachia in a Drosophila host in vivo by selective metabolic labelling of prokaryotic proteins and subsequent analysis by 1D and 2D gel electrophoresis. Using this method we could identify the major proteins synthesized by Wolbachia in ovaries and testes of flies. Of these proteins the most abundant was of low molecular weight and showed size variation between Wolbachia strains which correlated with the reproductive phenotype they generated in flies. Using the gel systems we employed it was not possible to identify any proteins of Wolbachia origin in the mature sperm cells of infected flies.
Resumo:
Bacterial endosymbionts of insects have long been implicated in the phenomenon of cytoplasmic incompatibility, in which certain crosses between symbiont-infected individuals lead to embryonic death or sex ratio distortion. The taxonomic position of these bacteria has, however, not been known with any certainty. Similarly, the relatedness of the bacteria infecting various insect hosts has been unclear. The inability to grow these bacteria on defined cell-free medium has been the major factor underlying these uncertainties. We circumvented this problem by selective PCR amplification and subsequent sequencing of the symbiont 16S rRNA genes directly from infected insect tissue. Maximum parsimony analysis of these sequences indicates that the symbionts belong in the α-subdivision of the Proteobacteria, where they are most closely related to the Rickettsia and their relatives. They are all closely related to each other and are assigned to the type species Wolbachia pipientis. Lack of congruence between the phylogeny of the symbionts and their insect hosts suggests that horizontal transfer of symbionts between insect species may occur. Comparison of the sequences for W. pipientis and for Wolbachia persica, an endosymbiont of ticks, shows that the genus Wolbachia is polyphyletic. A PCR assay based on 16S primers was designed for the detection of W. pipientis in insect tissue, and initial screening of insects indicates that cytoplasmic incompatibility may be a more general phenomenon in insects than is currently recognized.
Resumo:
Despite many successes of conventional DNA sequencing methods, some DNAs remain difficult or impossible to sequence. Unsequenceable regions occur in the genomes of many biologically important organisms, including the human genome. Such regions range in length from tens to millions of bases, and may contain valuable information such as the sequences of important genes. The authors have recently developed a technique that renders a wide range of problematic DNAs amenable to sequencing. The technique is known as sequence analysis via mutagenesis (SAM). This paper presents a number of algorithms for analysing and interpreting data generated by this technique.