60 resultados para Microarray Cancer Data
em University of Queensland eSpace - Australia
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
We consider a mixture model approach to the regression analysis of competing-risks data. Attention is focused on inference concerning the effects of factors on both the probability of occurrence and the hazard rate conditional on each of the failure types. These two quantities are specified in the mixture model using the logistic model and the proportional hazards model, respectively. We propose a semi-parametric mixture method to estimate the logistic and regression coefficients jointly, whereby the component-baseline hazard functions are completely unspecified. Estimation is based on maximum likelihood on the basis of the full likelihood, implemented via an expectation-conditional maximization (ECM) algorithm. Simulation studies are performed to compare the performance of the proposed semi-parametric method with a fully parametric mixture approach. The results show that when the component-baseline hazard is monotonic increasing, the semi-parametric and fully parametric mixture approaches are comparable for mildly and moderately censored samples. When the component-baseline hazard is not monotonic increasing, the semi-parametric method consistently provides less biased estimates than a fully parametric approach and is comparable in efficiency in the estimation of the parameters for all levels of censoring. The methods are illustrated using a real data set of prostate cancer patients treated with different dosages of the drug diethylstilbestrol. Copyright (C) 2003 John Wiley Sons, Ltd.
Resumo:
With mixed feature data, problems are induced in modeling the gating network of normalized Gaussian (NG) networks as the assumption of multivariate Gaussian becomes invalid. In this paper, we propose an independence model to handle mixed feature data within the framework of NG networks. The method is illustrated using a real example of breast cancer data.
Resumo:
Sum: Plant biologists in fields of ecology, evolution, genetics and breeding frequently use multivariate methods. This paper illustrates Principal Component Analysis (PCA) and Gabriel's biplot as applied to microarray expression data from plant pathology experiments. Availability: An example program in the publicly distributed statistical language R is available from the web site (www.tpp.uq.edu.au) and by e-mail from the contact. Contact: scott.chapman@csiro.au.
Resumo:
In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called. 632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
Resumo:
Alcohol and tobacco consumption are closely correlated and published results on their association with breast cancer have not always allowed adequately for confounding between these exposures. Over 80% of the relevant information worldwide on alcohol and tobacco consumption and breast cancer were collated, checked and analysed centrally. Analyses included 58515 women with invasive breast cancer and 95067 controls from 53 studies. Relative risks of breast cancer were estimated, after stratifying by study, age, parity and, where appropriate, women's age when their first child was born and consumption of alcohol and tobacco. The average consumption of alcohol reported by controls from developed countries was 6.0 g per day, i.e. about half a unit/drink of alcohol per day, and was greater in ever-smokers than never-smokers, (8.4 g per day and 5.0 g per day, respectively). Compared with women who reported drinking no alcohol, the relative risk of breast cancer was 1.32 (1.19 - 1.45, P < 0.00001) for an intake of 35 - 44 g per day alcohol, and 1.46 (1.33 - 1.61, P < 0.00001) for greater than or equal to 45 g per day alcohol. The relative risk of breast cancer increased by 7.1% (95% CI 5.5-8.7%; P
Resumo:
The tests that are currently available for the measurement of overexpression of the human epidermal growth factor-2 (HER2) in breast cancer have shown considerable problems in accuracy and interlaboratory reproducibility. Although these problems are partly alleviated by the use of validated, standardised 'kits', there may be considerable cost involved in their use. Prior to testing it may therefore be an advantage to be able to predict from basic pathology data whether a cancer is likely to overexpress HER2. In this study, we have correlated pathology features of cancers with the frequency of HER2 overexpression assessed by immunohistochemistry (IHC) using HercepTest (Dako). In addition, fluorescence in situ hybridisation (FISH) has been used to re-test the equivocal cancers and interobserver variation in assessing HER2 overexpression has been examined by a slide circulation scheme. Of the 1536 cancers, 1144 (74.5%) did not overexpress HER2. Unequivocal overexpression (3+ by IHC) was seen in 186 cancers (12%) and an equivocal result (2+ by IHC) was seen in 206 cancers (13%). Of the 156 IHC 3+ cancers for which complete data was available, 149 (95.5%) were ductal NST and 152 (97%) were histological grade 2 or 3. Only 1 of 124 infiltrating lobular carcinomas (0.8%) showed HER2 overexpression. None of the 49 'special types' of carcinoma showed HER2 overexpression. Re-testing by FISH of a proportion of the IHC 2+ cancers showed that only 25 (23%) of those assessable exhibited HER2 gene amplification, but 46 of the 47 IHC 3+ cancers (98%) were confirmed as showing gene amplification. Circulating slides for the assessment of HER2 score showed a moderate level of agreement between pathologists (kappa 0.4). As a result of this study we would advocate consideration of a triage approach to HER-2 testing. Infiltrating lobular and special types of carcinoma may not need to be routinely tested at presentation nor may grade 1 NST carcinomas in which only 1.4% have been shown to overexpress HER2. Testing of these carcinomas may be performed when HER2 status is required to assist in therapeutic or other clinical/prognostic decision-making. The highest yield of HER2 overexpressing carcinomas is seen in the grade 3 NST subgroup in which 24% are positive by IHC. (C) 2003 Elsevier Science Ltd. All rights reserved.
Resumo:
The identification of biomarkers capable of providing a reliable molecular diagnostic test for prostate cancer (PCa) is highly desirabie clinically. We describe here 4 biomarkers, UDP-N-Acetyl-alpha-D-galactosamine transferase (GalNAc-T3; not previously associated with PCa), PSMA, Hepsin and DD3/PCA3, which, in combination, distinguish prostate cancer from benign prostate hyperplasia (BPH). GalNAc-T3 was identified as overexpressed in PCa tissues by microarray analysis, confirmed by quantitative real-time PCR and shown immunohistochemically to be localised to prostate epithelial cells with higher expression in malignant cells. Real-time quantitative PCR analysis across 21 PCa and 34 BPH tissues showed 4.6-fold overexpression of GalNAc-T3 (p = 0.005). The noncoding mRNA (DD3/PCA3) was overexpressed 140-fold (p = 0.007) in the cancer samples compared to BPH tissues. Hepsin was overexpressed 21-fold (p = 0.049, whereas the overexpression for PSMA was 66-fold (p = 0.047). When the gene expression data for these 4 biomarkers was combined in a logistic regression model, a predictive index was obtained that distinguished 100% of the PCa samples from all of the BPH samples. Therefore, combining these genes in a real-time PCR assay represents a powerful new approach to diagnosing PCa by molecular profiling. (c) 2005 Wiley-Liss, Inc.