10 resultados para EXPRESSION DATA
em University of Queensland eSpace - Australia
Resumo:
This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).
Resumo:
With mixed feature data, problems are induced in modeling the gating network of normalized Gaussian (NG) networks as the assumption of multivariate Gaussian becomes invalid. In this paper, we propose an independence model to handle mixed feature data within the framework of NG networks. The method is illustrated using a real example of breast cancer data.
Resumo:
We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature. (C) 2004 Elsevier Inc. All rights reserved.
Resumo:
The identification of biomarkers capable of providing a reliable molecular diagnostic test for prostate cancer (PCa) is highly desirabie clinically. We describe here 4 biomarkers, UDP-N-Acetyl-alpha-D-galactosamine transferase (GalNAc-T3; not previously associated with PCa), PSMA, Hepsin and DD3/PCA3, which, in combination, distinguish prostate cancer from benign prostate hyperplasia (BPH). GalNAc-T3 was identified as overexpressed in PCa tissues by microarray analysis, confirmed by quantitative real-time PCR and shown immunohistochemically to be localised to prostate epithelial cells with higher expression in malignant cells. Real-time quantitative PCR analysis across 21 PCa and 34 BPH tissues showed 4.6-fold overexpression of GalNAc-T3 (p = 0.005). The noncoding mRNA (DD3/PCA3) was overexpressed 140-fold (p = 0.007) in the cancer samples compared to BPH tissues. Hepsin was overexpressed 21-fold (p = 0.049, whereas the overexpression for PSMA was 66-fold (p = 0.047). When the gene expression data for these 4 biomarkers was combined in a logistic regression model, a predictive index was obtained that distinguished 100% of the PCa samples from all of the BPH samples. Therefore, combining these genes in a real-time PCR assay represents a powerful new approach to diagnosing PCa by molecular profiling. (c) 2005 Wiley-Liss, Inc.
Resumo:
Monoclonal antibodies (Mab) are heterotetramers consisting of an equimolar ratio of heavy chain (HC) and light chain (LC) polypeptides. Accordingly, most recombinant Mab expression systems utilize an equimolar ratio of heavy chain (he) to light chain (lc) genes encoded on either one or two plasmids. However, there is no evidence to suggest that this gene ratio is optimal for stable or transient production of recombinant Mab. In this study we have determined the optimal ratio of hc:lc genes for production of a recombinant IgG(4) Mab, cB72.3, by Chinese hamster ovary (CHO) cells using both empirical and mathematical modeling approaches. Polyethyleneimine-mediated transient expression of cB72.3 at varying ratios of hc:lc genes encoded on separate plasmids yielded an optimal Mab titer at a hc:lc gene ratio of 3:2; a conclusion confirmed by separate mathematical modeling of the Mab folding and assembly process using transient expression data. On the basis of this information, we hypothesized that utilization of he genes at low hc:lc gene ratios is more efficient. To confirm this, cB72.3 Mab was transiently produced by CHO cells at constant he and varying lc gene dose. Under these conditions, Mab yield was increased with a concomitant increase in lc gene dose. To determine if the above findings also apply to stably transfected CHO cells producing recombinant Mab, we compared the intra- and extracellular ratios of HC and LC polypeptides for three GS-CHO cells lines transfected with a 1:1 ratio of hc:lc genes and selected for stable expression of the same recombinant Mab, cB72.3. Intra- and extracellular HC:LC polypeptide ratios ranged from 1:2 to 1:5, less than that observed on transient expression of the same Mab in parental CHO cells using the same vector. In conclusion, our data suggest that the optimal ratio of hc:lc genes used for transient and stable expression of Mab differ. In the case of the latter, we infer that optimal Mab production by stably transfected cells represents a compromise between HC abundance limiting productivity and the requirement for excess LC to render Mab folding and assembly more efficient.
Resumo:
Background: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Resumo:
Transcriptional regulatory networks govern cell differentiation and the cellular response to external stimuli. However, mammalian model systems have not yet been accessible for network analysis. Here, we present a genome-wide network analysis of the transcriptional regulation underlying the mouse macrophage response to bacterial lipopolysaccharide (LPS). Key to uncovering the network structure is our combination of time-series cap analysis of gene expression with in silico prediction of transcription factor binding sites. By integrating microarray and qPCR time-series expression data with a promoter analysis, we find dynamic subnetworks that describe how signaling pathways change dynamically during the progress of the macrophage LPS response, thus defining regulatory modules characteristic of the inflammatory response. In particular, our integrative analysis enabled us to suggest novel roles for the transcription factors ATF-3 and NRF-2 during the inflammatory response. We believe that our system approach presented here is applicable to understanding cellular differentiation in higher eukaryotes. (c) 2006 Elsevier Inc. All rights reserved.
Resumo:
Time-course experiments with microarrays are often used to study dynamic biological systems and genetic regulatory networks (GRNs) that model how genes influence each other in cell-level development of organisms. The inference for GRNs provides important insights into the fundamental biological processes such as growth and is useful in disease diagnosis and genomic drug design. Due to the experimental design, multilevel data hierarchies are often present in time-course gene expression data. Most existing methods, however, ignore the dependency of the expression measurements over time and the correlation among gene expression profiles. Such independence assumptions violate regulatory interactions and can result in overlooking certain important subject effects and lead to spurious inference for regulatory networks or mechanisms. In this paper, a multilevel mixed-effects model is adopted to incorporate data hierarchies in the analysis of time-course data, where temporal and subject effects are both assumed to be random. The method starts with the clustering of genes by fitting the mixture model within the multilevel random-effects model framework using the expectation-maximization (EM) algorithm. The network of regulatory interactions is then determined by searching for regulatory control elements (activators and inhibitors) shared by the clusters of co-expressed genes, based on a time-lagged correlation coefficients measurement. The method is applied to two real time-course datasets from the budding yeast (Saccharomyces cerevisiae) genome. It is shown that the proposed method provides clusters of cell-cycle regulated genes that are supported by existing gene function annotations, and hence enables inference on regulatory interactions for the genetic network.