889 resultados para Microarray Cancer Data


Relevância:

100.00% 100.00%

Publicador:

Resumo:

This research evaluates pattern recognition techniques on a subclass of big data where the dimensionality of the input space (p) is much larger than the number of observations (n). Specifically, we evaluate massive gene expression microarray cancer data where the ratio κ is less than one. We explore the statistical and computational challenges inherent in these high dimensional low sample size (HDLSS) problems and present statistical machine learning methods used to tackle and circumvent these difficulties. Regularization and kernel algorithms were explored in this research using seven datasets where κ < 1. These techniques require special attention to tuning necessitating several extensions of cross-validation to be investigated to support better predictive performance. While no single algorithm was universally the best predictor, the regularization technique produced lower test errors in five of the seven datasets studied.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper considers a model-based approach to the clustering of tissue samples of a very large number of genes from microarray experiments. It is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. Frequently in practice, there are also clinical data available on those cases on which the tissue samples have been obtained. Here we investigate how to use the clinical data in conjunction with the microarray gene expression data to cluster the tissue samples. We propose two mixture model-based approaches in which the number of components in the mixture model corresponds to the number of clusters to be imposed on the tissue samples. One approach specifies the components of the mixture model to be the conditional distributions of the microarray data given the clinical data with the mixing proportions also conditioned on the latter data. Another takes the components of the mixture model to represent the joint distributions of the clinical and microarray data. The approaches are demonstrated on some breast cancer data, as studied recently in van't Veer et al. (2002).

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.

Relevância:

100.00% 100.00%

Publicador:

Relevância:

100.00% 100.00%

Publicador:

Resumo:

PURPOSE: Tumor stage and nuclear grade are the most important prognostic parameters of clear cell renal cell carcinoma (ccRCC). The progression risk of ccRCC remains difficult to predict particularly for tumors with organ-confined stage and intermediate differentiation grade. Elucidating molecular pathways deregulated in ccRCC may point to novel prognostic parameters that facilitate planning of therapeutic approaches. EXPERIMENTAL DESIGN: Using tissue microarrays, expression patterns of 15 different proteins were evaluated in over 800 ccRCC patients to analyze pathways reported to be physiologically controlled by the tumor suppressors von Hippel-Lindau protein and phosphatase and tensin homologue (PTEN). Tumor staging and grading were improved by performing variable selection using Cox regression and a recursive bootstrap elimination scheme. RESULTS: Patients with pT2 and pT3 tumors that were p27 and CAIX positive had a better outcome than those with all remaining marker combinations. A prolonged survival among patients with intermediate grade (grade 2) correlated with both nuclear p27 and cytoplasmic PTEN expression, as well as with inactive, nonphosphorylated ribosomal protein S6. By applying graphical log-linear modeling for over 700 ccRCC for which the molecular parameters were available, only a weak conditional dependence existed between the expression of p27, PTEN, CAIX, and p-S6, suggesting that the dysregulation of several independent pathways are crucial for tumor progression. CONCLUSIONS: The use of recursive bootstrap elimination, as well as graphical log-linear modeling for comprehensive tissue microarray (TMA) data analysis allows the unraveling of complex molecular contexts and may improve predictive evaluations for patients with advanced renal cancer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Preoperative chemoradiation significantly improves oncological outcome in locally advanced rectal cancer. However there is no effective method of predicting tumor response to chemoradiation in these patients. Peripheral blood mononuclear cells have emerged recently as pathology markers of cancer and other diseases, making possible their use as therapy predictors. Furthermore, the importance of the immune response in radiosensivity of solid organs led us to hypothesized that microarray gene expression profiling of peripheral blood mononuclear cells could identify patients with response to chemoradiation in rectal cancer. Thirty five 35 patients with locally advanced rectal cancer were recruited initially to perform the study. Peripheral blood samples were obtained before neaodjuvant treatment. RNA was extracted and purified to obtain cDNA and cRNA for hybridization of microarrays included in Human WG CodeLink bioarrays. Quantitative real time PCR was used to validate microarray experiment data. Results were correlated with pathological response, according to Mandard´s criteria and final UICC Stage (patients with tumor regression grade 1-2 and downstaging being defined as responders and patients with grade 3-5 and no downstaging as non-responders). Twenty seven out of 35 patients were finally included in the study. We performed a multiple t-test using Significance Analysis of Microarrays, to find those genes differing significantly in expression, between responders (n = 11) and non-responders (n = 16) to CRT. The differently expressed genes were: BC 035656.1, CIR, PRDM2, CAPG, FALZ, HLA-DPB2, NUPL2, and ZFP36. The measurement of FALZ (p = 0.029) gene expression level determined by qRT-PCR, showed statistically significant differences between the two groups. Gene expression profiling reveals novel genes in peripheral blood samples of mononuclear cells that could predict responders and non-responders to chemoradiation in patients with locally advanced rectal cancer. Moreover, our investigation added further evidence to the importance of mononuclear cells' mediated response in the neoadjuvant treatment of rectal cancer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Microarray data analysis is one of data mining tool which is used to extract meaningful information hidden in biological data. One of the major focuses on microarray data analysis is the reconstruction of gene regulatory network that may be used to provide a broader understanding on the functioning of complex cellular systems. Since cancer is a genetic disease arising from the abnormal gene function, the identification of cancerous genes and the regulatory pathways they control will provide a better platform for understanding the tumor formation and development. The major focus of this thesis is to understand the regulation of genes responsible for the development of cancer, particularly colorectal cancer by analyzing the microarray expression data. In this thesis, four computational algorithms namely fuzzy logic algorithm, modified genetic algorithm, dynamic neural fuzzy network and Takagi Sugeno Kang-type recurrent neural fuzzy network are used to extract cancer specific gene regulatory network from plasma RNA dataset of colorectal cancer patients. Plasma RNA is highly attractive for cancer analysis since it requires a collection of small amount of blood and it can be obtained at any time in repetitive fashion allowing the analysis of disease progression and treatment response.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Tese de Doutoramento em Ciências (Especialidade em Matemática)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Dissertação de mestrado em Estatística

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Cancer is a reportable disease as stated in the Iowa Administrative Code. Cancer data are collected by the State Health Registry of Iowa, located at The University of Iowa in the College of Public Health’s Department of Epidemiology. The staff includes more than 50 people. Half of them, situated throughout the state, regularly visit hospitals, clinics, and medical laboratories in Iowa and neighboring states to collect cancer data. A follow-up program tracks more than 99 percent of the cancer survivors diagnosed since 1973. This program provides regular updates for follow-up and survival. The Registry maintains the confidentiality of the patients, physicians, and hospitals providing data. In 2009 data will be collected on an estimated 16,000 new cancers among Iowa residents. In situ cases of bladder cancer are included in the estimates for bladder cancer, to be in agreement with the definition of reportable cases of the Surveillance, Epidemiology, and End Results (SEER) Program of the National Cancer Institute. Since 1973 the Iowa Registry has been funded primarily by the SEER Program of the National Cancer Institute. Iowa represents rural and Midwestern populations and provides data included in many National Cancer Institute publications. Beginning in 1990 between 5 and 10 percent of the Registry’s annual operating budget has been provided by the state of Iowa. Beginning in 2003, the University of Iowa has been providing cost-sharing funds. The Registry also receives funding through grants and contracts with university, state, and national researchers investigating cancer-related topics.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Breast cancer is the most common diagnosed cancer and the leading cause of cancer death among females worldwide. It is considered a highly heterogeneous disease and it must be classified into more homogeneous groups. Hence, the purpose of this study was to classify breast tumors based on variations in gene expression patterns derived from RNA sequencing by using different class discovery methods. 42 breast tumors paired-samples were sequenced by Illumine Genome Analyzer and the data was analyzed and prepared by TopHat2 and htseq-count. As reported previously, breast cancer could be grouped into five main groups known as basal epithelial-like group, HER2 group, normal breast-like group and two Luminal groups with a distinctive expression profile. Classifying breast tumor samples by using PAM50 method, the most common subtype was Luminal B and was significantly associated with ESR1 and ERBB2 high expression. Luminal A subtype had ESR1 and SLC39A6 significant high expression, whereas HER2 subtype had a high expression of ERBB2 and CNNE1 genes and low luminal epithelial gene expression. Basal-like and normal-like subtypes were associated with low expression of ESR1, PgR and HER2, and had significant high expression of cytokeratins 5 and 17. Our results were similar compared with TGCA breast cancer data results and with known studies related with breast cancer classification. Classifying breast tumors could add significant prognostic and predictive information to standard parameters, and moreover, identify marker genes for each subtype to find a better therapy for patients with breast cancer.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Accurately and reliably identifying the actual number of clusters present with a dataset of gene expression profiles, when no additional information on cluster structure is available, is a problem addressed by few algorithms. GeneMCL transforms microarray analysis data into a graph consisting of nodes connected by edges, where the nodes represent genes, and the edges represent the similarity in expression of those genes, as given by a proximity measurement. This measurement is taken to be the Pearson correlation coefficient combined with a local non-linear rescaling step. The resulting graph is input to the Markov Cluster (MCL) algorithm, which is an elegant, deterministic, non-specific and scalable method, which models stochastic flow through the graph. The algorithm is inherently affected by any cluster structure present, and rapidly decomposes a graph into cohesive clusters. The potential of the GeneMCL algorithm is demonstrated with a 5730 gene subset (IGS) of the Van't Veer breast cancer database, for which the clusterings are shown to reflect underlying biological mechanisms. (c) 2005 Elsevier Ltd. All rights reserved.