3 resultados para Unsupervised unmixing
em DigitalCommons@The Texas Medical Center
Resumo:
The biomedical literature is extensively catalogued and indexed in MEDLINE. MEDLINE indexing is done by trained human indexers, who identify the most important concepts in each article, and is expensive and inconsistent. Automating the indexing task is difficult: the National Library of Medicine produces the Medical Text Indexer (MTI), which suggests potential indexing terms to the indexers. MTI’s output is not good enough to work unattended. In my thesis, I propose a different way to approach the indexing task called MEDRank. MEDRank creates graphs representing the concepts in biomedical articles and their relationships within the text, and applies graph-based ranking algorithms to identify the most important concepts in each article. I evaluate the performance of several automated indexing solutions, including my own, by comparing their output to the indexing terms selected by the human indexers. MEDRank outperformed all other evaluated indexing solutions, including MTI, in general indexing performance and precision. MEDRank can be used to cluster documents, index any kind of biomedical text with standard vocabularies, or could become part of MTI itself.
Resumo:
In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^
Resumo:
Background. Assessment of estrogen receptor (ER) expression has inconsistent utility as a prognostic marker in epithelial ovarian carcinoma. In breast and endometrial cancers, the use of estrogen-induced gene panels, rather than ER expression alone, has shown improved prognostic capability. Specifically, over-expression of estrogen-induced genes in these tumors is associated with a better prognosis and signifies estrogen sensitivity that can be exploited with hormone antagonizing agents. It was therefore hypothesized that estrogen-induced gene expression in ovarian carcinoma would successfully predict outcomes and differentiate between tumors of varying estrogen sensitivities. Methods. Two hundred nineteen (219) patients with ovarian cancer who underwent surgery at M. D. Anderson between 2004 and 2007 were identified. Of these, eighty-three (83) patients were selected for inclusion because they had advanced stage, high-grade serous carcinoma of the ovary or peritoneum, had not received neoadjuvant chemotherapy, and had readily available frozen tissue for study. All patients had also received adjuvant treatment with platinum and taxane agents. The expression of seven genes known to be induced by estrogen in the female reproductive tract (EIG121, sFRP1, sFRP4, RALDH2, PR, IGF-1, and ER) was measured using qRT-PCR. Unsupervised cluster analyses of multiple gene permutations were used to categorize patients as high or low estrogen-induced gene expressors. QPCR gene expression results were then compared to ER and PR immunohistochemical (IHC) expression. Cox proportional hazards models were used to evaluate the effects of both individual genes and selected gene clusters on patient survival. Results. Median follow-up time was 38.7 months (range 1-68 months). In a multivariate model, overall survival was predicted by sFRP1 expression (HR 1.10 [1.02-1.19], p=0.01) and EIG121 expression (HR 1.28 [1.10-1.49], p<0.01). A cluster defined by EIG121 and ER was further examined because that combination appeared to reasonably segregate tumors into distinct groups of high and low estrogen-induced gene expressors. Shorter overall survival was associated with high estrogen-induced gene expressors (HR 2.84 [1.11-7.30], p=0.03), even after adjustment for race, age, body mass index, and residual disease at debulking. No difference in IHC ER or PR expression was noted between gene clusters. Conclusion. In sharp contrast to breast and endometrial cancers, high estrogen-induced gene expression predicts shorter overall survival in patients with high-grade serous ovarian carcinoma. An estrogen-induced gene biomarker panel may have utility as prognostic indicator and may be useful to guide management with estrogen antagonists in this population.^