20 resultados para microarray data classification
em Reposit
Resumo:
Background: A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software. Results: This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes. Conclusion: Our results indicate that the arrow plot represents a new flexible and useful tool for the analysis of gene expression profiles from microarrays.
Resumo:
Microarray allow to monitoring simultaneously thousands of genes, where the abundance of the transcripts under a same experimental condition at the same time can be quantified. Among various available array technologies, double channel cDNA microarray experiments have arisen in numerous technical protocols associated to genomic studies, which is the focus of this work. Microarray experiments involve many steps and each one can affect the quality of raw data. Background correction and normalization are preprocessing techniques to clean and correct the raw data when undesirable fluctuations arise from technical factors. Several recent studies showed that there is no preprocessing strategy that outperforms others in all circumstances and thus it seems difficult to provide general recommendations. In this work, it is proposed to use exploratory techniques to visualize the effects of preprocessing methods on statistical analysis of cancer two-channel microarray data sets, where the cancer types (classes) are known. For selecting differential expressed genes the arrow plot was used and the graph of profiles resultant from the correspondence analysis for visualizing the results. It was used 6 background methods and 6 normalization methods, performing 36 pre-processing methods and it was analyzed in a published cDNA microarray database (Liver) available at http://genome-www5.stanford.edu/ which microarrays were already classified by cancer type. All statistical analyses were performed using the R statistical software.
Resumo:
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.
Resumo:
Dissertação para a obtenção do grau de Mestre em Engenharia Electrotécnica Ramo de Energia
Resumo:
Chronic liver disease (CLD) is most of the time an asymptomatic, progressive, and ultimately potentially fatal disease. In this study, an automatic hierarchical procedure to stage CLD using ultrasound images, laboratory tests, and clinical records are described. The first stage of the proposed method, called clinical based classifier (CBC), discriminates healthy from pathologic conditions. When nonhealthy conditions are detected, the method refines the results in three exclusive pathologies in a hierarchical basis: 1) chronic hepatitis; 2) compensated cirrhosis; and 3) decompensated cirrhosis. The features used as well as the classifiers (Bayes, Parzen, support vector machine, and k-nearest neighbor) are optimally selected for each stage. A large multimodal feature database was specifically built for this study containing 30 chronic hepatitis cases, 34 compensated cirrhosis cases, and 36 decompensated cirrhosis cases, all validated after histopathologic analysis by liver biopsy. The CBC classification scheme outperformed the nonhierachical one against all scheme, achieving an overall accuracy of 98.67% for the normal detector, 87.45% for the chronic hepatitis detector, and 95.71% for the cirrhosis detector.
Resumo:
PURPOSE: Fatty liver disease (FLD) is an increasing prevalent disease that can be reversed if detected early. Ultrasound is the safest and ubiquitous method for identifying FLD. Since expert sonographers are required to accurately interpret the liver ultrasound images, lack of the same will result in interobserver variability. For more objective interpretation, high accuracy, and quick second opinions, computer aided diagnostic (CAD) techniques may be exploited. The purpose of this work is to develop one such CAD technique for accurate classification of normal livers and abnormal livers affected by FLD. METHODS: In this paper, the authors present a CAD technique (called Symtosis) that uses a novel combination of significant features based on the texture, wavelet transform, and higher order spectra of the liver ultrasound images in various supervised learning-based classifiers in order to determine parameters that classify normal and FLD-affected abnormal livers. RESULTS: On evaluating the proposed technique on a database of 58 abnormal and 42 normal liver ultrasound images, the authors were able to achieve a high classification accuracy of 93.3% using the decision tree classifier. CONCLUSIONS: This high accuracy added to the completely automated classification procedure makes the authors' proposed technique highly suitable for clinical deployment and usage.
Resumo:
In this work the identification and diagnosis of various stages of chronic liver disease is addressed. The classification results of a support vector machine, a decision tree and a k-nearest neighbor classifier are compared. Ultrasound image intensity and textural features are jointly used with clinical and laboratorial data in the staging process. The classifiers training is performed by using a population of 97 patients at six different stages of chronic liver disease and a leave-one-out cross-validation strategy. The best results are obtained using the support vector machine with a radial-basis kernel, with 73.20% of overall accuracy. The good performance of the method is a promising indicator that it can be used, in a non invasive way, to provide reliable information about the chronic liver disease staging.
Resumo:
In this work liver contour is semi-automatically segmented and quantified in order to help the identification and diagnosis of diffuse liver disease. The features extracted from the liver contour are jointly used with clinical and laboratorial data in the staging process. The classification results of a support vector machine, a Bayesian and a k-nearest neighbor classifier are compared. A population of 88 patients at five different stages of diffuse liver disease and a leave-one-out cross-validation strategy are used in the classification process. The best results are obtained using the k-nearest neighbor classifier, with an overall accuracy of 80.68%. The good performance of the proposed method shows a reliable indicator that can improve the information in the staging of diffuse liver disease.
Resumo:
Low noise surfaces have been increasingly considered as a viable and cost-effective alternative to acoustical barriers. However, road planners and administrators frequently lack information on the correlation between the type of road surface and the resulting noise emission profile. To address this problem, a method to identify and classify different types of road pavements was developed, whereby near field road noise is analyzed using statistical learning methods. The vehicle rolling sound signal near the tires and close to the road surface was acquired by two microphones in a special arrangement which implements the Close-Proximity method. A set of features, characterizing the properties of the road pavement, was extracted from the corresponding sound profiles. A feature selection method was used to automatically select those that are most relevant in predicting the type of pavement, while reducing the computational cost. A set of different types of road pavement segments were tested and the performance of the classifier was evaluated. Results of pavement classification performed during a road journey are presented on a map, together with geographical data. This procedure leads to a considerable improvement in the quality of road pavement noise data, thereby increasing the accuracy of road traffic noise prediction models.
Resumo:
This work addresses the present-day (<100 ka) mantle heterogeneity in the Azores region through the study of two active volcanic systems from Terceira Island. Our study shows that mantle heterogeneities are detectable even when "coeval" volcanic systems (Santa Barbara and Fissural) erupted less than 10 km away. These volcanic systems, respectively, reflect the influence of the Terceira and D. Joao de Castro Bank end-members defined by Beier et at (2008) for the Terceira Rift Santa Barbara magmas are interpreted to be the result of mixing between a HIMU-type component, carried to the upper mantle by the Azores plume, and the regional depleted MORB magmas/source. Fissural lavas are characterized by higher Ba/Nb and Nb/U ratios and less radiogenic Pb-206/Pb-204, Nd-143/Nd-144 and Hf-176/Hf-177, requiring the small contribution of delaminated sub-continental lithospheric mantle residing in the upper mantle. Published noble gas data on lavas from both volcanic systems also indicate the presence of a relatively undegassed component, which is interpreted as inherited from a lower mantle reservoir sampled by the ascending Azores plume. As inferred from trace and major elements, melting began in the garnet stability field, while magma extraction occurred within the spinel zone. The intra-volcanic system's chemical heterogeneity is mainly explained by variable proportions of the above-mentioned local end-members and by crystal fractionation processes. (C) 2011 Elsevier By. All rights reserved.
Resumo:
Chronic Liver Disease is a progressive, most of the time asymptomatic, and potentially fatal disease. In this paper, a semi-automatic procedure to stage this disease is proposed based on ultrasound liver images, clinical and laboratorial data. In the core of the algorithm two classifiers are used: a k nearest neighbor and a Support Vector Machine, with different kernels. The classifiers were trained with the proposed multi-modal feature set and the results obtained were compared with the laboratorial and clinical feature set. The results showed that using ultrasound based features, in association with laboratorial and clinical features, improve the classification accuracy. The support vector machine, polynomial kernel, outperformed the others classifiers in every class studied. For the Normal class we achieved 100% accuracy, for the chronic hepatitis with cirrhosis 73.08%, for compensated cirrhosis 59.26% and for decompensated cirrhosis 91.67%.
Resumo:
Steatosis, also known as fatty liver, corresponds to an abnormal retention of lipids within the hepatic cells and reflects an impairment of the normal processes of synthesis and elimination of fat. Several causes may lead to this condition, namely obesity, diabetes, or alcoholism. In this paper an automatic classification algorithm is proposed for the diagnosis of the liver steatosis from ultrasound images. The features are selected in order to catch the same characteristics used by the physicians in the diagnosis of the disease based on visual inspection of the ultrasound images. The algorithm, designed in a Bayesian framework, computes two images: i) a despeckled one, containing the anatomic and echogenic information of the liver, and ii) an image containing only the speckle used to compute the textural features. These images are computed from the estimated RF signal generated by the ultrasound probe where the dynamic range compression performed by the equipment is taken into account. A Bayes classifier, trained with data manually classified by expert clinicians and used as ground truth, reaches an overall accuracy of 95% and a 100% of sensitivity. The main novelties of the method are the estimations of the RF and speckle images which make it possible to accurately compute textural features of the liver parenchyma relevant for the diagnosis.
Resumo:
Dissertação para obtenção do grau de Mestre em Engenharia Informática
Resumo:
Trabalho de Projeto para obtenção do grau de Mestre em Engenharia Informática e de Computadores
Resumo:
Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.