813 resultados para microarray data classification
Resumo:
Trichoepithelioma is a benign neoplasm that shares both clinical and histological features with basal cell carcinoma. It is important to distinguish these neoplasms because they require different clinical behavior and therapeutic planning. Many studies have addressed the use of immunohistochemistry to improve the differential diagnosis of these tumors. These studies present conflicting results when addressing the same markers, probably owing to the small number of basaloid tumors that comprised their studies, which generally did not exceed 50 cases. We built a tissue microarray with 162 trichoepithelioma and 328 basal cell carcinoma biopsies and tested a panel of immune markers composed of CD34, CD10, epithelial membrane antigen, Bcl-2, cytokeratins 15 and 20 and D2-40. The results were analyzed using multiple linear and logistic regression models. This analysis revealed a model that could differentiate trichoepithelioma from basal cell carcinoma in 36% of the cases. The panel of immunohistochemical markers required to differentiate between these tumors was composed of CD10, cytokeratin 15, cytokeratin 20 and D2-40. The results obtained in this work were generated from a large number of biopsies and resulted in the confirmation of overlapping epithelial and stromal immunohistochemical profiles from these basaloid tumors. The results also corroborate the point of view that trichoepithelioma and basal cell carcinoma tumors represent two different points in the differentiation of a single cell type. Despite the use of panels of immune markers, histopathological criteria associated with clinical data certainly remain the best guideline for the differential diagnosis of trichoepithelioma and basal cell carcinoma. Modern Pathology (2012) 25, 1345-1353; doi: 10.1038/modpathol.2012.96; published online 8 June 2012
Resumo:
In multi-label classification, examples can be associated with multiple labels simultaneously. The task of learning from multi-label data can be addressed by methods that transform the multi-label classification problem into several single-label classification problems. The binary relevance approach is one of these methods, where the multi-label learning task is decomposed into several independent binary classification problems, one for each label in the set of labels, and the final labels for each example are determined by aggregating the predictions from all binary classifiers. However, this approach fails to consider any dependency among the labels. Aiming to accurately predict label combinations, in this paper we propose a simple approach that enables the binary classifiers to discover existing label dependency by themselves. An experimental study using decision trees, a kernel method as well as Naive Bayes as base-learning techniques shows the potential of the proposed approach to improve the multi-label classification performance.
Resumo:
Statistical methods have been widely employed to assess the capabilities of credit scoring classification models in order to reduce the risk of wrong decisions when granting credit facilities to clients. The predictive quality of a classification model can be evaluated based on measures such as sensitivity, specificity, predictive values, accuracy, correlation coefficients and information theoretical measures, such as relative entropy and mutual information. In this paper we analyze the performance of a naive logistic regression model (Hosmer & Lemeshow, 1989) and a logistic regression with state-dependent sample selection model (Cramer, 2004) applied to simulated data. Also, as a case study, the methodology is illustrated on a data set extracted from a Brazilian bank portfolio. Our simulation results so far revealed that there is no statistically significant difference in terms of predictive capacity between the naive logistic regression models and the logistic regression with state-dependent sample selection models. However, there is strong difference between the distributions of the estimated default probabilities from these two statistical modeling techniques, with the naive logistic regression models always underestimating such probabilities, particularly in the presence of balanced samples. (C) 2012 Elsevier Ltd. All rights reserved.
Resumo:
Abstract Background One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements. Results A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology. Conclusion The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.
Resumo:
Abstract Background Spotted cDNA microarrays generally employ co-hybridization of fluorescently-labeled RNA targets to produce gene expression ratios for subsequent analysis. Direct comparison of two RNA samples in the same microarray provides the highest level of accuracy; however, due to the number of combinatorial pair-wise comparisons, the direct method is impractical for studies including large number of individual samples (e.g., tumor classification studies). For such studies, indirect comparisons using a common reference standard have been the preferred method. Here we evaluated the precision and accuracy of reconstructed ratios from three indirect methods relative to ratios obtained from direct hybridizations, herein considered as the gold-standard. Results We performed hybridizations using a fixed amount of Cy3-labeled reference oligonucleotide (RefOligo) against distinct Cy5-labeled targets from prostate, breast and kidney tumor samples. Reconstructed ratios between all tissue pairs were derived from ratios between each tissue sample and RefOligo. Reconstructed ratios were compared to (i) ratios obtained in parallel from direct pair-wise hybridizations of tissue samples, and to (ii) reconstructed ratios derived from hybridization of each tissue against a reference RNA pool (RefPool). To evaluate the effect of the external references, reconstructed ratios were also calculated directly from intensity values of single-channel (One-Color) measurements derived from tissue sample data collected in the RefOligo experiments. We show that the average coefficient of variation of ratios between intra- and inter-slide replicates derived from RefOligo, RefPool and One-Color were similar and 2 to 4-fold higher than ratios obtained in direct hybridizations. Correlation coefficients calculated for all three tissue comparisons were also similar. In addition, the performance of all indirect methods in terms of their robustness to identify genes deemed as differentially expressed based on direct hybridizations, as well as false-positive and false-negative rates, were found to be comparable. Conclusion RefOligo produces ratios as precise and accurate as ratios reconstructed from a RNA pool, thus representing a reliable alternative in reference-based hybridization experiments. In addition, One-Color measurements alone can reconstruct expression ratios without loss in precision or accuracy. We conclude that both methods are adequate options in large-scale projects where the amount of a common reference RNA pool is usually restrictive.
Resumo:
Abstract Background Smear negative pulmonary tuberculosis (SNPT) accounts for 30% of pulmonary tuberculosis cases reported yearly in Brazil. This study aimed to develop a prediction model for SNPT for outpatients in areas with scarce resources. Methods The study enrolled 551 patients with clinical-radiological suspicion of SNPT, in Rio de Janeiro, Brazil. The original data was divided into two equivalent samples for generation and validation of the prediction models. Symptoms, physical signs and chest X-rays were used for constructing logistic regression and classification and regression tree models. From the logistic regression, we generated a clinical and radiological prediction score. The area under the receiver operator characteristic curve, sensitivity, and specificity were used to evaluate the model's performance in both generation and validation samples. Results It was possible to generate predictive models for SNPT with sensitivity ranging from 64% to 71% and specificity ranging from 58% to 76%. Conclusion The results suggest that those models might be useful as screening tools for estimating the risk of SNPT, optimizing the utilization of more expensive tests, and avoiding costs of unnecessary anti-tuberculosis treatment. Those models might be cost-effective tools in a health care network with hierarchical distribution of scarce resources.
Resumo:
Abstract Background Oral squamous cell carcinoma (OSCC) is a frequent neoplasm, which is usually aggressive and has unpredictable biological behavior and unfavorable prognosis. The comprehension of the molecular basis of this variability should lead to the development of targeted therapies as well as to improvements in specificity and sensitivity of diagnosis. Results Samples of primary OSCCs and their corresponding surgical margins were obtained from male patients during surgery and their gene expression profiles were screened using whole-genome microarray technology. Hierarchical clustering and Principal Components Analysis were used for data visualization and One-way Analysis of Variance was used to identify differentially expressed genes. Samples clustered mostly according to disease subsite, suggesting molecular heterogeneity within tumor stages. In order to corroborate our results, two publicly available datasets of microarray experiments were assessed. We found significant molecular differences between OSCC anatomic subsites concerning groups of genes presently or potentially important for drug development, including mRNA processing, cytoskeleton organization and biogenesis, metabolic process, cell cycle and apoptosis. Conclusion Our results corroborate literature data on molecular heterogeneity of OSCCs. Differences between disease subsites and among samples belonging to the same TNM class highlight the importance of gene expression-based classification and challenge the development of targeted therapies.
Resumo:
Abstract Background From shotgun libraries used for the genomic sequencing of the phytopathogenic bacterium Xanthomonas axonopodis pv. citri (XAC), clones that were representative of the largest possible number of coding sequences (CDSs) were selected to create a DNA microarray platform on glass slides (XACarray). The creation of the XACarray allowed for the establishment of a tool that is capable of providing data for the analysis of global genome expression in this organism. Findings The inserts from the selected clones were amplified by PCR with the universal oligonucleotide primers M13R and M13F. The obtained products were purified and fixed in duplicate on glass slides specific for use in DNA microarrays. The number of spots on the microarray totaled 6,144 and included 768 positive controls and 624 negative controls per slide. Validation of the platform was performed through hybridization of total DNA probes from XAC labeled with different fluorophores, Cy3 and Cy5. In this validation assay, 86% of all PCR products fixed on the glass slides were confirmed to present a hybridization signal greater than twice the standard deviation of the deviation of the global median signal-to-noise ration. Conclusions Our validation of the XACArray platform using DNA-DNA hybridization revealed that it can be used to evaluate the expression of 2,365 individual CDSs from all major functional categories, which corresponds to 52.7% of the annotated CDSs of the XAC genome. As a proof of concept, we used this platform in a previously work to verify the absence of genomic regions that could not be detected by sequencing in related strains of Xanthomonas.
Resumo:
CONTEXT AND OBJECTIVE: Epidemiology may help educators to face the challenge of establishing content guidelines for the curricula in medical schools. The aim was to develop learning objectives for a medical curriculum from an epidemiology database. DESIGN AND SETTING: Descriptive study assessing morbidity and mortality data, conducted in a private university in São Paulo. METHODS: An epidemiology database was used, with mortality and morbidity recorded as summaries of deaths and the World Health Organization's Disability-Adjusted Life Year (DALY). The scoring took into consideration probabilities for mortality and morbidity. RESULTS: The scoring presented a classification of health conditions to be used by a curriculum design committee, taking into consideration its highest and lowest quartiles, which corresponded respectively to the highest and lowest impact on morbidity and mortality. Data from three countries were used for international comparison and showed distinct results. The resulting scores indicated topics to be developed through educational taxonomy. CONCLUSION: The frequencies of the health conditions and their statistical treatment made it possible to identify topics that should be fully developed within medical education. The classification also suggested limits between topics that should be developed in depth, including knowledge and development of skills and attitudes, regarding topics that can be concisely presented at the level of knowledge.
Resumo:
One of the problems in the analysis of nucleus-nucleus collisions is to get information on the value of the impact parameter b. This work consists in the application of pattern recognition techniques aimed at associating values of b to groups of events. To this end, a support vec- tor machine (SVM) classifier is adopted to analyze multifragmentation reactions. This method allows to backtracing the values of b through a particular multidimensional analysis. The SVM classification con- sists of two main phase. In the first one, known as training phase, the classifier learns to discriminate the events that are generated by two different model:Classical Molecular Dynamics (CMD) and Heavy- Ion Phase-Space Exploration (HIPSE) for the reaction: 58Ni +48 Ca at 25 AMeV. To check the classification of events in the second one, known as test phase, what has been learned is tested on new events generated by the same models. These new results have been com- pared to the ones obtained through others techniques of backtracing the impact parameter. Our tests show that, following this approach, the central collisions and peripheral collisions, for the CMD events, are always better classified with respect to the classification by the others techniques of backtracing. We have finally performed the SVM classification on the experimental data measured by NUCL-EX col- laboration with CHIMERA apparatus for the previous reaction.
Resumo:
In this report it was designed an innovative satellite-based monitoring approach applied on the Iraqi Marshlands to survey the extent and distribution of marshland re-flooding and assess the development of wetland vegetation cover. The study, conducted in collaboration with MEEO Srl , makes use of images collected from the sensor (A)ATSR onboard ESA ENVISAT Satellite to collect data at multi-temporal scales and an analysis was adopted to observe the evolution of marshland re-flooding. The methodology uses a multi-temporal pixel-based approach based on classification maps produced by the classification tool SOIL MAPPER ®. The catalogue of the classification maps is available as web service through the Service Support Environment Portal (SSE, supported by ESA). The inundation of the Iraqi marshlands, which has been continuous since April 2003, is characterized by a high degree of variability, ad-hoc interventions and uncertainty. Given the security constraints and vastness of the Iraqi marshlands, as well as cost-effectiveness considerations, satellite remote sensing was the only viable tool to observe the changes taking place on a continuous basis. The proposed system (ALCS – AATSR LAND CLASSIFICATION SYSTEM) avoids the direct use of the (A)ATSR images and foresees the application of LULCC evolution models directly to „stock‟ of classified maps. This approach is made possible by the availability of a 13 year classified image database, conceived and implemented in the CARD project (http://earth.esa.int/rtd/Projects/#CARD).The approach here presented evolves toward an innovative, efficient and fast method to exploit the potentiality of multi-temporal LULCC analysis of (A)ATSR images. The two main objectives of this work are both linked to a sort of assessment: the first is to assessing the ability of modeling with the web-application ALCS using image-based AATSR classified with SOIL MAPPER ® and the second is to evaluate the magnitude, the character and the extension of wetland rehabilitation.
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.
Resumo:
This study provides a comprehensive genetic overview on the endangered Italian wolf population. In particular, it focuses on two research lines. On one hand, we focalised on melanism in wolf in order to isolate a mutation related with black coat colour in canids. With several reported black individuals (an exception at European level), the Italian wolf population constituted a challenging research field posing many unanswered questions. As found in North American wolf, we reported that melanism in the Italian population is caused by a different melanocortin pathway component, the K locus, in which a beta-defensin protein acts as an alternative ligand for the Mc1r. This research project was conducted in collaboration with Prof. Gregory Barsh, Department of Genetics and Paediatrics, Stanford University. On the other hand, we performed analysis on a high number of SNPs thanks to a customized Canine microarray useful to integrate or substitute the STR markers for genotyping individuals and detecting wolf-dog hybrids. Thanks to DNA microchip technology, we obtained an impressive amount of genetic data which provides a solid base for future functional genomic studies. This study was undertaken in collaboration with Prof. Robert K. Wayne, Department of Ecology and Evolutionary Biology, University of California, Los Angeles (UCLA).
Resumo:
The purpose of this Thesis is to develop a robust and powerful method to classify galaxies from large surveys, in order to establish and confirm the connections between the principal observational parameters of the galaxies (spectral features, colours, morphological indices), and help unveil the evolution of these parameters from $z \sim 1$ to the local Universe. Within the framework of zCOSMOS-bright survey, and making use of its large database of objects ($\sim 10\,000$ galaxies in the redshift range $0 < z \lesssim 1.2$) and its great reliability in redshift and spectral properties determinations, first we adopt and extend the \emph{classification cube method}, as developed by Mignoli et al. (2009), to exploit the bimodal properties of galaxies (spectral, photometric and morphologic) separately, and then combining together these three subclassifications. We use this classification method as a test for a newly devised statistical classification, based on Principal Component Analysis and Unsupervised Fuzzy Partition clustering method (PCA+UFP), which is able to define the galaxy population exploiting their natural global bimodality, considering simultaneously up to 8 different properties. The PCA+UFP analysis is a very powerful and robust tool to probe the nature and the evolution of galaxies in a survey. It allows to define with less uncertainties the classification of galaxies, adding the flexibility to be adapted to different parameters: being a fuzzy classification it avoids the problems due to a hard classification, such as the classification cube presented in the first part of the article. The PCA+UFP method can be easily applied to different datasets: it does not rely on the nature of the data and for this reason it can be successfully employed with others observables (magnitudes, colours) or derived properties (masses, luminosities, SFRs, etc.). The agreement between the two classification cluster definitions is very high. ``Early'' and ``late'' type galaxies are well defined by the spectral, photometric and morphological properties, both considering them in a separate way and then combining the classifications (classification cube) and treating them as a whole (PCA+UFP cluster analysis). Differences arise in the definition of outliers: the classification cube is much more sensitive to single measurement errors or misclassifications in one property than the PCA+UFP cluster analysis, in which errors are ``averaged out'' during the process. This method allowed us to behold the \emph{downsizing} effect taking place in the PC spaces: the migration between the blue cloud towards the red clump happens at higher redshifts for galaxies of larger mass. The determination of $M_{\mathrm{cross}}$ the transition mass is in significant agreement with others values in literature.
Resumo:
A study of maar-diatreme volcanoes has been perfomed by inversion of gravity and magnetic data. The geophysical inverse problem has been solved by means of the damped nonlinear least-squares method. To ensure stability and convergence of the solution of the inverse problem, a mathematical tool, consisting in data weighting and model scaling, has been worked out. Theoretical gravity and magnetic modeling of maar-diatreme volcanoes has been conducted in order to get information, which is used for a simple rough qualitative and/or quantitative interpretation. The information also serves as a priori information to design models for the inversion and/or to assist the interpretation of inversion results. The results of theoretical modeling have been used to roughly estimate the heights and the dip angles of the walls of eight Eifel maar-diatremes — each taken as a whole. Inversemodeling has been conducted for the Schönfeld Maar (magnetics) and the Hausten-Morswiesen Maar (gravity and magnetics). The geometrical parameters of these maars, as well as the density and magnetic properties of the rocks filling them, have been estimated. For a reliable interpretation of the inversion results, beside the knowledge from theoretical modeling, it was resorted to other tools such like field transformations and spectral analysis for complementary information. Geologic models, based on thesynthesis of the respective interpretation results, are presented for the two maars mentioned above. The results gave more insight into the genesis, physics and posteruptive development of the maar-diatreme volcanoes. A classification of the maar-diatreme volcanoes into three main types has been elaborated. Relatively high magnetic anomalies are indicative of scoria cones embeded within maar-diatremes if they are not caused by a strong remanent component of the magnetization. Smaller (weaker) secondary gravity and magnetic anomalies on the background of the main anomaly of a maar-diatreme — especially in the boundary areas — are indicative for subsidence processes, which probably occurred in the late sedimentation phase of the posteruptive development. Contrary to postulates referring to kimberlite pipes, there exists no generalized systematics between diameter and height nor between geophysical anomaly and the dimensions of the maar-diatreme volcanoes. Although both maar-diatreme volcanoes and kimberlite pipes are products of phreatomagmatism, they probably formed in different thermodynamic and hydrogeological environments. In the case of kimberlite pipes, large amounts of magma and groundwater, certainly supplied by deep and large reservoirs, interacted under high pressure and temperature conditions. This led to a long period phreatomagmatic process and hence to the formation of large structures. Concerning the maar-diatreme and tuff-ring-diatreme volcanoes, the phreatomagmatic process takes place due to an interaction between magma from small and shallow magma chambers (probably segregated magmas) and small amounts of near-surface groundwater under low pressure and temperature conditions. This leads to shorter time eruptions and consequently to structures of smaller size in comparison with kimberlite pipes. Nevertheless, the results show that the diameter to height ratio for 50% of the studied maar-diatremes is around 1, whereby the dip angle of the diatreme walls is similar to that of the kimberlite pipes and lies between 70 and 85°. Note that these numerical characteristics, especially the dip angle, hold for the maars the diatremes of which — estimated by modeling — have the shape of a truncated cone. This indicates that the diatreme can not be completely resolved by inversion.