899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
The increase in the number of spatial data collected has motivated the development of geovisualisation techniques, aiming to provide an important resource to support the extraction of knowledge and decision making. One of these techniques are 3D graphs, which provides a dynamic and flexible increase of the results analysis obtained by the spatial data mining algorithms, principally when there are incidences of georeferenced objects in a same local. This work presented as an original contribution the potentialisation of visual resources in a computational environment of spatial data mining and, afterwards, the efficiency of these techniques is demonstrated with the use of a real database. The application has shown to be very interesting in interpreting obtained results, such as patterns that occurred in a same locality and to provide support for activities which could be done as from the visualisation of results. © 2013 Springer-Verlag.
Resumo:
Feature selection aims to find the most important information to save computational efforts and data storage. We formulated this task as a combinatorial optimization problem since the exponential growth of possible solutions makes an exhaustive search infeasible. In this work, we propose a new nature-inspired feature selection technique based on bats behavior, namely, binary bat algorithm The wrapper approach combines the power of exploration of the bats together with the speed of the optimum-path forest classifier to find a better data representation. Experiments in public datasets have shown that the proposed technique can indeed improve the effectiveness of the optimum-path forest and outperform some well-known swarm-based techniques. © 2013 Copyright © 2013 Elsevier Inc. All rights reserved.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Pós-graduação em Engenharia Elétrica - FEIS
Resumo:
Ultrasonography has an inherent noise pattern, called speckle, which is known to hamper object recognition for both humans and computers. Speckle noise is produced by the mutual interference of a set of scattered wavefronts. Depending on the phase of the wavefronts, the interference may be constructive or destructive, which results in brighter or darker pixels, respectively. We propose a filter that minimizes noise fluctuation while simultaneously preserving local gray level information. It is based on steps to attenuate the destructive and constructive interference present in ultrasound images. This filter, called interference-based speckle filter followed by anisotropic diffusion (ISFAD), was developed to remove speckle texture from B-mode ultrasound images, while preserving the edges and the gray level of the region. The ISFAD performance was compared with 10 other filters. The evaluation was based on their application to images simulated by Field II (developed by Jensen et al.) and the proposed filter presented the greatest structural similarity, 0.95. Functional improvement of the segmentation task was also measured, comparing rates of true positive, false positive and accuracy. Using three different segmentation techniques, ISFAD also presented the best accuracy rate (greater than 90% for structures with well-defined borders). (E-mail: fernando.okara@gmail.com) (C) 2012 World Federation for Ultrasound in Medicine & Biology.
Resumo:
Background: Genome-wide association studies (GWAS) require large sample sizes to obtain adequate statistical power, but it may be possible to increase the power by incorporating complementary data. In this study we investigated the feasibility of automatically retrieving information from the medical literature and leveraging this information in GWAS. Methods: We developed a method that searches through PubMed abstracts for pre-assigned keywords and key concepts, and uses this information to assign prior probabilities of association for each single nucleotide polymorphism (SNP) with the phenotype of interest - the Adjusting Association Priors with Text (AdAPT) method. Association results from a GWAS can subsequently be ranked in the context of these priors using the Bayes False Discovery Probability (BFDP) framework. We initially tested AdAPT by comparing rankings of known susceptibility alleles in a previous lung cancer GWAS, and subsequently applied it in a two-phase GWAS of oral cancer. Results: Known lung cancer susceptibility SNPs were consistently ranked higher by AdAPT BFDPs than by p-values. In the oral cancer GWAS, we sought to replicate the top five SNPs as ranked by AdAPT BFDPs, of which rs991316, located in the ADH gene region of 4q23, displayed a statistically significant association with oral cancer risk in the replication phase (per-rare-allele log additive p-value [p(trend)] = 2.5 x 10(-3)). The combined OR for having one additional rare allele was 0.83 (95% CI: 0.76-0.90), and this association was independent of previously identified susceptibility SNPs that are associated with overall UADT cancer in this gene region. We also investigated if rs991316 was associated with other cancers of the upper aerodigestive tract (UADT), but no additional association signal was found. Conclusion: This study highlights the potential utility of systematically incorporating prior knowledge from the medical literature in genome-wide analyses using the AdAPT methodology. AdAPT is available online (url: http://services.gate.ac.uk/lld/gwas/service/config).
Resumo:
[EN]In this paper, we experimentally study the combination of face and facial feature detectors to improve face detection performance. The face detection problem, as suggeted by recent face detection challenges, is still not solved. Face detectors traditionally fail in large-scale problems and/or when the face is occluded or di erent head rotations are present. The combination of face and facial feature detectors is evaluated with a public database. The obtained results evidence an improvement in the positive detection rate while reducing the false detection rate. Additionally, we prove that the integration of facial feature detectors provides useful information for pose estimation and face alignment.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
The study of protein expression profiles for biomarker discovery in serum and in mammalian cell populations needs the continuous improvement and combination of proteins/peptides separation techniques, mass spectrometry, statistical and bioinformatic approaches. In this thesis work two different mass spectrometry-based protein profiling strategies have been developed and applied to liver and inflammatory bowel diseases (IBDs) for the discovery of new biomarkers. The first of them, based on bulk solid-phase extraction combined with matrix-assisted laser desorption/ionization - Time of Flight mass spectrometry (MALDI-TOF MS) and chemometric analysis of serum samples, was applied to the study of serum protein expression profiles both in IBDs (Crohn’s disease and ulcerative colitis) and in liver diseases (cirrhosis, hepatocellular carcinoma, viral hepatitis). The approach allowed the enrichment of serum proteins/peptides due to the high interaction surface between analytes and solid phase and the high recovery due to the elution step performed directly on the MALDI-target plate. Furthermore the use of chemometric algorithm for the selection of the variables with higher discriminant power permitted to evaluate patterns of 20-30 proteins involved in the differentiation and classification of serum samples from healthy donors and diseased patients. These proteins profiles permit to discriminate among the pathologies with an optimum classification and prediction abilities. In particular in the study of inflammatory bowel diseases, after the analysis using C18 of 129 serum samples from healthy donors and Crohn’s disease, ulcerative colitis and inflammatory controls patients, a 90.7% of classification ability and a 72.9% prediction ability were obtained. In the study of liver diseases (hepatocellular carcinoma, viral hepatitis and cirrhosis) a 80.6% of prediction ability was achieved using IDA-Cu(II) as extraction procedure. The identification of the selected proteins by MALDITOF/ TOF MS analysis or by their selective enrichment followed by enzymatic digestion and MS/MS analysis may give useful information in order to identify new biomarkers involved in the diseases. The second mass spectrometry-based protein profiling strategy developed was based on a label-free liquid chromatography electrospray ionization quadrupole - time of flight differential analysis approach (LC ESI-QTOF MS), combined with targeted MS/MS analysis of only identified differences. The strategy was used for biomarker discovery in IBDs, and in particular of Crohn’s disease. The enriched serum peptidome and the subcellular fractions of intestinal epithelial cells (IECs) from healthy donors and Crohn’s disease patients were analysed. The combining of the low molecular weight serum proteins enrichment step and the LCMS approach allowed to evaluate a pattern of peptides derived from specific exoprotease activity in the coagulation and complement activation pathways. Among these peptides, particularly interesting was the discovery of clusters of peptides from fibrinopeptide A, Apolipoprotein E and A4, and complement C3 and C4. Further studies need to be performed to evaluate the specificity of these clusters and validate the results, in order to develop a rapid serum diagnostic test. The analysis by label-free LC ESI-QTOF MS differential analysis of the subcellular fractions of IECs from Crohn’s disease patients and healthy donors permitted to find many proteins that could be involved in the inflammation process. Among them heat shock protein 70, tryptase alpha-1 precursor and proteins whose upregulation can be explained by the increased activity of IECs in Crohn’s disease were identified. Follow-up studies for the validation of the results and the in-depth investigation of the inflammation pathways involved in the disease will be performed. Both the developed mass spectrometry-based protein profiling strategies have been proved to be useful tools for the discovery of disease biomarkers that need to be validated in further studies.