9 resultados para Annotations
em Doria (National Library of Finland DSpace Services) - National Library of Finland, Finland
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.
Resumo:
Diabetes is a rapidly increasing worldwide problem which is characterised by defective metabolism of glucose that causes long-term dysfunction and failure of various organs. The most common complication of diabetes is diabetic retinopathy (DR), which is one of the primary causes of blindness and visual impairment in adults. The rapid increase of diabetes pushes the limits of the current DR screening capabilities for which the digital imaging of the eye fundus (retinal imaging), and automatic or semi-automatic image analysis algorithms provide a potential solution. In this work, the use of colour in the detection of diabetic retinopathy is statistically studied using a supervised algorithm based on one-class classification and Gaussian mixture model estimation. The presented algorithm distinguishes a certain diabetic lesion type from all other possible objects in eye fundus images by only estimating the probability density function of that certain lesion type. For the training and ground truth estimation, the algorithm combines manual annotations of several experts for which the best practices were experimentally selected. By assessing the algorithm’s performance while conducting experiments with the colour space selection, both illuminance and colour correction, and background class information, the use of colour in the detection of diabetic retinopathy was quantitatively evaluated. Another contribution of this work is the benchmarking framework for eye fundus image analysis algorithms needed for the development of the automatic DR detection algorithms. The benchmarking framework provides guidelines on how to construct a benchmarking database that comprises true patient images, ground truth, and an evaluation protocol. The evaluation is based on the standard receiver operating characteristics analysis and it follows the medical practice in the decision making providing protocols for image- and pixel-based evaluations. During the work, two public medical image databases with ground truth were published: DIARETDB0 and DIARETDB1. The framework, DR databases and the final algorithm, are made public in the web to set the baseline results for automatic detection of diabetic retinopathy. Although deviating from the general context of the thesis, a simple and effective optic disc localisation method is presented. The optic disc localisation is discussed, since normal eye fundus structures are fundamental in the characterisation of DR.
Resumo:
Dedicated to: Carl Fredrich Mennander, Nils Idman, Michaël Forselius, Jacob Fårskål, Anders Berendt Gadd, Georg Haveman, Catharina Tackou, née Morin, Jacob Carenius.
Resumo:
Poster at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014
Resumo:
Presentation at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014
Resumo:
Presentation at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014
Resumo:
Biomedical natural language processing (BioNLP) is a subfield of natural language processing, an area of computational linguistics concerned with developing programs that work with natural language: written texts and speech. Biomedical relation extraction concerns the detection of semantic relations such as protein-protein interactions (PPI) from scientific texts. The aim is to enhance information retrieval by detecting relations between concepts, not just individual concepts as with a keyword search. In recent years, events have been proposed as a more detailed alternative for simple pairwise PPI relations. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence “Protein A causes protein B to bind protein C” can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used by computational applications. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction. In this thesis we present a method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from full dependency parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures. We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks. The Turku Event Extraction System is published as a freely available open-source project, documenting the research in detail as well as making the method available for practical applications. In particular, in this thesis we describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects. Finally, we discuss related literature, summarize the contributions of the work and present some thoughts on future directions for biomedical event extraction. This thesis includes and builds on six original research publications. The first of these introduces the analysis of dependency parses that leads to development of TEES. The entries in the three BioNLP Shared Tasks, as well as in the DDIExtraction 2011 task are covered in four publications, and the sixth one demonstrates the application of the system to PubMed-scale text mining.
Resumo:
y+LAT1 is a transmembrane protein that, together with the 4F2hc cell surface antigen, forms a transporter for cationic amino acids in the basolateral plasma membrane of epithelial cells. It is mainly expressed in the kidney and small intestine, and to a lesser extent in other tissues, such as the placenta and immunoactive cells. Mutations in y+LAT1 lead to a defect of the y+LAT1/4F2hc transporter, which impairs intestinal absorbance and renal reabsorbance of lysine, arginine and ornithine, causing lysinuric protein intolerance (LPI), a rare, recessively inherited aminoaciduria with severe multi-organ complications. This thesis examines the consequences of the LPI-causing mutations on two levels, the transporter structure and the Finnish patients’ gene expression profiles. Using fluorescence resonance energy transfer (FRET) confocal microscopy, optimised for this work, the subunit dimerisation was discovered to be a primary phenomenon occurring regardless of mutations in y+LAT1. In flow cytometric and confocal microscopic FRET analyses, the y+LAT1 molecules exhibit a strong tendency for homodimerisation both in the presence and absence of 4F2hc, suggesting a heterotetramer for the transporter’s functional form. Gene expression analysis of the Finnish patients, clinically variable but homogenic for the LPI-causing mutation in SLC7A7, revealed 926 differentially-expressed genes and a disturbance of the amino acid homeostasis affecting several transporters. However, despite the expression changes in individual patients, no overall compensatory effect of y+LAT2, the sister y+L transporter, was detected. The functional annotations of the altered genes included biological processes such as inflammatory response, immune system processes and apoptosis, indicating a strong immunological involvement for LPI.
Resumo:
Lichens are symbiotic organisms, which consist of the fungal partner and the photosynthetic partner, which can be either an alga or a cyanobacterium. In some lichen species the symbiosis is tripartite, where the relationship includes both an alga and a cyanobacterium alongside the primary symbiont, fungus. The lichen symbiosis is an evolutionarily old adaptation to life on land and many extant fungal species have evolved from lichenised ancestors. Lichens inhabit a wide range of habitats and are capable of living in harsh environments and on nutrient poor substrates, such as bare rocks, often enduring frequent cycles of drying and wetting. Most lichen species are desiccation tolerant, and they can survive long periods of dehydration, but can rapidly resume photosynthesis upon rehydration. The molecular mechanisms behind lichen desiccation tolerance are still largely uncharacterised and little information is available for any lichen species at the genomic or transcriptomic level. The emergence of the high-throughput next generation sequencing (NGS) technologies and the subsequent decrease in the cost of sequencing new genomes and transcriptomes has enabled non-model organism research on the whole genome level. In this doctoral work the transcriptome and genome of the grey reindeer lichen, Cladonia rangiferina, were sequenced, de novo assembled and characterised using NGS and traditional expressed sequence tag (EST) technologies. RNA extraction methods were optimised to improve the yield and quality of RNA extracted from lichen tissue. The effects of rehydration and desiccation on C. rangiferina gene expression on whole transcriptome level were studied and the most differentially expressed genes were identified. The secondary metabolites present in C. rangiferina decreased the quality – integrity, optical characteristics and utility for sensitive molecular biological applications – of the extracted RNA requiring an optimised RNA extraction method for isolating sufficient quantities of high-quality RNA from lichen tissue in a time- and cost-efficient manner. The de novo assembly of the transcriptome of C. rangiferina was used to produce a set of contiguous unigene sequences that were used to investigate the biological functions and pathways active in a hydrated lichen thallus. The de novo assembly of the genome yielded an assembly containing mostly genes derived from the fungal partner. The assembly was of sufficient quality, in size similar to other lichen-forming fungal genomes and included most of the core eukaryotic genes. Differences in gene expression were detected in all studied stages of desiccation and rehydration, but the largest changes occurred during the early stages of rehydration. The most differentially expressed genes did not have any annotations, making them potentially lichen-specific genes, but several genes known to participate in environmental stress tolerance in other organisms were also identified as differentially expressed.