54 resultados para Corpora (Linguistics)
Resumo:
Summary: Emergent metatheory in linguistics
Resumo:
Julkaistu myös paperimuodossa
Resumo:
Tutkimukseni käsittelee suomen kielen sanaston kehitystä 1800-luvulla eli aikana, jolloin suomen kielestä kehittyi monialainen sivistyskieli. Esimerkkiaineistona on yhden erikoisalan, maantieteen sanasto. Suomen kirjakieli syntyi 1500-luvulla, mutta aluksi kirjoitettua kieltä tarvittiin pääasiassa uskonnollisissa yhteyksissä. 1800-luvun aikana kielen käyttöalat monipuolistuivat ja uutta sanastoa tarvittiin monien erikoisalojen tarpeisiin. Ryhdyttiin tietoisesti kääntämään tietokirjallisuutta ja kirjoittamaan eri aiheista. Tutkimukseni selvittää maantieteen sanaston kehittymistä sadassa vuodessa erityisesti maantieteen oppikirjoissa. Tutkimus kuvaa sanaston kehitystä teoreettisesti uudenlaisista lähtökohdista tarkastelemalla leksikaalista variaatiota. Variaatiota on kuvattu tarkasti sekä yksittäisten käsitteiden nimitysten kehityksenä että ilmiönä yleisesti. Tutkimus hyödyntää myös kognitiivista lähestymistapaa, etenkin sosiokognitiivisen terminologian teoriaa. Aineiston analyysin pohjalta syntyy kuva sanaston kehityksestä ja vakiintumisesta. Tutkimus kuvaa myös tapoja, joilla uusia käsitteitä nimettiin. Se pohtii eri nimeämistapojen suhdetta sekä kirjoittajien ja aikalaisten roolia sanaston vakiintumisessa. 1800-luvun maantieteen sanastossa on runsaasti variaatiota; vain harvojen käsitteiden nimitykset ovat vakiintuneita tai vakiintuvat nopeasti. Tämän variaation kuvaaminen leksikaalisena variaationa osoittautui tutkimuksessa hyväksi metodiksi. Koska kirjakieli oli vakiintumatonta, nimityksissä esiintyy paljon kontekstuaalista variaatiota esimerkiksi sanojen kirjoitusasuissa. Kirjoittajat myös pohtivat havainnollista tapaa nimetä käsitteitä, ja tästä aiheutuu onomasiologista variaatiota. Semasiologinen variaatio taas kertoo käsitejärjestelmän vakiintumattomuudesta. Aineiston sanaston lähtökohdat ovat vanhan kirjasuomessa, mutta tältä pohjalta luodaan valtava määrä uutta sanastoa tai otetaan aiemmin kirjakielessä käytettyjä nimityksiä uuteen merkitykseen. Tärkeä rooli on sekä nimitysten muodostamisella kotoisista aineksista että kääntämisellä, jossa malli saadaan toisesta kielestä mutta nimitysten ainekset ovat omaperäisiä.
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.
Resumo:
In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.
Resumo:
The article describes some concrete problems that were encountered when writing a two-level model of Mari morphology. Mari is an agglutinative Finno-Ugric language spoken in Russia by about 600 000 people. The work was begun in the 1980s on the basis of K. Koskenniemi’s Two-Level Morphology (1983), but in the latest stage R. Beesley’s and L. Karttunen’s Finite State Morphology (2003) was used. Many of the problems described in the article concern the inexplicitness of the rules in Mari grammars and the lack of information about the exact distribution of some suffixes, e.g. enclitics. The Mari grammars usually give complete paradigms for a few unproblematic verb stems, whereas the difficult or unclear forms of certain verbs are only superficially discussed. Another example of phenomena that are poorly described in grammars is the way suffixes with an initial sibilant combine to stems ending in a sibilant. The help of informants and searches from electronic corpora were used to overcome such difficulties in the development of the two-level model of Mari. The variation of the order of plural markers, case suffixes and possessive suffixes is a typical feature of Mari. The morphotactic rules constructed for Mari declensional forms tend to be recursive and their productivity must be limited by some technical device, such as filters. In the present model, certain plural markers were treated like nouns. The positional and functional versatility of the possessive suffixes can be regarded as the most challenging phenomenon in attempts to formalize the Mari morphology. Cyrillic orthography, which was used in the model, also caused problems. For instance, a Cyrillic letter may represent a sequence of two sounds, the first being part of the word stem while the other belongs to a suffix. In some cases, letters for voiced consonants are also generalized to represent voiceless consonants. Such orthographical conventions distance a morphological model based on orthography from the actual (morpho)phonological processes in the language.
Resumo:
Fluent health information flow is critical for clinical decision-making. However, a considerable part of this information is free-form text and inabilities to utilize it create risks to patient safety and cost-effective hospital administration. Methods for automated processing of clinical text are emerging. The aim in this doctoral dissertation is to study machine learning and clinical text in order to support health information flow.First, by analyzing the content of authentic patient records, the aim is to specify clinical needs in order to guide the development of machine learning applications.The contributions are a model of the ideal information flow,a model of the problems and challenges in reality, and a road map for the technology development. Second, by developing applications for practical cases,the aim is to concretize ways to support health information flow. Altogether five machine learning applications for three practical cases are described: The first two applications are binary classification and regression related to the practical case of topic labeling and relevance ranking.The third and fourth application are supervised and unsupervised multi-class classification for the practical case of topic segmentation and labeling.These four applications are tested with Finnish intensive care patient records.The fifth application is multi-label classification for the practical task of diagnosis coding. It is tested with English radiology reports.The performance of all these applications is promising. Third, the aim is to study how the quality of machine learning applications can be reliably evaluated.The associations between performance evaluation measures and methods are addressed,and a new hold-out method is introduced.This method contributes not only to processing time but also to the evaluation diversity and quality. The main conclusion is that developing machine learning applications for text requires interdisciplinary, international collaboration. Practical cases are very different, and hence the development must begin from genuine user needs and domain expertise. The technological expertise must cover linguistics,machine learning, and information systems. Finally, the methods must be evaluated both statistically and through authentic user-feedback.
Resumo:
This dissertation considers the segmental durations of speech from the viewpoint of speech technology, especially speech synthesis. The idea is that better models of segmental durations lead to higher naturalness and better intelligibility. These features are the key factors for better usability and generality of synthesized speech technology. Even though the studies are based on a Finnish corpus the approaches apply to all other languages as well. This is possibly due to the fact that most of the studies included in this dissertation are about universal effects taking place on utterance boundaries. Also the methods invented and used here are suitable for any other study of another language. This study is based on two corpora of news reading speech and sentences read aloud. The other corpus is read aloud by a 39-year-old male, whilst the other consists of several speakers in various situations. The use of two corpora is twofold: it involves a comparison of the corpora and a broader view on the matters of interest. The dissertation begins with an overview to the phonemes and the quantity system in the Finnish language. Especially, we are covering the intrinsic durations of phonemes and phoneme categories, as well as the difference of duration between short and long phonemes. The phoneme categories are presented to facilitate the problem of variability of speech segments. In this dissertation we cover the boundary-adjacent effects on segmental durations. In initial positions of utterances we find that there seems to be initial shortening in Finnish, but the result depends on the level of detail and on the individual phoneme. On the phoneme level we find that the shortening or lengthening only affects the very first ones at the beginning of an utterance. However, on average, the effect seems to shorten the whole first word on the word level. We establish the effect of final lengthening in Finnish. The effect in Finnish has been an open question for a long time, whilst Finnish has been the last missing piece for it to be a universal phenomenon. Final lengthening is studied from various angles and it is also shown that it is not a mere effect of prominence or an effect of speech corpus with high inter- and intra-speaker variation. The effect of final lengthening seems to extend from the final to the penultimate word. On a phoneme level it reaches a much wider area than the initial effect. We also present a normalization method suitable for corpus studies on segmental durations. The method uses an utterance-level normalization approach to capture the pattern of segmental durations within each utterance. This prevents the impact of various problematic variations within the corpora. The normalization is used in a study on final lengthening to show that the results on the effect are not caused by variation in the material. The dissertation shows an implementation and prowess of speech synthesis on a mobile platform. We find that the rule-based method of speech synthesis is a real-time software solution, but the signal generation process slows down the system beyond real time. Future aspects of speech synthesis on limited platforms are discussed. The dissertation considers ethical issues on the development of speech technology. The main focus is on the development of speech synthesis with high naturalness, but the problems and solutions are applicable to any other speech technology approaches.