967 resultados para Natural language techniques, Semantic spaces, Random projection, Documents


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Työn tavoitteena on etsiä asiakasyritykselle sähköteknisen dokumentoinnin hallintaan sopiva järjestelmäratkaisu vertailemalla insinööritoimistojen käyttämien suunnittelujärjestelmien ja yleisten dokumenttien hallintajärjestelmien soveltuvuutta asiakasympäristöön. Työssä tutkitaan sopivien metatietojen kuvaustapojen käyttökelpoisuutta sähköteknisen dokumentoinnin hallintaan esimerkkiprojektien avulla. Työn sisältö koostuu neljästä pääkohdasta. Ensimmäisessä jaksossa tarkastellaan dokumentin ominaisuuksia ja elinkaarta luonnista aktiivikäyttöön, arkistointiin ja hävitykseen. Samassa yhteydessä kerrotaan dokumenttienhallinnan perustehtävistä. Toisessa jaksossa käsitellään dokumenttien kuvailun tavoitteita, kuvailusuosituksia ja -standardeja sekä luonnollisen kielen käyttöä sisällönkuvailussa. Tarkastelukohteina suosituksista ovat W3C:n julkaisemat suositukset, Dublin Core, JHS 143 ja SFS-EN 82045. Kolmannessa jaksossa tarkastellaan teollisuuden dokumentoinnin ominaispiirteitä ja käyttötarkoitusta. Teollisuudessa on monia erilaisia järjestelmäympäristöjä tehtaan sisällä ja työssä kuvataan dokumenttienhallinnan integrointitarpeita muihin järjestelmiin. Viimeisessä jaksossa kuvaillaan erilaisia dokumentoinnin hallintaympäristöjä alkaen järeimmästä päästä tuotetiedon hallintajärjestelmistä siirtyen pienempiinsuunnittelujärjestelmiin ja lopuksi yleisiin dokumenttien hallintajärjestelmiin. Tässä osassa on myös luettelo ohjelmistotoimittajista. Työn tuloksena on laadittu valituista dokumenttityypeistä metatietokuvaukset kahden eri kuvaustavan (JHS 143 ja SFS-EN 82045) avulla ja on todettu molemmat kuvaustavat käyttökelpoisiksi sähköteknisen dokumentoinnin käsittelyyn.Nämä kuvaukset palvelevat asiakasta dokumenttienhallintaprojektin määrittelytyössä. Asiakkaalle on tehty myös vertailu sopivista järjestelmävaihtoehdoista hankintaa varten.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this thesis we study the field of opinion mining by giving a comprehensive review of the available research that has been done in this topic. Also using this available knowledge we present a case study of a multilevel opinion mining system for a student organization's sales management system. We describe the field of opinion mining by discussing its historical roots, its motivations and applications as well as the different scientific approaches that have been used to solve this challenging problem of mining opinions. To deal with this huge subfield of natural language processing, we first give an abstraction of the problem of opinion mining and describe the theoretical frameworks that are available for dealing with appraisal language. Then we discuss the relation between opinion mining and computational linguistics which is a crucial pre-processing step for the accuracy of the subsequent steps of opinion mining. The second part of our thesis deals with the semantics of opinions where we describe the different ways used to collect lists of opinion words as well as the methods and techniques available for extracting knowledge from opinions present in unstructured textual data. In the part about collecting lists of opinion words we describe manual, semi manual and automatic ways to do so and give a review of the available lists that are used as gold standards in opinion mining research. For the methods and techniques of opinion mining we divide the task into three levels that are the document, sentence and feature level. The techniques that are presented in the document and sentence level are divided into supervised and unsupervised approaches that are used to determine the subjectivity and polarity of texts and sentences at these levels of analysis. At the feature level we give a description of the techniques available for finding the opinion targets, the polarity of the opinions about these opinion targets and the opinion holders. Also at the feature level we discuss the various ways to summarize and visualize the results of this level of analysis. In the third part of our thesis we present a case study of a sales management system that uses free form text and that can benefit from an opinion mining system. Using the knowledge gathered in the review of this field we provide a theoretical multi level opinion mining system (MLOM) that can perform most of the tasks needed from an opinion mining system. Based on the previous research we give some hints that many of the laborious market research tasks that are done by the sales force, which uses this sales management system, can improve their insight about their partners and by that increase the quality of their sales services and their overall results.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Learning of preference relations has recently received significant attention in machine learning community. It is closely related to the classification and regression analysis and can be reduced to these tasks. However, preference learning involves prediction of ordering of the data points rather than prediction of a single numerical value as in case of regression or a class label as in case of classification. Therefore, studying preference relations within a separate framework facilitates not only better theoretical understanding of the problem, but also motivates development of the efficient algorithms for the task. Preference learning has many applications in domains such as information retrieval, bioinformatics, natural language processing, etc. For example, algorithms that learn to rank are frequently used in search engines for ordering documents retrieved by the query. Preference learning methods have been also applied to collaborative filtering problems for predicting individual customer choices from the vast amount of user generated feedback. In this thesis we propose several algorithms for learning preference relations. These algorithms stem from well founded and robust class of regularized least-squares methods and have many attractive computational properties. In order to improve the performance of our methods, we introduce several non-linear kernel functions. Thus, contribution of this thesis is twofold: kernel functions for structured data that are used to take advantage of various non-vectorial data representations and the preference learning algorithms that are suitable for different tasks, namely efficient learning of preference relations, learning with large amount of training data, and semi-supervised preference learning. Proposed kernel-based algorithms and kernels are applied to the parse ranking task in natural language processing, document ranking in information retrieval, and remote homology detection in bioinformatics domain. Training of kernel-based ranking algorithms can be infeasible when the size of the training set is large. This problem is addressed by proposing a preference learning algorithm whose computation complexity scales linearly with the number of training data points. We also introduce sparse approximation of the algorithm that can be efficiently trained with large amount of data. For situations when small amount of labeled data but a large amount of unlabeled data is available, we propose a co-regularized preference learning algorithm. To conclude, the methods presented in this thesis address not only the problem of the efficient training of the algorithms but also fast regularization parameter selection, multiple output prediction, and cross-validation. Furthermore, proposed algorithms lead to notably better performance in many preference learning tasks considered.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

My research deals with agent nouns in the language of the works of Mikael Agricola (ca. 1510–1557). The main tasks addressed in my thesis have been to describe individual agent noun types, to provide a comprehensive picture of the category of agent nouns and to clarify the relations between different types of agent nouns. My research material consists of all the agent nouns referring to persons in the language of Agricola’s works, together with their context. The language studied is for the most part translated language. Agent nouns play an important role both in the vocabulary of natural language and in broader sentence structures, since in a text it is constantly necessary to refer to actors re-ferring to persons in the text. As a concept and a phenomenon, the agent noun is widely known in languages. It is a word formed with a certain derivational affixes, which typical-ly refers to a person. In my research the agent noun category includes both deverbal and denominal derivatives referring to persons, e.g. kirjoittaa > kirjoittaja (to write > writer), asua > asuva (to inhabit > inhabitant), imeä > imeväinen (to suck > suckling), juopua > juopunut (to drink > drunkard), pelätä > pelkuri (to fear > one who fears ‘a coward’), apu > apulainen (help/to help > helper); lammas > lampuri (sheep > shepherd). Besides original Finnish expressions, agent noun derivatives taken as such from foreign languages form a word group of central importance for the research (e.g. nikkari, porvari, ryöväri, based on the German/Swedish for carpenter, burgher, robber). Especially important for the formation of agent nouns in Finnish are the models offered by foreign languages. The starting point for my work is predominantly semantic, as both the criteria for collecting the material and the categorisation underlying the analysis of the material are based on semantic criteria. When examining derivatives, aspects relating to structure are also inevitably of central importance, as form and meaning are closely associated with each other in this type of vocabulary. The alliance of structure and meaning can be described in an illustrative manner with the help of structural schemata. The examination of agent nouns comprises on the one hand analysis of syntactic elements and on the other, study of cultural words in their most typical form. The latter aspect offers a research object in which language and the extralinguistic world, referents, their designations and cultural-historical reality are in concrete terms one and the same. Thus both the agent noun types that follow the word formation principles of the Finn-ish language and those of foreign origin borrowed as a whole into Finnish illustrate very well how an expression of a certain origin and formed according to a certain structural model is inseparably bound up with the background of its referent and in general with semantic factors. This becomes evident both on the level of the connection between cer-tain linguistic features and text genre and in relation to cultural words referring to per-sons. For example, the model for the designations of God based on agent nouns goes back thousands of years and is still closely linked in 16th century literature with certain text genres. This brings out the link between the linguistic feature and the genre in a very con-crete manner. A good example of the connection between language and the extralinguistic world is provided by the cultural vocabulary referring to persons. Originally Finnish agent noun derivatives are associated with an agrarian society, while the vocabulary relat-ing to mediaeval urbanisation, the Hansa trade and specialisation by trade or profession is borrowed and originates in its entirety from vocabulary that was originally German.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Biomedical natural language processing (BioNLP) is a subfield of natural language processing, an area of computational linguistics concerned with developing programs that work with natural language: written texts and speech. Biomedical relation extraction concerns the detection of semantic relations such as protein-protein interactions (PPI) from scientific texts. The aim is to enhance information retrieval by detecting relations between concepts, not just individual concepts as with a keyword search. In recent years, events have been proposed as a more detailed alternative for simple pairwise PPI relations. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence “Protein A causes protein B to bind protein C” can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used by computational applications. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction. In this thesis we present a method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from full dependency parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures. We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks. The Turku Event Extraction System is published as a freely available open-source project, documenting the research in detail as well as making the method available for practical applications. In particular, in this thesis we describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects. Finally, we discuss related literature, summarize the contributions of the work and present some thoughts on future directions for biomedical event extraction. This thesis includes and builds on six original research publications. The first of these introduces the analysis of dependency parses that leads to development of TEES. The entries in the three BioNLP Shared Tasks, as well as in the DDIExtraction 2011 task are covered in four publications, and the sixth one demonstrates the application of the system to PubMed-scale text mining.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The subject of the thesis is automatic sentence compression with machine learning, so that the compressed sentences remain both grammatical and retain their essential meaning. There are multiple possible uses for the compression of natural language sentences. In this thesis the focus is generation of television program subtitles, which often are compressed version of the original script of the program. The main part of the thesis consists of machine learning experiments for automatic sentence compression using different approaches to the problem. The machine learning methods used for this work are linear-chain conditional random fields and support vector machines. Also we take a look which automatic text analysis methods provide useful features for the task. The data used for machine learning is supplied by Lingsoft Inc. and consists of subtitles in both compressed an uncompressed form. The models are compared to a baseline system and comparisons are made both automatically and also using human evaluation, because of the potentially subjective nature of the output. The best result is achieved using a CRF - sequence classification using a rich feature set. All text analysis methods help classification and most useful method is morphological analysis. Tutkielman aihe on suomenkielisten lauseiden automaattinen tiivistäminen koneellisesti, niin että lyhennetyt lauseet säilyttävät olennaisen informaationsa ja pysyvät kieliopillisina. Luonnollisen kielen lauseiden tiivistämiselle on monta käyttötarkoitusta, mutta tässä tutkielmassa aihetta lähestytään television ohjelmien tekstittämisen kautta, johon käytännössä kuuluu alkuperäisen tekstin lyhentäminen televisioruudulle paremmin sopivaksi. Tutkielmassa kokeillaan erilaisia koneoppimismenetelmiä tekstin automaatiseen lyhentämiseen ja tarkastellaan miten hyvin erilaiset luonnollisen kielen analyysimenetelmät tuottavat informaatiota, joka auttaa näitä menetelmiä lyhentämään lauseita. Lisäksi tarkastellaan minkälainen lähestymistapa tuottaa parhaan lopputuloksen. Käytetyt koneoppimismenetelmät ovat tukivektorikone ja lineaarisen sekvenssin mallinen CRF. Koneoppimisen tukena käytetään tekstityksiä niiden eri käsittelyvaiheissa, jotka on saatu Lingsoft OY:ltä. Luotuja malleja vertaillaan Lopulta mallien lopputuloksia evaluoidaan automaattisesti ja koska teksti lopputuksena on jossain määrin subjektiivinen myös ihmisarviointiin perustuen. Vertailukohtana toimii kirjallisuudesta poimittu menetelmä. Tutkielman tuloksena paras lopputulos saadaan aikaan käyttäen CRF sekvenssi-luokittelijaa laajalla piirrejoukolla. Kaikki kokeillut teksin analyysimenetelmät auttavat luokittelussa, joista tärkeimmän panoksen antaa morfologinen analyysi.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal