171 resultados para parsing
Resumo:
The physiological basis of human cerebral asymmetry for language remains mysterious. We have used simultaneous physiological and anatomical measurements to investigate the issue. Concentrating on neural oscillatory activity in speech-specific frequency bands and exploring interactions between gestural (motor) and auditory-evoked activity, we find, in the absence of language-related processing, that left auditory, somatosensory, articulatory motor, and inferior parietal cortices show specific, lateralized, speech-related physiological properties. With the addition of ecologically valid audiovisual stimulation, activity in auditory cortex synchronizes with left-dominant input from the motor cortex at frequencies corresponding to syllabic, but not phonemic, speech rhythms. Our results support theories of language lateralization that posit a major role for intrinsic, hardwired perceptuomotor processing in syllabic parsing and are compatible both with the evolutionary view that speech arose from a combination of syllable-sized vocalizations and meaningful hand gestures and with developmental observations suggesting phonemic analysis is a developmentally acquired process.
Resumo:
BACKGROUND: Molecular interaction Information is a key resource in modern biomedical research. Publicly available data have previously been provided in a broad array of diverse formats, making access to this very difficult. The publication and wide implementation of the Human Proteome Organisation Proteomics Standards Initiative Molecular Interactions (HUPO PSI-MI) format in 2004 was a major step towards the establishment of a single, unified format by which molecular interactions should be presented, but focused purely on protein-protein interactions. RESULTS: The HUPO-PSI has further developed the PSI-MI XML schema to enable the description of interactions between a wider range of molecular types, for example nucleic acids, chemical entities, and molecular complexes. Extensive details about each supported molecular interaction can now be captured, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction. The format is supported by data management and analysis tools and has been adopted by major interaction data providers. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration. CONCLUSION: The PSI-MI XML2.5 and MITAB2.5 formats have been jointly developed by interaction data producers and providers from both the academic and commercial sector, and are already widely implemented and well supported by an active development community. PSI-MI XML2.5 enables the description of highly detailed molecular interaction data and facilitates data exchange between databases and users without loss of information. MITAB2.5 is a simpler format appropriate for fast Perl parsing or loading into Microsoft Excel.
Resumo:
Tämädiplomityö tutkii kuinka Eclipse -ympäristöä voidaan käyttää testitapausten generoinnissa. Eräs diplomityön pääaiheista on tutkia voidaanko olemassa olevilla Eclipsen komponenteilla parantaa symboolitietoutta, jotta testitapausten generointiin saataisiin lisää tietoa. Aluksi diplomityö antaa lyhyen katsauksen ohjelmistojentestaukseen, jotta lukija ymmärtää mitä ohjelmistotekniikan osa-aluetta diplomityö käsittelee. Tämän jälkeen kerrotaan lisää tietoa itse testitapausten generointiprosessista. Kun perusteet on käsitelty, tutustetaan lukija Eclipse -ympäristöön, mikä se on, mistä se koostuu ja mitä sillä voidaan tehdä. Tarkempaa tietoa kerrotaan Eclipsen komponenteista joita voidaan käyttää apuna testitapausten generoinnissa. Integrointi esimerkkinä diplomityössä esitellään valmiin testitapausgeneraattorin integrointi Eclipse -ympäristöön. Lopuksi Eclipse -pohjaista ratkaisua verrataan symboolitietouden sekä ajoajan kannalta aikaisempaan ratkaisuun. Diplomityön tuloksena syntyi prototyyppi jonka avulla todistettiin, että Eclipse - ympäristöön on mahdollista integroida testitapausgeneraattori ja että se voi lisätä symboolitietoutta. Tämätietouden lisäys kuitenkin lisäsi myös tarvittavaa ajoaikaa, joissakintapauksissa jopa merkittävästi. Samalla todettiin, että tällä hetkellä on menossa projekteja joiden tarkoituksena on parantaa käytettyjen Eclipse komponenttien suorituskykyä ja että tämä voi parantaa tuloksia tulevaisuudessa.
Resumo:
Työssä tutkittiin IFC (Industrial Foundation Classes)-tietomallin mukaisen tiedoston jäsentämistä, tiedon jatkoprosessointia ja tiedonsiirtoa sovellusten välillä. Tutkittiin, mitä vaihtoehtoja tiedon siirron toteuttamiseksi ohjelmallisesti on ja mihin suuntaan tiedon siirtäminen on menossa tulevaisuudessa. Soveltavassa osassa toteutettiin IFC-standardin mukaisen ISO10303-tiedoston (Osa 21) jäsentäminen ja tulkitseminen XML-muotoon. Sovelluksessa jäsennetään ja tulkitaan CAD-ohjelmistolla tehty IFC-tiedosto C# -ohjelmointikielellä ja tallennetaan tieto XML-tietokantaan kustannuslaskentaohjelmiston luettavaksi.
Resumo:
An increasing number of studies in recent years have sought to identify individual inventors from patent data. A variety of heuristics have been proposed for using the names and other information disclosed in patent documents to establish who is who in patents. This paper contributes to this literature by describing a methodology for identifying inventors using patents applied to the European Patent Office, EPO hereafter. As in much of this literature, we basically follow a threestep procedure : 1- the parsing stage, aimed at reducing the noise in the inventor’s name and other fields of the patent; 2- the matching stage, where name matching algorithms are used to group similar names; and 3- the filtering stage, where additional information and various scoring schemes are used to filter out these similarlynamed inventors. The paper presents the results obtained by using the algorithms with the set of European inventors applying to the EPO over a long period of time.
Resumo:
Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.
Resumo:
Työn tarkoitus on suunnitella ja toteuttaa nettipohjainen voimalaitosratkaisujen hinta-arviojärjestelmä Savonia Power Oy:n käyttöön. Järjestelmän tarkoitus on automatisoida voimalaitosratkaisujen tunnuslukujen laskeminen asiakkaan antamien alkuarvojen pohjalta ja tallentaa mahdollinen yhteydenottopyyntö. Järjestelmän vaatimuksina ovat laskentakaavojen helppo päivitettävyys, kaavojen automaattinen hakeminen Excel 2007–muotoisesta tiedostosta ja asiakasrajapinnan nettipohjaisuus. Työ jakaantuu kahteen osaan. Teoriaosassa selvitetään työssä käytettyjen tekniikoiden taustaa ja selvitetään Microsoftin OOXML-tiedostomuodon rakenne työssä vaadittavin osin. Käytännön osassa suunnitellaan ja osin toteutetaan valmis järjestelmä käyttäen PHP-kieltä, XML-määrittelykieltä ja MySQL-tietokantaa. Suurimmat haasteet järjestelmän toteutuksessa ovat laskentakaavojen parsiminen Excel-tiedostosta ilman sen sisällön tiukkaa rajoittamista tiettyihin raameihin ja järjestelmän helppo päivitys saaduilla laskentakaavoilla. Työn lopputuloksena on toimiva, muttei viimeistelty järjestelmä sekä tämä dokumentti. Työn suurin merkitys tulee olemaan edellä mainittujen suunnitteluhaasteiden selvittäminen, sekä valmis ohjelmarunko yleiseen käyttöön otetulle järjestelmälle.
Resumo:
Presentation at Open Repositories 2014, Helsinki, Finland, June 9-13, 2014
Resumo:
Biomedical natural language processing (BioNLP) is a subfield of natural language processing, an area of computational linguistics concerned with developing programs that work with natural language: written texts and speech. Biomedical relation extraction concerns the detection of semantic relations such as protein-protein interactions (PPI) from scientific texts. The aim is to enhance information retrieval by detecting relations between concepts, not just individual concepts as with a keyword search. In recent years, events have been proposed as a more detailed alternative for simple pairwise PPI relations. Events provide a systematic, structural representation for annotating the content of natural language texts. Events are characterized by annotated trigger words, directed and typed arguments and the ability to nest other events. For example, the sentence “Protein A causes protein B to bind protein C” can be annotated with the nested event structure CAUSE(A, BIND(B, C)). Converted to such formal representations, the information of natural language texts can be used by computational applications. Biomedical event annotations were introduced by the BioInfer and GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task on Event Extraction. In this thesis we present a method for automated event extraction, implemented as the Turku Event Extraction System (TEES). A unified graph format is defined for representing event annotations and the problem of extracting complex event structures is decomposed into a number of independent classification tasks. These classification tasks are solved using SVM and RLS classifiers, utilizing rich feature representations built from full dependency parsing. Building on earlier work on pairwise relation extraction and using a generalized graph representation, the resulting TEES system is capable of detecting binary relations as well as complex event structures. We show that this event extraction system has good performance, reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently, TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared Tasks, as well as shown competitive performance in the binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared tasks. The Turku Event Extraction System is published as a freely available open-source project, documenting the research in detail as well as making the method available for practical applications. In particular, in this thesis we describe the application of the event extraction method to PubMed-scale text mining, showing how the developed approach not only shows good performance, but is generalizable and applicable to large-scale real-world text mining projects. Finally, we discuss related literature, summarize the contributions of the work and present some thoughts on future directions for biomedical event extraction. This thesis includes and builds on six original research publications. The first of these introduces the analysis of dependency parses that leads to development of TEES. The entries in the three BioNLP Shared Tasks, as well as in the DDIExtraction 2011 task are covered in four publications, and the sixth one demonstrates the application of the system to PubMed-scale text mining.
Resumo:
This thesis introduces an extension of Chomsky’s context-free grammars equipped with operators for referring to left and right contexts of strings.The new model is called grammar with contexts. The semantics of these grammars are given in two equivalent ways — by language equations and by logical deduction, where a grammar is understood as a logic for the recursive definition of syntax. The motivation for grammars with contexts comes from an extensive example that completely defines the syntax and static semantics of a simple typed programming language. Grammars with contexts maintain most important practical properties of context-free grammars, including a variant of the Chomsky normal form. For grammars with one-sided contexts (that is, either left or right), there is a cubic-time tabular parsing algorithm, applicable to an arbitrary grammar. The time complexity of this algorithm can be improved to quadratic,provided that the grammar is unambiguous, that is, it only allows one parsefor every string it defines. A tabular parsing algorithm for grammars withtwo-sided contexts has fourth power time complexity. For these grammarsthere is a recognition algorithm that uses a linear amount of space. For certain subclasses of grammars with contexts there are low-degree polynomial parsing algorithms. One of them is an extension of the classical recursive descent for context-free grammars; the version for grammars with contexts still works in linear time like its prototype. Another algorithm, with time complexity varying from linear to cubic depending on the particular grammar, adapts deterministic LR parsing to the new model. If all context operators in a grammar define regular languages, then such a grammar can be transformed to an equivalent grammar without context operators at all. This allows one to represent the syntax of languages in a more succinct way by utilizing context specifications. Linear grammars with contexts turned out to be non-trivial already over a one-letter alphabet. This fact leads to some undecidability results for this family of grammars
Resumo:
Dynamic logic is an extension of modal logic originally intended for reasoning about computer programs. The method of proving correctness of properties of a computer program using the well-known Hoare Logic can be implemented by utilizing the robustness of dynamic logic. For a very broad range of languages and applications in program veri cation, a theorem prover named KIV (Karlsruhe Interactive Veri er) Theorem Prover has already been developed. But a high degree of automation and its complexity make it di cult to use it for educational purposes. My research work is motivated towards the design and implementation of a similar interactive theorem prover with educational use as its main design criteria. As the key purpose of this system is to serve as an educational tool, it is a self-explanatory system that explains every step of creating a derivation, i.e., proving a theorem. This deductive system is implemented in the platform-independent programming language Java. In addition, a very popular combination of a lexical analyzer generator, JFlex, and the parser generator BYacc/J for parsing formulas and programs has been used.
Resumo:
Dans cette dissertation, nous présentons plusieurs techniques d’apprentissage d’espaces sémantiques pour plusieurs domaines, par exemple des mots et des images, mais aussi à l’intersection de différents domaines. Un espace de représentation est appelé sémantique si des entités jugées similaires par un être humain, ont leur similarité préservée dans cet espace. La première publication présente un enchaînement de méthodes d’apprentissage incluant plusieurs techniques d’apprentissage non supervisé qui nous a permis de remporter la compétition “Unsupervised and Transfer Learning Challenge” en 2011. Le deuxième article présente une manière d’extraire de l’information à partir d’un contexte structuré (177 détecteurs d’objets à différentes positions et échelles). On montrera que l’utilisation de la structure des données combinée à un apprentissage non supervisé permet de réduire la dimensionnalité de 97% tout en améliorant les performances de reconnaissance de scènes de +5% à +11% selon l’ensemble de données. Dans le troisième travail, on s’intéresse à la structure apprise par les réseaux de neurones profonds utilisés dans les deux précédentes publications. Plusieurs hypothèses sont présentées et testées expérimentalement montrant que l’espace appris a de meilleures propriétés de mixage (facilitant l’exploration de différentes classes durant le processus d’échantillonnage). Pour la quatrième publication, on s’intéresse à résoudre un problème d’analyse syntaxique et sémantique avec des réseaux de neurones récurrents appris sur des fenêtres de contexte de mots. Dans notre cinquième travail, nous proposons une façon d’effectuer de la recherche d’image ”augmentée” en apprenant un espace sémantique joint où une recherche d’image contenant un objet retournerait aussi des images des parties de l’objet, par exemple une recherche retournant des images de ”voiture” retournerait aussi des images de ”pare-brises”, ”coffres”, ”roues” en plus des images initiales.
Resumo:
This thesis summarizes the results on the studies on a syntax based approach for translation between Malayalam, one of Dravidian languages and English and also on the development of the major modules in building a prototype machine translation system from Malayalam to English. The development of the system is a pioneering effort in Malayalam language unattempted by previous researchers. The computational models chosen for the system is first of its kind for Malayalam language. An in depth study has been carried out in the design of the computational models and data structures needed for different modules: morphological analyzer , a parser, a syntactic structure transfer module and target language sentence generator required for the prototype system. The generation of list of part of speech tags, chunk tags and the hierarchical dependencies among the chunks required for the translation process also has been done. In the development process, the major goals are: (a) accuracy of translation (b) speed and (c) space. Accuracy-wise, smart tools for handling transfer grammar and translation standards including equivalent words, expressions, phrases and styles in the target language are to be developed. The grammar should be optimized with a view to obtaining a single correct parse and hence a single translated output. Speed-wise, innovative use of corpus analysis, efficient parsing algorithm, design of efficient Data Structure and run-time frequency-based rearrangement of the grammar which substantially reduces the parsing and generation time are required. The space requirement also has to be minimised
Resumo:
This thesis aims at empowering software customers with a tool to build software tests them selves, based on a gradual refinement of natural language scenarios into executable visual test models. The process is divided in five steps: 1. First, a natural language parser is used to extract a graph of grammatical relations from the textual scenario descriptions. 2. The resulting graph is transformed into an informal story pattern by interpreting structurization rules based on Fujaba Story Diagrams. 3. While the informal story pattern can already be used by humans the diagram still lacks technical details, especially type information. To add them, a recommender based framework uses web sites and other resources to generate formalization rules. 4. As a preparation for the code generation the classes derived for formal story patterns are aligned across all story steps, substituting a class diagram. 5. Finally, a headless version of Fujaba is used to generate an executable JUnit test. The graph transformations used in the browser application are specified in a textual domain specific language and visualized as story pattern. Last but not least, only the heavyweight parsing (step 1) and code generation (step 5) are executed on the server side. All graph transformation steps (2, 3 and 4) are executed in the browser by an interpreter written in JavaScript/GWT. This result paves the way for online collaboration between global teams of software customers, IT business analysts and software developers.
Resumo:
Free-word order languages have long posed significant problems for standard parsing algorithms. This thesis presents an implemented parser, based on Government-Binding (GB) theory, for a particular free-word order language, Warlpiri, an aboriginal language of central Australia. The words in a sentence of a free-word order language may swap about relatively freely with little effect on meaning: the permutations of a sentence mean essentially the same thing. It is assumed that this similarity in meaning is directly reflected in the syntax. The parser presented here properly processes free word order because it assigns the same syntactic structure to the permutations of a single sentence. The parser also handles fixed word order, as well as other phenomena. On the view presented here, there is no such thing as a "configurational" or "non-configurational" language. Rather, there is a spectrum of languages that are more or less ordered. The operation of this parsing system is quite different in character from that of more traditional rule-based parsing systems, e.g., context-free parsers. In this system, parsing is carried out via the construction of two different structures, one encoding precedence information and one encoding hierarchical information. This bipartite representation is the key to handling both free- and fixed-order phenomena. This thesis first presents an overview of the portion of Warlpiri that can be parsed. Following this is a description of the linguistic theory on which the parser is based. The chapter after that describes the representations and algorithms of the parser. In conclusion, the parser is compared to related work. The appendix contains a substantial list of test cases ??th grammatical and ungrammatical ??at the parser has actually processed.