999 resultados para SYNTACTIC DEPENDENCY NETWORKS
Resumo:
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore there is ample opportunity to enhance the statistics-based methods as new measures of network topology and dynamics are created. In this paper, we employ for the first time the metrics betweenness, vulnerability and diversity to analyze written texts in Brazilian Portuguese. Using strategies based on diversity metrics, a better performance in automatic summarization is achieved in comparison to previous work employing complex networks. With an optimized method the Rouge score (an automatic evaluation method used in summarization) was 0.5089, which is the best value ever achieved for an extractive summarizer with statistical methods based on complex networks for Brazilian Portuguese. Furthermore, the diversity metric can detect keywords with high precision, which is why we believe it is suitable to produce good summaries. It is also shown that incorporating linguistic knowledge through a syntactic parser does enhance the performance of the automatic summarizers, as expected, but the increase in the Rouge score is only minor. These results reinforce the suitability of complex network methods for improving automatic summarizers in particular, and treating text in general. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
With the development of information technology, the theory and methodology of complex network has been introduced to the language research, which transforms the system of language in a complex networks composed of nodes and edges for the quantitative analysis about the language structure. The development of dependency grammar provides theoretical support for the construction of a treebank corpus, making possible a statistic analysis of complex networks. This paper introduces the theory and methodology of the complex network and builds dependency syntactic networks based on the treebank of speeches from the EEE-4 oral test. According to the analysis of the overall characteristics of the networks, including the number of edges, the number of the nodes, the average degree, the average path length, the network centrality and the degree distribution, it aims to find in the networks potential difference and similarity between various grades of speaking performance. Through clustering analysis, this research intends to prove the network parameters’ discriminating feature and provide potential reference for scoring speaking performance.
Resumo:
This thesis proposes a novel graphical model for inference called the Affinity Network,which displays the closeness between pairs of variables and is an alternative to Bayesian Networks and Dependency Networks. The Affinity Network shares some similarities with Bayesian Networks and Dependency Networks but avoids their heuristic and stochastic graph construction algorithms by using a message passing scheme. A comparison with the above two instances of graphical models is given for sparse discrete and continuous medical data and data taken from the UCI machine learning repository. The experimental study reveals that the Affinity Network graphs tend to be more accurate on the basis of an exhaustive search with the small datasets. Moreover, the graph construction algorithm is faster than the other two methods with huge datasets. The Affinity Network is also applied to data produced by a synchronised system. A detailed analysis and numerical investigation into this dynamical system is provided and it is shown that the Affinity Network can be used to characterise its emergent behaviour even in the presence of noise.
Resumo:
Les logiciels de correction grammaticale commettent parfois des détections illégitimes (fausses alertes), que nous appelons ici surdétections. La présente étude décrit les expériences de mise au point d’un système créé pour identifier et mettre en sourdine les surdétections produites par le correcteur du français conçu par la société Druide informatique. Plusieurs classificateurs ont été entraînés de manière supervisée sur 14 types de détections faites par le correcteur, en employant des traits couvrant di-verses informations linguistiques (dépendances et catégories syntaxiques, exploration du contexte des mots, etc.) extraites de phrases avec et sans surdétections. Huit des 14 classificateurs développés sont maintenant intégrés à la nouvelle version d’un correcteur commercial très populaire. Nos expériences ont aussi montré que les modèles de langue probabilistes, les SVM et la désambiguïsation sémantique améliorent la qualité de ces classificateurs. Ce travail est un exemple réussi de déploiement d’une approche d’apprentissage machine au service d’une application langagière grand public robuste.
Resumo:
This book argues for novel strategies to integrate engineering design procedures and structural analysis data into architectural design. Algorithmic procedures that recently migrated into the architectural practice are utilized to improve the interface of both disciplines. Architectural design is predominately conducted as a negotiation process of various factors but often lacks rigor and data structures to link it to quantitative procedures. Numerical structural design on the other hand could act as a role model for handling data and robust optimization but it often lacks the complexity of architectural design. The goal of this research is to bring together robust methods from structural design and complex dependency networks from architectural design processes. The book presents three case studies of tools and methods that are developed to exemplify, analyze and evaluate a collaborative work flow.
Machine Learning applicato al Web Semantico: Statistical Relational Learning vs Tensor Factorization
Resumo:
Obiettivo della tesi è analizzare e testare i principali approcci di Machine Learning applicabili in contesti semantici, partendo da algoritmi di Statistical Relational Learning, quali Relational Probability Trees, Relational Bayesian Classifiers e Relational Dependency Networks, per poi passare ad approcci basati su fattorizzazione tensori, in particolare CANDECOMP/PARAFAC, Tucker e RESCAL.
Resumo:
Background: Microarray techniques have become an important tool to the investigation of genetic relationships and the assignment of different phenotypes. Since microarrays are still very expensive, most of the experiments are performed with small samples. This paper introduces a method to quantify dependency between data series composed of few sample points. The method is used to construct gene co-expression subnetworks of highly significant edges. Results: The results shown here are for an adapted subset of a Saccharomyces cerevisiae gene expression data set with low temporal resolution and poor statistics. The method reveals common transcription factors with a high confidence level and allows the construction of subnetworks with high biological relevance that reveals characteristic features of the processes driving the organism adaptations to specific environmental conditions. Conclusion: Our method allows a reliable and sophisticated analysis of microarray data even under severe constraints. The utilization of systems biology improves the biologists ability to elucidate the mechanisms underlying celular processes and to formulate new hypotheses.
Resumo:
Background: DAPfinder and DAPview are novel BRB-ArrayTools plug-ins to construct gene coexpression networks and identify significant differences in pairwise gene-gene coexpression between two phenotypes. Results: Each significant difference in gene-gene association represents a Differentially Associated Pair (DAP). Our tools include several choices of filtering methods, gene-gene association metrics, statistical testing methods and multiple comparison adjustments. Network results are easily displayed in Cytoscape. Analyses of glioma experiments and microarray simulations demonstrate the utility of these tools. Conclusions: DAPfinder is a new friendly-user tool for reconstruction and comparison of biological networks.
Resumo:
Admission controls, such as trunk reservation, are often used in loss networks to optimise their performance. Since the numerical evaluation of performance measures is complex, much attention has been given to finding approximation methods. The Erlang Fixed-Point (EFP) approximation, which is based on an independent blocking assumption, has been used for networks both with and without controls. Several more elaborate approximation methods which account for dependencies in blocking behaviour have been developed for the uncontrolled setting. This paper is an exploratory investigation of extensions and synthesis of these methods to systems with controls, in particular, trunk reservation. In order to isolate the dependency factor, we restrict our attention to a highly linear network. We will compare the performance of the resulting approximations against the benchmark of the EFP approximation extended to the trunk reservation setting. By doing this, we seek to gain insight into the critical factors in constructing an effective approximation. (C) 2003 Elsevier Ltd. All rights reserved.
Resumo:
Using optimized voxel-based morphometry, we performed grey matter density analyses on 59 age-, sex- and intelligence-matched young adults with three distinct, progressive levels of musical training intensity or expertise. Structural brain adaptations in musicians have been repeatedly demonstrated in areas involved in auditory perception and motor skills. However, musical activities are not confined to auditory perception and motor performance, but are entangled with higher-order cognitive processes. In consequence, neuronal systems involved in such higher-order processing may also be shaped by experience-driven plasticity. We modelled expertise as a three-level regressor to study possible linear relationships of expertise with grey matter density. The key finding of this study resides in a functional dissimilarity between areas exhibiting increase versus decrease of grey matter as a function of musical expertise. Grey matter density increased with expertise in areas known for their involvement in higher-order cognitive processing: right fusiform gyrus (visual pattern recognition), right mid orbital gyrus (tonal sensitivity), left inferior frontal gyrus (syntactic processing, executive function, working memory), left intraparietal sulcus (visuo-motor coordination) and bilateral posterior cerebellar Crus II (executive function, working memory) and in auditory processing: left Heschl's gyrus. Conversely, grey matter density decreased with expertise in bilateral perirolandic and striatal areas that are related to sensorimotor function, possibly reflecting high automation of motor skills. Moreover, a multiple regression analysis evidenced that grey matter density in the right mid orbital area and the inferior frontal gyrus predicted accuracy in detecting fine-grained incongruities in tonal music.
Resumo:
Tiedon jakaminen ja kommunikointi ovat tärkeitä toimintoja verkostoituneiden yritysten välillä ja ne käsitetäänkin yhteistyösuhteen yhtenä menestystekijänä ja kulmakivenä. Tiedon jakamiseen liittyviä haasteita ovat mm. yrityksen liiketoiminnalle kriittisen tiedon vuotaminen ja liiketoiminnan vaatima tiedon reaaliaikaisuus ja riittävä määrä. Tuotekehitysyhteistyössä haasteellista on tiedon jäsentymättömyys ja sitä kautta lisääntyvä tiedon jakamisen tarve, minkä lisäksi jaettava tieto on usein monimutkaista ja yksityiskohtaista. Lisäksi tuotteiden elinkaaret lyhenevät, ja ulkoistaminen ja yhteistyö ovat yhä kasvavia trendejä liiketoiminnassa. Yhdessä nämä tekijät johtavat siihen, että tiedon jakaminen on haastavaa eritoten verkostoituneiden yritysten välillä. Tässä tutkimuksessa tiedon jakamisen haasteisiin pyrittiin vastaamaan ottamalla lähtökohdaksi tiedon jakamisen tilanneriippuvuuden ymmärtäminen. Työssä vastattiin kahteen pääkysymykseen: Mikä on tiedon jakamisen tilanneriippuvuus ja miten sitä voidaan hallita? Tilanneriippuvuudella tarkoitetaan työssä niitä tekijöitä, jotka vaikuttavat siihen, miten yritys jakaa tietoa tuotekehityskumppaneidensa kanssa. Tiedon jakamisella puolestaan tarkoitetaan yrityksestä toiselle siirrettävää tietoa, jota tarvitaan tuotekehitysprojektin aikana. Työn empiirinen aineisto on kerätty laadullisella tutkimusotteella case- eli tapaustutkimuksena yhdessä telekommunikaatioalan yrityksessä jasen eri liiketoimintayksiköissä. Tutkimusjoukko käsitti 19 tuotekehitys- ja toimittajanhallintatehtävissä toimivaa johtajaa tai päällikköä. Työ nojaa pääasiassa hankintojen johtamisen tutkimuskenttään ja tilanneriippuvuuden selvittämiseksi paneuduttiin erityisesti verkostojen tutkimukseen. Työssä kuvattiin tiedon jakaminen yhtenä verkoston toimintona ja yhteistyöhön liittyvättiedon jakamisen hyödyt, haasteet ja riskit identifioitiin. Tämän lisäksi työssä kehitettiin verkoston tutkimismalleja ja yhdistettiin eri tasoilla tapahtuvaa verkoston tutkimusta. Työssä esitettiin malli verkoston toimintojen tutkimiseksija todettiin, että verkostotutkimusta pitäisi tehdä verkosto, ketju, yrityssuhde- ja yritystasolla. Malliin on myös hyvä yhdistää tuote- ja tehtäväkohtaiset ominaispiirteet. Kirjallisuuskatsauksen perusteella huomattiin, että tiedon jakamista on aiemmin tarkasteltu lähinnä tuote- ja yrityssuhteiden tasolla. Väitöskirjassa esitettiin lisää merkittäviä tekijöitä, jotka vaikuttavat tiedon jakamiseen. Näitä olivat mm. tuotekehitystehtävän luonne, teknologia-alueen kypsyys ja toimittajan kyvykkyys. Tiedon jakamisen luonnetta tarkasteltaessa erotettiin operatiivinen, projektin hallintaan ja tuotekehitykseen liittyvä tieto sekä yleinen, toimittajan hallintaan liittyvä strateginen tieto. Tulosten mukaan erityisesti tuotekehityksen määrittelyvaihe ja tapaamiset kasvotusten korostuivat yhteistyössä. Empirian avulla tutkittiin myös niitä tekijöitä, joilla tiedon jakamista voidaan hallita tilanneriippuvuuteen perustuen, koska aiemmin tiedon jakamisen hallintakeinoja tai menestystekijöitä ei ole liitetty suoranaisesti eri olosuhteisiin. Nämä hallintakeinot jaettiin yhteistyötason- ja tuotekehitysprojektitason tekijöihin. Yksi työn keskeisistä tuloksista on se, että huolimatta tiedon jakamisen haasteista, monet niistä voidaan eliminoida tunnistamalla vallitsevat olosuhteet ja panostamalla tiedon jakamisen hallintakeinoihin. Työn manageriaalinen hyöty koskee erityisesti yrityksiä, jotka suunnittelevat ja tekevät tuotekehitysyhteistyötä yrityskumppaniensa kanssa. Työssä esitellään keinoja tämän haasteellisen tehtäväkentän hallintaan ja todetaan, että yritysten pitäisikin kiinnittää entistä enemmän huomiota tiedon jakamisen ja kommunikaation hallintaan jo tuotekehitysyhteistyötä suunniteltaessa.
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.