923 resultados para Information retrieval, dysorthography, dyslexia, finite state machines, readability
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. (c) 2012 Elsevier B.V. All rights reserved.
The automatic disambiguation of word senses (i.e., the identification of which of the meanings is used in a given context for a word that has multiple meanings) is essential for such applications as machine translation and information retrieval, and represents a key step for developing the so-called Semantic Web. Humans disambiguate words in a straightforward fashion, but this does not apply to computers. In this paper we address the problem of Word Sense Disambiguation (WSD) by treating texts as complex networks, and show that word senses can be distinguished upon characterizing the local structure around ambiguous words. Our goal was not to obtain the best possible disambiguation system, but we nevertheless found that in half of the cases our approach outperforms traditional shallow methods. We show that the hierarchical connectivity and clustering of words are usually the most relevant features for WSD. The results reported here shed light on the relationship between semantic and structural parameters of complex networks. They also indicate that when combined with traditional techniques the complex network approach may be useful to enhance the discrimination of senses in large texts. Copyright (C) EPLA, 2012
Este trabalho relata a experiência e os procedimentos adotados em um processo de análise e identificação dos títulos de periódicos recebidos pela Biblioteca do Instituto de Medicina Tropical de São Paulo da Universidade de São Paulo, desde sua criação. Para a coleta de dados foram utilizadas as informações dos registros bibliográficos no Módulo de Catalogação no Banco de Dados Bibliográficos – DEDALUS Aleph 500 Versão 18.1 da Universidade de São Paulo, seguindo alguns critérios pré-estabelecidos. Conclui-se que, apesar dos problemas detectados serem pouco relevantes em relação ao acervo analisado, deve-se manter um estudo comparativo entre a necessidade do usuário e a coleção disponível na Biblioteca, para que os periódicos atendam às necessidades de informação de seus usuários.
O artigo apresenta uma análise da operacionalidade das Folksonomias e a possibilidade de aplicação dessa ferramenta nos sistemas de organização da informação da área de Ciência da Informação. Para tanto foi realizada uma análise de coerência de tags e dos recursos disponíveis para etiquetagem em dois websites, a Last.fm e o CiteULike. Por meio dessa análise constatou-se que em ambos os websites ocorreram incoerências e discrepâncias nas tags utilizadas. Todavia, o sistema da Last.fm demonstrou-se mais funcional que o do CiteULike obtendo um desempenho melhor. Por fim, sugere-se a junção das Folksonomias às Ontologias, que permitiriam a criação de sistemas automatizados de organização de conteúdos informacionais alimentados pelos próprios usuários
The need for a convergence between semi-structured data management and Information Retrieval techniques is manifest to the scientific community. In order to fulfil this growing request, W3C has recently proposed XQuery Full Text, an IR-oriented extension of XQuery. However, the issue of query optimization requires the study of important properties like query equivalence and containment; to this aim, a formal representation of document and queries is needed. The goal of this thesis is to establish such formal background. We define a data model for XML documents and propose an algebra able to represent most of XQuery Full-Text expressions. We show how an XQuery Full-Text expression can be translated into an algebraic expression and how an algebraic expression can be optimized.
Service Oriented Computing is a new programming paradigm for addressing distributed system design issues. Services are autonomous computational entities which can be dynamically discovered and composed in order to form more complex systems able to achieve different kinds of task. E-government, e-business and e-science are some examples of the IT areas where Service Oriented Computing will be exploited in the next years. At present, the most credited Service Oriented Computing technology is that of Web Services, whose specifications are enriched day by day by industrial consortia without following a precise and rigorous approach. This PhD thesis aims, on the one hand, at modelling Service Oriented Computing in a formal way in order to precisely define the main concepts it is based upon and, on the other hand, at defining a new approach, called bipolar approach, for addressing system design issues by synergically exploiting choreography and orchestration languages related by means of a mathematical relation called conformance. Choreography allows us to describe systems of services from a global view point whereas orchestration supplies a means for addressing such an issue from a local perspective. In this work we present SOCK, a process algebra based language inspired by the Web Service orchestration language WS-BPEL which catches the essentials of Service Oriented Computing. From the definition of SOCK we will able to define a general model for dealing with Service Oriented Computing where services and systems of services are related to the design of finite state automata and process algebra concurrent systems, respectively. Furthermore, we introduce a formal language for dealing with choreography. Such a language is equipped with a formal semantics and it forms, together with a subset of the SOCK calculus, the bipolar framework. Finally, we present JOLIE which is a Java implentation of a subset of the SOCK calculus and it is part of the bipolar framework we intend to promote.
L'informatica musicale è una disciplina in continua crescita che sta ottenendo risultati davvero interessanti con l'impiego di sistemi artificiali intelligenti, come le reti neuronali, che permettono di emulare capacità umane di ascolto e di esecuzione musicale. Di particolare interesse è l'ambito della codifica di informazioni musicali tramite formati simbolici, come il MIDI, che permette un'analisi di alto livello dei brani musicali e consente la realizzazione di applicazioni sorprendentemente innovative. Una delle più fruttifere applicazioni di questi nuovi strumenti di codifica riguarda la classificazione di file audio musicali. Questo elaborato si propone di esporre i fondamentali aspetti teorici che concernono la classificazione di brani musicali tramite reti neuronali artificiali e descrivere alcuni esperimenti di classificazione di file MIDI. La prima parte fornisce alcune conoscenze di base che permettono di leggere gli esperimenti presenti nella seconda sezione con una consapevolezza teorica più profonda. Il fine principale della prima parte è quello di sviluppare una comparazione da diversi punti di vista disciplinari tra le capacità di classificazione musicale umane e quelle artificiali. Si descrivono le reti neuronali artificiali come sistemi intelligenti ispirati alla struttura delle reti neurali biologiche, soffermandosi in particolare sulla rete Feedforward e sull'algoritmo di Backpropagation. Si esplora il concetto di percezione nell'ambito della psicologia cognitiva con maggiore attenzione alla percezione uditiva. Accennate le basi della psicoacustica, si passa ad una descrizione delle componenti strutturali prima del suono e poi della musica: la frequenza e l'ampiezza delle onde, le note e il timbro, l'armonia, la melodia ed il ritmo. Si parla anche delle illusioni sonore e della rielaborazione delle informazioni audio da parte del cervello umano. Si descrive poi l'ambito che interessa questa tesi da vicino: il MIR (Music Information Retrieval). Si analizzano i campi disciplinari a cui questa ricerca può portare vantaggi, ossia quelli commerciali, in cui i database musicali svolgono ruoli importanti, e quelli più speculativi ed accademici che studiano i comportamenti di sistemi intelligenti artificiali e biologici. Si descrivono i diversi metodi di classificazione musicale catalogabili in base al tipo di formato dei file audio in questione e al tipo di feature che si vogliono estrarre dai file stessi. Conclude la prima sezione di stampo teorico un capitolo dedicato al MIDI che racconta la storia del protocollo e ne descrive le istruzioni fondamentali nonchè la struttura dei midifile. La seconda parte ha come obbiettivo quello di descrivere gli esperimenti svolti che classificano file MIDI tramite reti neuronali mostrando nel dettaglio i risultati ottenuti e le difficoltà incontrate. Si coniuga una presentazione dei programmi utilizzati e degli eseguibili di interfaccia implementati con una descrizione generale della procedura degli esperimenti. L'obbiettivo comune di tutte le prove è l'addestramento di una rete neurale in modo che raggiunga il più alto livello possibile di apprendimento circa il riconoscimento di uno dei due compositori dei brani che le sono stati forniti come esempi.
Synthetic Biology is a relatively new discipline, born at the beginning of the New Millennium, that brings the typical engineering approach (abstraction, modularity and standardization) to biotechnology. These principles aim to tame the extreme complexity of the various components and aid the construction of artificial biological systems with specific functions, usually by means of synthetic genetic circuits implemented in bacteria or simple eukaryotes like yeast. The cell becomes a programmable machine and its low-level programming language is made of strings of DNA. This work was performed in collaboration with researchers of the Department of Electrical Engineering of the University of Washington in Seattle and also with a student of the Corso di Laurea Magistrale in Ingegneria Biomedica at the University of Bologna: Marilisa Cortesi. During the collaboration I contributed to a Synthetic Biology project already started in the Klavins Laboratory. In particular, I modeled and subsequently simulated a synthetic genetic circuit that was ideated for the implementation of a multicelled behavior in a growing bacterial microcolony. In the first chapter the foundations of molecular biology are introduced: structure of the nucleic acids, transcription, translation and methods to regulate gene expression. An introduction to Synthetic Biology completes the section. In the second chapter is described the synthetic genetic circuit that was conceived to make spontaneously emerge, from an isogenic microcolony of bacteria, two different groups of cells, termed leaders and followers. The circuit exploits the intrinsic stochasticity of gene expression and intercellular communication via small molecules to break the symmetry in the phenotype of the microcolony. The four modules of the circuit (coin flipper, sender, receiver and follower) and their interactions are then illustrated. In the third chapter is derived the mathematical representation of the various components of the circuit and the several simplifying assumptions are made explicit. Transcription and translation are modeled as a single step and gene expression is function of the intracellular concentration of the various transcription factors that act on the different promoters of the circuit. A list of the various parameters and a justification for their value closes the chapter. In the fourth chapter are described the main characteristics of the gro simulation environment, developed by the Self Organizing Systems Laboratory of the University of Washington. Then, a sensitivity analysis performed to pinpoint the desirable characteristics of the various genetic components is detailed. The sensitivity analysis makes use of a cost function that is based on the fraction of cells in each one of the different possible states at the end of the simulation and the wanted outcome. Thanks to a particular kind of scatter plot, the parameters are ranked. Starting from an initial condition in which all the parameters assume their nominal value, the ranking suggest which parameter to tune in order to reach the goal. Obtaining a microcolony in which almost all the cells are in the follower state and only a few in the leader state seems to be the most difficult task. A small number of leader cells struggle to produce enough signal to turn the rest of the microcolony in the follower state. It is possible to obtain a microcolony in which the majority of cells are followers by increasing as much as possible the production of signal. Reaching the goal of a microcolony that is split in half between leaders and followers is comparatively easy. The best strategy seems to be increasing slightly the production of the enzyme. To end up with a majority of leaders, instead, it is advisable to increase the basal expression of the coin flipper module. At the end of the chapter, a possible future application of the leader election circuit, the spontaneous formation of spatial patterns in a microcolony, is modeled with the finite state machine formalism. The gro simulations provide insights into the genetic components that are needed to implement the behavior. In particular, since both the examples of pattern formation rely on a local version of Leader Election, a short-range communication system is essential. Moreover, new synthetic components that allow to reliably downregulate the growth rate in specific cells without side effects need to be developed. In the appendix are listed the gro code utilized to simulate the model of the circuit, a script in the Python programming language that was used to split the simulations on a Linux cluster and the Matlab code developed to analyze the data.
Tesi interdisciplinare che coniuga due importanti ambiti della Matematica: il Calcolo Numerico e la Didattica della Matematica. Alcuni algoritmi utilizzati per il web information retrieval sono stati introdotti all'interno di due classi di scuola superiore avvalendosi del programma di calcolo Matlab.
Introduzione a tecniche di web semantico e realizzazione di un approccio in grado di ricreare un ambiente familiare di un qualsiasi motore di ricerca con funzionalità semantico-lessicali e possibilità di estrazione, in base ai risultati di ricerca, dei concetti e termini chiave che costituiranno i relativi gruppi di raccolta per i vari documenti con argomenti in comune.
Except the article forming the main content most HTML documents on the WWW contain additional contents such as navigation menus, design elements or commercial banners. In the context of several applications it is necessary to draw the distinction between main and additional content automatically. Content extraction and template detection are the two approaches to solve this task. This thesis gives an extensive overview of existing algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The newly introduced content code blurring algorithm overcomes several drawbacks of previous approaches and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create sets of training documents for template detection algorithms. As the whole process can be automated it allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms.
The monitoring of cognitive functions aims at gaining information about the current cognitive state of the user by decoding brain signals. In recent years, this approach allowed to acquire valuable information about the cognitive aspects regarding the interaction of humans with external world. From this consideration, researchers started to consider passive application of brain–computer interface (BCI) in order to provide a novel input modality for technical systems solely based on brain activity. The objective of this thesis is to demonstrate how the passive Brain Computer Interfaces (BCIs) applications can be used to assess the mental states of the users, in order to improve the human machine interaction. Two main studies has been proposed. The first one allows to investigate whatever the Event Related Potentials (ERPs) morphological variations can be used to predict the users’ mental states (e.g. attentional resources, mental workload) during different reactive BCI tasks (e.g. P300-based BCIs), and if these information can predict the subjects’ performance in performing the tasks. In the second study, a passive BCI system able to online estimate the mental workload of the user by relying on the combination of the EEG and the ECG biosignals has been proposed. The latter study has been performed by simulating an operative scenario, in which the occurrence of errors or lack of performance could have significant consequences. The results showed that the proposed system is able to estimate online the mental workload of the subjects discriminating three different difficulty level of the tasks ensuring a high reliability.
Automatic design has become a common approach to evolve complex networks, such as artificial neural networks (ANNs) and random boolean networks (RBNs), and many evolutionary setups have been discussed to increase the efficiency of this process. However networks evolved in this way have few limitations that should not be overlooked. One of these limitations is the black-box problem that refers to the impossibility to analyze internal behaviour of complex networks in an efficient and meaningful way. The aim of this study is to develop a methodology that make it possible to extract finite-state automata (FSAs) descriptions of robot behaviours from the dynamics of automatically designed complex controller networks. These FSAs unlike complex networks from which they're extracted are both readable and editable thus making the resulting designs much more valuable.
In questo lavoro si introducono i concetti di base di Natural Language Processing, soffermandosi su Information Extraction e analizzandone gli ambiti applicativi, le attività principali e la differenza rispetto a Information Retrieval. Successivamente si analizza il processo di Named Entity Recognition, focalizzando l’attenzione sulle principali problematiche di annotazione di testi e sui metodi per la valutazione della qualità dell’estrazione di entità. Infine si fornisce una panoramica della piattaforma software open-source di language processing GATE/ANNIE, descrivendone l’architettura e i suoi componenti principali, con approfondimenti sugli strumenti che GATE offre per l'approccio rule-based a Named Entity Recognition.