945 resultados para Language Resources
Resumo:
Historicamente, ao serem criadas, novas mídias se apropriam de recursos de linguagens de outras mídias pré-existentes. Na medida em que as tecnologias da mídia se desenvolvem o mesmo ocorre com as linguagens, de forma a adaptarem-se, simultaneamente, ao meio e mensagens; modos de produção; e condições ideais de interação com os usuários. As mídias digitais, por sua natureza, dispõem de interfaces performáticas imagens-pensantes que permitem mais que a simples representação estética de conteúdos. Neste contexto, se insere a problemática desta pesquisa: Quais teorias transdisciplinares podem contribuir para a compreensão dos complexos processos comunicacionais que envolvem o relacionamento entre seres humanos e mídias digitais com propósito de aprendizagem? O objetivo desta pesquisa foi o de estender o modelo desenvolvido por Stephen Littlejohn e incluir novos conceitos e generalizações, provenientes de outros ramos da ciência com diferentes 'visões de mundo', visando ampliar a proposta de Littlejohn para um Modelo Transdisciplinar para Comunicação com Mídias Digitais, que, em nossa perspectiva, contribui para explicar os fenômenos pertinentes à relação de humanos com mídias digitais, principalmente em processos de aprendizagem de ciências. A pesquisa foi feita com métodos de pesquisa Bibliográfica e Descritiva.(AU)
Resumo:
This paper shows a system about the recognition of temporal expressions in Spanish and the resolution of their temporal reference. For the identification and recognition of temporal expressions we have based on a temporal expression grammar and for the resolution on an inference engine, where we have the information necessary to do the date operation based on the recognized expressions. For further information treatment, the output is proposed by means of XML tags in order to add standard information of the resolution obtained. Different kinds of annotation of temporal expressions are explained in another articles [WILSON2001][KATZ2001]. In the evaluation of our proposal we have obtained successful results.
Resumo:
Actualmente existe una gran cantidad de empresas ofreciendo servicios para el análisis de contenido y minería de datos de las redes sociales con el objetivo de realizar análisis de opiniones y gestión de la reputación. Un alto porcentaje de pequeñas y medianas empresas (pymes) ofrecen soluciones específicas a un sector o dominio industrial. Sin embargo, la adquisición de la necesaria tecnología básica para ofrecer tales servicios es demasiado compleja y constituye un sobrecoste demasiado alto para sus limitados recursos. El objetivo del proyecto europeo OpeNER es la reutilización y desarrollo de componentes y recursos para el procesamiento lingüístico que proporcione la tecnología necesaria para su uso industrial y/o académico.
imaxin|software: PLN aplicada a la mejora de la comunicación multilingüe de empresas e instituciones
Resumo:
imaxin|software es una empresa creada en 1997 por cuatro titulados en ingeniería informática cuyo objetivo ha sido el de desarrollar videojuegos multimedia educativos y procesamiento del lenguaje natural multilingüe. 17 años más tarde, hemos desarrollado recursos, herramientas y aplicaciones multilingües de referencia para diferentes lenguas: Portugués (Galicia, Portugal, Brasil, etc.), Español (España, Argentina, México, etc.), Inglés, Catalán y Francés. En este artículo haremos una descripción de aquellos principales hitos en relación a la incorporación de estas tecnologías PLN al sector industrial e institucional.
Resumo:
Introducción: la presente investigación está orientada a ofrecer un análisis donde se establezcan los recursos lingüísticos utilizados por los participantes sobre el contenido y alcance de la prestación básica de información y orientación en los servicios sociales comunitarios, tal como la desarrollan los trabajadores sociales. Material y métodos: siguiendo una metodología cualitativa y la utilización del análisis del discurso en la propuesta de Wetherell y Potter (1996) con el empleo de la herramienta analítica de los repertorios interpretativos, se intentarán resaltar aquellos elementos definitorios, estrategias profesionales, valores, normas, prácticas organizacionales, elementos de la cultura institucional, entre otros, que dan forma a los escenarios donde desarrollan su labor los profesionales y que configuran el sistema de servicios sociales comunitarios. Resultados: las entrevistas realizadas a veinticinco trabajadores sociales de la provincia de Málaga muestran cuatro repertorios interpretativos que reflejan la construcción del sistema de servicios sociales por parte de los profesionales implicados: el olvido de lo comunitario, la eterna indefinición del sistema, el elefante encadenado y la escasez agudiza el ingenio. Discusión: se pone de manifiesto cómo se construye un modelo de intervención distante a lo establecido en las normas y códigos éticos a causa de los comportamientos organizacionales e institucionales, que los profesionales intentan minimizar mediante la puesta en práctica de habilidades personales.
Resumo:
Historicamente, ao serem criadas, novas mídias se apropriam de recursos de linguagens de outras mídias pré-existentes. Na medida em que as tecnologias da mídia se desenvolvem o mesmo ocorre com as linguagens, de forma a adaptarem-se, simultaneamente, ao meio e mensagens; modos de produção; e condições ideais de interação com os usuários. As mídias digitais, por sua natureza, dispõem de interfaces performáticas imagens-pensantes que permitem mais que a simples representação estética de conteúdos. Neste contexto, se insere a problemática desta pesquisa: Quais teorias transdisciplinares podem contribuir para a compreensão dos complexos processos comunicacionais que envolvem o relacionamento entre seres humanos e mídias digitais com propósito de aprendizagem? O objetivo desta pesquisa foi o de estender o modelo desenvolvido por Stephen Littlejohn e incluir novos conceitos e generalizações, provenientes de outros ramos da ciência com diferentes 'visões de mundo', visando ampliar a proposta de Littlejohn para um Modelo Transdisciplinar para Comunicação com Mídias Digitais, que, em nossa perspectiva, contribui para explicar os fenômenos pertinentes à relação de humanos com mídias digitais, principalmente em processos de aprendizagem de ciências. A pesquisa foi feita com métodos de pesquisa Bibliográfica e Descritiva.(AU)
Resumo:
In this paper we present a new approach to ontology learning. Its basis lies in a dynamic and iterative view of knowledge acquisition for ontologies. The Abraxas approach is founded on three resources, a set of texts, a set of learning patterns and a set of ontological triples, each of which must remain in equilibrium. As events occur which disturb this equilibrium various actions are triggered to re-establish a balance between the resources. Such events include acquisition of a further text from external resources such as the Web or the addition of ontological triples to the ontology. We develop the concept of a knowledge gap between the coverage of an ontology and the corpus of texts as a measure triggering actions. We present an overview of the algorithm and its functionalities.
Resumo:
This paper describes part of the corpus collection efforts underway in the EC funded Companions project. The Companions project is collecting substantial quantities of dialogue a large part of which focus on reminiscing about photographs. The texts are in English and Czech. We describe the context and objectives for which this dialogue corpus is being collected, the methodology being used and make observations on the resulting data. The corpora will be made available to the wider research community through the Companions Project web site.
Resumo:
Automatic Term Recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.
Resumo:
Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.
Resumo:
This is a study of specific aspects of classroom interaction primary school level in Kenya. The study entailed the identification of the sources of particular communication problems during the change-over period from Kiswahili to English medium teaching in two primary schools. There was subsequently an examination of the language resources which were employed by teachers to maintain pupil participation in communication in the light of the occurrence of possibility of occurrence of specific communication problems. The language resources which were found to be significant in this regard concerned firstly the use of different elicitation types by teachers to stimulate pupils into giving responses and secondly teachers' recourse to code-switching from English to Kiswahili and vice-versa. It was also found in this study that although the use of English as the medium of instruction in the classrooms which were observed resulted in certain communication problems, some of these problems need not have arisen if teachers had been more careful in their use of language. The consideration of this finding, after taking into account the role of different elicitation types and code-switching as interpretable from data samples had certain implications which are specified in the study for teaching in Kenyan primary schools. The corpus for the study consisted of audio-recordings of English, Science and Number-Work lessons which were later transcribed. Relevant data samples were subsequently extracted from transcripts for analysis. Many of the samples have examples of cases of communication breakdowns, but they also illustrate how teachers maintained interaction with pupils who had yet to acquire an operational mastery of English. This study thus differs from most studies on classroom interaction because of its basic concern with the examination of the resources available to teachers for overcoming the problem areas of classroom communication.
Resumo:
This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tone languages. We have demonstrated its applicability using the Standard Yorùbá (SY) language. Our approach is motivated by the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular and unified framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293–324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticated data structure for representing a multi-dimensional waveform in the form of a tree. The underlying assumption of this research is that it is possible to develop a practical prosody model by using appropriate computational tools and techniques which combine acoustic data with an encoding of the phonological and phonetic knowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed using speech data from native speakers of Yorùbá. The Fuzzy Decision Tree (FDT) and the Classification and Regression Tree (CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT for implementing the duration dimension of our prosody model. To establish the effectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. We have performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggest that, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model, it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitable for speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Furthermore, the R-Tree model is easy to implement, interpret and analyse.
Resumo:
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
Resumo:
One of the leading motivations behind the multilingual semantic web is to make resources accessible digitally in an online global multilingual context. Consequently, it is fundamental for knowledge bases to find a way to manage multilingualism and thus be equipped with those procedures for its conceptual modelling. In this context, the goal of this paper is to discuss how common-sense knowledge and cultural knowledge are modelled in a multilingual framework. More particularly, multilingualism and conceptual modelling are dealt with from the perspective of FunGramKB, a lexico-conceptual knowledge base for natural language understanding. This project argues for a clear division between the lexical and the conceptual dimensions of knowledge. Moreover, the conceptual layer is organized into three modules, which result from a strong commitment towards capturing semantic knowledge (Ontology), procedural knowledge (Cognicon) and episodic knowledge (Onomasticon). Cultural mismatches are discussed and formally represented at the three conceptual levels of FunGramKB.
Resumo:
This paper is a study about the way in which se structures are represented in 20 verb entries of nine dictionaries of Spanish language. There is a large number of these structures and they are problematic for native and non native speakers. Verbs of the analysis are middle-high frequency and, in the most part of the cases, very polysemous, and this allows to observe interconnections between the different se structures and the different meanings of each verb. Data of the lexicographic analysis are cross-checked with corpus analysis of the same units. As a result, it is observed that there is a large variety in the data which are offered in each dictionary and in the way they are offered, inter and intradictionary. The reasons range from the theoretical overall of each Project to practical performance. This leads to the conclusion that it is necessary to further progress in the dictionary model it is being handled, in order to offer lexico-grammatical phenomenon such as se verbs in an accurate, clear and exhaustive way.