3 resultados para Corpus Linguistic
em Universidad Politécnica de Madrid
Resumo:
OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
1. INTRODUCTION. LINGUISTIC TOOLS AND ANNOTATIONS: THEIR LIGHTS AND SHADOWS
Computational Linguistics is already a consolidated research area. It builds upon the results of other two major ones, namely Linguistics and Computer Science and Engineering, and it aims at developing computational models of human language (or natural language, as it is termed in this area). Possibly, its most well-known applications are the different tools developed so far for processing human language, such as machine translation systems and speech recognizers or dictation programs.
These tools for processing human language are commonly referred to as linguistic tools. Apart from the examples mentioned above, there are also other types of linguistic tools that perhaps are not so well-known, but on which most of the other applications of Computational Linguistics are built. These other types of linguistic tools comprise POS taggers, natural language parsers and semantic taggers, amongst others. All of them can be termed linguistic annotation tools.
Linguistic annotation tools are important assets. In fact, POS and semantic taggers (and, to a lesser extent, also natural language parsers) have become critical resources for the computer applications that process natural language. Hence, any computer application that has to analyse a text automatically and ‘intelligently’ will include at least a module for POS tagging. The more an application needs to ‘understand’ the meaning of the text it processes, the more linguistic tools and/or modules it will incorporate and integrate.
However, linguistic annotation tools have still some limitations, which can be summarised as follows:
1. Normally, they perform annotations only at a certain linguistic level (that is, Morphology, Syntax, Semantics, etc.).
2. They usually introduce a certain rate of errors and ambiguities when tagging. This error rate ranges from 10 percent up to 50 percent of the units annotated for unrestricted, general texts.
3. Their annotations are most frequently formulated in terms of an annotation schema designed and implemented ad hoc.
A priori, it seems that the interoperation and the integration of several linguistic tools into an appropriate software architecture could most likely solve the limitations stated in (1). Besides, integrating several linguistic annotation tools and making them interoperate could also minimise the limitation stated in (2). Nevertheless, in the latter case, all these tools should produce annotations for a common level, which would have to be combined in order to correct their corresponding errors and inaccuracies. Yet, the limitation stated in (3) prevents both types of integration and interoperation from being easily achieved.
In addition, most high-level annotation tools rely on other lower-level annotation tools and their outputs to generate their own ones. For example, sense-tagging tools (operating at the semantic level) often use POS taggers (operating at a lower level, i.e., the morphosyntactic) to identify the grammatical category of the word or lexical unit they are annotating. Accordingly, if a faulty or inaccurate low-level annotation tool is to be used by other higher-level one in its process, the errors and inaccuracies of the former should be minimised in advance. Otherwise, these errors and inaccuracies would be transferred to (and even magnified in) the annotations of the high-level annotation tool.
Therefore, it would be quite useful to find a way to
(i) correct or, at least, reduce the errors and the inaccuracies of lower-level linguistic tools;
(ii) unify the annotation schemas of different linguistic annotation tools or, more generally speaking, make these tools (as well as their annotations) interoperate.
Clearly, solving (i) and (ii) should ease the automatic annotation of web pages by means of linguistic tools, and their transformation into Semantic Web pages (Berners-Lee, Hendler and Lassila, 2001). Yet, as stated above, (ii) is a type of interoperability problem. There again, ontologies (Gruber, 1993; Borst, 1997) have been successfully applied thus far to solve several interoperability problems. Hence, ontologies should help solve also the problems and limitations of linguistic annotation tools aforementioned.
Thus, to summarise, the main aim of the present work was to combine somehow these separated approaches, mechanisms and tools for annotation from Linguistics and Ontological Engineering (and the Semantic Web) in a sort of hybrid (linguistic and ontological) annotation model, suitable for both areas. This hybrid (semantic) annotation model should (a) benefit from the advances, models, techniques, mechanisms and tools of these two areas; (b) minimise (and even solve, when possible) some of the problems found in each of them; and (c) be suitable for the Semantic Web. The concrete goals that helped attain this aim are presented in the following section.
2. GOALS OF THE PRESENT WORK
As mentioned above, the main goal of this work was to specify a hybrid (that is, linguistically-motivated and ontology-based) model of annotation suitable for the Semantic Web (i.e. it had to produce a semantic annotation of web page contents). This entailed that the tags included in the annotations of the model had to (1) represent linguistic concepts (or linguistic categories, as they are termed in ISO/DCR (2008)), in order for this model to be linguistically-motivated; (2) be ontological terms (i.e., use an ontological vocabulary), in order for the model to be ontology-based; and (3) be structured (linked) as a collection of ontology-based
Resumo:
Basándonos en la recopilación inicial de preposiciones, locuciones preposicionales, términos con preposición dependiente y phrasal verbs utilizados en el texto técnico realizada en otros proyectos anteriores del Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, el objetivo de este trabajo es completar, organizar, actualizar y dar visibilidad a esta información inicial. Tras realizar un proceso exhaustivo de verificación, unificación, clasificación y ampliación de la información existente, en caso necesario, el listado resultante se utiliza para elaborar un glosario de términos con preposición. El objetivo final de este proyecto es que este glosario esté a disposición de los usuarios, a través de una consulta on-line, en la página del ILLLab (http://illlab.euitt.upm.es/wordpress/), dependiente del Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología. Para incluir en el glosario ejemplos actualizados de textos técnicos, se ha recopilado un corpus lingüístico de textos técnicos, tomando como base diferentes números de la revista IEEE Spectrum, en su edición digital, publicados entre los años 2009 y 2012. El objetivo de esta recopilación es la de ofrecer al consultante diferentes ejemplos de uso en el texto técnico de los distintos términos con preposición que componen el glosario, de manera que pueda acceder de manera rápida y sencilla a ejemplos de uso real de los términos que está buscando, con objeto de clarificar aspectos relacionados con su uso o, en su caso, facilitar su aprendizaje. Toda esta información, tanto el listado de términos con preposición como las frases pertenecientes al corpus recopilado, se incorpora a una base de datos, alojada dentro de la misma página web del ILLLab. A través de un formulario de consulta, a disposición del usuario en dicha página, se pueden obtener todos los términos recopilados que coincidan con los criterios de búsqueda introducidos. El usuario puede realizar dos tipos de búsqueda principales: por preposición o por término completo. Además, puede elegir una búsqueda global (entre todos los términos que integran el glosario) o parcial (en una sola de las categorías en las que se han dividido los diferentes términos, de acuerdo con su función gramatical). Por último, se presentan unas estadísticas de uso de los términos recopilados dentro de los diferentes textos que integran el corpus lingüístico, de manera que pueda establecerse una relación de los que aparecen con más frecuencia en el texto técnico. ABSTRACT. Based on the initial collection of prepositions, prepositional phrases, dependent prepositions and phrasal verbs used in technical texts collected on previous projects in the Department of Applied Linguistics to Science and Technology, the aim of this project is to improve, organize, update and provide visibility to this initial information. Following a process of verification, unification, classification and extension of existing information, if necessary, a glossary of terms with preposition is built. The ultimate objective of this project is to make this glossary available to users through an online consultation in the ILLLab webpage (http://illlab.euitt.upm.es/wordpress/). The administration of tis webpage depends of the Department of Applied Linguistics in Science and Technology. A linguistic corpus of technical texts has been compiled, based on different numbers of the IEEE Spectrum magazine, in its online edition, published between the years 2009 and 2012. The aim of this collection is to provide different examples of use in the technical text for the terms included in the glossary, so that examples of the actual use of the terms consulted can be easily and quickly accessed, in order to clarify doubts regarding their meaning or translation into Spanish and facilitate learning. All this information, both the list of terms with prepositional phrases as well as the corpus developed, is incorporated in a database. Through a searching form, the ILLLab's user may obtain all the terms matching the search criteria entered. The user can perform two types of main search: by preposition or by full term. Additionally, a global search can be selected (including all terms included in the glossary) or a partial one (including only one of the glossary's categories). Finally, some statistics of use are presented according to the various texts included in the corpus, so a relation of the most frequent prepositions in the technical text can be established.
Resumo:
Esta investigación se enmarca dentro de los denominados lenguajes de especialidad que para esta tesis será el de las Tecnologías de la Información y la Comunicación (TIC). De todos los aspectos relacionados con el estudio de estos lenguajes que pudieran tener interés lingüístico ha primado el análisis del componente terminológico. Tradicionalmente la conceptualización de un campo del saber se representaba mayoritariamente a través del elemento nominal, así lo defiende la Teoría General de la Terminología (Wüster, 1968). Tanto la lexicología como la lexicografía han aportado importantes contribuciones a los estudios terminológicos para la identificación del componente léxico a través del cual se transmite la información especializada. No obstante esos primeros estudios terminológicos que apuntaban al sustantivo como elmentos denominativo-conceptual, otras teorías más recientes, entre las que destacamos la Teoría Comunicativa de la Terminología (Cabré, 1999) identifican otras estructuras morfosintácticas integradas por otros elementos no nominales portadores igualmente de esa carga conceptual. A partir de esta consideración, hemos seleccionado para este estudio el adjetivo relacional en tanto que representa otra categoría gramatical distinta al sustantivo y mantiene un vínculo con éste debido a su procedencia. Todo lo cual puede suscitar cierto interés terminológico. A través de esta investigación, nos hemos propuesto demostrar las siguientes hipótesis: 1. El adjetivo relacional aporta contenido especializado en su asociación con el componente nominal. 2. El adjetivo relacional es portador de un valor semántico que hace posible identificar con más precisión la relación conceptual de los elementos -adjetivo y sustantivo - de la combinación léxica resultante, especialmente en algunas formaciones ambiguas. 3. El adjetivo relacional, como modificador natural del sustantivo al que acompaña, podría imponer cierta restricción en sus combinaciones y, por tanto, hacer una selección discriminada de los integrantes de la combinación léxica especializada. Teniendo en cuenta las anteriores hipótesis, esta investigación ha delimitado y caracterizado el segmento léxico objeto de estudio: la ‘combinación léxica especializada (CLE)’ formalmente representada por la estructura sintáctica [adjR+n], en donde adjR es el adjetivo y n el sustantivo al que acompaña. De igual forma hemos descrito el marco teórico desde el que abordar nuestro análisis. Se trata de la teoría del Lexicón Generatvio (LG) y de la representación semántica (Pustojovsky, 1995) que propone como explicación de la generación de significados. Hemos analizado las distintas estructuras de representación léxica y en especial la estructura qualia a través de la cual hemos identificado la relación semántica que mantienen los dos ítems léxicos [adjR+n] de la estructura sintáctica de nuestro estudio. El estudio semántico de las dos piezas léxicas ha permitido, además, comprobar el valor denominativo del adjetivo en la combinación. Ha sido necesario elaborar un corpus de textos escritos en inglés y español pertenecientes al discurso de especialidad de las TIC. Este material ha sido procesado para nuestros fines utilizando distintas herramientas electrónicas. Se ha hecho uso de lexicones electrónicos, diccionarios online generales y de especialidad y corpus de referencia online, estos últimos para poder eventualmente validad nuetros datos. Asimismo se han utilizado motores de búsqueda, entre ellos WordNet Search 3.1, para obtener la información semántica de nuestros elementos léxicos. Nuestras conclusiones han corroborado las hipótesis que se planteaban en esta tesis, en especial la referente al valor denominativo-conceptual del adjetivo relacional el cual, junto con el sustantivo al que acompaña, forma parte de la representación cognitiva del lenguaje de especialidad de las TIC. Como continuación a este estudio se proponen sugerencias sobre líneas futuras de investigación así como el diseño de herramientas informáticas que pudieran incorporar estos datos semánticos como complemento de los ítems léxicos dotados de valor denominativo-conceptual. ABSTRACT This research falls within the field of the so-called Specialized Languages which for the purpose of this study is the Information and Communication Technology (ICT) discourse. Considering their several distinguishing features terminology concentrates our interest from the point of view of linguistics. It is broadly assumed that terms represent concepts of a subject field. For the classical view of terminology (Wüster, 1968) these terms are formally represented by nouns. Both lexicology and terminology have made significant contributions to the study of terms. Later research as well as other theories on Terminology such as the Communicative Theory of Terminology (Cabré, 1993) have shown that other lexical units can also represent knowledge organization. On these bases, we have focused our research on the relational adjective which represents a functional unit different from a noun while still connected to the noun by means of its nominal root. This may have a potential terminological interest. Therefore the present research is based on the next hypotheses: 1. The relational adjective conveys specialized information when combined with the noun. 2. The relational adjective has a semantic meaning which helps understand the conceptual relationship between the adjective and the noun being modified and disambiguate certain senses of the resulting lexical combination. 3. The relational adjective may impose some restrictions when choosing the nouns it modifies. Considering the above hypotheses, this study has identified and described a multi-word lexical unit pattern [Radj+n] referred to as a Specialized Lexical Combination (SLC) linguistically realized by a relational adjective, Radj, and a noun, n. The analysis of such a syntactic pattern is addressed from the framework of the Generative Lexicon (Pustojovsky, 1995). Such theory provides several levels of semantic description which help lexical decomposition performed generatively. These levels of semantic representation are connected through generative operations or generative devices which account for the compositional interpretation of any linguistic utterance in a given context. This study analyses these different levels and focuses on one of them, i.e. the qualia structure since it may encode the conceptual meaning of the syntactic pattern [Radj+n]. The semantic study of these two lexical items has ultimately confirmed the conceptual meaning of the relational adjective. A corpus made of online ICT articles from magazines written in English and Spanish – some being their translations - has been used for the word extraction. For this purpose some word processing software packages have been employed. Moreover online general language and specialized language dictionaries have been consulted. Search engines, namely WordNet Search 3.1, have been also exploited to find the semantic information of our lexical units. Online reference corpora in English and Spanish have been used for a contrastive analysis of our data. Finally our conclusions have confirmed our initial hypotheses, i.e. relational adjectives are specialized lexical units which together with the nouns are part of the knowledge representation of the ICT subject field. Proposals for new research have been made together with some other suggestions for the design of computer applications to visually show the conceptual meaning of certain lexical units.