20 resultados para corpora allata
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition,production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building blocks of PANACEA. The CAC, which is the first stage in the PANACEA pipeline for building Language Resources, adopts an efficient and distributed methodology to crawl for web documents with rich textual content in specific languages and predefined domains. The CAC includes modules that can acquire parallel data from sites with in-domain content available in more than one language. In order to extrinsically evaluate the CAC methodology, we have conducted several experiments that used crawled parallel corpora for the identification and extraction of parallel sentences using sentence alignment. The corpora were then successfully used for domain adaptation of Machine Translation Systems.
Resumo:
L'objectiu d'aquest informe és presentar l'aplicació d'una sèrie de propostes sobre transcripció, etiquetatge i codificació a dos corpus: el corpus bilingüe LC (La Canonja (Català-Espanyol)) i el corpus trilingüe CSCD (Code-switching as Communicative Design (Català-Espanyol-Anglès)). Aquestes propostes, que constitueixen l'aportació de l'equip IULA-LIPPS (Language Interaction in Plurilingual and Plurilectal Speakers) al manual de codificació del sistema LIDES (Language Interaction Database Exchange System), adoptat pel grup europeu LIPPS, poden ser útils per transcriure, etiquetar i codificar dades provinents de llengües tipològicament properes i distants.
Resumo:
This research investigates the phenomenon of translationese in two monolingual comparable corpora of original and translated Catalan texts. Translationese has been defined as the dialect, sub-language or code of translated language. This study aims at giving empirical evidence of translation universals regardless the source language.Traditionally, research conducted on translation strategies has been mainly intuition-based. Computational Linguistics and Natural Language Processing techniques provide reliable information of lexical frequencies, morphological and syntactical distribution in corpora. Therefore, they have been applied to observe which translation strategies occur in these corpora.Results seem to prove the simplification, interference and explicitation hypotheses, whereas no sign of normalization has been detected with the methodology used.The data collected and the resources created for identifying lexical, morphological and syntactic patterns of translations can be useful for Translation Studies teachers, scholars and students: teachers will have more tools to help students avoid the reproduction of translationese patterns. Resources developed will help in detecting non-genuine or inadequate structures in the target language. This fact may imply an improvement in stylistic quality in translations. Translation professionals can also take advantage of these resources to improve their translation quality.
Resumo:
CoCo is a collaborative web interface for the compilation of linguistic resources. In this demo we are presenting one of its possible applications: paraphrase acquisition.
Resumo:
Proyecto realizado en la Universidad de Lleida entre 2003 i 2006. El objetivo principal de este trabajo es vislumbrar la génesis y la evolución de unidades fraseológicas que proceden del latín y de las que utilizan algún motivo grecorromano para su creación. Una de las razones por las que se ha elegido este tipo de fraseologismos es que se remontan a una época bastante bien conocida de la historia de nuestro pueblo, circunstancia que permitirá desarrollar el segundo objetivo, a saber: conocer qué aspectos de la sociedad y la cultura clásicas han sido seleccionados por los antiguos y por los hispanohablantes para la formación de una expresión figurada, así como sacar a la luz las posibles causas que han motivado esta elección. La inclusión de estos dos grupos de unidades –el de las que proceden directamente del latín y el de las que se han basado, ya en plena época romance, en un motivo de la cultura clásica– nos posibilita la consecución de un tercer objetivo: saber si en ambos pueblos (grecorromano e hispano) han llamado la atención los mismos aspectos de la realidad (y de idéntico modo) para la cristalización de unidades fraseológicas. Las hipótesis de partida eran que, en español, debiera darse tanto la perduración o recreación de unidades latinas y griegas, como la creación de nuevas unidades referidas a aspectos culturales de Grecia y Roma; y que el número de estas unidades debiera ser elevado, puesto que el latín es la lengua de la que deriva el español y la cultura grecorromana es la base de la nuestra. Para el establecimiento del corpus definitivo se han elaborado dos corpora de fraseologismos y refranes (sobre el español y el latín), que han sido debidamente comparados hasta llegar establecer unas 20.000 unidades de origen latino y unas 3.000 de origen clásico .
Resumo:
L’Ària amb diverses variacions BWV 988, quarta part dels Clavier-übung, de Johann Sebastian Bach, és una obra que ha estat objecte de molts estudis. La seva extrema bellesa, el sobrenom amb què avui la coneixem, Variacions Goldberg -a partir de la llegenda que explica Forkel sobre un comte que patia insomni i el seu clavecinista Goldberg- i la seva estructura interna tan perfectament calculada, alhora que generadora de grans preguntes complexes de respondre, fan d’aquesta obra un mite indiscutible de la literatura per a teclat que, des de fa més 250 anys, ens continua fascinant i la fa immortal. Aquest treball és una humil aproximació a aquests aspectes, que ens faran acostar-nos al món interior d’aquesta obra.
Resumo:
In this paper we present a description of the role of definitional verbal patterns for the extraction of semantic relations. Several studies show that semantic relations can be extracted from analytic definitions contained in machine-readable dictionaries (MRDs). In addition, definitions found in specialised texts are a good starting point to search for different types of definitions where other semantic relations occur. The extraction of definitional knowledge from specialised corpora represents another interesting approach for the extraction of semantic relations. Here, we present a descriptive analysis of definitional verbal patterns in Spanish and the first steps towards the development of a system for the automatic extraction of definitional knowledge.
Resumo:
In this article we present a hybrid approach for automatic summarization of Spanish medical texts. There are a lot of systems for automatic summarization using statistics or linguistics, but only a few of them combining both techniques. Our idea is that to reach a good summary we need to use linguistic aspects of texts, but as well we should benefit of the advantages of statistical techniques. We have integrated the Cortex (Vector Space Model) and Enertex (statistical physics) systems coupled with the Yate term extractor, and the Disicosum system (linguistics). We have compared these systems and afterwards we have integrated them in a hybrid approach. Finally, we have applied this hybrid system over a corpora of medical articles and we have evaluated their performances obtaining good results.
Resumo:
By providing a better understanding of paraphrase and coreference in terms of similarities and differences in their linguistic nature, this article delimits what the focus of paraphrase extraction and coreference resolution tasks should be, and to what extent they can help each other. We argue for the relevance of this discussion to Natural Language Processing.
Resumo:
In this paper, we present a critical analysis of the state of the art in the definition and typologies of paraphrasing. This analysis shows that there exists no characterization of paraphrasing that is comprehensive, linguistically based and computationally tractable at the same time. The following sets out to define and delimit the concept on the basis of the propositional content. We present a general, inclusive and computationally oriented typology of the linguistic mechanisms that give rise to form variations between paraphrase pairs.
Resumo:
El objetivo del trabajo es determinar si el uso de un grupo de verbos es propio del español de Argentina o si, por el contrario, se extiende a otros países hispanohablantes. Para ello, se analizan el proceso de derivación verbal, la semántica y el carácter neológico de las voces.
Resumo:
En este trabajo se estudia la relación entre la morfología y la lexicografía mediante el análisis de seis verbos prefijados con re-. Se comparan sus definiciones en tres diccionarios y se proponen nuevas definiciones siguiendo el modelo de entrada lexicográfica del Diccionario de Aprendizaje de Español como Lengua Extranjera.
Resumo:
In recent decades, technological advances have made extensive documentation available to us. But the philologist must be aware of the dangers of poor use of the documentary corpus in order to avoid creating dreaded ghost words. In this paper we recall the main sources of this type of error: folk etymology phenomena among speakers, copyists" errors, transcribers" errors in the interpretation of some abbreviations and graphic variants of the manuscripts, onomastic changes introduced by cartographers" ignorance of linguistic variants, gaps in the dating of some documents, confusion in the processes of lemmatization and the evaluation of texts... All these sources of error contribute, to a greater or lesser degree, to the distortion or to the masking of the data on which the research of philologists is based. Hence the importance of philological rigour in the transmission and study of ancient texts.
Resumo:
This article introduces EsPal: a Web-accessible repository containing a comprehensive set of properties of Spanish words. EsPal is based on an extensible set of data sources, beginning with a 300 million token written database and a 460 million token subtitle database. Properties available include word frequency, orthographic structure and neighborhoods, phonological structure and neighborhoods, and subjective ratings such as imageability. Subword structure properties are also available in terms of bigrams and trigrams, bi-phones, and bi-syllables. Lemma and part-of-speech information and their corresponding frequencies are also indexed. The website enables users to either upload a set of words to receive their properties, or to receive a set of words matching constraints on the properties. The properties themselves are easily extensible and will be added over time as they become available. It is freely available from the following website: http://www.bcbl.eu/databases/espal
Resumo:
This article describes the developmentof an Open Source shallow-transfer machine translation system from Czech to Polish in theApertium platform. It gives details ofthe methods and resources used in contructingthe system. Although the resulting system has quite a high error rate, it is still competitive with other systems.