991 resultados para PoS Tagging
Resumo:
The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.
Resumo:
In this paper we describe the methodology and the structural design of a system that translates English into Malayalam using statistical models. A monolingual Malayalam corpus and a bilingual English/Malayalam corpus are the main resource in building this Statistical Machine Translator. Training strategy adopted has been enhanced by PoS tagging which helps to get rid of the insignificant alignments. Moreover, incorporating units like suffix separator and the stop word eliminator has proven to be effective in bringing about better training results. In the decoder, order conversion rules are applied to reduce the structural difference between the language pair. The quality of statistical outcome of the decoder is further improved by applying mending rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
This paper underlines a methodology for translating text from English into the Dravidian language, Malayalam using statistical models. By using a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase, the machine automatically generates Malayalam translations of English sentences. This paper also discusses a technique to improve the alignment model by incorporating the parts of speech information into the bilingual corpus. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in training. Various handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. The structural difference between the English Malayalam pair is resolved in the decoder by applying the order conversion rules. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
This paper investigates certain methods of training adopted in the Statistical Machine Translator (SMT) from English to Malayalam. In English Malayalam SMT, the word to word translation is determined by training the parallel corpus. Our primary goal is to improve the alignment model by reducing the number of possible alignments of all sentence pairs present in the bilingual corpus. Incorporating morphological information into the parallel corpus with the help of the parts of speech tagger has brought around better training results with improved accuracy
Resumo:
A methodology for translating text from English into the Dravidian language, Malayalam using statistical models is discussed in this paper. The translator utilizes a monolingual Malayalam corpus and a bilingual English/Malayalam corpus in the training phase and generates automatically the Malayalam translation of an unseen English sentence. Various techniques to improve the alignment model by incorporating the morphological inputs into the bilingual corpus are discussed. Removing the insignificant alignments from the sentence pairs by this approach has ensured better training results. Pre-processing techniques like suffix separation from the Malayalam corpus and stop word elimination from the bilingual corpus also proved to be effective in producing better alignments. Difficulties in translation process that arise due to the structural difference between the English Malayalam pair is resolved in the decoding phase by applying the order conversion rules. The handcrafted rules designed for the suffix separation process which can be used as a guideline in implementing suffix separation in Malayalam language are also presented in this paper. Experiments conducted on a sample corpus have generated reasonably good Malayalam translations and the results are verified with F measure, BLEU and WER evaluation metrics
Resumo:
Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)
Resumo:
OntoTag - A Linguistic and Ontological Annotation Model Suitable for the Semantic Web
1. INTRODUCTION. LINGUISTIC TOOLS AND ANNOTATIONS: THEIR LIGHTS AND SHADOWS
Computational Linguistics is already a consolidated research area. It builds upon the results of other two major ones, namely Linguistics and Computer Science and Engineering, and it aims at developing computational models of human language (or natural language, as it is termed in this area). Possibly, its most well-known applications are the different tools developed so far for processing human language, such as machine translation systems and speech recognizers or dictation programs.
These tools for processing human language are commonly referred to as linguistic tools. Apart from the examples mentioned above, there are also other types of linguistic tools that perhaps are not so well-known, but on which most of the other applications of Computational Linguistics are built. These other types of linguistic tools comprise POS taggers, natural language parsers and semantic taggers, amongst others. All of them can be termed linguistic annotation tools.
Linguistic annotation tools are important assets. In fact, POS and semantic taggers (and, to a lesser extent, also natural language parsers) have become critical resources for the computer applications that process natural language. Hence, any computer application that has to analyse a text automatically and ‘intelligently’ will include at least a module for POS tagging. The more an application needs to ‘understand’ the meaning of the text it processes, the more linguistic tools and/or modules it will incorporate and integrate.
However, linguistic annotation tools have still some limitations, which can be summarised as follows:
1. Normally, they perform annotations only at a certain linguistic level (that is, Morphology, Syntax, Semantics, etc.).
2. They usually introduce a certain rate of errors and ambiguities when tagging. This error rate ranges from 10 percent up to 50 percent of the units annotated for unrestricted, general texts.
3. Their annotations are most frequently formulated in terms of an annotation schema designed and implemented ad hoc.
A priori, it seems that the interoperation and the integration of several linguistic tools into an appropriate software architecture could most likely solve the limitations stated in (1). Besides, integrating several linguistic annotation tools and making them interoperate could also minimise the limitation stated in (2). Nevertheless, in the latter case, all these tools should produce annotations for a common level, which would have to be combined in order to correct their corresponding errors and inaccuracies. Yet, the limitation stated in (3) prevents both types of integration and interoperation from being easily achieved.
In addition, most high-level annotation tools rely on other lower-level annotation tools and their outputs to generate their own ones. For example, sense-tagging tools (operating at the semantic level) often use POS taggers (operating at a lower level, i.e., the morphosyntactic) to identify the grammatical category of the word or lexical unit they are annotating. Accordingly, if a faulty or inaccurate low-level annotation tool is to be used by other higher-level one in its process, the errors and inaccuracies of the former should be minimised in advance. Otherwise, these errors and inaccuracies would be transferred to (and even magnified in) the annotations of the high-level annotation tool.
Therefore, it would be quite useful to find a way to
(i) correct or, at least, reduce the errors and the inaccuracies of lower-level linguistic tools;
(ii) unify the annotation schemas of different linguistic annotation tools or, more generally speaking, make these tools (as well as their annotations) interoperate.
Clearly, solving (i) and (ii) should ease the automatic annotation of web pages by means of linguistic tools, and their transformation into Semantic Web pages (Berners-Lee, Hendler and Lassila, 2001). Yet, as stated above, (ii) is a type of interoperability problem. There again, ontologies (Gruber, 1993; Borst, 1997) have been successfully applied thus far to solve several interoperability problems. Hence, ontologies should help solve also the problems and limitations of linguistic annotation tools aforementioned.
Thus, to summarise, the main aim of the present work was to combine somehow these separated approaches, mechanisms and tools for annotation from Linguistics and Ontological Engineering (and the Semantic Web) in a sort of hybrid (linguistic and ontological) annotation model, suitable for both areas. This hybrid (semantic) annotation model should (a) benefit from the advances, models, techniques, mechanisms and tools of these two areas; (b) minimise (and even solve, when possible) some of the problems found in each of them; and (c) be suitable for the Semantic Web. The concrete goals that helped attain this aim are presented in the following section.
2. GOALS OF THE PRESENT WORK
As mentioned above, the main goal of this work was to specify a hybrid (that is, linguistically-motivated and ontology-based) model of annotation suitable for the Semantic Web (i.e. it had to produce a semantic annotation of web page contents). This entailed that the tags included in the annotations of the model had to (1) represent linguistic concepts (or linguistic categories, as they are termed in ISO/DCR (2008)), in order for this model to be linguistically-motivated; (2) be ontological terms (i.e., use an ontological vocabulary), in order for the model to be ontology-based; and (3) be structured (linked) as a collection of ontology-based
Resumo:
Tese de doutoramento, Linguística (Linguística Educacional), Universidade de Lisboa, Faculdade de Letras, 2016
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
This research project is based on the Multimodal Corpus of Chinese Court Interpreting (MUCCCI [mutʃɪ]), a small-scale multimodal corpus on the basis of eight authentic court hearings with Chinese-English interpreting in Mainland China. The corpus has approximately 92,500 word tokens in total. Besides the transcription of linguistic and para-linguistic features, utilizing the facial expression classification rules suggested by Black and Yacoob (1995), MUCCCI also includes approximately 1,200 annotations of facial expressions linked to the six basic types of human emotions, namely, anger, disgust, happiness, surprise, sadness, and fear (Black & Yacoob, 1995). This thesis is an example of conducting qualitative analysis on interpreter-mediated courtroom interactions through a multimodal corpus. In particular, miscommunication events (MEs) and the reasons behind them were investigated in detail. During the analysis, although queries were conducted based on non-verbal annotations when searching for MEs, both verbal and non-verbal features were considered indispensable parts contributing to the entire context. This thesis also includes a detailed description of the compilation process of MUCCCI utilizing ELAN, from data collection to transcription, POS tagging and non-verbal annotation. The research aims at assessing the possibility and feasibility of conducting qualitative analysis through a multimodal corpus of court interpreting. The concept of integrating both verbal and non-verbal features to contribute to the entire context is emphasized. The qualitative analysis focusing on MEs can provide an inspiration for improving court interpreters’ performances. All the constraints and difficulties presented can be regarded as a reference for similar research in the future.
Resumo:
Amphibians have been declining worldwide and the comprehension of the threats that they face could be improved by using mark-recapture models to estimate vital rates of natural populations. Recently, the consequences of marking amphibians have been under discussion and the effects of toe clipping on survival are debatable, although it is still the most common technique for individually identifying amphibians. The passive integrated transponder (PIT tag) is an alternative technique, but comparisons among marking techniques in free-ranging populations are still lacking. We compared these two marking techniques using mark-recapture models to estimate apparent survival and recapture probability of a neotropical population of the blacksmith tree frog, Hypsiboas faber. We tested the effects of marking technique and number of toe pads removed while controlling for sex. Survival was similar among groups, although slightly decreased from individuals with one toe pad removed, to individuals with two and three toe pads removed, and finally to PIT-tagged individuals. No sex differences were detected. Recapture probability slightly increased with the number of toe pads removed and was the lowest for PIT-tagged individuals. Sex was an important predictor for recapture probability, with males being nearly five times more likely to be recaptured. Potential negative effects of both techniques may include reduced locomotion and high stress levels. We recommend the use of covariates in models to better understand the effects of marking techniques on frogs. Accounting for the effect of the technique on the results should be considered, because most techniques may reduce survival. Based on our results, but also on logistical and cost issues associated with PIT tagging, we suggest the use of toe clipping with anurans like the blacksmith tree frog.
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Analise da variabilidade da frequencia cardiaca em mulheres na pos-menopausa sedentarias e treinadas
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Resumo:
Universidade Estadual de Campinas . Faculdade de Educação Física
Resumo:
Tissue responses to the application of Rototags and Jumbo Rototags in the first dorsal fin of Carcharhinus melanopterus, C. obscurus and C. plumbeus were examined. The acute response included tissue tearing and haemorrhage and was present by 5 days post-tagging. The intermediate response had begun by 20 days post-tagging and continued beyond 207 days. This response involved decreased red blood cell activity as the inflammatory response commenced. The chronic response had begun by 301 days and was complete by 553 days with a layer of fibrous connective tissue walling off the tag. External damage to the fin was caused by continued abrasion by the tag. Repair scales were observed at 242 days using scanning electron microscopy and were confirmed histologically in 61- and 553-day samples. Repair scales were not seen in areas of continuous abrasion. No infection was observed in tissues surrounding the wound. Disruption of the fin surface was observed due to abrasion by the tag, but did not appear to cause a severe tissue reaction. The tissue responses observed were consistent with a normal, but relatively slow, healing in the vicinity of the tag wound. Use of Rototags or Jumbo Rototags appears to be an efficient way of marking elasmobranchs with minimal damage to the shark. (C) 1998 The Fisheries Society of the British Isles.