7 resultados para Text linguistics
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
The need for a convergence between semi-structured data management and Information Retrieval techniques is manifest to the scientific community. In order to fulfil this growing request, W3C has recently proposed XQuery Full Text, an IR-oriented extension of XQuery. However, the issue of query optimization requires the study of important properties like query equivalence and containment; to this aim, a formal representation of document and queries is needed. The goal of this thesis is to establish such formal background. We define a data model for XML documents and propose an algebra able to represent most of XQuery Full-Text expressions. We show how an XQuery Full-Text expression can be translated into an algebraic expression and how an algebraic expression can be optimized.
Resumo:
For a long time, the work of a Franciscan Friar who had lived in Bologna and in Florence during the 13th and 14th centuries, Bartolomeo Della Pugliola, was thought to have been lost. Recent paleographic research, however, has affirmed that most of Della Pugliola’s work, although mixed into other authors, is contained in two manuscripts (1994 and 3843), currently kept at University Library in Bologna. Pugliola’s chronicle is central to Bolognese medieval literature, not only because it was the privileged source for the important work of Ramponis’ chronicle, but also because Bartolomeo della Pugliola’s sources are several significant works such as Jacopo Bianchetti’s lost writings and Pietro and Floriano Villolas’ chronicle (1163-1372). Ongoing historical studies and recent discoveries enabled me to reconstruct the historical chronology of Pugliola’s work as well as the Bolognese language between the 13th and 14th century The original purpose of my research was to add a linguistic commentary to the edition of the text in order to fill the gaps in medieval Bolognese language studies. In addition to being a reliable source, Pugliola’s chronicle was widely disseminated and became a sort of vulgate. The tradition of chronicle, through collation, allows the study of the language from a diachronic point of view. I therefore described all the linguistics phenomena related to phonetics, morphology and syntax in Pugliola’s text and I compared these results with variants in Villola’s and Ramponis’ chronicles. I also did likewise with another chronicle by a 16th century merchant, Friano Ubaldini, that I edited. This supplement helped to complete the Bolognese language outline from the 13th to the 16th century. In order to analize the data that I collected, I tried to approach them from a sociolinguistic point of view because each author represents a different variant of the language: closer to a scripta and the Florentine the language used by Pugliola, closer to the dialect spoken in Bologna the language used by Ubaldini. Differencies in handwriting especially show the models the authors try to reproduce or imitate. The glossary I added at the end of this study can help to understand these nuances with a number of examples.
Resumo:
This study aims to the elaboration of juridical and administrative terminology in Ladin language, actually on the Ladin idiom spoken in Val Badia. The necessity of this study is strictly connected to the fact that in South Tyrol the Ladin language is not just safeguarded, but the editing of administrative and normative text is guaranteed by law. This means that there is a need for a unique terminology in order to support translators and editors of specialised texts. The starting point of this study are, on one side the need of a unique terminology, and on the other side the translation work done till now from the employees of the public administration in Ladin language. In order to document their efforts a corpus made up of digitalized administrative and normative documents was build. The first two chapters focuses on the state of the art of projects on terminology and corpus linguistics for lesser used languages. The information were collected thanks to the help of institutes, universities and researchers dealing with lesser used languages. The third chapter focuses on the development of administrative language in Ladin language and the fourth chapter focuses on the creation of the trilingual Italian – German – Ladin corpus made up of administrative and normative documents. The last chapter deals with the methodologies applied in order to elaborate the terminology entries in Ladin language though the use of the trilingual corpus. Starting from the terminology entry all steps are described, from term extraction, to the extraction of equivalents, contexts and definitions and of course also of the elaboration of translation proposals for not found equivalences. Finally the problems referring to the elaboration of terminology in Ladin language are illustrated.
Resumo:
The construction and use of multimedia corpora has been advocated for a while in the literature as one of the expected future application fields of Corpus Linguistics. This research project represents a pioneering experience aimed at applying a data-driven methodology to the study of the field of AVT, similarly to what has been done in the last few decades in the macro-field of Translation Studies. This research was based on the experience of Forlixt 1, the Forlì Corpus of Screen Translation, developed at the University of Bologna’s Department of Interdisciplinary Studies in Translation, Languages and Culture. As a matter of fact, in order to quantify strategies of linguistic transfer of an AV product, we need to take into consideration not only the linguistic aspect of such a product but all the meaning-making resources deployed in the filmic text. Provided that one major benefit of Forlixt 1 is the combination of audiovisual and textual data, this corpus allows the user to access primary data for scientific investigation, and thus no longer rely on pre-processed material such as traditional annotated transcriptions. Based on this rationale, the first chapter of the thesis sets out to illustrate the state of the art of research in the disciplinary fields involved. The primary objective was to underline the main repercussions on multimedia texts resulting from the interaction of a double support, audio and video, and, accordingly, on procedures, means, and methods adopted in their translation. By drawing on previous research in semiotics and film studies, the relevant codes at work in visual and acoustic channels were outlined. Subsequently, we concentrated on the analysis of the verbal component and on the peculiar characteristics of filmic orality as opposed to spontaneous dialogic production. In the second part, an overview of the main AVT modalities was presented (dubbing, voice-over, interlinguistic and intra-linguistic subtitling, audio-description, etc.) in order to define the different technologies, processes and professional qualifications that this umbrella term presently includes. The second chapter focuses diachronically on various theories’ contribution to the application of Corpus Linguistics’ methods and tools to the field of Translation Studies (i.e. Descriptive Translation Studies, Polysystem Theory). In particular, we discussed how the use of corpora can favourably help reduce the gap existing between qualitative and quantitative approaches. Subsequently, we reviewed the tools traditionally employed by Corpus Linguistics in regard to the construction of traditional “written language” corpora, to assess whether and how they can be adapted to meet the needs of multimedia corpora. In particular, we reviewed existing speech and spoken corpora, as well as multimedia corpora specifically designed to investigate Translation. The third chapter reviews Forlixt 1's main developing steps, from a technical (IT design principles, data query functions) and methodological point of view, by laying down extensive scientific foundations for the annotation methods adopted, which presently encompass categories of pragmatic, sociolinguistic, linguacultural and semiotic nature. Finally, we described the main query tools (free search, guided search, advanced search and combined search) and the main intended uses of the database in a pedagogical perspective. The fourth chapter lists specific compilation criteria retained, as well as statistics of the two sub-corpora, by presenting data broken down by language pair (French-Italian and German-Italian) and genre (cinema’s comedies, television’s soapoperas and crime series). Next, we concentrated on the discussion of the results obtained from the analysis of summary tables reporting the frequency of categories applied to the French-Italian sub-corpus. The detailed observation of the distribution of categories identified in the original and dubbed corpus allowed us to empirically confirm some of the theories put forward in the literature and notably concerning the nature of the filmic text, the dubbing process and Italian dubbed language’s features. This was possible by looking into some of the most problematic aspects, like the rendering of socio-linguistic variation. The corpus equally allowed us to consider so far neglected aspects, such as pragmatic, prosodic, kinetic, facial, and semiotic elements, and their combination. At the end of this first exploration, some specific observations concerning possible macrotranslation trends were made for each type of sub-genre considered (cinematic and TV genre). On the grounds of this first quantitative investigation, the fifth chapter intended to further examine data, by applying ad hoc models of analysis. Given the virtually infinite number of combinations of categories adopted, and of the latter with searchable textual units, three possible qualitative and quantitative methods were designed, each of which was to concentrate on a particular translation dimension of the filmic text. The first one was the cultural dimension, which specifically focused on the rendering of selected cultural references and on the investigation of recurrent translation choices and strategies justified on the basis of the occurrence of specific clusters of categories. The second analysis was conducted on the linguistic dimension by exploring the occurrence of phrasal verbs in the Italian dubbed corpus and by ascertaining the influence on the adoption of related translation strategies of possible semiotic traits, such as gestures and facial expressions. Finally, the main aim of the third study was to verify whether, under which circumstances, and through which modality, graphic and iconic elements were translated into Italian from an original corpus of both German and French films. After having reviewed the main translation techniques at work, an exhaustive account of possible causes for their non-translation was equally provided. By way of conclusion, the discussion of results obtained from the distribution of annotation categories on the French-Italian corpus, as well as the application of specific models of analysis allowed us to underline possible advantages and drawbacks related to the adoption of a corpus-based approach to AVT studies. Even though possible updating and improvement were proposed in order to help solve some of the problems identified, it is argued that the added value of Forlixt 1 lies ultimately in having created a valuable instrument, allowing to carry out empirically-sound contrastive studies that may be usefully replicated on different language pairs and several types of multimedia texts. Furthermore, multimedia corpora can also play a crucial role in L2 and translation teaching, two disciplines in which their use still lacks systematic investigation.
Resumo:
This research focuses on the definition of the complex relationship that exists between theory and project, which - in the architectural work by Oswald Mathias Ungers - is based on several essays and on the publications that - though they have never been collected in an organic text - make up an articulated corpus, so that it is possible to consider it as the foundations of a theory. More specifically, this thesis deals with the role of metaphor in Unger’s theory and its subsequent practical application to his projects. The path leading from theoretical analysis to architectural project is in Ungers’ view a slow and mediated path, where theory is an instrument without which it would not be possible to create the project's foundations. The metaphor is a figure of speech taken from disciplines such as philosophy, aesthetics, linguistics. Using a metaphor implies a transfer of meaning, as it is essentially based on the replacement of a real object with a figurative one. The research is articulated in three parts, each of them corresponding to a text by Ungers that is considered as crucial to understand the development of his architectural thinking. Each text marks three decades of Ungers’ work: the sixties, seventies and eighties. The first part of the research deals with the topic of Großform expressed by Ungers in his publication of 1966 Grossformen im Wohnungsbau, where he defines four criteria based on which architecture identifies with a Großform. One of the hypothesis underlying this study is that there is a relationship between the notion of Großform and the figure of metaphor. The second part of the thesis analyzes the time between the end of the sixties and the seventies, i.e. the time during which Ungers lived in the USA and taught at the Cornell University of Ithaca. The analysis focuses on the text Entwerfen und Denken in Vorstellungen, Metaphern und Analogien, written by Ungers in 1976, for the exhibition MAN transFORMS organized in the Cooper - Hewitt Museum in New York. This text, through which Ungers creates a sort of vocabulary to explain the notions of metaphor, analogy, signs, symbols and allegories, can be defined as the Manifesto of his architectural theory, the latter being strictly intertwined with the metaphor as a design instrument and which is best expressed when he introduces the 11 thesis with P. Koolhaas, P. Riemann, H. Kollhoff and A. Ovaska in Die Stadt in der Stadt in 1977. Berlin das grüne Stadtarchipel. The third part analyzes the indissoluble tie between the use of metaphor and the choice of the topic on which the project is based and, starting from Ungers’ publication in 1982 Architecture as theme, the relationship between idea/theme and image/metaphor is explained. Playing with shapes requires metaphoric thinking, i.e. taking references to create new ideas from the world of shapes and not just from architecture. The metaphor as a tool to interpret reality becomes for Ungers an inquiry method that precedes a project and makes it possible to define the theme on which the project will be based. In Ungers’ case, the architecture of ideas matches the idea of architecture; for Ungers the notions of idea and theme, image and metaphor cannot be separated from each other, the text on thematization of architecture is not a report of his projects, but it represents the need to put them in order and highlight the theme on which they are based.
Resumo:
L’obiettivo della presente dissertazione è quello di creare un nuovo linguaggio controllato, denominato Español Técnico Simplificado (ETS). Basato sulla specifica tecnica del Simplified Technical English (STE), ufficialmente conosciuta come ASD-STE100, lo spagnolo controllato ETS si presenta come un documento metalinguistico in grado di fornire ad un redattore o traduttore tecnico alcune regole specifiche per produrre un documento tecnico. La strategia di implementazione conduce allo studio preliminare di alcuni linguaggi controllati simili all’inglese STE, quali il Français Rationalisé e il Simplified Technical Spanish. Attraverso un approccio caratteristico della linguistica dei corpora, la soluzione proposta fornisce il nuovo linguaggio controllato mediante l’estrazione di informazioni specifiche da un corpus ad-hoc di lingua spagnola appositamente creato ed interrogato. I risultati evidenziano un metodo linguistico (controllato) utile a produrre documentazione tecnica priva di ogni eventuale ambiguità. Il sistema ETS, infatti, si fonda sul concetto della intelligibilità in quanto condizione necessaria da soddisfare nell’ambito della produzione di un testo controllato. E, attraverso la sua macrostruttura, il documento ETS fornisce gli strumenti necessari per rendere il testo controllato univoco. Infatti, tale struttura bipartita suddivide in maniera logica i dettami: una prima parte riguarda e contiene regole sintattiche e stilistiche; una seconda parte riguarda e contiene un dizionario di un numero limitato di lemmi opportunamente selezionati. Il tutto a favore del principio della biunivocità dei segni, in questo caso, della lingua spagnola. Il progetto, nel suo insieme, apre le porte ad un linguaggio nuovo in alternativa a quelli presenti, totalmente creato in accademia, che vale come prototipo a cui far seguire altri progetti di ricerca.
Resumo:
Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.