3 resultados para Text alignment
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
The need for a convergence between semi-structured data management and Information Retrieval techniques is manifest to the scientific community. In order to fulfil this growing request, W3C has recently proposed XQuery Full Text, an IR-oriented extension of XQuery. However, the issue of query optimization requires the study of important properties like query equivalence and containment; to this aim, a formal representation of document and queries is needed. The goal of this thesis is to establish such formal background. We define a data model for XML documents and propose an algebra able to represent most of XQuery Full-Text expressions. We show how an XQuery Full-Text expression can be translated into an algebraic expression and how an algebraic expression can be optimized.
Resumo:
The research project presented in this dissertation is about text and memory. The title of the work is "Text and memory between Semiotics and Cognitive Science: an experimental setting about remembering a movie". The object of the research is the relationship between texts or "textuality" - using a more general semiotic term - and memory. The goal is to analyze the link between those semiotic artifacts that a culture defines as autonomous meaningful objects - namely texts - and the cognitive performance of memory that allows to remember them. An active dialogue between Semiotics and Cognitive Science is the theoretical paradigm in which this research is set, the major intend is to establish a productive alignment between the "theory of text" developed in Semiotics and the "theory of memory" outlined in Cognitive Science. In particular the research is an attempt to study how human subjects remember and/or misremember a film, as a specific case study; in semiotics, films are “cinematographic texts”. The research is based on the production of a corpus of data gained through the qualitative method of interviewing. After an initial screening of a fulllength feature film each participant of the experiment has been interviewed twice, according to a pre-established set of questions. The first interview immediately after the screening: the subsequent, follow-up interview three months from screening. The purpose of this design is to elicit two types of recall from the participants. In order to conduce a comparative inquiry, three films have been used in the experimental setting. Each film has been watched by thirteen subjects, that have been interviewed twice. The corpus of data is then made by seventy-eight interviews. The present dissertation displays the results of the investigation of these interviews. It is divided into six main parts. Chapter one presents a theoretical framework about the two main issues: memory and text. The issue of the memory is introduced through many recherches drown up in the field of Cognitive Science and Neuroscience. It is developed, at the same time, a possible relationship with a semiotic approach. The theoretical debate about textuality, characterizing the field of Semiotics, is examined in the same chapter. Chapter two deals with methodology, showing the process of definition of the whole method used for production of the corpus of data. The interview is explored in detail: how it is born, what are the expected results, what are the main underlying hypothesis. In Chapter three the investigation of the answers given by the spectators starts. It is examined the phenomenon of the outstanding details of the process of remembering, trying to define them in a semiotic way. Moreover there is an investigation of the most remembered scenes in the movie. Chapter four considers how the spectators deal with the whole narrative. At the same time it is examined what they think about the global meaning of the film. Chapter five is about affects. It tries to define the role of emotions in the process of comprehension and remembering. Chapter six presents a study of how the spectators account for a single scene of the movie. The complete work offers a broad perspective about the semiotic issue of textuality, using both a semiotic competence and a cognitive one. At the same time it presents a new outlook on the issue of memory, opening several direction of research.
Resumo:
Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.