Biblioteca Digital

941 resultados para context-free language

Aspectos culturais como fios condutores de interações em Tandem na aprendizagem de português língua estrangeira: interculturalidade, estereótipos e identidade(s)

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)

Veja mais

La traduzione dei racconti di Pat Mora: muoversi tra due lingue e due culture

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This dissertation deals with the translations of seven books for children written by the Chicano author Pat Mora. I started to be interested in the Chicano world, a world suspended between Mexico and the United States, after reading a book by Sandra Cisneros. I decided to deepen my curiosity and for this reason, I discovered a hybrid reality full of history, culture and traditions. In this context, the language used is characterized by a continuous code switching between Spanish and English and I thought it was an interesting phenomenon from the literary and translation point of view. During my research in the Chicano culture, I ran across Pat Mora. Her books for children fascinated me because of their actual themes (the cultural diversity and the defense of identity) and their beautiful illustrations. For this reason, I chose to translate seven of her books because I believe they could be an enrichment for children literature in Italy. The work consists of five chapters. The first one deals with the identity of Chicano people, their history, their literature and their language. In the second chapter, I outline Pat Mora’s profile. I talk about her biography and I analyze her most famous works. In the third chapter, I introduce the seven books for children to be translated and I point out their plots and main themes. In the fourth chapter, I present the translation of the books. The fifth chapter is the translation comment. I deal with the linguistic analysis of the source texts and the analysis of the target texts focusing on the choices made during the translation process.

Veja mais

Offline grammar-based recognition of handwritten sentences

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper proposes a sequential coupling of a Hidden Markov Model (HMM) recognizer for offline handwritten English sentences with a probabilistic bottom-up chart parser using Stochastic Context-Free Grammars (SCFG) extracted from a text corpus. Based on extensive experiments, we conclude that syntax analysis helps to improve recognition rates significantly.

Veja mais

Accepting splicing systems with permitting and forbidding words

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Abstract: In this paper we propose a generalization of the accepting splicingsystems introduced in Mitrana et al. (Theor Comput Sci 411:2414?2422,2010). More precisely, the input word is accepted as soon as a permittingword is obtained provided that no forbidding word has been obtained sofar, otherwise it is rejected. Note that in the new variant of acceptingsplicing system the input word is rejected if either no permitting word isever generated (like in Mitrana et al. in Theor Comput Sci 411:2414?2422,2010) or a forbidding word has been generated and no permitting wordhad been generated before. We investigate the computational power ofthe new variants of accepting splicing systems and the interrelationshipsamong them. We show that the new condition strictly increases thecomputational power of accepting splicing systems. Although there areregular languages that cannot be accepted by any of the splicing systemsconsidered here, the new variants can accept non-regular and even non-context-free languages, a situation that is not very common in the case of(extended) finite splicing systems without additional restrictions. We alsoshow that the smallest class of languages out of the four classes definedby accepting splicing systems is strictly included in the class of context-free languages. Solutions to a few decidability problems are immediatelyderived from the proof of this result.

Veja mais

Programación genética guiada por gramáticas

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Este trabajo fin de grado, presenta una herramienta para experimentar con técnicas de la Programación Genética Guiada por Gramáticas. La mayor parte de los trabajos realizados hasta el momento en esta área, son demasiado restrictivos, ya que trabajan con gramáticas, y funciones fitness predefinidas dentro de las propias herramientas, por lo que solo son útiles sobre un único problema. Este trabajo se plantea el objetivo de presentar una herramienta mediante la cual todos los parámetros, gramáticas, individuos y funciones fitness, sean parametrizables. Es decir, una herramienta de carácter general, valida para cualquier tipo de problema que sea representable mediante una gramática libre de contexto. Para abordad el objetivo principal propuesto, se plantea un mecanismo para construir el árbol de derivación de los individuos de acuerdo a una gramática libre de contexto, y a partir de ahí, aplicar una serie de operadores genéticos guiados por gramáticas para ofrecer un resultado final, de acuerdo a una función fitness, que el usuario puede seleccionar antes de realizar la ejecución. La herramienta, también propone una medida de similitud entre los individuos pertenecientes a una determinada generación, que permite comparar los individuos desde el punto de vista de la información semántica que contienen. Con el objetivo de validar el trabajo realizado, se ha probado la herramienta con una gramática libre de contexto ya predefinida, y se exponen numerosos tipos de resultados de acuerdo a distintos parámetros de la aplicación, así como su comparación, para poder estudiar la velocidad e convergencia de los mismos. ---ABSTRACT---This final project presents a tool for working with algorithms related to Genetic Grammar Guided Programming. Most of the work done so far in this area is too restrictive, since they only work with predefined grammars, and fitness functions built within the tools themselves, so they are only useful on a single problem. The main objective of this tool is that all parameters, grammars, individuals and fitness functions, are can be easily modified thought the interface. In other words, a general tool valid for any type of problem that can be represented by a context-free grammar. To address the main objective proposed, the tool provides a mechanism to build the derivation tree of individuals according to a context-free grammar, and from there, applying a series of grammar guided genetic operators to deliver a final result, according to a fitness function, which the user can select before execution. The tool also offers a measure of similarity between individuals belonging to a certain generation, allowing comparison of individuals from the point of view of semantic information they contain. In order to validate the work done, the tool has been tested with a context-free grammar previously defined, and numerous types test have been run with different parameters of the application. The results are compared according to their speed convergence

Veja mais

Contributions to Speech Analytics based on Speech Recognition and Topic Identification

Relevância:

80.00% 80.00%

Publicador:

Resumo:

La última década ha sido testigo de importantes avances en el campo de la tecnología de reconocimiento de voz. Los sistemas comerciales existentes actualmente poseen la capacidad de reconocer habla continua de múltiples locutores, consiguiendo valores aceptables de error, y sin la necesidad de realizar procedimientos explícitos de adaptación. A pesar del buen momento que vive esta tecnología, el reconocimiento de voz dista de ser un problema resuelto. La mayoría de estos sistemas de reconocimiento se ajustan a dominios particulares y su eficacia depende de manera significativa, entre otros muchos aspectos, de la similitud que exista entre el modelo de lenguaje utilizado y la tarea específica para la cual se está empleando. Esta dependencia cobra aún más importancia en aquellos escenarios en los cuales las propiedades estadísticas del lenguaje varían a lo largo del tiempo, como por ejemplo, en dominios de aplicación que involucren habla espontánea y múltiples temáticas. En los últimos años se ha evidenciado un constante esfuerzo por mejorar los sistemas de reconocimiento para tales dominios. Esto se ha hecho, entre otros muchos enfoques, a través de técnicas automáticas de adaptación. Estas técnicas son aplicadas a sistemas ya existentes, dado que exportar el sistema a una nueva tarea o dominio puede requerir tiempo a la vez que resultar costoso. Las técnicas de adaptación requieren fuentes adicionales de información, y en este sentido, el lenguaje hablado puede aportar algunas de ellas. El habla no sólo transmite un mensaje, también transmite información acerca del contexto en el cual se desarrolla la comunicación hablada (e.g. acerca del tema sobre el cual se está hablando). Por tanto, cuando nos comunicamos a través del habla, es posible identificar los elementos del lenguaje que caracterizan el contexto, y al mismo tiempo, rastrear los cambios que ocurren en estos elementos a lo largo del tiempo. Esta información podría ser capturada y aprovechada por medio de técnicas de recuperación de información (information retrieval) y de aprendizaje de máquina (machine learning). Esto podría permitirnos, dentro del desarrollo de mejores sistemas automáticos de reconocimiento de voz, mejorar la adaptación de modelos del lenguaje a las condiciones del contexto, y por tanto, robustecer al sistema de reconocimiento en dominios con condiciones variables (tales como variaciones potenciales en el vocabulario, el estilo y la temática). En este sentido, la principal contribución de esta Tesis es la propuesta y evaluación de un marco de contextualización motivado por el análisis temático y basado en la adaptación dinámica y no supervisada de modelos de lenguaje para el robustecimiento de un sistema automático de reconocimiento de voz. Esta adaptación toma como base distintos enfoque de los sistemas mencionados (de recuperación de información y aprendizaje de máquina) mediante los cuales buscamos identificar las temáticas sobre las cuales se está hablando en una grabación de audio. Dicha identificación, por lo tanto, permite realizar una adaptación del modelo de lenguaje de acuerdo a las condiciones del contexto. El marco de contextualización propuesto se puede dividir en dos sistemas principales: un sistema de identificación de temática y un sistema de adaptación dinámica de modelos de lenguaje. Esta Tesis puede describirse en detalle desde la perspectiva de las contribuciones particulares realizadas en cada uno de los campos que componen el marco propuesto: _ En lo referente al sistema de identificación de temática, nos hemos enfocado en aportar mejoras a las técnicas de pre-procesamiento de documentos, asimismo en contribuir a la definición de criterios más robustos para la selección de index-terms. – La eficiencia de los sistemas basados tanto en técnicas de recuperación de información como en técnicas de aprendizaje de máquina, y específicamente de aquellos sistemas que particularizan en la tarea de identificación de temática, depende, en gran medida, de los mecanismos de preprocesamiento que se aplican a los documentos. Entre las múltiples operaciones que hacen parte de un esquema de preprocesamiento, la selección adecuada de los términos de indexado (index-terms) es crucial para establecer relaciones semánticas y conceptuales entre los términos y los documentos. Este proceso también puede verse afectado, o bien por una mala elección de stopwords, o bien por la falta de precisión en la definición de reglas de lematización. En este sentido, en este trabajo comparamos y evaluamos diferentes criterios para el preprocesamiento de los documentos, así como también distintas estrategias para la selección de los index-terms. Esto nos permite no sólo reducir el tamaño de la estructura de indexación, sino también mejorar el proceso de identificación de temática. – Uno de los aspectos más importantes en cuanto al rendimiento de los sistemas de identificación de temática es la asignación de diferentes pesos a los términos de acuerdo a su contribución al contenido del documento. En este trabajo evaluamos y proponemos enfoques alternativos a los esquemas tradicionales de ponderado de términos (tales como tf-idf ) que nos permitan mejorar la especificidad de los términos, así como también discriminar mejor las temáticas de los documentos. _ Respecto a la adaptación dinámica de modelos de lenguaje, hemos dividimos el proceso de contextualización en varios pasos. – Para la generación de modelos de lenguaje basados en temática, proponemos dos tipos de enfoques: un enfoque supervisado y un enfoque no supervisado. En el primero de ellos nos basamos en las etiquetas de temática que originalmente acompañan a los documentos del corpus que empleamos. A partir de estas, agrupamos los documentos que forman parte de la misma temática y generamos modelos de lenguaje a partir de dichos grupos. Sin embargo, uno de los objetivos que se persigue en esta Tesis es evaluar si el uso de estas etiquetas para la generación de modelos es óptimo en términos del rendimiento del reconocedor. Por esta razón, nosotros proponemos un segundo enfoque, un enfoque no supervisado, en el cual el objetivo es agrupar, automáticamente, los documentos en clusters temáticos, basándonos en la similaridad semántica existente entre los documentos. Por medio de enfoques de agrupamiento conseguimos mejorar la cohesión conceptual y semántica en cada uno de los clusters, lo que a su vez nos permitió refinar los modelos de lenguaje basados en temática y mejorar el rendimiento del sistema de reconocimiento. – Desarrollamos diversas estrategias para generar un modelo de lenguaje dependiente del contexto. Nuestro objetivo es que este modelo refleje el contexto semántico del habla, i.e. las temáticas más relevantes que se están discutiendo. Este modelo es generado por medio de la interpolación lineal entre aquellos modelos de lenguaje basados en temática que estén relacionados con las temáticas más relevantes. La estimación de los pesos de interpolación está basada principalmente en el resultado del proceso de identificación de temática. – Finalmente, proponemos una metodología para la adaptación dinámica de un modelo de lenguaje general. El proceso de adaptación tiene en cuenta no sólo al modelo dependiente del contexto sino también a la información entregada por el proceso de identificación de temática. El esquema usado para la adaptación es una interpolación lineal entre el modelo general y el modelo dependiente de contexto. Estudiamos también diferentes enfoques para determinar los pesos de interpolación entre ambos modelos. Una vez definida la base teórica de nuestro marco de contextualización, proponemos su aplicación dentro de un sistema automático de reconocimiento de voz. Para esto, nos enfocamos en dos aspectos: la contextualización de los modelos de lenguaje empleados por el sistema y la incorporación de información semántica en el proceso de adaptación basado en temática. En esta Tesis proponemos un marco experimental basado en una arquitectura de reconocimiento en ‘dos etapas’. En la primera etapa, empleamos sistemas basados en técnicas de recuperación de información y aprendizaje de máquina para identificar las temáticas sobre las cuales se habla en una transcripción de un segmento de audio. Esta transcripción es generada por el sistema de reconocimiento empleando un modelo de lenguaje general. De acuerdo con la relevancia de las temáticas que han sido identificadas, se lleva a cabo la adaptación dinámica del modelo de lenguaje. En la segunda etapa de la arquitectura de reconocimiento, usamos este modelo adaptado para realizar de nuevo el reconocimiento del segmento de audio. Para determinar los beneficios del marco de trabajo propuesto, llevamos a cabo la evaluación de cada uno de los sistemas principales previamente mencionados. Esta evaluación es realizada sobre discursos en el dominio de la política usando la base de datos EPPS (European Parliamentary Plenary Sessions - Sesiones Plenarias del Parlamento Europeo) del proyecto europeo TC-STAR. Analizamos distintas métricas acerca del rendimiento de los sistemas y evaluamos las mejoras propuestas con respecto a los sistemas de referencia. ABSTRACT The last decade has witnessed major advances in speech recognition technology. Today’s commercial systems are able to recognize continuous speech from numerous speakers, with acceptable levels of error and without the need for an explicit adaptation procedure. Despite this progress, speech recognition is far from being a solved problem. Most of these systems are adjusted to a particular domain and their efficacy depends significantly, among many other aspects, on the similarity between the language model used and the task that is being addressed. This dependence is even more important in scenarios where the statistical properties of the language fluctuates throughout the time, for example, in application domains involving spontaneous and multitopic speech. Over the last years there has been an increasing effort in enhancing the speech recognition systems for such domains. This has been done, among other approaches, by means of techniques of automatic adaptation. These techniques are applied to the existing systems, specially since exporting the system to a new task or domain may be both time-consuming and expensive. Adaptation techniques require additional sources of information, and the spoken language could provide some of them. It must be considered that speech not only conveys a message, it also provides information on the context in which the spoken communication takes place (e.g. on the subject on which it is being talked about). Therefore, when we communicate through speech, it could be feasible to identify the elements of the language that characterize the context, and at the same time, to track the changes that occur in those elements over time. This information can be extracted and exploited through techniques of information retrieval and machine learning. This allows us, within the development of more robust speech recognition systems, to enhance the adaptation of language models to the conditions of the context, thus strengthening the recognition system for domains under changing conditions (such as potential variations in vocabulary, style and topic). In this sense, the main contribution of this Thesis is the proposal and evaluation of a framework of topic-motivated contextualization based on the dynamic and non-supervised adaptation of language models for the enhancement of an automatic speech recognition system. This adaptation is based on an combined approach (from the perspective of both information retrieval and machine learning fields) whereby we identify the topics that are being discussed in an audio recording. The topic identification, therefore, enables the system to perform an adaptation of the language model according to the contextual conditions. The proposed framework can be divided in two major systems: a topic identification system and a dynamic language model adaptation system. This Thesis can be outlined from the perspective of the particular contributions made in each of the fields that composes the proposed framework: _ Regarding the topic identification system, we have focused on the enhancement of the document preprocessing techniques in addition to contributing in the definition of more robust criteria for the selection of index-terms. – Within both information retrieval and machine learning based approaches, the efficiency of topic identification systems, depends, to a large extent, on the mechanisms of preprocessing applied to the documents. Among the many operations that encloses the preprocessing procedures, an adequate selection of index-terms is critical to establish conceptual and semantic relationships between terms and documents. This process might also be weakened by a poor choice of stopwords or lack of precision in defining stemming rules. In this regard we compare and evaluate different criteria for preprocessing the documents, as well as for improving the selection of the index-terms. This allows us to not only reduce the size of the indexing structure but also to strengthen the topic identification process. – One of the most crucial aspects, in relation to the performance of topic identification systems, is to assign different weights to different terms depending on their contribution to the content of the document. In this sense we evaluate and propose alternative approaches to traditional weighting schemes (such as tf-idf ) that allow us to improve the specificity of terms, and to better identify the topics that are related to documents. _ Regarding the dynamic language model adaptation, we divide the contextualization process into different steps. – We propose supervised and unsupervised approaches for the generation of topic-based language models. The first of them is intended to generate topic-based language models by grouping the documents, in the training set, according to the original topic labels of the corpus. Nevertheless, a goal of this Thesis is to evaluate whether or not the use of these labels to generate language models is optimal in terms of recognition accuracy. For this reason, we propose a second approach, an unsupervised one, in which the objective is to group the data in the training set into automatic topic clusters based on the semantic similarity between the documents. By means of clustering approaches we expect to obtain a more cohesive association of the documents that are related by similar concepts, thus improving the coverage of the topic-based language models and enhancing the performance of the recognition system. – We develop various strategies in order to create a context-dependent language model. Our aim is that this model reflects the semantic context of the current utterance, i.e. the most relevant topics that are being discussed. This model is generated by means of a linear interpolation between the topic-based language models related to the most relevant topics. The estimation of the interpolation weights is based mainly on the outcome of the topic identification process. – Finally, we propose a methodology for the dynamic adaptation of a background language model. The adaptation process takes into account the context-dependent model as well as the information provided by the topic identification process. The scheme used for the adaptation is a linear interpolation between the background model and the context-dependent one. We also study different approaches to determine the interpolation weights used in this adaptation scheme. Once we defined the basis of our topic-motivated contextualization framework, we propose its application into an automatic speech recognition system. We focus on two aspects: the contextualization of the language models used by the system, and the incorporation of semantic-related information into a topic-based adaptation process. To achieve this, we propose an experimental framework based in ‘a two stages’ recognition architecture. In the first stage of the architecture, Information Retrieval and Machine Learning techniques are used to identify the topics in a transcription of an audio segment. This transcription is generated by the recognition system using a background language model. According to the confidence on the topics that have been identified, the dynamic language model adaptation is carried out. In the second stage of the recognition architecture, an adapted language model is used to re-decode the utterance. To test the benefits of the proposed framework, we carry out the evaluation of each of the major systems aforementioned. The evaluation is conducted on speeches of political domain using the EPPS (European Parliamentary Plenary Sessions) database from the European TC-STAR project. We analyse several performance metrics that allow us to compare the improvements of the proposed systems against the baseline ones.

Veja mais

Does translation hinder integration?

Relevância:

80.00% 80.00%

Publicador:

Resumo:

2006 and 2007 saw renewed debates about rising costs of public sector translation and interpreting. One argument put forward was that the provision of such services for the immigrant population is deterring community cohesion in the UK. For example, the former education secretary Ruth Kelly raised the question of whether we are providing a ‘crutch’ by making translations available to immigrants, thus not giving them the incentive to learn English. The paper will address this question in the context of language policies and language ideologies in the UK. It will be argued that the debate about translation services has been instrumentalised for the discursive construction of identities.

Veja mais

The temporal binding deficit hypothesis of autism

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Frith has argued that people with autism show “weak central coherence,” an unusual bias toward piecemeal rather than configurational processing and a reduction in the normal tendency to process information in context. However, the precise cognitive and neurological mechanisms underlying weak central coherence are still unknown. We propose the hypothesis that the features of autism associated with weak central coherence result from a reduction in the integration of specialized local neural networks in the brain caused by a deficit in temporal binding. The visuoperceptual anomalies associated with weak central coherence may be attributed to a reduction in synchronization of high-frequency gamma activity between local networks processing local features. The failure to utilize context in language processing in autism can be explained in similar terms. Temporal binding deficits could also contribute to executive dysfunction in autism and to some of the deficits in socialization and communication.

Veja mais

Fitness and novelty in evolutionary art

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper the effects of introducing novelty search in evolutionary art are explored. Our algorithm combines fitness and novelty metrics to frame image evolution as a multi-objective optimisation problem, promoting the creation of images that are both suitable and diverse. The method is illustrated by using two evolutionary art engines for the evolution of figurative objects and context free design grammars. The results demonstrate the ability of the algorithm to obtain a larger set of fit images compared to traditional fitness-based evolution, regardless of the engine used.

Veja mais

A DNA Codification for Genetic Algorithms Simulation

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper we propose a model of encoding data into DNA strands so that this data can be used in the simulation of a genetic algorithm based on molecular operations. DNA computing is an impressive computational model that needs algorithms to work properly and efficiently. The first problem when trying to apply an algorithm in DNA computing must be how to codify the data that the algorithm will use. In a genetic algorithm the first objective must be to codify the genes, which are the main data. A concrete encoding of the genes in a single DNA strand is presented and we discuss what this codification is suitable for. Previous work on DNA coding defined bond-free languages which several properties assuring the stability of any DNA word of such a language. We prove that a bond-free language is necessary but not sufficient to codify a gene giving the correct codification.

Veja mais

Using Inside-Outside Algorithm for Estimation of the Offspring Distribution in Multitype Branching Processes

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Multitype branching processes (MTBP) model branching structures, where the nodes of the resulting tree are particles of different types. Usually such a process is not observable in the sense of the whole tree, but only as the “generation” at a given moment in time, which consists of the number of particles of every type. This requires an EM-type algorithm to obtain a maximum likelihood (ML) estimate of the parameters of the branching process. Using a version of the inside-outside algorithm for stochastic context-free grammars (SCFG), such an estimate could be obtained for the offspring distribution of the process.

Veja mais

Em Estimation of the Offspring Distribution in Mutitype Branching Processes - a Model in Cell Kinetics

Relevância:

80.00% 80.00%

Publicador:

Resumo:

2000 Mathematics Subject Classification: 60J80, 60J85, 62P10, 92D25.

Veja mais

American Teachers in Anti-American Environments: How to Incorporate “Culture” in the EFL Classroom

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper explores the facets and the importance of culture as a necessary context for language competency, acknowledges the relevance of an antipathy towards Americanization, and investigates the characteristics of successful pedagogy for American teachers in a global setting of turbulent geopolitical circumstances influencing the EFL environment.

Veja mais

A noção de sistema na fundação da linguística moderna

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This work is intended to investigate the saussurian notion of system. Such a notion is fundamental to Ferdinand de Saussure's theorization, since it composes the definition of "langue", as he thought it. This definition was crucial to the delimitation of linguistics' specific object of study, which granted its place among modern sciences. However, the notion of system was not created by Saussure. Not only in Linguistics, but also in other areas, this notion appeared in very ancient studies, mingling with the establishment of man in society and the development of their economic and organizational activities. Specifically, in the context of language studies, the system consists in a notion that composed the work of the first grammarians in the West, in ancient Greece. Moreover, this notion was also used afterwards, in the synonymy studies and in the comparative analysis of languages, developed by scholars of the nineteenth century. Nevertheless, although Saussure had had his formation in Leipzig and Berlim, amid comparatists studies, his notion of system is an innovation, while is also continuing. In light of this, we aim to highlight the aspects of the saussurian notion of system that allow the establishment of a relationship of continuity and rupture with other conceptions of system. For that, we will investigate four Saussure authored documents: the « Cours de linguistique générale », the « Mémoire sur le système primitif des voyelles dans les langues indo-européennes », and the two sets of manuscripts « De l'essence double du langage » and « Notes pour le cours III ».

Veja mais

Content-aware compression for big textual data analysis

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements.

Veja mais

941 resultados para context-free language

Filtro por publicador