226 resultados para lazy parsing
Resumo:
The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. ^ Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. ^ This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model’s parsing mechanism. ^ The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents. ^
Resumo:
The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Model's parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.
Resumo:
Incremental parsing has long been recognized as a technique of great utility in the construction of language-based editors, and correspondingly, the area currently enjoys a mature theory. Unfortunately, many practical considerations have been largely overlooked in previously published algorithms. Many user requirements for an editing system necessarily impact on the design of its incremental parser, but most approaches focus only on one: response time. This paper details an incremental parser based on LR parsing techniques and designed for use in a modeless syntax recognition editor. The nature of this editor places significant demands on the structure and quality of the document representation it uses, and hence, on the parser. The strategy presented here is novel in that both the parser and the representation it constructs are tolerant of the inevitable and frequent syntax errors that arise during editing. This is achieved by a method that differs from conventional error repair techniques, and that is more appropriate for use in an interactive context. Furthermore, the parser aims to minimize disturbance to this representation, not only to ensure other system components can operate incrementally, but also to avoid unfortunate consequences for certain user-oriented services. The algorithm is augmented with a limited form of predictive tree-building, and a technique is presented for the determination of valid symbols for menu-based insertion. Copyright (C) 2001 John Wiley & Sons, Ltd.
Resumo:
Program slicing is a well known family of techniques intended to identify and isolate code fragments which depend on, or are depended upon, specific program entities. This is particularly useful in the areas of reverse engineering, program understanding, testing and software maintenance. Most slicing methods, and corresponding tools, target either the imperative or the object oriented paradigms, where program slices are computed with respect to a variable or a program statement. Taking a complementary point of view, this paper focuses on the slicing of higher-order functional programs under a lazy evaluation strategy. A prototype of a Haskell slicer, built as proof-of-concept for these ideas, is also introduced
Resumo:
Esta dissertação de Mestrado em Tradução pretende, através da análise da tradução de um texto cultural e temporalmente distante – “Lazy Lawrence” (1796), da autoria de Maria Edgeworth, – reflectir sobre a tradução de literatura infanto-juvenil, partindo de alguns pressupostos teóricos. Despertar um texto, com estas características, abre espaço para formular múltiplas hipóteses e retirar algumas conclusões. Na primeira parte, é apresentada a autora em estudo, Maria Edgeworth, seguindo-se uma incursão pela sua obra. Num segundo momento, tecem-se considerações sobre a literatura infanto-juvenil e a sua tradução. Na terceira parte, são apresentadas algumas perspectivas teóricas sobre a tradução, abrindo espaço à reflexão, a partir de uma análise da tradução do conto “Lazy Lawrence”, apontando e descrevendo algumas das opções tomadas, as quais procuram demonstrar as múltiplas questões que se colocam a um tradutor quando tem de lidar com um texto, temporal e culturalmente, afastado da cultura de chegada.
Resumo:
Dissertação de Mestrado em Engenharia Informática
Resumo:
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.
Resumo:
During the past few years, there has been much discussion of a shift from rule-based systems to principle-based systems for natural language processing. This paper outlines the major computational advantages of principle-based parsing, its differences from the usual rule-based approach, and surveys several existing principle-based parsing systems used for handling languages as diverse as Warlpiri, English, and Spanish, as well as language translation.
Resumo:
Cumple los requisitos del currículo de Inglaterra, Gales y Escocia aportando los conocimientos básicos relacionados con el aprendizaje de la lectura y escritura. Material para desarrollar la conciencia fonológica, conciencia de los sonidos en las palabras habladas,y el conocimiento de las relaciones simbolo y sonido, concentrándose en rimas y sonidos de las letras. La conciencia de los sonidos en los niños empieza con las palabras y las sílabas, pasa a los sonidos en las palabras y a continuación a los fonemas, los más difíciles de identificar, a excepción de cuando ocupan la posición inicial en una palabra. Para discriminar, escribir y leer el fonema er. Para niños a partir de siete años.
Resumo:
Resumen basado en el del autor. Resumen en castellano e inglés. Notas al final
Resumo:
25 monolingual (L1) children with Specific Language Impairment (SLI), 32 sequential bilingual (L2) children, and 29 L1 controls completed the Test of Active & Passive Sentences-Revised (van der Lely, 1996) and the self-paced listening task with picture verification for actives and passives (Marinis, 2007). These revealed important between-group differences in both tasks. The children with SLI showed difficulties in both actives and passives when they had to reanalyse thematic roles on-line. Their error pattern provided evidence for working memory limitations. The L2 children showed difficulties only in passives both on-line and off-line. We suggest that these relate to the complex syntactic algorithm in passives and reflect an earlier developmental stage due to reduced exposure to the L2. The results are discussed in relation to theories of SLI and can be best accommodated within accounts proposing that difficulties in the comprehension of passives stem from processing limitations.