995 resultados para Web as a Corpus
Resumo:
In this paper, we describe a machine-translated parallel English corpus for the NTCIR Chinese, Japanese and Korean (CJK) Wikipedia collections. This document collection is named CJK2E Wikipedia XML corpus. The corpus could be used by the information retrieval research community and knowledge sharing in Wikipedia in many ways; for example, this corpus could be used for experimentations in cross-lingual information retrieval, cross-lingual link discovery, or omni-lingual information retrieval research. Furthermore, the translated CJK articles could be used to further expand the current coverage of the English Wikipedia.
Resumo:
We present three natural language marking strategies based on fast and reliable shallow parsing techniques, and on widely available lexical resources: lexical substitution, adjective conjunction swaps, and relativiser switching. We test these techniques on a random sample of the British National Corpus. Individual candidate marks are checked for goodness of structural and semantic fit, using both lexical resources, and the web as a corpus. A representative sample of marks is given to 25 human judges to evaluate for acceptability and preservation of meaning. This establishes a correlation between corpus based felicity measures and perceived quality, and makes qualified predictions. Grammatical acceptability correlates with our automatic measure strongly (Pearson's r = 0.795, p = 0.001), allowing us to account for about two thirds of variability in human judgements. A moderate but statistically insignificant (Pearson's r = 0.422, p = 0.356) correlation is found with judgements of meaning preservation, indicating that the contextual window of five content words used for our automatic measure may need to be extended. © 2007 SPIE-IS&T.
Resumo:
Pós-graduação em Estudos Linguísticos - IBILCE
Resumo:
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)
Resumo:
False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the words’ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
Resumo:
En el Siglo XXI, donde las Nuevas Tecnologías de la Información y Comunicación están a la orden del día, se suceden manifestaciones que tradicionalmente abarcaban otros entornos menos virtuales y engloban grupos etarios desconocidos en los que la diferenciación de género es manifiesta. Con la facilidad de acceso y conexión a Internet, muchos disponemos de herramientas suficientes como para escribir en Redes Sociales determinadas emociones en su grado extremo así como ideaciones suicidas. Sin embargo, hay ubicaciones más profundas y desconocidas por algunos usuarios, como la Deep Web (y su navegador Tor), que permiten un completo anonimato del usuario. Por tanto, surge necesidad de la creación de un corpus de mensajes de índole suicida y relacionados con las emociones profundas con el fin de analizar el léxico mediante el lenguaje computacional y una previa categorización de los resultados con el fin de fomentar la creación de programas que detecten estas manifestaciones y ejerzan una labor preventiva.
Resumo:
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available. Research on the automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpusbase statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Another novelty of the proposed method is that it can utilize the readily available user-contributed labels of opinionated documents (e.g., the user ratings of product reviews) to bootstrap the performance of sentiment lexicon construction. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons as directly assessed by human experts. Moreover, the system generated domain-specific sentiment lexicons can improve polarity prediction tasks at the document level by 2:18% when compared to other well-known baseline methods. Our research opens the door to the development of practical and scalable methods for domain-specific sentiment analysis.
Resumo:
Recent Australian early childhood policy and curriculum guidelines promoting the use of technologies invite investigations of young children’s practices in classrooms. This study examined the practices of one preparatory year classroom, to show teacher and child interactions as they engaged in Web searching. The study investigated the in situ practices of the teacher and children to show how they accomplished the Web search. The data corpus consists of eight hours of videorecorded interactions over three days where children and teachers engaged in Web searching. One episode was selected that showed a teacher and two children undertaking a Web search. The episode is shown to consist of four phases: deciding on a new search subject, inputting the search query, considering the result options, and exploring the selected result. The sociological perspectives of ethnomethodology and conversation analysis were employed as the conceptual and methodological frameworks of the study, to analyse the video-recorded teacher and child interactions as they co-constructed a Web search. Ethnomethodology is concerned with how people make ‘sense’ in everyday interactions, and conversation analysis focuses on the sequential features of interaction to show how the interaction unfolds moment by moment. This extended single case analysis showed how the Web search was accomplished over multiple turns, and how the children and teacher collaboratively engaged in talk. There are four main findings. The first was that Web searching featured sustained teacher-child interaction, requiring a particular sort of classroom organisation to enable the teacher to work in this sustained way. The second finding was that the teacher’s actions recognised the children’s interactional competence in situ, orchestrating an interactional climate where everyone was heard. The third finding was that the teacher drew upon a range of interactional resources designed to progress the activity at hand, that of accomplishing the Web search. The teacher drew upon the interactional resources of interrogatives, discourse markers, and multi-unit turns during the Web search, and these assisted the teacher and children to co-construct their discussion, decide upon and co-ordinate their future actions, and accomplish the Web search in a timely way. The fourth finding explicates how particular social and pedagogic orders are accomplished through talk, where children collaborated with each other and with the teacher to complete the Web search. The study makes three key recommendations for the field of early childhood education. The study’s first recommendation is that fine-grained transcription and analysis of interaction aids in understanding interactional practices of Web searching. This study offers material for use in professional development, such as using transcribed and videorecorded interactions to highlight how teachers strategically engage with children, that is, how talk works in classroom settings. Another strategy is to focus on the social interactions of members engaging in Web searches, which is likely to be of interest to teachers as they work to engage with children in an increasingly online environment. The second recommendation involves classroom organisation; how teachers consider and plan for extended periods of time for Web searching, and how teachers accommodate children’s prior knowledge of Web searching in their classrooms. The third recommendation is in relation to future empirical research, with suggested possible topics focusing on the social interactions of children as they engage with peers as they Web search, as well as investigations of techno-literacy skills as children use the Internet in the early years.
Resumo:
This paper details the participation of the Australian e- Health Research Centre (AEHRC) in the ShARe/CLEF 2013 eHealth Evaluation Lab { Task 3. This task aims to evaluate the use of information retrieval (IR) systems to aid consumers (e.g. patients and their relatives) in seeking health advice on the Web. Our submissions to the ShARe/CLEF challenge are based on language models generated from the web corpus provided by the organisers. Our baseline system is a standard Dirichlet smoothed language model. We enhance the baseline by identifying and correcting spelling mistakes in queries, as well as expanding acronyms using AEHRC's Medtex medical text analysis platform. We then consider the readability and the authoritativeness of web pages to further enhance the quality of the document ranking. Measures of readability are integrated in the language models used for retrieval via prior probabilities. Prior probabilities are also used to encode authoritativeness information derived from a list of top-100 consumer health websites. Empirical results show that correcting spelling mistakes and expanding acronyms found in queries signi cantly improves the e ectiveness of the language model baseline. Readability priors seem to increase retrieval e ectiveness for graded relevance at early ranks (nDCG@5, but not precision), but no improvements are found at later ranks and when considering binary relevance. The authoritativeness prior does not appear to provide retrieval gains over the baseline: this is likely to be because of the small overlap between websites in the corpus and those in the top-100 consumer-health websites we acquired.
Resumo:
This paper evaluates the efficiency of a number of popular corpus-based distributional models in performing discovery on very large document sets, including online collections. Literature-based discovery is the process of identifying previously unknown connections from text, often published literature, that could lead to the development of new techniques or technologies. Literature-based discovery has attracted growing research interest ever since Swanson's serendipitous discovery of the therapeutic effects of fish oil on Raynaud's disease in 1986. The successful application of distributional models in automating the identification of indirect associations underpinning literature-based discovery has been heavily demonstrated in the medical domain. However, we wish to investigate the computational complexity of distributional models for literature-based discovery on much larger document collections, as they may provide computationally tractable solutions to tasks including, predicting future disruptive innovations. In this paper we perform a computational complexity analysis on four successful corpus-based distributional models to evaluate their fit for such tasks. Our results indicate that corpus-based distributional models that store their representations in fixed dimensions provide superior efficiency on literature-based discovery tasks.
Resumo:
Tämä tutkielma käsittelee World Wide Webin sisältämien verkkosivujen sisältöjen käyttöä korpusmaisesti kielitieteellisenä tutkimusaineistona. World Wide Web sisältää moninkertaisesti enemmän tekstiä kuin suurimmat olemassa olevat perinteiset tekstikorpukset, joten verkkosivuilta voi todennäköisesti löytää paljon esiintymiä sellaisista sanoista ja rakenteista, jotka ovat perinteisissä korpuksissa harvinaisia. Verkkosivuja voidaan käyttää aineistona kahdella eri tavalla: voidaan kerätä satunnainen otos verkkosivuista ja luoda itsenäinen korpus niiden sisällöistä, tai käyttää koko World Wide Webiä korpuksena verkkohakukoneiden kautta. Verkkosivuja on käytetty tutkimusaineistona monilla eri kielitieteen aloilla, kuten leksikograafisessa tutkimuksessa, syntaktisten rakenteiden tutkimuksessa, pedagogisena materiaalina ja vähemmistökielten tutkimuksessa. Verkkosivuilla on perinteisiin korpuksiin verrattuna useita haitallisia ominaisuuksia, jotka pitää ottaa huomioon, kun niitä käytetään aineistona. Kaikki sivut eivät sisällä kelvollista tekstiä, ja sivut ovat usein esimerkiksi HTML-muotoisia, jolloin ne pitää muuttaa helpommin käsiteltävissä olevaan muotoon. Verkkosivut sisältävät enemmän kielellisiä virheitä kuin perinteiset korpukset, ja niiden tekstityypit ja aihepiirit ovat runsaslukuisempia kuin perinteisten korpusten. Aineiston keräämiseen verkkosivuilta tarvitaan tehokkaita ohjelmatyökaluja. Näistä yleisimpiä ovat kaupalliset verkkohakukoneet, joiden kautta on mahdollista päästä nopeasti käsiksi suureen määrään erilaisia sivuja. Näiden lisäksi voidaan käyttää erityisesti kielitieteellisiin tarpeisiin kehitettyjä työkaluja. Tässä tutkielmassa esitellään ohjelmatyökalut WebCorp, WebAsCorpus.org, BootCaT ja Web as Corpus Toolkit, joiden avulla voi hakea aineistoa verkkosivuilta nimenomaan kielitieteellisiin tarkoituksiin.