39 resultados para Bag-of-words
Resumo:
In most previous research on distributional semantics, Vector Space Models (VSMs) of words are built either from topical information (e.g., documents in which a word is present), or from syntactic/semantic types of words (e.g., dependency parse links of a word in sentences), but not both. In this paper, we explore the utility of combining these two representations to build VSM for the task of semantic composition of adjective-noun phrases. Through extensive experiments on benchmark datasets, we find that even though a type-based VSM is effective for semantic composition, it is often outperformed by a VSM built using a combination of topic- and type-based statistics. We also introduce a new evaluation task wherein we predict the composed vector representation of a phrase from the brain activity of a human subject reading that phrase. We exploit a large syntactically parsed corpus of 16 billion tokens to build our VSMs, with vectors for both phrases and words, and make them publicly available.
Resumo:
We consider the problem of segmenting text documents that have a
two-part structure such as a problem part and a solution part. Documents
of this genre include incident reports that typically involve
description of events relating to a problem followed by those pertaining
to the solution that was tried. Segmenting such documents
into the component two parts would render them usable in knowledge
reuse frameworks such as Case-Based Reasoning. This segmentation
problem presents a hard case for traditional text segmentation
due to the lexical inter-relatedness of the segments. We develop
a two-part segmentation technique that can harness a corpus
of similar documents to model the behavior of the two segments
and their inter-relatedness using language models and translation
models respectively. In particular, we use separate language models
for the problem and solution segment types, whereas the interrelatedness
between segment types is modeled using an IBM Model
1 translation model. We model documents as being generated starting
from the problem part that comprises of words sampled from
the problem language model, followed by the solution part whose
words are sampled either from the solution language model or from
a translation model conditioned on the words already chosen in the
problem part. We show, through an extensive set of experiments on
real-world data, that our approach outperforms the state-of-the-art
text segmentation algorithms in the accuracy of segmentation, and
that such improved accuracy translates well to improved usability
in Case-based Reasoning systems. We also analyze the robustness
of our technique to varying amounts and types of noise and empirically
illustrate that our technique is quite noise tolerant, and
degrades gracefully with increasing amounts of noise
Resumo:
Background: Search filters are combinations of words and phrases designed to retrieve an optimal set of records on a particular topic (subject filters) or study design (methodological filters). Information specialists are increasingly turning to reusable filters to focus their searches. However, the extent of the academic literature on search filters is unknown. We provide a broad overview to the academic literature on search filters.
Objectives: To map the academic literature on search filters from 2004 to 2015 using a novel form of content analysis.
Methods: We conducted a comprehensive search for literature between 2004 and 2015 across eight databases using a subjectively derived search strategy. We identified key words from titles, grouped them into categories, and examined their frequency and co-occurrences.
Results: The majority of records were housed in Embase (n = 178) and MEDLINE (n = 154). Over the last decade, both databases appeared to exhibit a bimodal distribution with the number of publications on search filters rising until 2006, before dipping in 2007, and steadily increasing until 2012. Few articles appeared in social science databases over the same time frame (e.g. Social Services Abstracts, n = 3).
Unsurprisingly, the term ‘search’ appeared in most titles, and quite often, was used as a noun adjunct for the word 'filter' and ‘strategy’. Across the papers, the purpose of searches as a means of 'identifying' information and gathering ‘evidence’ from 'databases' emerged quite strongly. Other terms relating to the methodological assessment of search filters, such as precision and validation, also appeared albeit less frequently.
Conclusions: Our findings show surprising commonality across the papers with regard to the literature on search filters. Much of the literature seems to be focused on developing search filters to identify and retrieve information, as opposed to testing or validating such filters. Furthermore, the literature is mostly housed in health-related databases, namely MEDLINE, CINAHL, and Embase, implying that it is medically driven. Relatively few papers focus on the use of search filters in the social sciences.
Resumo:
This paper addresses the problem of colorectal tumour segmentation in complex real world imagery. For efficient segmentation, a multi-scale strategy is developed for extracting the potentially cancerous region of interest (ROI) based on colour histograms while searching for the best texture resolution. To achieve better segmentation accuracy, we apply a novel bag-of-visual-words method based on rotation invariant raw statistical features and random projection based l2-norm sparse representation to classify tumour areas in histopathology images. Experimental results on 20 real world digital slides demonstrate that the proposed algorithm results in better recognition accuracy than several state of the art segmentation techniques.
Resumo:
The analytic advantages of central concepts from linguistics and information theory, and the analogies demonstrated between them, for understanding patterns of retrieval from full-text indexes to documents are developed. The interaction between the syntagm and the paradigm in computational operations on written language in indexing, searching, and retrieval is used to account for transformations of the signified or meaning between documents and their representation and between queries and documents retrieved. Characteristics of the message, and messages for selection for written language, are brought to explain the relative frequency of occurrence of words and multiple word sequences in documents. The examples given in the companion article are revisited and a fuller example introduced. The signified of the sequence stood for, the term classically used in the definitions of the sign, as something standing for something else, can itself change rapidly according to its syntagm. A greater than ordinary discourse understanding of patterns in retrieval is obtained.
Resumo:
There is evidence that patients with schizophrenia have impaired explicit memory and intact implicit memory. The present study sought to replicate and extend that of O'Carroll et al. [O'Carroll, R.E., Russell, H.H., Lawrie, S.M. and Johnstone, E.C., 1999. Errorless learning and the cognitive rehabilitation of memory-impaired schizophrenic patients. Psychological Medicine 29, 105-112.] which reported that for memory-impaired patients with schizophrenia performance on a (cued) word recall task is enhanced using errorless learning techniques (in which errors are prevented during learning) compared to errorful learning (the traditional trial-and-error approach). Thirty patients with a DSM-IV diagnosis of schizophrenia and fifteen healthy controls (HC) participated. The Rivermead Behavioural Memory Test was administered and from their scores, the schizophrenic patients were classified as either memory-impaired (MIS), or memory-unimpaired (MUS). During the training phase two lists of words were learned separately, one using the errorless learning approach and the other using an errorful approach. Subjects were then tested for their recall of the words using cued recall. After errorful learning training, performance on word recall for the MIS group was impaired compared to the MUS and HC groups. However, after errorless learning training, no significant differences in performance were found between the three groups. Errorless learning may play an important role in remediation of cognitive deficits for patients with schizophrenia. (c) 2007 Elsevier Ireland Ltd. All rights reserved.
Resumo:
In this paper, we examine the war of words between those who contend that health care practice, including nursing, should primarily be informed by research (the evidence-based practice movement), and those who argue that there should be no restrictions on the sources of knowledge used by practitioners (the postmodernists). We review the postmodernist interventions of Dave Holmes and his colleagues, observing that the postmodernist style to which they adhere, which includes the use of continental philosophy, metaphors, and acerbic delivery, tends to obscure their substantive arguments. The heated nature of some responses to them has tended to have the same effect. However, the substantive arguments are important. Five main postmodernist charges are identified and discussed. The first argument, that the notion of ‘best evidence’ implies a hierarchical and exclusivist approach to knowledge, is persuasive. However, the contention that this hierarchy is maintained by the combined pressures of capitalism and vested interests within academia and the health services, is less well founded. Nevertheless, postmodernist contentions that the hierarchy embraced by the evidence-based practice movement damages health care because it excludes other forms of evidence that are needed to understand the complexity of care, it marginalizes important aspects of clinical knowledge, and it fails to take account of individuals or their experience, are all seen to be of some merit. However, we do not share the postmodernist conclusion that this adds up to a fascist order. Instead, we characterize evidence-based practice as a necessary but not sufficient component of health care knowledge.
Resumo:
Experiments show that for a large corpus, Zipf’s law does not hold for all rank of words: the frequencies fall below those predicted by Zipf’s law for ranks greater than about 5,000 word types in the English language and about 30,000 word types in the inflected languages Irish and Latin. It also does not hold for syllables or words in the syllable-based languages, Chinese or Vietnamese. However, when single words are combined together with word n-grams in one list and put in rank order, the frequency of tokens in the combined list extends Zipf’s law with a slope close to -1 on a log-log plot in all five languages. Further experiments have demonstrated the validity of this extension of Zipf’s law to n-grams of letters, phonemes or binary bits in English. It is shown theoretically that probability theory
alone can predict this behavior in randomly created n-grams of binary bits.
Resumo:
The core difficulty in developmental dyslexia across languages is a "phonological deficit", a specific difficulty with the neural representation of the sound structure of words. Recent data across languages suggest that this phonological deficit arises in part from inefficient auditory processing of the rate of change of the amplitude envelope at syllable onset (inefficient sensory processing of rise time). Rise time is a complex percept that also involves changes in duration and perceived intensity. Understanding the neural mechanisms that give rise to the phonological deficit in dyslexia is important for optimising educational interventions. In a three-deviant passive 'oddball' paradigm and a corresponding blocked 'deviant-alone' control condition we recorded ERPs to tones varying in rise time, duration and intensity in children with dyslexia and typically developing children longitudinally. We report here results from test Phases 1 and 2, when participants were aged 8-10. years. We found an MMN to duration, but not to rise time nor intensity deviants, at both time points for both groups. For rise time, duration and intensity we found group effects in both the Oddball and Blocked conditions. There was a slower fronto-central P1 response in the dyslexic group compared to controls. The amplitude of the P1 fronto-centrally to tones with slower rise times and lower intensity was smaller compared to tones with sharper rise times and higher intensity in the Oddball condition, for children with dyslexia only. The latency of this ERP component for all three stimuli was shorter on the right compared to the left hemisphere, only for the dyslexic group in the Blocked condition. Furthermore, we found decreased N1c amplitude to tones with slower rise times compared to tones with sharper rise times for children with dyslexia, only in the Oddball condition. Several other effects of stimulus type, age and laterality were also observed. Our data suggest that neuronal responses underlying some aspects of auditory sensory processing may be impaired in dyslexia. © 2011 Elsevier Inc.
Resumo:
In this paper, we introduce an application of matrix factorization to produce corpus-derived, distributional
models of semantics that demonstrate cognitive plausibility. We find that word representations
learned by Non-Negative Sparse Embedding (NNSE), a variant of matrix factorization, are sparse,
effective, and highly interpretable. To the best of our knowledge, this is the first approach which
yields semantic representation of words satisfying these three desirable properties. Though extensive
experimental evaluations on multiple real-world tasks and datasets, we demonstrate the superiority
of semantic models learned by NNSE over other state-of-the-art baselines.
Resumo:
Computational models of meaning trained on naturally occurring text successfully model human performance on tasks involving simple similarity measures, but they characterize meaning in terms of undifferentiated bags of words or topical dimensions. This has led some to question their psychological plausibility (Murphy, 2002; Schunn, 1999). We present here a fully automatic method for extracting a structured and comprehensive set of concept descriptions directly from an English part-of-speech-tagged corpus. Concepts are characterized by weighted properties, enriched with concept-property types that approximate classical relations such as hypernymy and function. Our model outperforms comparable algorithms in cognitive tasks pertaining not only to concept-internal structures (discovering properties of concepts, grouping properties by property type) but also to inter-concept relations (clustering into superordinates), suggesting the empirical validity of the property-based approach. Copyright © 2009 Cognitive Science Society, Inc. All rights reserved.
Resumo:
Both embodied and symbolic accounts of conceptual organization would predict partial sharing and partial differentiation between the neural activations seen for concepts activated via different stimulus modalities. But cross-participant and cross-session variability in BOLD activity patterns makes analyses of such patterns with MVPA methods challenging. Here, we examine the effect of cross-modal and individual variation on the machine learning analysis of fMRI data recorded during a word property generation task. We present the same set of living and non-living concepts (land-mammals, or work tools) to a cohort of Japanese participants in two sessions: the first using auditory presentation of spoken words; the second using visual presentation of words written in Japanese characters. Classification accuracies confirmed that these semantic categories could be detected in single trials, with within-session predictive accuracies of 80-90%. However cross-session prediction (learning from auditory-task data to classify data from the written-word-task, or vice versa) suffered from a performance penalty, achieving 65-75% (still individually significant at p « 0.05). We carried out several follow-on analyses to investigate the reason for this shortfall, concluding that distributional differences in neither time nor space alone could account for it. Rather, combined spatio-temporal patterns of activity need to be identified for successful cross-session learning, and this suggests that feature selection strategies could be modified to take advantage of this.