937 resultados para Distributional semantics
Resumo:
This thesis makes several contributions towards improved methods for encoding structure in computational models of word meaning. New methods are proposed and evaluated which address the requirement of being able to easily encode linguistic structural features within a computational representation while retaining the ability to scale to large volumes of textual data. Various methods are implemented and evaluated on a range of evaluation tasks to demonstrate the effectiveness of the proposed methods.
Resumo:
This paper presents our system to address the CogALex-IV 2014 shared task of identifying a single word most semantically related to a group of 5 words (queries). Our system uses an implementation of a neural language model and identifies the answer word by finding the most semantically similar word representation to the sum of the query representations. It is a fully unsupervised system which learns on around 20% of the UkWaC corpus. It correctly identifies 85 exact correct targets out of 2,000 queries, 285 approximate targets in lists of 5 suggestions.
Resumo:
[EU]Lan honetan semantika distribuzionalaren eta ikasketa automatikoaren erabilera aztertzen dugu itzulpen automatiko estatistikoa hobetzeko. Bide horretan, erregresio logistikoan oinarritutako ikasketa automatikoko eredu bat proposatzen dugu hitz-segiden itzulpen- probabilitatea modu dinamikoan modelatzeko. Proposatutako eredua itzulpen automatiko estatistikoko ohiko itzulpen-probabilitateen orokortze bat dela frogatzen dugu, eta testuinguruko nahiz semantika distribuzionaleko informazioa barneratzeko baliatu ezaugarri lexiko, hitz-cluster eta hitzen errepresentazio bektorialen bidez. Horretaz gain, semantika distribuzionaleko ezagutza itzulpen automatiko estatistikoan txertatzeko beste hurbilpen bat lantzen dugu: hitzen errepresentazio bektorial elebidunak erabiltzea hitz-segiden itzulpenen antzekotasuna modelatzeko. Gure esperimentuek proposatutako ereduen baliagarritasuna erakusten dute, emaitza itxaropentsuak eskuratuz oinarrizko sistema sendo baten gainean. Era berean, gure lanak ekarpen garrantzitsuak egiten ditu errepresentazio bektorialen mapaketa elebidunei eta hitzen errepresentazio bektorialetan oinarritutako hitz-segiden antzekotasun neurriei dagokienean, itzulpen automatikoaz haratago balio propio bat dutenak semantika distribuzionalaren arloan.
Resumo:
Complex numbers are a fundamental aspect of the mathematical formalism of quantum physics. Quantum-like models developed outside physics often overlooked the role of complex numbers. Specifically, previous models in Information Retrieval (IR) ignored complex numbers. We argue that to advance the use of quantum models of IR, one has to lift the constraint of real-valued representations of the information space, and package more information within the representation by means of complex numbers. As a first attempt, we propose a complex-valued representation for IR, which explicitly uses complex valued Hilbert spaces, and thus where terms, documents and queries are represented as complex-valued vectors. The proposal consists of integrating distributional semantics evidence within the real component of a term vector; whereas, ontological information is encoded in the imaginary component. Our proposal has the merit of lifting the role of complex numbers from a computational byproduct of the model to the very mathematical texture that unifies different levels of semantic information. An empirical instantiation of our proposal is tested in the TREC Medical Record task of retrieving cohorts for clinical studies.
Resumo:
Distributional semantics tries to characterize the meaning of words by the contexts in which they occur. Similarity of words hence can be derived from the similarity of contexts. Contexts of a word are usually vectors of words appearing near to that word in a corpus. It was observed in previous research that similarity measures for the context vectors of two words depend on the frequency of these words. In the present paper we investigate this dependency in more detail for one similarity measure, the Jensen-Shannon divergence. We give an empirical model of this dependency and propose the deviation of the observed Jensen-Shannon divergence from the divergence expected on the basis of the frequencies of the words as an alternative similarity measure. We show that this new similarity measure is superior to both the Jensen-Shannon divergence and the cosine similarity in a task, in which pairs of words, taken from Wordnet, have to be classified as being synonyms or not.
Resumo:
In this paper, we introduce an application of matrix factorization to produce corpus-derived, distributional
models of semantics that demonstrate cognitive plausibility. We find that word representations
learned by Non-Negative Sparse Embedding (NNSE), a variant of matrix factorization, are sparse,
effective, and highly interpretable. To the best of our knowledge, this is the first approach which
yields semantic representation of words satisfying these three desirable properties. Though extensive
experimental evaluations on multiple real-world tasks and datasets, we demonstrate the superiority
of semantic models learned by NNSE over other state-of-the-art baselines.
Resumo:
In most previous research on distributional semantics, Vector Space Models (VSMs) of words are built either from topical information (e.g., documents in which a word is present), or from syntactic/semantic types of words (e.g., dependency parse links of a word in sentences), but not both. In this paper, we explore the utility of combining these two representations to build VSM for the task of semantic composition of adjective-noun phrases. Through extensive experiments on benchmark datasets, we find that even though a type-based VSM is effective for semantic composition, it is often outperformed by a VSM built using a combination of topic- and type-based statistics. We also introduce a new evaluation task wherein we predict the composed vector representation of a phrase from the brain activity of a human subject reading that phrase. We exploit a large syntactically parsed corpus of 16 billion tokens to build our VSMs, with vectors for both phrases and words, and make them publicly available.
Resumo:
DIANA es un proyecto coordinado en el que participan el grupo de Ingeniería del Lenguaje Natural y Reconocimiento de Formas (ELiRF) de la Universitat Politècnica de València y el grupo Centre de Llenguatge i Computació (CLiC) de la Universitat de Barcelona. Se trata de un proyecto del programa de I+D (TIN2012-38603) financiado por el Ministerio de Economía y Competitividad. Paolo Rosso coordina el proyecto DIANA y lidera el subproyecto DIANA-Applications y M. Antònia Martí lidera el subproyecto DIANA-Constructions.
Resumo:
25 p.
Resumo:
This study examined how riverine inputs, in particular sediment, influenced the community structure and trophic composition of reef fishes within Rio Bueno, north Jamaica. Due to river discharge a distinct gradient of riverine inputs existed across the study sites. Results suggested that riverine inputs (or a factor associated with them) had a structuring effect on fish community structure. Whilst fish communities at all sites were dominated by small individuals (
Resumo:
Aim: Species loss has increased significantly over the last 1000 years and is ultimately attributed to the direct and indirect consequences of increased human population growth across the planet. A growing number of species are becoming endangered and require human intervention to prevent their local extirpation or complete extinction. Management strategies aimed at mitigating a species loss can benefit greatly from empirical approaches that indicate the rate of decline of a species providing objective information on the need for immediate conservation actions, e.g. captive breeding; however, this is rarely employed. The current study used a novel method to examine the distributional trends of a model endangered species, the freshwater pearl mussel, Margaritifera margaritifera (L.).
Location: United Kingdom and Republic of Ireland.
Methods: Using species presence data within 10-km grid squares since records began three-parameter logistic regression curves were fitted to extrapolate an estimated date of regional extinction.
Results: This study has shown that freshwater pearl mussel distribution has contracted since known historical records and outlier populations were lost first. Within the United Kingdom and Republic of Ireland, distribution loss has been greatest in Scotland, Northern Ireland, Wales and England, respectively, with the Republic of Ireland containing the highest relative proportion of M. margaritifera distribution, in 1998.
Main conclusions: This study provides empirical evidence that this species could become extinct throughout countries within the United Kingdom within 170 years under the current trends and emphasizes that regionally specific management strategies need to be implemented to prevent extirpation of this species.
Resumo:
Thecamoebians were examined from 123 surface sediment samples collected from 45 lakes in the Greater Toronto Area (GTA) and the surrounding region to i) elucidate the controls on faunal distribution in modern lake environments; and ii) to consider the utility of thecamoebians in quantitative studies of water quality change. This area was chosen because it includes a high density of lakes that are threatened by urban development and where water quality has deteriorated locally as a result of contaminant inputs, particularly nutrients. Canonical Correspondence analysis (CCA) and a series of partial CCAs were used to examine species-environment relationships. Twenty-four environmental variables were considered, including water properties (e.g. pH, DO, conductivity), substrate characteristics, nutrient loading, and environmentally available metals. The thecamoebian assemblages showed a strong association with Olsen's Phosphorus, reflecting the eutrophic status of many of the lakes, and locally to elevated conductivity measurements, which appear to reflect road salt inputs associated with winter de-icing operations. A transfer function was developed for Olsen P using this training set based on weighted averaging with inverse deshrinking (WA Inv). The model was applied to infer past changes in Phosphorus enrichment in core samples from several lakes, including eutrophic Haynes Lake within the GTA. Thecamoebian-inferred changes in sedimentary Phosphorus from a 210Pb dated core from Haynes Lake are related to i) widespread introduction of chemical fertilizers to agricultural land in the post WWII era; ii) a steep decline in Phosphorous with a change in agricultural practices in the late 1970s; and iii) the construction of a golf course in close proximity to the lake in the early 1990s. This preliminary study confirms that thecamoebians have considerable potential as indicators of eutrophication in lakes and can provide an estimate of baseline conditions.