148 resultados para Natural language techniques, Semantic spaces, Random projection, Documents
em Queensland University of Technology - ePrints Archive
Resumo:
The aim of this paper is to provide a comparison of various algorithms and parameters to build reduced semantic spaces. The effect of dimension reduction, the stability of the representation and the effect of word order are examined in the context of the five algorithms bearing on semantic vectors: Random projection (RP), singular value decom- position (SVD), non-negative matrix factorization (NMF), permutations and holographic reduced representations (HRR). The quality of semantic representation was tested by means of synonym finding task using the TOEFL test on the TASA corpus. Dimension reduction was found to improve the quality of semantic representation but it is hard to find the optimal parameter settings. Even though dimension reduction by RP was found to be more generally applicable than SVD, the semantic vectors produced by RP are somewhat unstable. The effect of encoding word order into the semantic vector representation via HRR did not lead to any increase in scores over vectors constructed from word co-occurrence in context information. In this regard, very small context windows resulted in better semantic vectors for the TOEFL test.
Resumo:
We present a text watermarking scheme that embeds a bitstream watermark Wi in a text document P preserving the meaning, context, and flow of the document. The document is viewed as a set of paragraphs, each paragraph being a set of sentences. The sequence of paragraphs and sentences used to embed watermark bits is permuted using a secret key. Then, English language sentence transformations are used to modify sentence lengths, thus embedding watermarking bits in the Least Significant Bits (LSB) of the sentences’ cardinalities. The embedding and extracting algorithms are public, while the secrecy and security of the watermark depends on a secret key K. The probability of False Positives is extremely small, hence avoiding incidental occurrences of our watermark in random text documents. Majority voting provides security against text addition, deletion, and swapping attacks, further reducing the probability of False Positives. The scheme is secure against the general attacks on text watermarks such as reproduction (photocopying, FAX), reformatting, synonym substitution, text addition, text deletion, text swapping, paragraph shuffling and collusion attacks.
Resumo:
Semantic space models of word meaning derived from co-occurrence statistics within a corpus of documents, such as the Hyperspace Analogous to Language (HAL) model, have been proposed in the past. While word similarity can be computed using these models, it is not clear how semantic spaces derived from different sets of documents can be compared. In this paper, we focus on this problem, and we revisit the proposal of using semantic subspace distance measurements [1]. In particular, we outline the research questions that still need to be addressed to investigate and validate these distance measures. Then, we describe our plans for future research.
Resumo:
This paper presents a symbolic navigation system that uses spatial language descriptions to inform goal-directed exploration in unfamiliar office environments. An abstract map is created from a collection of natural language phrases describing the spatial layout of the environment. The spatial representation in the abstract map is controlled by a constraint based interpretation of each natural language phrase. In goal-directed exploration of an unseen office environment, the robot links the information in the abstract map to observed symbolic information and its grounded world representation. This paper demonstrates the ability of the system, in both simulated and real-world trials, to efficiently find target rooms in environments that it has never been to previously. In three unexplored environments, it is shown that on average the system travels only 8.42% further than the optimal path when using only natural language phrases to complete navigation tasks.
Resumo:
This thesis introduces the problem of conceptual ambiguity, or Shades of Meaning (SoM) that can exist around a term or entity. As an example consider President Ronald Reagan the ex-president of the USA, there are many aspects to him that are captured in text; the Russian missile deal, the Iran-contra deal and others. Simply finding documents with the word “Reagan” in them is going to return results that cover many different shades of meaning related to "Reagan". Instead it may be desirable to retrieve results around a specific shade of meaning of "Reagan", e.g., all documents relating to the Iran-contra scandal. This thesis investigates computational methods for identifying shades of meaning around a word, or concept. This problem is related to word sense ambiguity, but is more subtle and based less on the particular syntactic structures associated with or around an instance of the term and more with the semantic contexts around it. A particularly noteworthy difference from typical word sense disambiguation is that shades of a concept are not known in advance. It is up to the algorithm itself to ascertain these subtleties. It is the key hypothesis of this thesis that reducing the number of dimensions in the representation of concepts is a key part of reducing sparseness and thus also crucial in discovering their SoMwithin a given corpus.
Resumo:
Process Modeling is a widely used concept for understanding, documenting and also redesigning the operations of organizations. The validation and usage of process models is however affected by the fact that only business analysts fully understand them in detail. This is in particular a problem because they are typically not domain experts. In this paper, we investigate in how far the concept of verbalization can be adapted from object-role modeling to process models. To this end, we define an approach which automatically transforms BPMN process models into natural language texts and combines different techniques from linguistics and graph decomposition in a flexible and accurate manner. The evaluation of the technique is based on a prototypical implementation and involves a test set of 53 BPMN process models showing that natural language texts can be generated in a reliable fashion.
Resumo:
Semantic Space models, which provide a numerical representation of words’ meaning extracted from corpus of documents, have been formalized in terms of Hermitian operators over real valued Hilbert spaces by Bruza et al. [1]. The collapse of a word into a particular meaning has been investigated applying the notion of quantum collapse of superpositional states [2]. While the semantic association between words in a Semantic Space can be computed by means of the Minkowski distance [3] or the cosine of the angle between the vector representation of each pair of words, a new procedure is needed in order to establish relations between two or more Semantic Spaces. We address the question: how can the distance between different Semantic Spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance. Such distance needs to take into account the difference in the dimensions between subspaces. The availability of a distance for comparing different Semantic Subspaces would enable to achieve a deeper understanding about the geometry of Semantic Spaces which would possibly translate into better effectiveness in Information Retrieval tasks.
Resumo:
The design and development of process-aware information systems is often supported by specifying requirements as business process models. Although this approach is generally accepted as an effective strategy, it remains a fundamental challenge to adequately validate these models given the diverging skill set of domain experts and system analysts. As domain experts often do not feel confident in judging the correctness and completeness of process models that system analysts create, the validation often has to regress to a discourse using natural language. In order to support such a discourse appropriately, so-called verbalization techniques have been defined for different types of conceptual models. However, there is currently no sophisticated technique available that is capable of generating natural-looking text from process models. In this paper, we address this research gap and propose a technique for generating natural language texts from business process models. A comparison with manually created process descriptions demonstrates that the generated texts are superior in terms of completeness, structure, and linguistic complexity. An evaluation with users further demonstrates that the texts are very understandable and effectively allow the reader to infer the process model semantics. Hence, the generated texts represent a useful input for process model validation.
Resumo:
Background Cancer monitoring and prevention relies on the critical aspect of timely notification of cancer cases. However, the abstraction and classification of cancer from the free-text of pathology reports and other relevant documents, such as death certificates, exist as complex and time-consuming activities. Aims In this paper, approaches for the automatic detection of notifiable cancer cases as the cause of death from free-text death certificates supplied to Cancer Registries are investigated. Method A number of machine learning classifiers were studied. Features were extracted using natural language techniques and the Medtex toolkit. The numerous features encompassed stemmed words, bi-grams, and concepts from the SNOMED CT medical terminology. The baseline consisted of a keyword spotter using keywords extracted from the long description of ICD-10 cancer related codes. Results Death certificates with notifiable cancer listed as the cause of death can be effectively identified with the methods studied in this paper. A Support Vector Machine (SVM) classifier achieved best performance with an overall F-measure of 0.9866 when evaluated on a set of 5,000 free-text death certificates using the token stem feature set. The SNOMED CT concept plus token stem feature set reached the lowest variance (0.0032) and false negative rate (0.0297) while achieving an F-measure of 0.9864. The SVM classifier accounts for the first 18 of the top 40 evaluated runs, and entails the most robust classifier with a variance of 0.001141, half the variance of the other classifiers. Conclusion The selection of features significantly produced the most influences on the performance of the classifiers, although the type of classifier employed also affects performance. In contrast, the feature weighting schema created a negligible effect on performance. Specifically, it is found that stemmed tokens with or without SNOMED CT concepts create the most effective feature when combined with an SVM classifier.
Resumo:
The Australian e-Health Research Centre (AEHRC) recently participated in the ShARe/CLEF eHealth Evaluation Lab Task 1. The goal of this task is to individuate mentions of disorders in free-text electronic health records and map disorders to SNOMED CT concepts in the UMLS metathesaurus. This paper details our participation to this ShARe/CLEF task. Our approaches are based on using the clinical natural language processing tool Metamap and Conditional Random Fields (CRF) to individuate mentions of disorders and then to map those to SNOMED CT concepts. Empirical results obtained on the 2013 ShARe/CLEF task highlight that our instance of Metamap (after ltering irrelevant semantic types), although achieving a high level of precision, is only able to identify a small amount of disorders (about 21% to 28%) from free-text health records. On the other hand, the addition of the CRF models allows for a much higher recall (57% to 79%) of disorders from free-text, without sensible detriment in precision. When evaluating the accuracy of the mapping of disorders to SNOMED CT concepts in the UMLS, we observe that the mapping obtained by our ltered instance of Metamap delivers state-of-the-art e ectiveness if only spans individuated by our system are considered (`relaxed' accuracy).
Resumo:
In computational linguistics, information retrieval and applied cognition, words and concepts are often represented as vectors in high dimensional spaces computed from a corpus of text. These high dimensional spaces are often referred to as Semantic Spaces. We describe a novel and efficient approach to computing these semantic spaces via the use of complex valued vector representations. We report on the practical implementation of the proposed method and some associated experiments. We also briefly discuss how the proposed system relates to previous theoretical work in Information Retrieval and Quantum Mechanics and how the notions of probability, logic and geometry are integrated within a single Hilbert space representation. In this sense the proposed system has more general application and gives rise to a variety of opportunities for future research.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
This paper outlines a novel approach for modelling semantic relationships within medical documents. Medical terminologies contain a rich source of semantic information critical to a number of techniques in medical informatics, including medical information retrieval. Recent research suggests that corpus-driven approaches are effective at automatically capturing semantic similarities between medical concepts, thus making them an attractive option for accessing semantic information. Most previous corpus-driven methods only considered syntagmatic associations. In this paper, we adapt a recent approach that explicitly models both syntagmatic and paradigmatic associations. We show that the implicit similarity between certain medical concepts can only be modelled using paradigmatic associations. In addition, the inclusion of both types of associations overcomes the sensitivity to the training corpus experienced by previous approaches, making our method both more effective and more robust. This finding may have implications for researchers in the area of medical information retrieval.