8 resultados para SVD

em Queensland University of Technology - ePrints Archive


Relevância:

10.00% 10.00%

Publicador:

Resumo:

This article explores two matrix methods to induce the ``shades of meaning" (SoM) of a word. A matrix representation of a word is computed from a corpus of traces based on the given word. Non-negative Matrix Factorisation (NMF) and Singular Value Decomposition (SVD) compute a set of vectors corresponding to a potential shade of meaning. The two methods were evaluated based on loss of conditional entropy with respect to two sets of manually tagged data. One set reflects concepts generally appearing in text, and the second set comprises words used for investigations into word sense disambiguation. Results show that for NMF consistently outperforms SVD for inducing both SoM of general concepts as well as word senses. The problem of inducing the shades of meaning of a word is more subtle than that of word sense induction and hence relevant to thematic analysis of opinion where nuances of opinion can arise.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The computation of compact and meaningful representations of high dimensional sensor data has recently been addressed through the development of Nonlinear Dimensional Reduction (NLDR) algorithms. The numerical implementation of spectral NLDR techniques typically leads to a symmetric eigenvalue problem that is solved by traditional batch eigensolution algorithms. The application of such algorithms in real-time systems necessitates the development of sequential algorithms that perform feature extraction online. This paper presents an efficient online NLDR scheme, Sequential-Isomap, based on incremental singular value decomposition (SVD) and the Isomap method. Example simulations demonstrate the validity and significant potential of this technique in real-time applications such as autonomous systems.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The aim of this paper is to provide a comparison of various algorithms and parameters to build reduced semantic spaces. The effect of dimension reduction, the stability of the representation and the effect of word order are examined in the context of the five algorithms bearing on semantic vectors: Random projection (RP), singular value decom- position (SVD), non-negative matrix factorization (NMF), permutations and holographic reduced representations (HRR). The quality of semantic representation was tested by means of synonym finding task using the TOEFL test on the TASA corpus. Dimension reduction was found to improve the quality of semantic representation but it is hard to find the optimal parameter settings. Even though dimension reduction by RP was found to be more generally applicable than SVD, the semantic vectors produced by RP are somewhat unstable. The effect of encoding word order into the semantic vector representation via HRR did not lead to any increase in scores over vectors constructed from word co-occurrence in context information. In this regard, very small context windows resulted in better semantic vectors for the TOEFL test.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Internet chatrooms are common means of interaction and communications, and they carry valuable information about formal or ad-hoc formation of groups with diverse objectives. This work presents a fully automated surveillance system for data collection and analysis in Internet chatrooms. The system has two components: First, it has an eavesdropping tool which collects statistics on individual (chatter) and chatroom behavior. This data can be used to profile a chatroom and its chatters. Second, it has a computational discovery algorithm based on Singular Value Decomposition (SVD) to locate hidden communities and communication patterns within a chatroom. The eavesdropping tool is used for fine tuning the SVD-based discovery algorithm which can be deployed in real-time and requires no semantic information processing. The evaluation of the system on real data shows that (i) statistical properties of different chatrooms vary significantly, thus profiling is possible, (ii) SVD-based algorithm has up to 70-80% accuracy to discover groups of chatters.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This work identifies the limitations of n-way data analysis techniques in multidimensional stream data, such as Internet chat room communications data, and establishes a link between data collection and performance of these techniques. Its contributions are twofold. First, it extends data analysis to multiple dimensions by constructing n-way data arrays known as high order tensors. Chat room tensors are generated by a simulator which collects and models actual communication data. The accuracy of the model is determined by the Kolmogorov-Smirnov goodness-of-fit test which compares the simulation data with the observed (real) data. Second, a detailed computational comparison is performed to test several data analysis techniques including svd [1], and multi-way techniques including Tucker1, Tucker3 [2], and Parafac [3].

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.