871 resultados para Latent Dirichlet Allocation
Resumo:
This thesis addressed issues that have prevented qualitative researchers from using thematic discovery algorithms. The central hypothesis evaluated whether allowing qualitative researchers to interact with thematic discovery algorithms and incorporate domain knowledge improved their ability to address research questions and trust the derived themes. Non-negative Matrix Factorisation and Latent Dirichlet Allocation find latent themes within document collections but these algorithms are rarely used, because qualitative researchers do not trust and cannot interact with the themes that are automatically generated. The research determined the types of interactivity that qualitative researchers require and then evaluated interactive algorithms that matched these requirements. Theoretical contributions included the articulation of design guidelines for interactive thematic discovery algorithms, the development of an Evaluation Model and a Conceptual Framework for Interactive Content Analysis.
Resumo:
Local spatio-temporal features with a Bag-of-visual words model is a popular approach used in human action recognition. Bag-of-features methods suffer from several challenges such as extracting appropriate appearance and motion features from videos, converting extracted features appropriate for classification and designing a suitable classification framework. In this paper we address the problem of efficiently representing the extracted features for classification to improve the overall performance. We introduce two generative supervised topic models, maximum entropy discrimination LDA (MedLDA) and class- specific simplex LDA (css-LDA), to encode the raw features suitable for discriminative SVM based classification. Unsupervised LDA models disconnect topic discovery from the classification task, hence yield poor results compared to the baseline Bag-of-words framework. On the other hand supervised LDA techniques learn the topic structure by considering the class labels and improve the recognition accuracy significantly. MedLDA maximizes likelihood and within class margins using max-margin techniques and yields a sparse highly discriminative topic structure; while in css-LDA separate class specific topics are learned instead of common set of topics across the entire dataset. In our representation first topics are learned and then each video is represented as a topic proportion vector, i.e. it can be comparable to a histogram of topics. Finally SVM classification is done on the learned topic proportion vector. We demonstrate the efficiency of the above two representation techniques through the experiments carried out in two popular datasets. Experimental results demonstrate significantly improved performance compared to the baseline Bag-of-features framework which uses kmeans to construct histogram of words from the feature vectors.
Resumo:
It is important to identify the ``correct'' number of topics in mechanisms like Latent Dirichlet Allocation(LDA) as they determine the quality of features that are presented as features for classifiers like SVM. In this work we propose a measure to identify the correct number of topics and offer empirical evidence in its favor in terms of classification accuracy and the number of topics that are naturally present in the corpus. We show the merit of the measure by applying it on real-world as well as synthetic data sets(both text and images). In proposing this measure, we view LDA as a matrix factorization mechanism, wherein a given corpus C is split into two matrix factors M-1 and M-2 as given by C-d*w = M1(d*t) x Q(t*w).Where d is the number of documents present in the corpus anti w is the size of the vocabulary. The quality of the split depends on ``t'', the right number of topics chosen. The measure is computed in terms of symmetric KL-Divergence of salient distributions that are derived from these matrix factors. We observe that the divergence values are higher for non-optimal number of topics - this is shown by a `dip' at the right value for `t'.
Resumo:
Song-selection and mood are interdependent. If we capture a song’s sentiment, we can determine the mood of the listener, which can serve as a basis for recommendation systems. Songs are generally classified according to genres, which don’t entirely reflect sentiments. Thus, we require an unsupervised scheme to mine them. Sentiments are classified into either two (positive/negative) or multiple (happy/angry/sad/...) classes, depending on the application. We are interested in analyzing the feelings invoked by a song, involving multi-class sentiments. To mine the hidden sentimental structure behind a song, in terms of “topics”, we consider its lyrics and use Latent Dirichlet Allocation (LDA). Each song is a mixture of moods. Topics mined by LDA can represent moods. Thus we get a scheme of collecting similar-mood songs. For validation, we use a dataset of songs containing 6 moods annotated by users of a particular website.
Resumo:
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated given the summary topics. This ensures that our summaries always highlight the crux of the document without paying any attention to the grammar and the structure of the documents. Finally, we evaluate our summaries on the DUC 2002 Single document summarization data corpus using ROUGE measures. Our summaries had higher ROUGE values and better semantic similarity with the documents than the DUC summaries.
Resumo:
Scene classification based on latent Dirichlet allocation (LDA) is a more general modeling method known as a bag of visual words, in which the construction of a visual vocabulary is a crucial quantization process to ensure success of the classification. A framework is developed using the following new aspects: Gaussian mixture clustering for the quantization process, the use of an integrated visual vocabulary (IVV), which is built as the union of all centroids obtained from the separate quantization process of each class, and the usage of some features, including edge orientation histogram, CIELab color moments, and gray-level co-occurrence matrix (GLCM). The experiments are conducted on IKONOS images with six semantic classes (tree, grassland, residential, commercial/industrial, road, and water). The results show that the use of an IVV increases the overall accuracy (OA) by 11 to 12% and 6% when it is implemented on the selected and all features, respectively. The selected features of CIELab color moments and GLCM provide a better OA than the implementation over CIELab color moment or GLCM as individuals. The latter increases the OA by only ∼2 to 3%. Moreover, the results show that the OA of LDA outperforms the OA of C4.5 and naive Bayes tree by ∼20%. © 2014 Society of Photo-Optical Instrumentation Engineers (SPIE) [DOI: 10.1117/1.JRS.8.083690]
Resumo:
The amount of information contained within the Internet has exploded in recent decades. As more and more news, blogs, and many other kinds of articles that are published on the Internet, categorization of articles and documents are increasingly desired. Among the approaches to categorize articles, labeling is one of the most common method; it provides a relatively intuitive and effective way to separate articles into different categories. However, manual labeling is limited by its efficiency, even thought the labels selected manually have relatively high quality. This report explores the topic modeling approach of Online Latent Dirichlet Allocation (Online-LDA). Additionally, a method to automatically label articles with their latent topics by combining the Online-LDA posterior with a probabilistic automatic labeling algorithm is implemented. The goal of this report is to examine the accuracy of the labels generated automatically by a topic model and probabilistic relevance algorithm for a set of real-world, dynamically updated articles from an online Rich Site Summary (RSS) service.
Resumo:
Constant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.
Resumo:
The ability to detect unusual events in surviellance footage as they happen is a highly desireable feature for a surveillance system. However, this problem remains challenging in crowded scenes due to occlusions and the clustering of people. In this paper, we propose using the Distributed Behavior Model (DBM), which has been widely used in computer graphics, for video event detection. Our approach does not rely on object tracking, and is robust to camera movements. We use sparse coding for classification, and test our approach on various datasets. Our proposed approach outperforms a state-of-the-art work which uses the social force model and Latent Dirichlet Allocation.
Resumo:
This paper elaborates the approach used by the Applied Data Mining Research Group (ADMRG) for the Social Event Detection (SED) Tasks of the 2013 MediaEval Benchmark. We extended the constrained clustering algorithm to apply to the first semi-supervised clustering task, and we compared several classifiers with Latent Dirichlet Allocation as feature selector in the second event classification task. The proposed approach focuses on scalability and efficient memory allocation when applied to a high dimensional data with large clusters. Results of the first task show the effectiveness of the proposed method. Results from task 2 indicate that attention on the imbalance categories distributions is needed.
Resumo:
Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, which has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering is rarely known. Patterns are always thought to be more representative than single terms for representing documents. In this paper, a novel information filtering model, Pattern-based Topic Model(PBTM) , is proposed to represent the text documents not only using the topic distributions at general level but also using semantic pattern representations at detailed specific level, both of which contribute to the accurate document representation and document relevance ranking. Extensive experiments are conducted to evaluate the effectiveness of PBTM by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model achieves outstanding performance.
Resumo:
Many mature term-based or pattern-based approaches have been used in the field of information filtering to generate users’ information needs from a collection of documents. A fundamental assumption for these approaches is that the documents in the collection are all about one topic. However, in reality users’ interests can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in a collection of documents, and this has been widely utilized in the fields of machine learning and information retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always thought to be more discriminative than single terms for describing documents. However, the enormous amount of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore, selection of the most discriminative and representative patterns from the huge amount of discovered patterns becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main distinctive features of the proposed model include: (1) user information needs are generated in terms of multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are organized in terms of their statistical and taxonomic features, and; (4) the most discriminative and representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance to the user’s information needs in order to filter out irrelevant documents. Extensive experiments are conducted to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus Volume 1. The results show that the proposed model significantly outperforms both state-of-the-art term-based models and pattern-based models
Resumo:
Due to the popularity of security cameras in public places, it is of interest to design an intelligent system that can efficiently detect events automatically. This paper proposes a novel algorithm for multi-person event detection. To ensure greater than real-time performance, features are extracted directly from compressed MPEG video. A novel histogram-based feature descriptor that captures the angles between extracted particle trajectories is proposed, which allows us to capture motion patterns of multi-person events in the video. To alleviate the need for fine-grained annotation, we propose the use of Labelled Latent Dirichlet Allocation, a “weakly supervised” method that allows the use of coarse temporal annotations which are much simpler to obtain. This novel system is able to run at approximately ten times real-time, while preserving state-of-theart detection performance for multi-person events on a 100-hour real-world surveillance dataset (TRECVid SED).
Resumo:
This PhD research has proposed new machine learning techniques to improve human action recognition based on local features. Several novel video representation and classification techniques have been proposed to increase the performance with lower computational complexity. The major contributions are the construction of new feature representation techniques, based on advanced machine learning techniques such as multiple instance dictionary learning, Latent Dirichlet Allocation (LDA) and Sparse coding. A Binary-tree based classification technique was also proposed to deal with large amounts of action categories. These techniques are not only improving the classification accuracy with constrained computational resources but are also robust to challenging environmental conditions. These developed techniques can be easily extended to a wide range of video applications to provide near real-time performance.
Resumo:
There are many popular models available for classification of documents like Naïve Bayes Classifier, k-Nearest Neighbors and Support Vector Machine. In all these cases, the representation is based on the “Bag of words” model. This model doesn't capture the actual semantic meaning of a word in a particular document. Semantics are better captured by proximity of words and their occurrence in the document. We propose a new “Bag of Phrases” model to capture this discriminative power of phrases for text classification. We present a novel algorithm to extract phrases from the corpus using the well known topic model, Latent Dirichlet Allocation(LDA), and to integrate them in vector space model for classification. Experiments show a better performance of classifiers with the new Bag of Phrases model against related representation models.