Distributed Diverging Topic Models: A Novel Algorithm for Large Scale Topic Modeling in Spark


Autoria(s): Marquardt, James Andrew
Contribuinte(s)

De Cock, Martine

Data(s)

24/02/2015

24/02/2015

2014

Resumo

Thesis (Master's)--University of Washington, 2014

In their 2001 work Latent Dirichlet Allocation, Blei, Ng, and Jordan proposed the generative model of the same name that has since become the basis for most research in the field of topic modeling. The model overcame many of the shortcomings of previous probabilistic models such as allowing the inference of topics in documents not present in the learning phase, as well as allowing for topic mixtures. In the past decade the algorithm for inferring the probabilities associated with the model has been implemented in many different languages, been extended to allow topic relationships with other entities such as emotion and document label, and optimized in a variety of ways to allow faster learning. Latent Dirichlet Allocation (LDA) has found applications within a wide variety of disciplines; including digital humanities, computational social science, e-commerce, and government science policy. In short, the numerous advances and applications illustrate the significant influence of the original LDA algorithm. However, in spite of the numerous publications and tools created as a result of LDA, the model suffers from one issue: it is extremely computationally intensive. This shortcoming is so great that its utility towards large datasets of the scale of those mined from the Internet is somewhat questionable. Additionally, the topic modeling algorithm often requires a degree of active learning, requiring feedback from a domain expert, which in certain circumstances would be ideally minimized. In this work we present Distributed Diverging Latent Dirichlet Allocation (DD-LDA), a novel algorithm for the creation of topic models based on the original Latent Dirichlet Allocation model. The algorithm takes advantage of recent advances in distributed systems approaches to computation, and demonstrates its utility through decreased time requirements as well as increased model performance via the ability to intelligently determine appropriate model size.

Formato

application/pdf

Identificador

Marquardt_washington_0250O_14002.pdf

http://hdl.handle.net/1773/27372

Idioma(s)

en_US

Direitos

Copyright is held by the individual authors.

Palavras-Chave #Latent Dirichlet Allocation; Spark; Topic Modeling #Computer science #computing and software systems
Tipo

Thesis