27 resultados para pLSA


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Non-negative matrix factorization [5](NMF) is a well known tool for unsupervised machine learning. It can be viewed as a generalization of the K-means clustering, Expectation Maximization based clustering and aspect modeling by Probabilistic Latent Semantic Analysis (PLSA). Specifically PLSA is related to NMF with KL-divergence objective function. Further it is shown that K-means clustering is a special case of NMF with matrix L2 norm based error function. In this paper our objective is to analyze the relation between K-means clustering and PLSA by examining the KL-divergence function and matrix L2 norm based error function.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用·基于PLSA模型的文本分割试图使隐藏于片段内的不同主题与文本表面的词、句对建立联系·实验以汉语的整句作为基本块,尝试了多种相似性度量手段及边界估计策略,同时考虑相邻句重复的未登录词对相似值的影响,其最佳结果表明,片段边界的识别错误率为6·06%,远远低于其他同类算法·

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Beijing University of Technology (BJUT); Beijing Municipal Lab of Brain Informatics; Chinese Society of Radiology; National Natural Science Foundation of China (NSFC); State Administration of Foreign Experts Affairs

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Two decades after its inception, Latent Semantic Analysis(LSA) has become part and parcel of every modern introduction to Information Retrieval. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows: (1) that LSA recovers latent semantic factors underlying the document space, (2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and (3) that the latter can best be achieved by Singular Value Decomposition. For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Topic modeling has been widely utilized in the fields of information retrieval, text mining, text classification etc. Most existing statistical topic modeling methods such as LDA and pLSA generate a term based representation to represent a topic by selecting single words from multinomial word distribution over this topic. There are two main shortcomings: firstly, popular or common words occur very often across different topics that bring ambiguity to understand topics; secondly, single words lack coherent semantic meaning to accurately represent topics. In order to overcome these problems, in this paper, we propose a two-stage model that combines text mining and pattern mining with statistical modeling to generate more discriminative and semantic rich topic representations. Experiments show that the optimized topic representations generated by the proposed methods outperform the typical statistical topic modeling method LDA in terms of accuracy and certainty.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

互联网个性化推荐系统(Internet personal recommender systems)是根据用户的兴趣推荐最相关的互联网信息给用户的系统。在网上信息过载矛盾越来越严重、用户信息检索的个性化需求日益增强的现状下,推荐系统已经在搜索引擎、电子商务、网上社区等互联网关键应用中起到了关键性的作用,并且越来越受到重视。 然而,在大型网站上部署一个成熟推荐系统的代价依然很大,需要大量的计算和存储资源,推荐的准确性也依然有很大提升空间和需求,这就为推荐系统的研究提供了很多挑战。在这些挑战中推荐算法的准确性和可扩展性一直是该领域最为关注的两个问题,所谓推荐的准确性是指推荐的信息中用户真正感兴趣的比例,而可扩展性指的是系统能否在可容忍的时间和空间复杂度内处理海量的数据。如何在提高算法推荐准确性的同时增强算法的可扩展性是推荐系统改进的主要研究目标。然而,目前学术界的研究更多侧重于提高推荐算法的准确性,而对于可扩展性,很多准确性很高的算法由于需要比较复杂的计算,处理大规模动态数据的能力往往比较有限,并且它们的评测实验中并没有将可扩展性纳入到评价范畴,导致这些算法目前还很难在工业界大规模应用。 本论文的研究试图解决这一问题。通过在推荐算法中借鉴增量学习(Incremental learning)的思想,即考虑最新的训练数据来更新原有的机器学习模型,不需要或仅需要参考部分旧的训练数据,相对于使用全部数据也即批量的处理方式,增量式改进可以大大降低模型更新的复杂度,从而可以大幅度提高推荐算法在遇到新的训练数据时推荐模型更新的效率,降低计算代价,使得推荐模型的更新可以更加及时,进而提高推荐结果的准确性。具体来说,我们在提出了两种新的增量式协同过滤算法的同时,采用增量式学习的方法对目前准确性最好的若干推荐算法进行加速,特别是提高这些算法面对新的训练数据的更新模型的速度和效率,从而为这些算法的大规模的应用提供了可能。另一方面,新的训练数据包含了最新的用户兴趣,因此相对于旧的训练数据,算法在做更新时应给予更高的权重,这样才能做到推荐的结果在考虑到用户长期兴趣的同时,特别考虑用户近期的兴趣,从而使得推荐结果更加准确。这两方面归纳起来,我们旨在通过增量式学习使得推荐算法在更新时更加高效和精确,真正适用于互联网上海量数据的推荐,同时对其他增量式推荐系统方面的研究也具有借鉴意义。我们的改进工作主要包括以下几个方面: 基于主题模型的增量式推荐算法。主题模型,特别是概率隐含主题模型(PLSA)是一种广泛应用于推荐系统的主流方法,在文本推荐、图像推荐以及协同过滤推荐领域都有着很好的推荐效果。目前制约PLSA算法取得更大成功的重要因素就是PLSA算法更新的复杂度过高,使得学习模型的更新只能做批量式处理,这样就导致推荐的时效性不高,也没有办法体现用户的最新的兴趣和整体的最新动态。我们提出了一种增量式学习方法,可以应用于文本分析领域和协同过滤领域,当有新的训练数据到来时,对于基于文本的推荐,增量式更新方法仅寻找最相关的用户和文本以及涉及到的单词进行主题分布的更新,并给予新的文本以更高权重;对于协同过滤,我们的方法仅对当前用户所评分过得物品以及当前物品所涉及的用户进行更新,大大降低了更新的运算复杂度,提高了新数据在推荐算法中所占的权重,使得推荐更加准确、及时。我们的算法在天涯问答文本数据集上和MovieLens电影推荐数据集、Last.FM歌曲推荐数据集、豆瓣图书推荐数据集等协同过滤数据集上取得了很好的效果。 基于蚁群算法(Ant colony algorithm)的协同过滤推荐方法。受到群体智能(Swarm intelligence)算法的启发,我们提出了一种类似于蚁群算法的协同过滤推荐方法——Ant Collaborative Filtering,初始化阶段该方法给予每个用户或一组用户以全局唯一的单位数量的信息素,当用户对物品评分或者用户表示对该物品感兴趣时,用户所携带的信息素相应的传播到该物品上,同时该物品上已有的信息素(初始化为0)也会相应的传播给该用户;此外,用户和物品所携带的信息素会随着时间的推移有一定速率的挥发,通过挥发机制,可以在推荐时更重视用户近期的兴趣;推荐阶段,按照用户和物品所携带的信息素的种类和数量,我们可以得到相应的相似度,进而通过经典的相似度比较的方法来进行推荐。基于蚁群的协同过滤方法的优势在于可以有效的降低训练数据中的稀疏性,并且推荐算法可以实时的进行更新和推荐,同时考虑了用户兴趣随着时间的变化。我们在MovieLens电影评分、豆瓣书籍推荐、Last.FM音乐推荐数据集上验证了我们的方法。最后,我们建立了一个互联网新闻推荐系统,该系统以Firefox插件形式实现,自动采集用户浏览兴趣和偏好,后端使用不同的推荐算法推荐用户感兴趣的新闻给用户。 基于联合聚类(Co-clustering)的两阶段协同过滤方法。聚类(Clustering)是一种缩小数据规模、降低数据稀疏性的有效方法。对于庞大而稀疏的协同过滤训练数据来说,聚类是一种很自然事实上也的确很有效的预处理方法。因此我们提出了一种两阶段协同过滤框架:首先通过我们提出的一种联合聚类的方法,将原始评分矩阵分解成很多维度很小的块,每一块里面包含相似的用户对相似的物品的评分,然后通过矩阵拟合的方法(我们使用了非负矩阵分解NMF和主题模型PLSA)来对这些小块中的未知评分进行预测。当用户新增了对于某物品的一条评分,我们仅需要更新该用户或该物品所处的数据块进行重新评分预估,大大加快了评分预估的速度。我们在MovieLens电影评分数据集上验证了该算法的效果。 本文的研究成果不仅可以直接应用于大型推荐系统中,而且对于增量式推荐系统的后续研究也具有一定的指导意义。首先基于PLSA的增量式推荐算法对于其他基于图模型的推荐系统具有借鉴价值,其次蚁群推荐算法为一类新的、基于群体智能(Swarm intellignece)的协同过滤算法做出了有价值的探索,最后我们提出的两阶段协同过滤框架对于提高推荐算法的可扩展性和更新效率提出了一个通用的有效解决方案。 推荐系统是一个无止尽的优化的过程,除了推荐精度的不断提高之外,推荐算法的性能随着互联网上数据量的增加也需要进一步提高,增量式学习无疑是提高推荐算法更新速度最重要的方法,本文的研究为这一方向提供了参考。

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We present a new approach to model and classify breast parenchymal tissue. Given a mammogram, first, we will discover the distribution of the different tissue densities in an unsupervised manner, and second, we will use this tissue distribution to perform the classification. We achieve this using a classifier based on local descriptors and probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature. We studied the influence of different descriptors like texture and SIFT features at the classification stage showing that textons outperform SIFT in all cases. Moreover we demonstrate that pLSA automatically extracts meaningful latent aspects generating a compact tissue representation based on their densities, useful for discriminating on mammogram classification. We show the results of tissue classification over the MIAS and DDSM datasets. We compare our method with approaches that classified these same datasets showing a better performance of our proposal

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos

Relevância:

10.00% 10.00%

Publicador:

Resumo:

L'increment de bases de dades que cada vegada contenen imatges més difícils i amb un nombre més elevat de categories, està forçant el desenvolupament de tècniques de representació d'imatges que siguin discriminatives quan es vol treballar amb múltiples classes i d'algorismes que siguin eficients en l'aprenentatge i classificació. Aquesta tesi explora el problema de classificar les imatges segons l'objecte que contenen quan es disposa d'un gran nombre de categories. Primerament s'investiga com un sistema híbrid format per un model generatiu i un model discriminatiu pot beneficiar la tasca de classificació d'imatges on el nivell d'anotació humà sigui mínim. Per aquesta tasca introduïm un nou vocabulari utilitzant una representació densa de descriptors color-SIFT, i desprès s'investiga com els diferents paràmetres afecten la classificació final. Tot seguit es proposa un mètode par tal d'incorporar informació espacial amb el sistema híbrid, mostrant que la informació de context es de gran ajuda per la classificació d'imatges. Desprès introduïm un nou descriptor de forma que representa la imatge segons la seva forma local i la seva forma espacial, tot junt amb un kernel que incorpora aquesta informació espacial en forma piramidal. La forma es representada per un vector compacte obtenint un descriptor molt adequat per ésser utilitzat amb algorismes d'aprenentatge amb kernels. Els experiments realitzats postren que aquesta informació de forma te uns resultats semblants (i a vegades millors) als descriptors basats en aparença. També s'investiga com diferents característiques es poden combinar per ésser utilitzades en la classificació d'imatges i es mostra com el descriptor de forma proposat juntament amb un descriptor d'aparença millora substancialment la classificació. Finalment es descriu un algoritme que detecta les regions d'interès automàticament durant l'entrenament i la classificació. Això proporciona un mètode per inhibir el fons de la imatge i afegeix invariança a la posició dels objectes dins les imatges. S'ensenya que la forma i l'aparença sobre aquesta regió d'interès i utilitzant els classificadors random forests millora la classificació i el temps computacional. Es comparen els postres resultats amb resultats de la literatura utilitzant les mateixes bases de dades que els autors Aixa com els mateixos protocols d'aprenentatge i classificació. Es veu com totes les innovacions introduïdes incrementen la classificació final de les imatges.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Biomedical time series clustering that automatically groups a collection of time series according to their internal similarity is of importance for medical record management and inspection such as bio-signals archiving and retrieval. In this paper, a novel framework that automatically groups a set of unlabelled multichannel biomedical time series according to their internal structural similarity is proposed. Specifically, we treat a multichannel biomedical time series as a document and extract local segments from the time series as words. We extend a topic model, i.e., the Hierarchical probabilistic Latent Semantic Analysis (H-pLSA), which was originally developed for visual motion analysis to cluster a set of unlabelled multichannel time series. The H-pLSA models each channel of the multichannel time series using a local pLSA in the first layer. The topics learned in the local pLSA are then fed to a global pLSA in the second layer to discover the categories of multichannel time series. Experiments on a dataset extracted from multichannel Electrocardiography (ECG) signals demonstrate that the proposed method performs better than previous state-of-the-art approaches and is relatively robust to the variations of parameters including length of local segments and dictionary size. Although the experimental evaluation used the multichannel ECG signals in a biometric scenario, the proposed algorithm is a universal framework for multichannel biomedical time series clustering according to their structural similarity, which has many applications in biomedical time series management.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

There has been an increased demand for characterizing user access patterns using web mining techniques since the informative knowledge extracted from web server log files can not only offer benefits for web site structure improvement but also for better understanding of user navigational behavior. In this paper, we present a web usage mining method, which utilize web user usage and page linkage information to capture user access pattern based on Probabilistic Latent Semantic Analysis (PLSA) model. A specific probabilistic model analysis algorithm, EM algorithm, is applied to the integrated usage data to infer the latent semantic factors as well as generate user session clusters for revealing user access patterns. Experiments have been conducted on real world data set to validate the effectiveness of the proposed approach. The results have shown that the presented method is capable of characterizing the latent semantic factors and generating user profile in terms of weighted page vectors, which may reflect the common access interest exhibited by users among same session cluster.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Web transaction data between Web visitors and Web functionalities usually convey user task-oriented behavior pattern. Mining such type of click-stream data will lead to capture usage pattern information. Nowadays Web usage mining technique has become one of most widely used methods for Web recommendation, which customizes Web content to user-preferred style. Traditional techniques of Web usage mining, such as Web user session or Web page clustering, association rule and frequent navigational path mining can only discover usage pattern explicitly. They, however, cannot reveal the underlying navigational activities and identify the latent relationships that are associated with the patterns among Web users as well as Web pages. In this work, we propose a Web recommendation framework incorporating Web usage mining technique based on Probabilistic Latent Semantic Analysis (PLSA) model. The main advantages of this method are, not only to discover usage-based access pattern, but also to reveal the underlying latent factor as well. With the discovered user access pattern, we then present user more interested content via collaborative recommendation. To validate the effectiveness of proposed approach, we conduct experiments on real world datasets and make comparisons with some existing traditional techniques. The preliminary experimental results demonstrate the usability of the proposed approach.