6 resultados para document categorization

em Chinese Academy of Sciences Institutional Repositories Grid Portal


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract. Latent Dirichlet Allocation (LDA) is a document level language model. In general, LDA employ the symmetry Dirichlet distribution as prior of the topic-words’ distributions to implement model smoothing. In this paper, we propose a data-driven smoothing strategy in which probability mass is allocated from smoothing-data to latent variables by the intrinsic inference procedure of LDA. In such a way, the arbitrariness of choosing latent variables'priors for the multi-level graphical model is overcome. Following this data-driven strategy,two concrete methods, Laplacian smoothing and Jelinek-Mercer smoothing, are employed to LDA model. Evaluations on different text categorization collections show data-driven smoothing can significantly improve the performance in balanced and unbalanced corpora.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a highly efficient content-lossless compression scheme for Chinese document images. The scheme combines morphologic analysis with pattern matching to cluster patterns. In order to achieve the error maps with minimal error numbers, the morphologic analysis is applied to decomposing and recomposing the Chinese character patterns. In the pattern matching, the criteria are adapted to the characteristics of Chinese characters. Since small-size components sometimes can be inserted into the blank spaces of large-size components, we can achieve small-size pattern library images. Arithmetic coding is applied to the final compression. Our method achieves much better compression performance than most alternative methods, and assures content-lossless reconstruction. (c) 2006 Society of Photo-Optical Instrumentation Engineers.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a highly efficient content-lossless compression scheme for Chinese document images. The scheme combines morphologic analysis with pattern matching to cluster patterns. In order to achieve the error maps with minimal error numbers, the morphologic analysis is applied to decomposing and recomposing the Chinese character patterns. In the pattern matching, the criteria are adapted to the characteristics of Chinese characters. Since small-size components sometimes can be inserted into the blank spaces of large-size components, we can achieve small-size pattern library images. Arithmetic coding is applied to the final compression. Our method achieves much better compression performance than most alternative methods, and assures content-lossless reconstruction. (c) 2006 Society of Photo-Optical Instrumentation Engineers.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

It is common that documents are represented by document icon in graphical user interfaces. The document icon facilitates user to retrieve documents, but it is difficult to distinguish the document from a collection of documents that user have accessed to. Our paper presents a document icon on which the users can add some subjective values and mark. Then we describe a system ex-explorer that users can browser and search the extent document icon. We found that it is easy to re-find the document on which users added some annotation or mark by themselves.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

ACM SIGIR; ACM SIGWEB