The Use of Latent Semantic Indexing to Mitigate OCR Effects of Related Document Images


Autoria(s): BULCAO-NETO, Renato F.; CAMACHO-GUERRERO, Jose A.; DUTRA, Marcio; BARREIRO, Alvaro; PARAPAR, Javier; MACEDO, Alessandra A.
Contribuinte(s)

UNIVERSIDADE DE SÃO PAULO

Data(s)

17/04/2012

17/04/2012

2011

Resumo

Due to both the widespread and multipurpose use of document images and the current availability of a high number of document images repositories, robust information retrieval mechanisms and systems have been increasingly demanded. This paper presents an approach to support the automatic generation of relationships among document images by exploiting Latent Semantic Indexing (LSI) and Optical Character Recognition (OCR). We developed the LinkDI (Linking of Document Images) service, which extracts and indexes document images content, computes its latent semantics, and defines relationships among images as hyperlinks. LinkDI was experimented with document images repositories, and its performance was evaluated by comparing the quality of the relationships created among textual documents as well as among their respective document images. Considering those same document images, we ran further experiments in order to compare the performance of LinkDI when it exploits or not the LSI technique. Experimental results showed that LSI can mitigate the effects of usual OCR misrecognition, which reinforces the feasibility of LinkDI relating OCR output with high degradation.

CNPq[557976/2008-1]

FAPESP[05/60038-5]

FAPESP[05/60729-8]

FAPESP[06/58984-2]

FAPESP[09/14292-8]

FAPESP[2009/05504-1]

Spanish Ministerio de Ciencia e Innovacion[TIN2008-06566-C04-04]

FEDER

Xunta de Galicia[07SIN005206PR]

Innolution Sistemas de Informatica

Identificador

JOURNAL OF UNIVERSAL COMPUTER SCIENCE, v.17, n.1, p.64-80, 2011

0948-695X

http://producao.usp.br/handle/BDPI/14973

http://www.jucs.org/jucs_17_1/the_use_of_latent/jucs_17_01_0064_0080_neto.pdf

Idioma(s)

eng

Publicador

GRAZ UNIV TECHNOLGOY, INST INFORMATION SYSTEMS COMPUTER MEDIA-IICM

Relação

Journal of Universal Computer Science

Direitos

openAccess

Copyright GRAZ UNIV TECHNOLGOY, INST INFORMATION SYSTEMS COMPUTER MEDIA-IICM

Palavras-Chave #Applied Computing #Information Retrieval #Document Engineering #Latent Semantic #Optical Character Recognition #Document Image #Experimentation #VECTOR-SPACE #RETRIEVAL
Tipo

article

original article

publishedVersion