1000 resultados para Documents électroniques


Relevância:

20.00% 20.00%

Publicador:

Resumo:

XML has emerged as a medium for interoperability over the Internet. As the number of documents published in the form of XML is increasing there is a need for selective dissemination of XML documents based on user interests. In the proposed technique, a combination of Self Adaptive Migration Model Genetic Algorithm (SAMCA)[5] and multi class Support Vector Machine (SVM) are used to learn a user model. Based on the feedback from the users the system automatically adapts to the user's preference and interests. The user model and a similarity metric are used for selective dissemination of a continuous stream of XML documents. Experimental evaluations performed over a wide range of XML documents indicate that the proposed approach significantly improves the performance of the selective dissemination task, with respect to accuracy and efficiency.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The following topics were dealt with: document analysis and recognition; multimedia document processing; character recognition; document image processing; cheque processing; form processing; music processing; document segmentation; electronic documents; character classification; handwritten character recognition; information retrieval; postal automation; font recognition; Indian language OCR; handwriting recognition; performance evaluation; graphics recognition; oriental character recognition; and word recognition

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a set of metrics that evaluate the uniformity, sharpness, continuity, noise, stroke width variance,pulse width ratio, transient pixels density, entropy and variance of components to quantify the quality of a document image. The measures are intended to be used in any optical character recognition (OCR) engine to a priori estimate the expected performance of the OCR. The suggested measures have been evaluated on many document images, which have different scripts. The quality of a document image is manually annotated by users to create a ground truth. The idea is to correlate the values of the measures with the user annotated data. If the measure calculated matches the annotated description,then the metric is accepted; else it is rejected. In the set of metrics proposed, some of them are accepted and the rest are rejected. We have defined metrics that are easily estimatable. The metrics proposed in this paper are based on the feedback of homely grown OCR engines for Indic (Tamil and Kannada) languages. The metrics are independent of the scripts, and depend only on the quality and age of the paper and the printing. Experiments and results for each proposed metric are discussed. Actual recognition of the printed text is not performed to evaluate the proposed metrics. Sometimes, a document image containing broken characters results in good document image as per the evaluated metrics, which is part of the unsolved challenges. The proposed measures work on gray scale document images and fail to provide reliable information on binarized document image.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The broader goal of the research being described here is to automatically acquire diagnostic knowledge from documents in the domain of manual and mechanical assembly of aircraft structures. These documents are treated as a discourse used by experts to communicate with others. It therefore becomes possible to use discourse analysis to enable machine understanding of the text. The research challenge addressed in the paper is to identify documents or sections of documents that are potential sources of knowledge. In a subsequent step, domain knowledge will be extracted from these segments. The segmentation task requires partitioning the document into relevant segments and understanding the context of each segment. In discourse analysis, the division of a discourse into various segments is achieved through certain indicative clauses called cue phrases that indicate changes in the discourse context. However, in formal documents such language may not be used. Hence the use of a domain specific ontology and an assembly process model is proposed to segregate chunks of the text based on a local context. Elements of the ontology/model, and their related terms serve as indicators of current context for a segment and changes in context between segments. Local contexts are aggregated for increasingly larger segments to identify if the document (or portions of it) pertains to the topic of interest, namely, assembly. Knowledge acquired through such processes enables acquisition and reuse of knowledge during any part of the lifecycle of a product.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In optical character recognition of very old books, the recognition accuracy drops mainly due to the merging or breaking of characters. In this paper, we propose the first algorithm to segment merged Kannada characters by using a hypothesis to select the positions to be cut. This method searches for the best possible positions to segment, by taking into account the support vector machine classifier's recognition score and the validity of the aspect ratio (width to height ratio) of the segments between every pair of cut positions. The hypothesis to select the cut position is based on the fact that a concave surface exists above and below the touching portion. These concave surfaces are noted down by tracing the valleys in the top contour of the image and similarly doing it for the image rotated upside-down. The cut positions are then derived as closely matching valleys of the original and the rotated images. Our proposed segmentation algorithm works well for different font styles, shapes and sizes better than the existing vertical projection profile based segmentation. The proposed algorithm has been tested on 1125 different word images, each containing multiple merged characters, from an old Kannada book and 89.6% correct segmentation is achieved and the character recognition accuracy of merged words is 91.2%. A few points of merge are still missed due to the absence of a matched valley due to the specific shapes of the particular characters meeting at the merges.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This is a translation of selected articles from the Japanese language publication Hiroshimaken Suisan Shikenjo Hokoku (Report of Hirshima Prefectural Fisheries Experimental Station), Hiroshima City, Japan, vol.22, no. 1, 1960, pages 1-76. Articles translated are: Haematological study of bacteria affected oysters, The distribution of oyster larvae and spatfalls in the Hiroshima City perimeter, On the investigation of the timing of spatfalls, On the prediction of oyster seeding at inner Hiroshima Bay, Oyster growth and its environment at the oyster farm in Hiroshima Bay