987 resultados para Escolapis-Documents pontificis
Resumo:
We propose a set of metrics that evaluate the uniformity, sharpness, continuity, noise, stroke width variance,pulse width ratio, transient pixels density, entropy and variance of components to quantify the quality of a document image. The measures are intended to be used in any optical character recognition (OCR) engine to a priori estimate the expected performance of the OCR. The suggested measures have been evaluated on many document images, which have different scripts. The quality of a document image is manually annotated by users to create a ground truth. The idea is to correlate the values of the measures with the user annotated data. If the measure calculated matches the annotated description,then the metric is accepted; else it is rejected. In the set of metrics proposed, some of them are accepted and the rest are rejected. We have defined metrics that are easily estimatable. The metrics proposed in this paper are based on the feedback of homely grown OCR engines for Indic (Tamil and Kannada) languages. The metrics are independent of the scripts, and depend only on the quality and age of the paper and the printing. Experiments and results for each proposed metric are discussed. Actual recognition of the printed text is not performed to evaluate the proposed metrics. Sometimes, a document image containing broken characters results in good document image as per the evaluated metrics, which is part of the unsolved challenges. The proposed measures work on gray scale document images and fail to provide reliable information on binarized document image.
Resumo:
When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.
Resumo:
The broader goal of the research being described here is to automatically acquire diagnostic knowledge from documents in the domain of manual and mechanical assembly of aircraft structures. These documents are treated as a discourse used by experts to communicate with others. It therefore becomes possible to use discourse analysis to enable machine understanding of the text. The research challenge addressed in the paper is to identify documents or sections of documents that are potential sources of knowledge. In a subsequent step, domain knowledge will be extracted from these segments. The segmentation task requires partitioning the document into relevant segments and understanding the context of each segment. In discourse analysis, the division of a discourse into various segments is achieved through certain indicative clauses called cue phrases that indicate changes in the discourse context. However, in formal documents such language may not be used. Hence the use of a domain specific ontology and an assembly process model is proposed to segregate chunks of the text based on a local context. Elements of the ontology/model, and their related terms serve as indicators of current context for a segment and changes in context between segments. Local contexts are aggregated for increasingly larger segments to identify if the document (or portions of it) pertains to the topic of interest, namely, assembly. Knowledge acquired through such processes enables acquisition and reuse of knowledge during any part of the lifecycle of a product.
Resumo:
In optical character recognition of very old books, the recognition accuracy drops mainly due to the merging or breaking of characters. In this paper, we propose the first algorithm to segment merged Kannada characters by using a hypothesis to select the positions to be cut. This method searches for the best possible positions to segment, by taking into account the support vector machine classifier's recognition score and the validity of the aspect ratio (width to height ratio) of the segments between every pair of cut positions. The hypothesis to select the cut position is based on the fact that a concave surface exists above and below the touching portion. These concave surfaces are noted down by tracing the valleys in the top contour of the image and similarly doing it for the image rotated upside-down. The cut positions are then derived as closely matching valleys of the original and the rotated images. Our proposed segmentation algorithm works well for different font styles, shapes and sizes better than the existing vertical projection profile based segmentation. The proposed algorithm has been tested on 1125 different word images, each containing multiple merged characters, from an old Kannada book and 89.6% correct segmentation is achieved and the character recognition accuracy of merged words is 91.2%. A few points of merge are still missed due to the absence of a matched valley due to the specific shapes of the particular characters meeting at the merges.
Resumo:
This is a translation of selected articles from the Japanese language publication Hiroshimaken Suisan Shikenjo Hokoku (Report of Hirshima Prefectural Fisheries Experimental Station), Hiroshima City, Japan, vol.22, no. 1, 1960, pages 1-76. Articles translated are: Haematological study of bacteria affected oysters, The distribution of oyster larvae and spatfalls in the Hiroshima City perimeter, On the investigation of the timing of spatfalls, On the prediction of oyster seeding at inner Hiroshima Bay, Oyster growth and its environment at the oyster farm in Hiroshima Bay
Resumo:
This compilation of references to works which synthesize information on coastal topics is intended to be useful to resource managers in decision making processes. However, the utility must be understand in terms of its limited coverage. The bibliography is not inclusive of all the published materials on the topics selected. Coverage is clearly defined in the following paragraph. The time span of the bibliography is limited to references that were published from I983 to 1993, except for a last-minute addition of a few 1994 publications. All searches were done in mid- to late-1993. The bibliography was compiled from searches done on the following DIALOG electronic databases: Aquatic Sciences and Fisheries Abstracts, BlOSlS Previews, Dissertation Abstracts Online, Life Sciences Collection, NTlS (National Technical lnformation Service), Oceanic Abstracts, Pollution Abstracts, SciSearch, and Water Resources Abstracts. In addition, two NOAA electronic datases were searched: the NOAA Library and lnformation Catalog and the NOAA Sea Grant Depository Database. Synthesis of information is not an ubiquitous term used in database development. In order to locate syntheses of required coastal topics, 89 search terms were used in combinations which required 10 searches from each file. From the nearly 6,000 citations which resulted from the electronic searches, the most appropriate were selected to produce this bibliography. The document was edited and indexed using Wordperfect software. When available, an abstract has been included. Every abstract was edited. The bibliography is subdivided into four main topics or sections: ecosystems, coastal water body conditions, natural disasters, and resource management. In the ecosystems section, emphasis is placed on organisms in their environment on the major coastlines of the U.S. In the second section, coastal water body conditions, the environment itself is emphasized. References were found for the Alaskan coast, but none were found for Hawaii. The third section, on natural disasters, emphasizes environmental impacts resulting from natural phenomena. Guidelines, planning and management reports, modelling documents, strategic and restoration plans, and environmental economics related to sustainability are included in the fourth section, resource management. Author, geographic, and subject indices indices are provided. The authors would like to thank Victor Omelczenko and Terry Seldon of the NOAA Sea Grant Office for access to and training on the NOAA Sea Grant Depository Database. We are grateful also to Dorothy Anderson, Philip Keavey, and Elizabeth Petersen who reviewed the draft document.
Resumo:
This research proposes a method for extracting technology intelligence (TI) systematically from a large set of document data. To do this, the internal and external sources in the form of documents, which might be valuable for TI, are first identified. Then the existing techniques and software systems applicable to document analysis are examined. Finally, based on the reviews, a document-mining framework designed for TI is suggested and guidelines for software selection are proposed. The research output is expected to support intelligence operatives in finding suitable techniques and software systems for getting value from document-mining and thus facilitate effective knowledge management. Copyright © 2012 Inderscience Enterprises Ltd.