5 resultados para Cross Document Structure Theory
em Nottingham eTheses
Resumo:
A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding.
Resumo:
Portable Document Format (PDF) is a page-oriented, graphically rich format based on PostScript semantics and it is also the format interpreted by the Adobe Acrobat viewers. Although each of the pages in a PDF document is an independent graphic object this property does not necessarily extend to the components (headings, diagrams, paragraphs etc.) within a page. This, in turn, makes the manipulation and extraction of graphic objects on a PDF page into a very difficult and uncertain process. The work described here investigates the advantages of a model wherein PDF pages are created from assemblies of COGs (Component Object Graphics) each with a clearly defined graphic state. The relative positioning of COGs on a PDF page is determined by appropriate "spacer" objects and a traversal of the tree of COGs and spacers determines the rendering order. The enhanced revisability of PDF documents within the COG model is discussed, together with the application of the model in those contexts which require easy revisability coupled with the ability to maintain and amend PDF document structure.
Resumo:
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving. Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'. This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window. Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document.
Resumo:
As the amount of material on the World Wide Web continues to grow, users are discovering that the Web's embedded, hard-coded, links are difficult to maintain and update. Hyperlinks need a degree of abstraction in the way they are specified together with a sound underlying document structure and the property of separability from the documents they are linking. The case is made by studying the advantages of program/data separation in computer system architectures and also by re-examining some selected hypermedia systems that have already implemented separability. The prospects for introducing more abstract links into future versions of HTML and PDF, via emerging standards such as XPath, XPointer XLink and URN, are briefly discussed.
Resumo:
The purpose of this paper is twofold. Firstly it presents a preliminary and ethnomethodologically-informed analysis of the way in which the growing structure of a particular program's code was ongoingly derived from its earliest stages. This was motivated by an interest in how the detailed structure of completed program `emerged from nothing' as a product of the concrete practices of the programmer within the framework afforded by the language. The analysis is broken down into three sections that discuss: the beginnings of the program's structure; the incremental development of structure; and finally the code productions that constitute the structure and the importance of the programmer's stock of knowledge. The discussion attempts to understand and describe the emerging structure of code rather than focus on generating `requirements' for supporting the production of that structure. Due to time and space constraints, however, only a relatively cursory examination of these features was possible. Secondly the paper presents some thoughts on the difficulties associated with the analytic---in particular ethnographic---study of code, drawing on general problems as well as issues arising from the difficulties and failings encountered as part of the analysis presented in the first section.