982 resultados para Web documents
Resumo:
Magdeburg, Univ., Fak. für Informatik, Diss., 2010
Resumo:
A vast amount of temporal information is provided on the Web. Even though many facts expressed in documents are time-related, the temporal properties of Web presentations have not received much attention. In database research, temporal databases have become a mainstream topic in recent years. In Web documents, temporal data may exist as meta data in the header and as user-directed data in the body of a document. Whereas temporal data can easily be identified in the semi-structured meta data, it is more difficult to determine temporal data and its role in the body. We propose procedures for maintaining temporal integrity of Web pages and outline different approaches of applying bitemporal data concepts for Web documents. In particular, we regard desirable functionalities of Web repositories and other Web-related tools that may support the Webmasters in managing the temporal data of their Web documents. Some properties of a prototype environment are described.
Resumo:
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with “false correlation”. In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a twophase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problemresulted from the sparse term-paragraphmatrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerancerough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
Resumo:
Document engineering is the computer science discipline that investigates systems for documents in any form and in all media. As with the relationship between software engineering and software, document engineering is concerned with principles, tools and processes that improve our ability to create, manage, and maintain documents (http://www.documentengineering.org). The ACM Symposium on Document Engineering is an annual meeting of researchers active in document engineering: it is sponsored by ACM by means of the ACM SIGWEB Special Interest Group. In this editorial, we first point to work carried out in the context of document engineering, which are directly related to multimedia tools and applications. We conclude with a summary of the papers presented in this special issue.
Resumo:
Except the article forming the main content most HTML documents on the WWW contain additional contents such as navigation menus, design elements or commercial banners. In the context of several applications it is necessary to draw the distinction between main and additional content automatically. Content extraction and template detection are the two approaches to solve this task. This thesis gives an extensive overview of existing algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The newly introduced content code blurring algorithm overcomes several drawbacks of previous approaches and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create sets of training documents for template detection algorithms. As the whole process can be automated it allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms.
Resumo:
Web-scale knowledge retrieval can be enabled by distributed information retrieval, clustering Web clients to a large-scale computing infrastructure for knowledge discovery from Web documents. Based on this infrastructure, we propose to apply semiotic (i.e., sub-syntactical) and inductive (i.e., probabilistic) methods for inferring concept associations in human knowledge. These associations can be combined to form a fuzzy (i.e.,gradual) semantic net representing a map of the knowledge in the Web. Thus, we propose to provide interactive visualizations of these cognitive concept maps to end users, who can browse and search the Web in a human-oriented, visual, and associative interface.
Resumo:
This dissertation research points out major challenging problems with current Knowledge Organization (KO) systems, such as subject gateways or web directories: (1) the current systems use traditional knowledge organization systems based on controlled vocabulary which is not very well suited to web resources, and (2) information is organized by professionals not by users, which means it does not reflect intuitively and instantaneously expressed users’ current needs. In order to explore users’ needs, I examined social tags which are user-generated uncontrolled vocabulary. As investment in professionally-developed subject gateways and web directories diminishes (support for both BUBL and Intute, examined in this study, is being discontinued), understanding characteristics of social tagging becomes even more critical. Several researchers have discussed social tagging behavior and its usefulness for classification or retrieval; however, further research is needed to qualitatively and quantitatively investigate social tagging in order to verify its quality and benefit. This research particularly examined the indexing consistency of social tagging in comparison to professional indexing to examine the quality and efficacy of tagging. The data analysis was divided into three phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of tag attributes. Most indexing consistency studies have been conducted with a small number of professional indexers, and they tended to exclude users. Furthermore, the studies mainly have focused on physical library collections. This dissertation research bridged these gaps by (1) extending the scope of resources to various web documents indexed by users and (2) employing the Information Retrieval (IR) Vector Space Model (VSM) - based indexing consistency method since it is suitable for dealing with a large number of indexers. As a second phase, an analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. Finally, to investigate tagging pattern and behaviors, a content analysis on tag attributes was conducted based on the FRBR model. The findings revealed that there was greater consistency over all subjects among taggers compared to that for two groups of professionals. The analysis of tagging exhaustivity and tag specificity in relation to tagging effectiveness was conducted to ameliorate difficulties associated with limitations in the analysis of indexing consistency based on only the quantitative measures of vocabulary matching. Examination of exhaustivity and specificity of social tags provided insights into particular characteristics of tagging behavior and its variation across subjects. To further investigate the quality of tags, a Latent Semantic Analysis (LSA) was conducted to determine to what extent tags are conceptually related to professionals’ keywords and it was found that tags of higher specificity tended to have a higher semantic relatedness to professionals’ keywords. This leads to the conclusion that the term’s power as a differentiator is related to its semantic relatedness to documents. The findings on tag attributes identified the important bibliographic attributes of tags beyond describing subjects or topics of a document. The findings also showed that tags have essential attributes matching those defined in FRBR. Furthermore, in terms of specific subject areas, the findings originally identified that taggers exhibited different tagging behaviors representing distinctive features and tendencies on web documents characterizing digital heterogeneous media resources. These results have led to the conclusion that there should be an increased awareness of diverse user needs by subject in order to improve metadata in practical applications. This dissertation research is the first necessary step to utilize social tagging in digital information organization by verifying the quality and efficacy of social tagging. This dissertation research combined both quantitative (statistics) and qualitative (content analysis using FRBR) approaches to vocabulary analysis of tags which provided a more complete examination of the quality of tags. Through the detailed analysis of tag properties undertaken in this dissertation, we have a clearer understanding of the extent to which social tagging can be used to replace (and in some cases to improve upon) professional indexing.
Resumo:
While multimedia data, image data in particular, is an integral part of most websites and web documents, our quest for information so far is still restricted to text based search. To explore the World Wide Web more effectively, especially its rich repository of truly multimedia information, we are facing a number of challenging problems. Firstly, we face the ambiguous and highly subjective nature of defining image semantics and similarity. Secondly, multimedia data could come from highly diversified sources, as a result of automatic image capturing and generation processes. Finally, multimedia information exists in decentralised sources over the Web, making it difficult to use conventional content-based image retrieval (CBIR) techniques for effective and efficient search. In this special issue, we present a collection of five papers on visual and multimedia information management and retrieval topics, addressing some aspects of these challenges. These papers have been selected from the conference proceedings (Kluwer Academic Publishers, ISBN: 1-4020- 7060-8) of the Sixth IFIP 2.6 Working Conference on Visual Database Systems (VDB6), held in Brisbane, Australia, on 29–31 May 2002.
Resumo:
The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition,production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building blocks of PANACEA. The CAC, which is the first stage in the PANACEA pipeline for building Language Resources, adopts an efficient and distributed methodology to crawl for web documents with rich textual content in specific languages and predefined domains. The CAC includes modules that can acquire parallel data from sites with in-domain content available in more than one language. In order to extrinsically evaluate the CAC methodology, we have conducted several experiments that used crawled parallel corpora for the identification and extraction of parallel sentences using sentence alignment. The corpora were then successfully used for domain adaptation of Machine Translation Systems.
Resumo:
Information and communication technologies enabled cultural and scientific patrimony -and information in general- to be presented in digital format, as well as in traditional analogical formats. The response was immediate and since the decade of the 1990s different projects have been designed to guarantee permanent access to the digital production -retrieval, storage, handling, preservation and dissemination. This article presents an international overview of existing models of national digital repositories, a name given to these projects that are normally generated by national libraries with a common objective: ensuring that web pages are always accessible.
Resumo:
Information and communication technologies enabled cultural and scientific patrimony -and information in general- to be presented in digital format, as well as in traditional analogical formats. The response was immediate and since the decade of the 1990s different projects have been designed to guarantee permanent access to the digital production -retrieval, storage, handling, preservation and dissemination. This article presents an international overview of existing models of national digital repositories, a name given to these projects that are normally generated by national libraries with a common objective: ensuring that web pages are always accessible.
Resumo:
Summary : Functional cataloguing of web documents
Resumo:
Kandidaatintutkielmani käsittelee Invoice-kauppaa ja sitä, mitä etuja siihen liittyy verrattuna tavanomaiseen tax-free kauppaan. Invoice-veronpalautusjärjestelmässä EU:n ulkopuolella asuva asiakas saa veronpalautuksen samasta liikkeestä seuraavalla asiointikerralla. Arvonlisäveronpalautukset on kuitenkin haettava puolen vuoden sisällä ostosten tekemisestä. Tavanomaisessa tax-free kaupassa asiakas saa arvonlisäveronpalautuksen rajalta poistuessaan Suomesta. Invoicea käytettäessä asiakas saa verosta isomman osan takaisin kuin tavanomaisessa tax-free kaupassa, mutta palautuksen saaminen kestää kauemmin, koska Invoicea käytettäessä veronpalautuksen voi saada vain samasta liikkeestä, missä ostokset on tehty. Tutkielmani tarkastelee aihetta kauppiaan näkökulmasta. Kauppiaan kannalta Invoicen etuna on erityisesti asiakkaiden ”koukuttaminen”, koska veronpalautukset on aina haettava samasta liikkestä, mistä tuotteet on ostettu. Näin samat asiakkaat tulevat usein samaan liikkeeseen myös seuraavilla Suomen matkoilla saadakseen veronpalautukset. Tämä tuo liikkeille usein myös uusia kanta-asiakkaita.Toisaalta on huomioitava myös Invoicen käytöstä kauppiaalle ja kassoille mahdollisesti aiheutuva lisätyö ja kustannukset. Veronpalautusten maksaminen takaisin asiakkaille ja tullissa leimattujen kuittien käsittely vie kassoilla tavanomaista enemmän aikaa ja saattaa vaatia lisää henkilökuntaa. Tutkielma on toteutettu laadullisena eli kvalitatiivisena ja tutkimusmenetelmänä on käytetty haastatteluita. Haastateltavat ovat kauppiaita Kaakkois-Suomen alueelta. Tavoitteenani oli koota mahdollisimman monipuolinen haastateltavien joukko sisältäen niin vaate- ja vapaa-ajan liikkeitä kuin sekatavara- ja päivittäistavarakauppoja. Teorialähteinä käytin yliopiston kirjastosta lainattuja kirjoja, LUT:in tietokantojen ja Edilex-tietokannan artikkeleita sekä Verohallinnon dokumentteja ja verkkojulkaisuja. Lisäksi olen hyödyntänyt tutkielmassani myös ajankohtaisia uutisia sekä erilaisten paikallis- ja aikakauslehtien artikkeleita. Tutkielmani johtopäätöksissä tulin siihen tulokseen, että kauppiaan kannalta edullisinta on käyttää samanaikaisesti sekä Invoicea että perinteistä, palautusliikeiden palveluja hyödyntävää tax-free järjestelmää. Tämä mahdollistaa liikkeille mahdollisimman laajan asiakasjoukon. Suomessa usein käyvät ostosmatkailijat suosivat yleensä Invoicea täysimääräisen veronpalautuksen vuoksi. Palautusliikkeet taas veloittavat asiakkaalle maksettavasta veronpalautuksesta oman palvelumaksunsa. Suomessa harvemmin vieraileville taas palautusliikkeiden palvelut ovat edullisempia, sillä veronpalautukset saa rajalta maasta poistuttaessa, eikä tarvitse palata samaan liikkeeseen puolen vuoden sisällä. Palautusliikkeiden etuna Invoiceen nähden on myös asioinnin vaivattomuus, sillä eri liikkeissä asioivat ostosmatkailijat saavat kaikista matkalla tekemistään ostoksista arvonlisäveronpalautuksensa yhdestä paikasta sen sijaan, että kävisivät hakemassa ne joka liikkeestä erikseen.