986 resultados para documents éphémères
Resumo:
Except the article forming the main content most HTML documents on the WWW contain additional contents such as navigation menus, design elements or commercial banners. In the context of several applications it is necessary to draw the distinction between main and additional content automatically. Content extraction and template detection are the two approaches to solve this task. This thesis gives an extensive overview of existing algorithms from both areas. It contributes an objective way to measure and evaluate the performance of content extraction algorithms under different aspects. These evaluation measures allow to draw the first objective comparison of existing extraction solutions. The newly introduced content code blurring algorithm overcomes several drawbacks of previous approaches and proves to be the best content extraction algorithm at the moment. An analysis of methods to cluster web documents according to their underlying templates is the third major contribution of this thesis. In combination with a localised crawling process this clustering analysis can be used to automatically create sets of training documents for template detection algorithms. As the whole process can be automated it allows to perform template detection on a single document, thereby combining the advantages of single and multi document algorithms.
Resumo:
Lo scopo di questa dissertazione è di identificare le tecnologie più appropriate per la creazione di editor parametrici per documenti strutturati e di descrivere LIME, un editor di markup parametrico e indipendente dal linguaggio. La recente evoluzione delle tecnologie XML ha portato ad un utilizzo sempre più consistente di documenti strutturati. Oggigiorno, questi vengono utilizzati sia per scopi tipografici sia per l’interscambio di dati nella rete internet. Per questa ragione, sempre più persone hanno a che fare con documenti XML nel lavoro quotidiano. Alcuni dialetti XML, tuttavia, non sono semplici da comprendere e da utilizzare e, per questo motivo, si rendono necessari editor XML che possano guidare gli autori di documenti XML durante tutto il processo di markup. In alcuni contesti, specialmente in quello dell’informatica giuridica, sono stati introdotti i markup editor, software WYSIWYG che assistono l’utente nella creazione di documenti corretti. Questi editor possono essere utilizzati anche da persone che non conoscono a fondo XML ma, d’altra parte, sono solitamente basati su uno specifico linguaggio XML. Questo significa che sono necessarie molte risorse, in termini di programmazione, per poterli adattare ad altri linguaggi XML o ad altri contesti. Basando l’architettura degli editor di markup su parametri, è possibile progettare e sviluppare software che non dipendono da uno specifico linguaggio XML e che possono essere personalizzati al fine di utilizzarli in svariati contesti.
Resumo:
From the beginning of the standardisation of language in Bosnia and Herzegovina, i.e. from the acceptance of Karadzic's phonetic spelling in the mid-19th century, to the present day when there are three different language standards in force - Bosniac (Muslim), Croatian and Serbian, language in Bosnia and Herzegovina has been a subject of political conflict. Documents on language policy from this period show the degree to which domestic and foreign political factors influenced the standard language issue, beginning with the very appellation for the specific norm regulation. The material analysed (proclamations by political, cultural and other organisations as well as corresponding constitutional and statutory provisions on language use) shows the differing treatment of the standard language in Bosnia and Herzegovina in different historical periods. During the period of Turkish rule (until 1878) there was no real political interest in the issue. Under Austro-Hungarian rule (1878-1918) there was an attempt to use the language as a means of forming a united Bosnian nation, but this was later abandoned. During the first Yugoslavia (1918-1941) a uniform solution was imposed on Bosnia and Herzegovina, as throughout the Serbo-Croatian language area, while under the Independent State of Croatia (1941-1945), the official language of Bosnia and Herzegovina was Croatian. The period from 1945 to 1991 had two phases: the first a standard language unity of Serbs, Croats, Muslims and Montenegrins (until 1965), and the second a gradual but stormy separation of national languages, which has been largely completed since 1991. The introductory study includes a detailed analysis of all the expressions used, with special reference to the present state, and accompanies the collection of documents which represent the main outcome of the research.