12 resultados para HTML (Language for Labelling Documents)
em Universidad Politécnica de Madrid
Resumo:
En la realización de este proyecto se ha tratado principalmente la temática del web scraping sobre documentos HTML en Android. Como resultado del mismo, se ha propuesto una metodología para poder realizar web scraping en aplicaciones implementadas para este sistema operativo y se desarrollará una aplicación basada en esta metodología que resulte útil a los alumnos de la escuela. Web scraping se puede definir como una técnica basada en una serie de algoritmos de búsqueda de contenido con el fin de obtener una determinada información de páginas web, descartando aquella que no sea relevante. Como parte central, se ha dedicado bastante tiempo al estudio de los navegadores y servidores Web, y del lenguaje HTML presente en casi todas las páginas web en la actualidad así como de los mecanismos utilizados para la comunicación entre cliente y servidor ya que son los pilares en los que se basa esta técnica. Se ha realizado un estudio de las técnicas y herramientas necesarias, aportándose todos los conceptos teóricos necesarios, así como la proposición de una posible metodología para su implementación. Finalmente se ha codificado la aplicación UPMdroid, desarrollada con el fin de ejemplificar la implementación de la metodología propuesta anteriormente y a la vez desarrollar una aplicación cuya finalidad es brindar al estudiante de la ETSIST un soporte móvil en Android que le facilite el acceso y la visualización de aquellos datos más importantes del curso académico como son: el horario de clases y las calificaciones de las asignaturas en las que se matricule. Esta aplicación, además de implementar la metodología propuesta, es una herramienta muy interesante para el alumno, ya que le permite utilizar de una forma sencilla e intuitiva gran número de funcionalidades de la escuela solucionando así los problemas de visualización de contenido web en los dispositivos. ABSTRACT. The main topic of this project is about the web scraping over HTML documents on Android OS. As a result thereof, it is proposed a methodology to perform web scraping in deployed applications for this operating system and based on this methodology that is useful to the ETSIST school students. Web scraping can be defined as a technique based on a number of content search algorithms in order to obtain certain information from web pages, discarding those that are not relevant. As a main part, has spent considerable time studying browsers and Web servers, and the HTML language that is present today in almost all websites as well as the mechanisms used for communication between client and server because they are the pillars which this technique is based. We performed a study of the techniques and tools needed, providing all the necessary theoretical concepts, as well as the proposal of a possible methodology for implementation. Finally it has codified UPMdroid application, developed in order to illustrate the implementation of the previously proposed methodology and also to give the student a mobile ETSIST Android support to facilitate access and display those most important data of the current academic year such as: class schedules and scores for the subjects in which you are enrolled. This application, in addition to implement the proposed methodology is also a very interesting tool for the student, as it allows a simple and intuitive way of use these school functionalities thus fixing the viewing web content on devices.
Resumo:
We discuss from a practical point of view a number of ssues involved in writing distributed Internet and WWW applications using LP/CLP systems. We describe PiLLoW, a publicdomain Internet and WWW programming library for LP/CLP systems that we have designed in order to simplify the process of writing such applications. PiLLoW provides facilities for accessing documents and code on the WWW; parsing, manipulating and generating HTML and XML structured documents and data; producing HTML forms; writing form handlers and CGI-scripts; and processing HTML/XML templates. An important contribution of PÍ'LLOW is to model HTML/XML code (and, thus, the content of WWW pages) as terms. The PÍ'LLOW library has been developed in the context of the Ciao Prolog system, but it has been adapted to a number of popular LP/CLP systems, supporting most of its functionality. We also describe the use of concurrency and a highlevel model of client-server interaction, Ciao Prolog's active modules, in the context of WWW programming. We propose a solution for client-side downloading and execution of Prolog code, using generic browsers. Finally, we also provide an overview of related work on the topic.
Resumo:
This paper describes a preprocessing module for improving the performance of a Spanish into Spanish Sign Language (Lengua de Signos Espanola: LSE) translation system when dealing with sparse training data. This preprocessing module replaces Spanish words with associated tags. The list with Spanish words (vocabulary) and associated tags used by this module is computed automatically considering those signs that show the highest probability of being the translation of every Spanish word. This automatic tag extraction has been compared to a manual strategy achieving almost the same improvement. In this analysis, several alternatives for dealing with non-relevant words have been studied. Non-relevant words are Spanish words not assigned to any sign. The preprocessing module has been incorporated into two well-known statistical translation architectures: a phrase-based system and a Statistical Finite State Transducer (SFST). This system has been developed for a specific application domain: the renewal of Identity Documents and Driver's License. In order to evaluate the system a parallel corpus made up of 4080 Spanish sentences and their LSE translation has been used. The evaluation results revealed a significant performance improvement when including this preprocessing module. In the phrase-based system, the proposed module has given rise to an increase in BLEU (Bilingual Evaluation Understudy) from 73.8% to 81.0% and an increase in the human evaluation score from 0.64 to 0.83. In the case of SFST, BLEU increased from 70.6% to 78.4% and the human evaluation score from 0.65 to 0.82.
Resumo:
This paper proposes the use of Factored Translation Models (FTMs) for improving a Speech into Sign Language Translation System. These FTMs allow incorporating syntactic-semantic information during the translation process. This new information permits to reduce significantly the translation error rate. This paper also analyses different alternatives for dealing with the non-relevant words. The speech into sign language translation system has been developed and evaluated in a specific application domain: the renewal of Identity Documents and Driver’s License. The translation system uses a phrase-based translation system (Moses). The evaluation results reveal that the BLEU (BiLingual Evaluation Understudy) has improved from 69.1% to 73.9% and the mSER (multiple references Sign Error Rate) has been reduced from 30.6% to 24.8%.
Resumo:
We describe lpdoc, a tool which generates documentation manuals automatically from one or more logic program source files, written in Ciao, ISO-Prolog, and other (C)LP languages. It is particularly useful for documenting library modules, for which it automatically generates a rich description of the module interface. However, it can also be used quite successfully to document full applications. A fundamental advantage of using lpdoc is that it helps maintaining a true correspondence between the program and its documentation, and also identifying precisely to what versión of the program a given printed manual corresponds. The quality of the documentation generated can be greatly enhanced by including within the program text assertions (declarations with types, modes, etc. ...) for the predicates in the program, and machine-readable comments. One of the main novelties of lpdoc is that these assertions and comments are written using the Ciao system asseriion language, which is also the language of communication between the compiler and the user and between the components of the compiler. This allows a significant synergy among specification, debugging, documentation, optimization, etc. A simple compatibility library allows conventional (C)LP systems to ignore these assertions and comments and treat normally programs documented in this way. The documentation can be generated interactively from emacs or from the command line, in many formats including texinfo, dvi, ps, pdf, info, ascii, html/css, Unix nroff/man, Windows help, etc., and can include bibliographic citations and images, lpdoc can also genérate "man" pages (Unix man page format), nicely formatted plain ASCII "readme" files, installation scripts useful when the manuals are included in software distributions, brief descriptions in html/css or info formats suitable for inclusión in on-line Índices of manuals, and even complete WWW and info sites containing on-line catalogs of documents and software distributions. The lpdoc manual, all other Ciao system manuals, and parts of this paper are generated by lpdoc.
Resumo:
We describe lpdoc, a tool which generates documentation manuals automatically from one or more logic program source files, written in ISO-Prolog, Ciao, and other (C)LP languages. It is particularly useful for documenting library modules, for which it automatically generates a rich description of the module interface. However, it can also be used quite successfully to document full applications. A fundamental advantage of using lpdoc is that it helps maintaining a true correspondence between the program and its documentation, and also identifying precisely to what version of the program a given printed manual corresponds. The quality of the documentation generated can be greatly enhanced by including within the program text assertions (declarations with types, modes, etc.) for the predicates in the program, and machine-readable comments. One of the main novelties of lpdoc is that these assertions and comments are written using the Ciao system assertion language, which is also the language of communication between the compiler and the user and between the components of the compiler. This allows a significant synergy among specification, documentation, optimization, etc. A simple compatibility library allows conventional (C)LP systems to ignore these assertions and comments and treat normally programs documented in this way. The documentation can be generated in many formats including texinfo, dvi, ps, pdf, info, html/css, Unix nroff/man, Windows help, etc., and can include bibliographic citations and images. lpdoc can also generate “man” pages (Unix man page format), nicely formatted plain ascii “readme” files, installation scripts useful when the manuals are included in software distributions, brief descriptions in html/css or info formats suitable for inclusion in on-line indices of manuals, and even complete WWW and info sites containing on-line catalogs of documents and software distributions. The lpdoc manual, all other Ciao system manuals, and parts of this paper are generated by lpdoc.
Resumo:
We describe lpdoc, a tool which generates documentation manuals automatically from one or more logic program source files, written in ISO-Prolog, Ciao, and other (C)LP languages. It is particularly useful for documenting library modules, for which it automatically generates a rich description of the module interface. However, it can also be used quite successfully to document full applications. The documentation can be generated in many formats including t e x i n f o, dvi, ps, pdf, inf o, html/css, Unix nrof f/man, Windows help, etc., and can include bibliographic citations and images, lpdoc can also genérate "man" pages (Unix man page format), nicely formatted plain ascii "readme" files, installation scripts useful when the manuals are included in software distributions, brief descriptions in html/css or inf o formats suitable for inclusión in on-line Índices of manuals, and even complete WWW and inf o sites containing on-line catalogs of documents and software distributions. A fundamental advantage of using lpdoc is that it helps maintaining a true correspondence between the program and its documentation, and also identifying precisely to what versión of the program a given printed manual corresponds. The quality of the documentation generated can be greatly enhanced by including within the program text assertions (declarations with types, modes, etc. ...) for the predicates in the program, and machine-readable comments. These assertions and comments are written using the Ciao system assertion language. A simple compatibility library allows conventional (C)LP systems to ignore these assertions and comments and treat normally programs documented in this way. The lpdoc manual, all other Ciao system manuals, and most of this paper, are generated by lpdoc.
Resumo:
Modularity allows the construction of complex designs from simpler, independent units that most of the time can be developed separately. In this paper we are concerned with developing mechanisms for easily implementing modular extensions to modular (logic) languages. By (language) extensions we refer to different groups of syntactic definitions and translation rules that extend a language. Our application of the concept of modularity in this context is twofold. We would like these extensions to be modular, in the above sense, i.e., we should be able to develop different extensions mostly separately. At the same time, the sources and targets for the extensions are modular languages, i.e., such extensions may take as input separate pieces of code and also produce separate pieces of code. Dealing with this double requirement involves interesting challenges to ensure that modularity is not broken: first, combinations of extensions (as if they were a single extension) must be given a precise meaning. Also, the separate translation of multiple sources (as if they were a single source) must be feasible. We present a detailed description of a code expansion-based framework that proposes novel solutions for these problems. We argue that the approach, while implemented for Ciao, can be adapted for other languages and Prolog-based systems.
Resumo:
Lpdoc is an automatic program documentation generator for (C)LP systems. Lpdoc generates a reference manual automatically from one or more source files for a logic program (including ISO-Prolog, Ciao, many CLP systems, ...). It is particularly useful for documenting library modules, for which it automatically generates a description of the module interface. However, lpdoc can also be used quite successfully to document full applications and to generate nicely formatted plain ascii "readme" files. A fundamental advantage of using lpdoc to document programs is that it is much easier to maintain a true correspondence between the program and its documentation, and to identify precisely to what version of the program a given printed manual corresponds. The quality of the documentation generated can be greatly enhanced by including within the program text: • assertions (types, modes, etc. ...) for the predicates in the program, and • machine-readable comments (in the "literate programming" style). The assertions and comments included in the source file need to be written using the Ciao system assertion language. A simple compatibility library is available to make traditional (constraint) logic programming systems ignore these assertions and comments allowing normal treatment of programs documented in this way. The documentation is currently generated in HTML or texinf o format. From the texinf o output, printed and on-line manuals in several formats (dvi, ps, info, etc.) can be easily generated automatically, using publicly available tools, lpdoc can also generate 'man' pages (Unix man page format) as well as brief descriptions in html or emacs info formats suitable for inclusion in an on-line index of applications. In particular, lpdoc can create and maintain fully automatically WWW and info sites containing on-line versions of the documents it produces. The lpdoc manual (and the Ciao system manuals) are generated by lpdoc. Lpdoc is distributed under the GNU general public license. Note: lpdoc is fully supported on Linux, Mac OS X, and other Un*x-like systems. Due to the use of several Un*x-related utilities, some documentation back-ends may require Cygwin under Win32. This documentation corresponds to version 3.0 (2011/7/7, 16:33:15 CEST).
Resumo:
We describe a simple, public domain, HTML package for LP/CLP systems. The package allows generating HTML documents easily from LP/CLP systems, including HTML forms. It also provides facilities for parsing the input provided by HTML forms, as well as for creating standalone form handlers. The purpose of this document is to serve as a user's manual as well as a short description of the capabilities of the package. The package was originally developed for SICStus Prolog and the UPM &-Prolog/CIAO systems, but has been adapted to a number of popular LP/CLP systems. The document is also a WWW/HTML primer, containing sufficient information for developing medium complexity WWW applications in Prolog and other LP and CLP languages.
Resumo:
Patent claims defi ne the protection scope of the intellectual property sought by the patent applicant or patentee. Broad claims are valuable as they can describe more expansive rights to the invention. Therefore, if these claims are too broad a potential infringer will more easily argue against them. But if the claims are too narrow the scope of protection of the intellectual property is greatly reduced. Patent claims have to be, on the one hand, determinate and precise enough and, on the other hand, as inclusive as possible. Therefore patent applicants must fi nd a balance in the broadness of the scope defi ned by their claims. This balance can be achieved by the choice of words with a convenient degree of semantic indeterminacy, by the choice of modifi ers or other strategies. In fact, vagueness in patent claims is a desirable characteristic for such documents. A quantitative and qualitative analysis of a corpus of 350 U.S. patents provides a promising starting point to understand the linguistic instruments used to achieve the balance between property claim scope and precision of property description. To conclude, some issues relating vagueness and pragmatics are suggested as a line of further research.
Resumo:
Language resources, such as multilingual lexica and multilingual electronic dictionaries, contain collections of lexical entries in several languages. Having access to the corresponding explicit or implicit translation relations between such entries might be of great interest for many NLP-based applications. By using Semantic Web-based techniques, translations can be available on the Web to be consumed by other (semantic enabled) resources in a direct manner, not relying on application-specific formats. To that end, in this paper we propose a model for representing translations as linked data, as an extension of the lemon model. Our translation module represents some core information associated to term translations and does not commit to specific views or translation theories. As a proof of concept, we have extracted the translations of the terms contained in Terminesp, a multilingual terminological database, and represented them as linked data. We have made them accessible on the Web both for humans (via a Web interface) and software agents (with a SPARQL endpoint).