999 resultados para web harvesting
Resumo:
UOC
Resumo:
Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.
Resumo:
This paper presents the current state and development of a prototype web-GIS (Geographic Information System) decision support platform intended for application in natural hazards and risk management, mainly for floods and landslides. This web platform uses open-source geospatial software and technologies, particularly the Boundless (formerly OpenGeo) framework and its client side software development kit (SDK). The main purpose of the platform is to assist the experts and stakeholders in the decision-making process for evaluation and selection of different risk management strategies through an interactive participation approach, integrating web-GIS interface with decision support tool based on a compromise programming approach. The access rights and functionality of the platform are varied depending on the roles and responsibilities of stakeholders in managing the risk. The application of the prototype platform is demonstrated based on an example case study site: Malborghetto Valbruna municipality of North-Eastern Italy where flash floods and landslides are frequent with major events having occurred in 2003. The preliminary feedback collected from the stakeholders in the region is discussed to understand the perspectives of stakeholders on the proposed prototype platform.
Resumo:
BACKGROUND: Available methods to simulate nucleotide or amino acid data typically use Markov models to simulate each position independently. These approaches are not appropriate to assess the performance of combinatorial and probabilistic methods that look for coevolving positions in nucleotide or amino acid sequences. RESULTS: We have developed a web-based platform that gives a user-friendly access to two phylogenetic-based methods implementing the Coev model: the evaluation of coevolving scores and the simulation of coevolving positions. We have also extended the capabilities of the Coev model to allow for the generalization of the alphabet used in the Markov model, which can now analyse both nucleotide and amino acid data sets. The simulation of coevolving positions is novel and builds upon the developments of the Coev model. It allows user to simulate pairs of dependent nucleotide or amino acid positions. CONCLUSIONS: The main focus of our paper is the new simulation method we present for coevolving positions. The implementation of this method is embedded within the web platform Coev-web that is freely accessible at http://coev.vital-it.ch/, and was tested in most modern web browsers.
Resumo:
The article discusses the development of WEBDATANET established in 2011 which aims to create a multidisciplinary network of web-based data collection experts in Europe. Topics include the presence of 190 experts in 30 European countries and abroad, the establishment of web-based teaching and discussion platforms and working groups and task forces. Also discussed is the scope of the research carried by WEBDATANET. In light of the growing importance of web-based data in the social and behavioral sciences, WEBDATANET was established in 2011 as a COST Action (IS 1004) to create a multidisciplinary network of web-based data collection experts: (web) survey methodologists, psychologists, sociologists, linguists, economists, Internet scientists, media and public opinion researchers. The aim was to accumulate and synthesize knowledge regarding methodological issues of web-based data collection (surveys, experiments, tests, non-reactive data, and mobile Internet research), and foster its scientific usage in a broader community.
Resumo:
Análisis, diseño y desarrollo de una aplicación para la gestión de la formación interna de una empresa.
Resumo:
Online paper web analysis relies on traversing scanners that criss-cross on top of a rapidly moving paper web. The sensors embedded in the scanners measure many important quality variables of paper, such as basis weight, caliper and porosity. Most of these quantities are varying a lot and the measurements are noisy at many different scales. The zigzagging nature of scanning makes it difficult to separate machine direction (MD) and cross direction (CD) variability from one another. For improving the 2D resolution of the quality variables above, the paper quality control team at the Department of Mathematics and Physics at LUT has implemented efficient Kalman filtering based methods that currently use 2D Fourier series. Fourier series are global and therefore resolve local spatial detail on the paper web rather poorly. The target of the current thesis is to study alternative wavelet based representations as candidates to replace the Fourier basis for a higher resolution spatial reconstruction of these quality variables. The accuracy of wavelet compressed 2D web fields will be compared with corresponding truncated Fourier series based fields.
Resumo:
Tässä työssä selvitettiin hyviä tapoja ja vakiintuneita käytäntöjä pitkän käyttöiän web-sovelluksen tekemiseksi. Saatiin selville, että sovelluksen elinkaaren aikana suurin osa kustannuksista tulee ylläpidosta. Tavoitteena oli tehdä pitkään käytettävä sovellus, joten ylläpidon kustannusten osuudesta tuli saada mandollisimman pieni. Ohjelmistotuotantoprosessissa mandollisimman aikaisessa vaiheessa havaitut virheet vähentävät korjauskustannuksia oleellisesti verrattuna siihen, että virheet havaittaisiin valmiissa tuotteessa. Siksi tässä työssä tehdyssä web-sovelluksessa panostettiin prosessin alkuvaiheisiin, määrittelyyn ja suunnitteluun. Web-sovelluksen ylläpidettävyyteen ja selkeyteen vaikuttavat oleellisesti hyvät ohjelmistokehitystavat. Käyttämällä valmista sovelluskehystä ja lisäämällä toiminnallisuuksia valmiiden ohjelmistokomponenttien avulla saadaan aikaiseksi hyvien tapojen mukaisesti tehty sovellus. Tässä työssä toteutettu web-sovellus laadittiin käyttämällä sovelluskehystä ja komponenttiarkkitehtuuria. Toteutuksesta saatiin selkeä. Sovellus jaettiin loogisiin kokonaisuuksiin, jotka käsittelevät näkymiä, tietokantaa ja tietojen yhdistämistä näiden välillä. Jokainen kokonaisuus on itsenäisesti toimiva, mikä auttaa sovelluksen ylläpitämisessä ja testaamisessa.