5 resultados para web content
Resumo:
A rapidly increasing number of Web databases are now become accessible via
their HTML form-based query interfaces. Query result pages are dynamically generated
in response to user queries, which encode structured data and are displayed for human
use. Query result pages usually contain other types of information in addition to query
results, e.g., advertisements, navigation bar etc. The problem of extracting structured data
from query result pages is critical for web data integration applications, such as comparison
shopping, meta-search engines etc, and has been intensively studied. A number of approaches
have been proposed. As the structures of Web pages become more and more complex, the
existing approaches start to fail, and most of them do not remove irrelevant contents which
may a®ect the accuracy of data record extraction. We propose an automated approach for
Web data extraction. First, it makes use of visual features and query terms to identify data
sections and extracts data records in these sections. We also represent several content and
visual features of visual blocks in a data section, and use them to ¯lter out noisy blocks.
Second, it measures similarity between data items in di®erent data records based on their
visual and content features, and aligns them into di®erent groups so that the data in the
same group have the same semantics. The results of our experiments with a large set of
Web query result pages in di®erent domains show that our proposed approaches are highly
e®ective.
Resumo:
Web sites that rely on databases for their content are now ubiquitous. Query result pages are dynamically generated from these databases in response to user-submitted queries. Automatically extracting structured data from query result pages is a challenging problem, as the structure of the data is not explicitly represented. While humans have shown good intuition in visually understanding data records on a query result page as displayed by a web browser, no existing approach to data record extraction has made full use of this intuition. We propose a novel approach, in which we make use of the common sources of evidence that humans use to understand data records on a displayed query result page. These include structural regularity, and visual and content similarity between data records displayed on a query result page. Based on these observations we propose new techniques that can identify each data record individually, while ignoring noise items, such as navigation bars and adverts. We have implemented these techniques in a software prototype, rExtractor, and tested it using two datasets. Our experimental results show that our approach achieves significantly higher accuracy than previous approaches. Furthermore, it establishes the case for use of vision-based algorithms in the context of data extraction from web sites.
Resumo:
Background: Men can be hard to reach with face-to-face health-related information, while increasingly, research shows that they are seeking health information from online sources. Recognizing this trend, there is merit in developing innovative online knowledge translation (KT) strategies capable of translating research on men’s health into engaging health promotion materials. While the concept of KT has become a new mantra for researchers wishing to bridge the gap between research evidence and improved health outcomes, little is written about the process, necessary skills, and best practices by which researchers can develop online knowledge translation.
Objective: Our aim was to illustrate some of the processes and challenges involved in, and potential value of, developing research knowledge online to promote men’s health.
Methods: We present experiences of KT across two case studies of men’s health. First, we describe a study that uses interactive Web apps to translate knowledge relating to Canadian men’s depression. Through a range of mechanisms, study findings were repackaged with the explicit aim of raising awareness and reducing the stigma associated with men’s depression and/or help-seeking. Second, we describe an educational resource for teenage men about unintended pregnancy, developed for delivery in the formal Relationship and Sexuality Education school curricula of Ireland, Northern Ireland (United Kingdom), and South Australia. The intervention is based around a Web-based interactive film drama entitled “If I Were Jack”.
Results: For each case study, we describe the KT process and strategies that aided development of credible and well-received online content focused on men’s health promotion. In both case studies, the original research generated the inspiration for the interactive online content and the core development strategy was working with a multidisciplinary team to develop this material through arts-based approaches. In both cases also, there is an acknowledgment of the need for gender and culturally sensitive information. Both aimed to engage men by disrupting stereotypes about men, while simultaneously addressing men through authentic voices and faces. Finally, in both case studies we draw attention to the need to think beyond placement of content online to delivery to target audiences from the outset.
Conclusions: The case studies highlight some of the new skills required by academics in the emerging paradigm of translational research and contribute to the nascent literature on KT. Our approach to online KT was to go beyond dissemination and diffusion to actively repackage research knowledge through arts-based approaches (videos and film scripts) as health promotion tools, with optimal appeal, to target male audiences. Our findings highlight the importance of developing a multidisciplinary team to inform the design of content, the importance of adaptation to context, both in terms of the national implementation context and consideration of gender-specific needs, and an integrated implementation and evaluation framework in all KT work.
Resumo:
In this paper, we propose a new learning approach to Web data annotation, where a support vector machine-based multiclass classifier is trained to assign labels to data items. For data record extraction, a data section re-segmentation algorithm based on visual and content features is introduced to improve the performance of Web data record extraction. We have implemented the proposed approach and tested it with a large set of Web query result pages in different domains. Our experimental results show that our proposed approach is highly effective and efficient.
Resumo:
Background: This study investigated the nature of newspaper reporting about online health information in the UK and US. Internet users frequently search for health information online, although the accuracy of the information retrieved varies greatly and can be misleading. Newspapers have the potential to influence public health behaviours, but information has been lacking in relation to how newspapers portray online health information to their readers.
Methods: The newspaper database Nexis (R) UK was searched for articles published from 2003 - 2012 relating to online health information. Systematic content analysis of articles published in the highest circulation newspapers in the UK and US was performed. A second researcher coded a 10% sample to establish inter-rater reliability of coding.
Results: In total, 161 newspaper articles were included in the analysis. Publication was most frequent in 2003, 2008 and 2009, which coincided with global threats to public health. UK broadsheet newspapers were significantly more likely to cover online health information than UK tabloid newspapers (p = 0.04) and only one article was identified in US tabloid newspapers. Articles most frequently appeared in health sections. Among the 79 articles that linked online health information to specific diseases or health topics, diabetes was the most frequently mentioned disease, cancer the commonest group of diseases and sexual health the most frequent health topic. Articles portrayed benefits of obtaining online health information more frequently than risks. Quotations from health professionals portrayed mixed opinions regarding public access to online health information. 108 (67.1%) articles directed readers to specific health-related web sites. 135 (83.9%) articles were rated as having balanced judgement and 76 (47.2%) were judged as having excellent quality reporting. No difference was found in the quality of reporting between UK and US articles.
Conclusions: Newspaper coverage of online health information was low during the 10-year period 2003 to 2012. Journalists tended to emphasise the benefits and understate the risks of online health information and the quality of reporting varied considerably. Newspapers directed readers to sources of online health information during global epidemics although, as most articles appeared in the health sections of broadsheet newspapers, coverage was limited to a relatively small readership.