973 resultados para Compressed text search
Resumo:
Since manually constructing domain-specific sentiment lexicons is extremely time consuming and it may not even be feasible for domains where linguistic expertise is not available. Research on the automatic construction of domain-specific sentiment lexicons has become a hot topic in recent years. The main contribution of this paper is the illustration of a novel semi-supervised learning method which exploits both term-to-term and document-to-term relations hidden in a corpus for the construction of domain specific sentiment lexicons. More specifically, the proposed two-pass pseudo labeling method combines shallow linguistic parsing and corpusbase statistical learning to make domain-specific sentiment extraction scalable with respect to the sheer volume of opinionated documents archived on the Internet these days. Another novelty of the proposed method is that it can utilize the readily available user-contributed labels of opinionated documents (e.g., the user ratings of product reviews) to bootstrap the performance of sentiment lexicon construction. Our experiments show that the proposed method can generate high quality domain-specific sentiment lexicons as directly assessed by human experts. Moreover, the system generated domain-specific sentiment lexicons can improve polarity prediction tasks at the document level by 2:18% when compared to other well-known baseline methods. Our research opens the door to the development of practical and scalable methods for domain-specific sentiment analysis.
Resumo:
In the present paper, we introduce BioPatML.NET, an application library for the Microsoft Windows .NET framework [2] that implements the BioPatML pattern definition language and sequence search engine. BioPatML.NET is integrated with the Microsoft Biology Foundation (MBF) application library [3], unifying the parsers and annotation services supported or emerging through MBF with the language, search framework and pattern repository of BioPatML. End users who wish to exploit the BioPatML.NET engine and repository without engaging the services of a programmer may do so via the freely accessible web-based BioPatML Editor, which we describe below.
Resumo:
Information has no value unless it is accessible. Information must be connected together so a knowledge network can then be built. Such a knowledge base is a key resource for Internet users to interlink information from documents. Information retrieval, a key technology for knowledge management, guarantees access to large corpora of unstructured text. Collaborative knowledge management systems such as Wikipedia are becoming more popular than ever; however, their link creation function is not optimized for discovering possible links in the collection and the quality of automatically generated links has never been quantified. This research begins with an evaluation forum which is intended to cope with the experiments of focused link discovery in a collaborative way as well as with the investigation of the link discovery application. The research focus was on the evaluation strategy: the evaluation framework proposal, including rules, formats, pooling, validation, assessment and evaluation has proved to be efficient, reusable for further extension and efficient for conducting evaluation. The collection-split approach is used to re-construct the Wikipedia collection into a split collection comprising single passage files. This split collection is proved to be feasible for improving relevant passages discovery and is devoted to being a corpus for focused link discovery. Following these experiments, a mobile client-side prototype built on iPhone is developed to resolve the mobile Search issue by using focused link discovery technology. According to the interview survey, the proposed mobile interactive UI does improve the experience of mobile information seeking. Based on this evaluation framework, a novel cross-language link discovery proposal using multiple text collections is developed. A dynamic evaluation approach is proposed to enhance both the collaborative effort and the interacting experience between submission and evaluation. A realistic evaluation scheme has been implemented at NTCIR for cross-language link discovery tasks.
Resumo:
Volatile properties of particle emissions from four compressed natural gas (CNG) and four diesel buses were investigated under steady state and transient driving modes on a chassis dynamometer. The exhaust was diluted utilising a full-flow continuous volume sampling system and passed through a thermodenuder at controlled temperature. Particle number concentration and size distribution were measured with a condensation particle counter and a scanning mobility particle sizer, respectively. We show that, while almost all the particles emitted by the CNG buses were in the nanoparticle size range, at least 85% and 98% were removed at 100ºC and 250ºC, respectively. Closer analysis of the volatility of particles emitted during transient cycles showed that volatilisation began at around 40°C with the majority occurring by 80°C. Particles produced during hard acceleration from rest exhibited lower volatility than that produced during other times of the cycle. Based on our results and the observation of ash deposits on the walls of the tailpipes, we suggest that these non-volatile particles were composed mostly of ash from lubricating oil. Heating the diesel bus emissions to 100ºC removed ultrafine particle numbers by 69% to 82% when a nucleation mode was present and just 18% when it was not.
Resumo:
The growing importance and need of data processing for information extraction is vital for Web databases. Due to the sheer size and volume of databases, retrieval of relevant information as needed by users has become a cumbersome process. Information seekers are faced by information overloading - too many result sets are returned for their queries. Moreover, too few or no results are returned if a specific query is asked. This paper proposes a ranking algorithm that gives higher preference to a user’s current search and also utilizes profile information in order to obtain the relevant results for a user’s query.
Resumo:
This paper discusses human factors issues of low cost railway level crossings in Australia. Several issues are discussed in this paper including safety at passive level railway crossings, human factors considerations associated with unavailability of a warning device, and a conceptual model for how safety could be compromised at railway level crossings following prolonged or frequent unavailability. The research plans to quantify safety risk to motorists at level crossings using a Human Reliability Assessment (HRA) method, supported by data collected using an advanced driving simulator. This method aims to identify human error within tasks and task units identified as part of the task analysis process. It is anticipated that by modelling driver behaviour the current study will be able to quantify meaningful task variability including temporal parameters, between participants and within participants. The process of complex tasks such as driving through a level crossing is fundamentally context-bound. Therefore this study also aims to quantify those performance-shaping factors that contribute to vehicle train collisions by highlighting changes in the task units and driver physiology. Finally we will also consider a number of variables germane to ensuring external validity of our results. Without this inclusion, such an analysis could seriously underestimate the probabilistic risk assessment.
Resumo:
The Web has become a worldwide repository of information which individuals, companies, and organizations utilize to solve or address various information problems. Many of these Web users utilize automated agents to gather this information for them. Some assume that this approach represents a more sophisticated method of searching. However, there is little research investigating how Web agents search for online information. In this research, we first provide a classification for information agent using stages of information gathering, gathering approaches, and agent architecture. We then examine an implementation of one of the resulting classifications in detail, investigating how agents search for information on Web search engines, including the session, query, term, duration and frequency of interactions. For this temporal study, we analyzed three data sets of queries and page views from agents interacting with the Excite and AltaVista search engines from 1997 to 2002, examining approximately 900,000 queries submitted by over 3,000 agents. Findings include: (1) agent sessions are extremely interactive, with sometimes hundreds of interactions per second (2) agent queries are comparable to human searchers, with little use of query operators, (3) Web agents are searching for a relatively limited variety of information, wherein only 18% of the terms used are unique, and (4) the duration of agent-Web search engine interaction typically spans several hours. We discuss the implications for Web information agents and search engines.
Resumo:
This paper reports results from a study exploring the multimedia search functionality of Chinese language search engines. Web searching in Chinese (Mandarin) is a growing research area and a technical challenge for popular commercial Web search engines. Few studies have been conducted on Chinese language search engines. We investigate two research questions: which Chinese language search engines provide multimedia searching, and what multimedia search functionalities are available in Chinese language Web search engines. Specifically, we examine each Web search engine's (1) features permitting Chinese language multimedia searches, (2) extent of search personalization and user control of multimedia search variables, and (3) the relationships between Web search engines and their features in the Chinese context. Key findings show that Chinese language Web search engines offer limited multimedia search functionality, and general search engines provide a wider range of features than specialized multimedia search engines. Study results have implications for Chinese Web users, Website designers and Web search engine developers. © 2009 Elsevier Ltd. All rights reserved.
Resumo:
Usability is a multi-dimensional characteristic of a computer system. This paper focuses on usability as a measurement of interaction between the user and the system. The research employs a task-oriented approach to evaluate the usability of a meta search engine. This engine encourages and accepts queries of unlimited size expressed in natural language. A variety of conventional metrics developed by academic and industrial research, including ISO standards,, are applied to the information retrieval process consisting of sequential tasks. Tasks range from formulating (long) queries to interpreting and retaining search results. Results of the evaluation and analysis of the operation log indicate that obtaining advanced search engine results can be accomplished simultaneously with enhancing the usability of the interactive process. In conclusion, we discuss implications for interactive information retrieval system design and directions for future usability research. © 2008 Academy Publisher.
Resumo:
Detecting query reformulations within a session by a Web searcher is an important area of research for designing more helpful searching systems and targeting content to particular users. Methods explored by other researchers include both qualitative (i.e., the use of human judges to manually analyze query patterns on usually small samples) and nondeterministic algorithms, typically using large amounts of training data to predict query modification during sessions. In this article, we explore three alternative methods for detection of session boundaries. All three methods are computationally straightforward and therefore easily implemented for detection of session changes. We examine 2,465,145 interactions from 534,507 users of Dogpile.com on May 6, 2005. We compare session analysis using (a) Internet Protocol address and cookie; (b) Internet Protocol address, cookie, and a temporal limit on intrasession interactions; and (c) Internet Protocol address, cookie, and query reformulation patterns. Overall, our analysis shows that defining sessions by query reformulation along with Internet Protocol address and cookie provides the best measure, resulting in an 82% increase in the count of sessions. Regardless of the method used, the mean session length was fewer than three queries, and the mean session duration was less than 30 min. Searchers most often modified their query by changing query terms (nearly 23% of all query modifications) rather than adding or deleting terms. Implications are that for measuring searching traffic, unique sessions may be a better indicator than the common metric of unique visitors. This research also sheds light on the more complex aspects of Web searching involving query modifications and may lead to advances in searching tools.
Resumo:
From the late sixteenth century, in response to the problem of how best to teach children to read, a variety of texts such as primers, spellers and readers were produced in England for vernacular instruction. This paper describes how these materials were used by teachers to develop first, a specific religious understanding according to the stricture of the time and second, a moral reading practice that provided the child with a guide to secular conduct. The analysis focuses on the use of these texts as a productive means for shaping the child-reader in the context of newly emerging educational spaces which fostered a particular, morally formative relation among teacher, child and text.
Resumo:
Discovering proper search intents is a vi- tal process to return desired results. It is constantly a hot research topic regarding information retrieval in recent years. Existing methods are mainly limited by utilizing context-based mining, query expansion, and user profiling techniques, which are still suffering from the issue of ambiguity in search queries. In this pa- per, we introduce a novel ontology-based approach in terms of a world knowledge base in order to construct personalized ontologies for identifying adequate con- cept levels for matching user search intents. An iter- ative mining algorithm is designed for evaluating po- tential intents level by level until meeting the best re- sult. The propose-to-attempt approach is evaluated in a large volume RCV1 data set, and experimental results indicate a distinct improvement on top precision after compared with baseline models.
Resumo:
Flexible work practices spreading work times across the entire week have reduced the time to engage in leisure activities and for some have compounded the problem of a lack of defined break between work weeks. This study examines time spent outside of the workplace through a multiple case study of working time and leisure in the construction industry. A framework of synchronous leisure is used to examine the interplay of work and non-work arrangements. The effects of changing work arrangements to deliver a longer break between working weeks and the consequent impact on leisure activities are analysed. Interviews and focus groups across four construction sites revealed that while leisure is important to relieve fatigue and overwork, a work schedule allowing a long break between working weeks, specifically on a weekend, enables workers to achieve synchronous time, particularly with family, and improves work-life balance satisfaction. It was found that a well-defined break across a weekend also offers the opportunity to synchronize schedules with others to spend time away on short breaks.
Resumo:
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern-based methods should perform better than term- based ones in describing user preferences, but many experiments do not support this hypothesis. This research presents a promising method, Relevance Feature Discovery (RFD), for solving this challenging issue. It discovers both positive and negative patterns in text documents as high-level features in order to accurately weight low-level features (terms) based on their specificity and their distributions in the high-level features. The thesis also introduces an adaptive model (called ARFD) to enhance the exibility of using RFD in adaptive environment. ARFD automatically updates the system's knowledge based on a sliding window over new incoming feedback documents. It can efficiently decide which incoming documents can bring in new knowledge into the system. Substantial experiments using the proposed models on Reuters Corpus Volume 1 and TREC topics show that the proposed models significantly outperform both the state-of-the-art term-based methods underpinned by Okapi BM25, Rocchio or Support Vector Machine and other pattern-based methods.