319 resultados para INFORMATION RETRIEVAL


Relevância:

60.00% 60.00%

Publicador:

Resumo:

With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Many user studies in Web information searching have found the significant effect of task types on search strategies. However, little attention was given to Web image searching strategies, especially the query reformulation activity despite that this is a crucial part in Web image searching. In this study, we investigated the effects of topic domains and task types on user’s image searching behavior and query reformulation strategies. Some significant differences in user’s tasks specificity and initial concepts were identified among the task domains. Task types are also found to influence participant’s result reviewing behavior and query reformulation strategies.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In information retrieval (IR) research, more and more focus has been placed on optimizing a query language model by detecting and estimating the dependencies between the query and the observed terms occurring in the selected relevance feedback documents. In this paper, we propose a novel Aspect Language Modeling framework featuring term association acquisition, document segmentation, query decomposition, and an Aspect Model (AM) for parameter optimization. Through the proposed framework, we advance the theory and practice of applying high-order and context-sensitive term relationships to IR. We first decompose a query into subsets of query terms. Then we segment the relevance feedback documents into chunks using multiple sliding windows. Finally we discover the higher order term associations, that is, the terms in these chunks with high degree of association to the subsets of the query. In this process, we adopt an approach by combining the AM with the Association Rule (AR) mining. In our approach, the AM not only considers the subsets of a query as “hidden” states and estimates their prior distributions, but also evaluates the dependencies between the subsets of a query and the observed terms extracted from the chunks of feedback documents. The AR provides a reasonable initial estimation of the high-order term associations by discovering the associated rules from the document chunks. Experimental results on various TREC collections verify the effectiveness of our approach, which significantly outperforms a baseline language model and two state-of-the-art query language models namely the Relevance Model and the Information Flow model

Relevância:

60.00% 60.00%

Publicador:

Resumo:

At NTCIR-9, we participated in the cross-lingual link discovery (Crosslink) task. In this paper we describe our approaches to discovering Chinese, Japanese, and Korean (CJK) cross-lingual links for English documents in Wikipedia. Our experimental results show that a link mining approach that mines the existing link structure for anchor probabilities and relies on the “translation” using cross-lingual document name triangulation performs very well. The evaluation shows encouraging results for our system.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Consider the concept combination ‘pet human’. In word association experiments, human subjects produce the associate ‘slave’ in relation to this combination. The striking aspect of this associate is that it is not produced as an associate of ‘pet’, or ‘human’ in isolation. In other words, the associate ‘slave’ seems to be emergent. Such emergent associations sometimes have a creative character and cognitive science is largely silent about how we produce them. Departing from a dimensional model of human conceptual space, this article will explore concept combinations, and will argue that emergent associations are a result of abductive reasoning within conceptual space, that is, below the symbolic level of cognition. A tensor-based approach is used to model concept combinations allowing such combinations to be formalized as interacting quantum systems. Free association norm data is used to motivate the underlying basis of the conceptual space. It is shown by analogy how some concept combinations may behave like quantum-entangled (non-separable) particles. Two methods of analysis were presented for empirically validating the presence of non-separable concept combinations in human cognition. One method is based on quantum theory and another based on comparing a joint (true theoretic) probability distribution with another distribution based on a separability assumption using a chi-square goodness-of-fit test. Although these methods were inconclusive in relation to an empirical study of bi-ambiguous concept combinations, avenues for further refinement of these methods are identified.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Electronic services are a leitmotif in ‘hot’ topics like Software as a Service, Service Oriented Architecture (SOA), Service oriented Computing, Cloud Computing, application markets and smart devices. We propose to consider these in what has been termed the Service Ecosystem (SES). The SES encompasses all levels of electronic services and their interaction, with human consumption and initiation on its periphery in much the same way the ‘Web’ describes a plethora of technologies that eventuate to connect information and expose it to humans. Presently, the SES is heterogeneous, fragmented and confined to semi-closed systems. A key issue hampering the emergence of an integrated SES is Service Discovery (SD). A SES will be dynamic with areas of structured and unstructured information within which service providers and ‘lay’ human consumers interact; until now the two are disjointed, e.g., SOA-enabled organisations, industries and domains are choreographed by domain experts or ‘hard-wired’ to smart device application markets and web applications. In a SES, services are accessible, comparable and exchangeable to human consumers closing the gap to the providers. This requires a new SD with which humans can discover services transparently and effectively without special knowledge or training. We propose two modes of discovery, directed search following an agenda and explorative search, which speculatively expands knowledge of an area of interest by means of categories. Inspired by conceptual space theory from cognitive science, we propose to implement the modes of discovery using concepts to map a lay consumer’s service need to terminologically sophisticated descriptions of services. To this end, we reframe SD as an information retrieval task on the information attached to services, such as, descriptions, reviews, documentation and web sites - the Service Information Shadow. The Semantic Space model transforms the shadow's unstructured semantic information into a geometric, concept-like representation. We introduce an improved and extended Semantic Space including categorization calling it the Semantic Service Discovery model. We evaluate our model with a highly relevant, service related corpus simulating a Service Information Shadow including manually constructed complex service agendas, as well as manual groupings of services. We compare our model against state-of-the-art information retrieval systems and clustering algorithms. By means of an extensive series of empirical evaluations, we establish optimal parameter settings for the semantic space model. The evaluations demonstrate the model’s effectiveness for SD in terms of retrieval precision over state-of-the-art information retrieval models (directed search) and the meaningful, automatic categorization of service related information, which shows potential to form the basis of a useful, cognitively motivated map of the SES for exploratory search.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

It is a big challenge to clearly identify the boundary between positive and negative streams. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on RCV1, and substantial experiments show that the proposed approach achieves encouraging performance.

Relevância:

60.00% 60.00%

Publicador:

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Fundamental tooling is required in order to apply USDL in practical settings. This chapter discusses three fundamental types of tools for USDL. First, USDL editors have been developed for expert and casual users, respectively. Second, several USDL repositories have been built to allow editors accessing and storing USDL descriptions. Third, our generic USDL marketplace allows providers to describe their services once and potentially trade them anywhere. In addition, the iosyncrasies of service trading as opposed to the simpler case of product trading. The chapter also presents several deployment scenarios of such tools to foster individual value chains and support new business models across organizational boundaries. We close the chapter with an application of USDL in the context of service engineering.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper develops a framework for classifying term dependencies in query expansion with respect to the role terms play in structural linguistic associations. The framework is used to classify and compare the query expansion terms produced by the unigram and positional relevance models. As the unigram relevance model does not explicitly model term dependencies in its estimation process it is often thought to ignore dependencies that exist between words in natural language. The framework presented in this paper is underpinned by two types of linguistic association, namely syntagmatic and paradigmatic associations. It was found that syntagmatic associations were a more prevalent form of linguistic association used in query expansion. Paradoxically, it was the unigram model that exhibited this association more than the positional relevance model. This surprising finding has two potential implications for information retrieval models: (1) if linguistic associations underpin query expansion, then a probabilistic term dependence assumption based on position is inadequate for capturing them; (2) the unigram relevance model captures more term dependency information than its underlying theoretical model suggests, so its normative position as a baseline that ignores term dependencies should perhaps be reviewed.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Traditional recommendation methods offer items, that are inanimate and one way recommendation, to users. Emerging new applications such as online dating or job recruitments require reciprocal people-to-people recommendations that are animate and two-way recommendations. In this paper, we propose a reciprocal collaborative method based on the concepts of users' similarities and common neighbors. The dataset employed for the experiment is gathered from a real life online dating network. The proposed method is compared with baseline methods that use traditional collaborative algorithms. Results show the proposed method can achieve noticeably better performance than the baseline methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper discusses users’ query reformulation behaviour while searching information on the Web. Query reformulations have emerged as an important component of Web search behaviour and human-computer interaction (HCI) because a user’s success of information retrieval (IR) depends on how he or she formulates queries. There are various factors, such as cognitive styles, that influence users’ query reformulation behaviour. Understanding how users with different cognitive styles formulate their queries while performing Web searches can help HCI researchers and information systems (IS) developers to provide assistance to the users. This paper aims to examine the effects of users’ cognitive styles on their query reformation behaviour. To achieve the goal of the study, a user study was conducted in which a total of 3613 search terms and 872 search queries were submitted by 50 users who engaged in 150 scenario-based search tasks. Riding’s (1991) Cognitive Style Analysis (CSA) test was used to assess users’ cognitive style as wholist or analytic, and verbaliser or imager. The study findings show that users’ query reformulation behaviour is affected by their cognitive styles. The results reveal that analytic users tended to prefer Add queries while all other users preferred New queries. A significant difference was found among wholists and analytics in the manner they performed Remove query reformulations. Future HCI researchers and IS developers can utilize the study results to develop interactive and user-cantered search model, and to provide context-based query suggestions for users.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Divergence from a random baseline is a technique for the evaluation of document clustering. It ensures cluster quality measures are performing work that prevents ineffective clusterings from giving high scores to clusterings that provide no useful result. These concepts are defined and analysed using intrinsic and extrinsic approaches to the evaluation of document cluster quality. This includes the classical clusters to categories approach and a novel approach that uses ad hoc information retrieval. The divergence from a random baseline approach is able to differentiate ineffective clusterings encountered in the INEX XML Mining track. It also appears to perform a normalisation similar to the Normalised Mutual Information (NMI) measure but it can be applied to any measure of cluster quality. When it is applied to the intrinsic measure of distortion as measured by RMSE, subtraction from a random baseline provides a clear optimum that is not apparent otherwise. This approach can be applied to any clustering evaluation. This paper describes its use in the context of document clustering evaluation.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In 2012 the existing eight disciplines of Creative Industries Faculty, QUT combined with the School of Design (formerly a component of the Faculty of Built Environment and Engineering) to create a super faculty that includes the following disciplines: Architecture, Creative Writing & Literary Studies, Dance, Drama, Fashion, Film & Television, Industrial Design, Interior Design, Journalism, Media & Communication, Landscape Architecture, Music & Sound and Urban Design. The university’s research training unit AIRS (Advanced Information Retrieval Skills) is a systematic introduction to research level information literacies. It is currently being redesigned to reflect today’s new data intensive research environment and facilitate the capacity for life-long learning. Upon completion participants are expected to be able to: 1. Demonstrate an understanding of the theory of advanced search and evaluative strategies to efficiently yield appropriate resources to create original research. 2. Apply appropriate data management strategies to organise and utilize your information proficiently, ethically and legally. 3. Identify strategies to ensure best practice in the use of information sources, information technologies, information access tools and investigative methods. All Creative Industries Faculty research students must complete this unit into which CI Librarians teach discipline specific material. The library employs a team of research specific experts as well as Liaison Librarians for each faculty. Together they develop and deliver a generic research training program that provides researcher training in the following areas: Managing Research Data, QUT ePrints: New features for tracking your research impact, Tracking Research Impact, Research Students and the Library: Overview of Library Research Support Services, Technologies for Research Collaboration, Open Access Publishing, Greater Impact via Creative Commons Licence, CAMBIA - Navigating the patent literature, Uploading Publications to QUT ePrints Workshop, AIRS for supervisors, Finding Existing Research Data, Keeping up to date:Discovering and managing current awareness information and Getting Published. In 2011 Creative Industries initiated a new faculty specific research training program to promote capacity building for research within their Faculty, with workshops designed and developed with Faculty Research Leaders, The Office of Research and Liaison Librarians. “Show me the money” which assists staff to pursue alternative funding sources was one such session that was well attended and generated much discussion and interest. Drop in support sessions for ePrints, EndNote referencing software and Tracking Research Impact for the Creative Industries were also popular options on the menu. Liaison Librarians continue to provide one-on-one consultations with individual researchers as requested. This service assists Librarians greatly with getting to know and monitoring their researchers’ changing needs. The CI Faculty has enlisted two Research Leaders, one for each of the two Schools (Design and Media, Entertainment & Creative Arts) whose role it is to mentor newer research staff. Similarly within the CI library liaison team one librarian is assigned the role of Research Coordinator, whose responsibility it is to be the primary liaison with the Assistant Dean, Research and other key Faculty research managers and is the one most likely to attend Faculty committees and meetings relating to research support.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Success of query reformulation and relevant information retrieval depends on many factors, such as users’ prior knowledge, age, gender, and cognitive styles. One of the important factors that affect a user’s query reformulation behaviour is that of the nature of the search tasks. Limited studies have examined the impact of the search task types on query reformulation behaviour while performing Web searches. This paper examines how the nature of the search tasks affects users’ query reformulation behaviour during information searching. The paper reports empirical results from a user study in which 50 participants performed a set of three Web search tasks – exploratory, factorial and abstract. Users’ interactions with search engines were logged by using a monitoring program. 872 unique search queries were classified into five query types – New, Add, Remove, Replace and Repeat. Users submitted fewer queries for the factual task, which accounted for 26%. They completed a higher number of queries (40% of the total queries) while carrying out the exploratory task. A one-way MANOVA test indicated a significant effect of search task types on users’ query reformulation behaviour. In particular, the search task types influenced the manner in which users reformulated the New and Repeat queries.