915 resultados para Open source information retrieval
Resumo:
Traditional information retrieval (IR) systems respond to user queries with ranked lists of relevant documents. The separation of content and structure in XML documents allows individual XML elements to be selected in isolation. Thus, users expect XML-IR systems to return highly relevant results that are more precise than entire documents. In this paper we describe the implementation of a search engine for XML document collections. The system is keyword based and is built upon an XML inverted file system. We describe the approach that was adopted to meet the requirements of Content Only (CO) and Vague Content and Structure (VCAS) queries in INEX 2004.
Resumo:
Peer to peer systems have been widely used in the internet. However, most of the peer to peer information systems are still missing some of the important features, for example cross-language IR (Information Retrieval) and collection selection / fusion features. Cross-language IR is the state-of-art research area in IR research community. It has not been used in any real world IR systems yet. Cross-language IR has the ability to issue a query in one language and receive documents in other languages. In typical peer to peer environment, users are from multiple countries. Their collections are definitely in multiple languages. Cross-language IR can help users to find documents more easily. E.g. many Chinese researchers will search research papers in both Chinese and English. With Cross-language IR, they can do one query in Chinese and get documents in two languages. The Out Of Vocabulary (OOV) problem is one of the key research areas in crosslanguage information retrieval. In recent years, web mining was shown to be one of the effective approaches to solving this problem. However, how to extract Multiword Lexical Units (MLUs) from the web content and how to select the correct translations from the extracted candidate MLUs are still two difficult problems in web mining based automated translation approaches. Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalized-score based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database but do not consider the retrieval performance of the remote search engine. This thesis presents research on building a peer to peer IR system with crosslanguage IR and advance collection profiling technique for fusion features. Particularly, this thesis first presents a new Chinese term measurement and new Chinese MLU extraction process that works well on small corpora. An approach to selection of MLUs in a more accurate manner is also presented. After that, this thesis proposes a collection profiling strategy which can discover not only collection content but also retrieval performance of the remote search engine. Based on collection profiling, a web-based query classification method and two collection fusion approaches are developed and presented in this thesis. Our experiments show that the proposed strategies are effective in merging results in uncooperative peer to peer environments. Here, an uncooperative environment is defined as each peer in the system is autonomous. Peer like to share documents but they do not share collection statistics. This environment is a typical peer to peer IR environment. Finally, all those approaches are grouped together to build up a secure peer to peer multilingual IR system that cooperates through X.509 and email system.
Resumo:
Key topics: Since the birth of the Open Source movement in the mid-80's, open source software has become more and more widespread. Amongst others, the Linux operating system, the Apache web server and the Firefox internet explorer have taken substantial market shares to their proprietary competitors. Open source software is governed by particular types of licenses. As proprietary licenses only allow the software's use in exchange for a fee, open source licenses grant users more rights like the free use, free copy, free modification and free distribution of the software, as well as free access to the source code. This new phenomenon has raised many managerial questions: organizational issues related to the system of governance that underlie such open source communities (Raymond, 1999a; Lerner and Tirole, 2002; Lee and Cole 2003; Mockus et al. 2000; Tuomi, 2000; Demil and Lecocq, 2006; O'Mahony and Ferraro, 2007;Fleming and Waguespack, 2007), collaborative innovation issues (Von Hippel, 2003; Von Krogh et al., 2003; Von Hippel and Von Krogh, 2003; Dahlander, 2005; Osterloh, 2007; David, 2008), issues related to the nature as well as the motivations of developers (Lerner and Tirole, 2002; Hertel, 2003; Dahlander and McKelvey, 2005; Jeppesen and Frederiksen, 2006), public policy and innovation issues (Jullien and Zimmermann, 2005; Lee, 2006), technological competitions issues related to standard battles between proprietary and open source software (Bonaccorsi and Rossi, 2003; Bonaccorsi et al. 2004, Economides and Katsamakas, 2005; Chen, 2007), intellectual property rights and licensing issues (Laat 2005; Lerner and Tirole, 2005; Gambardella, 2006; Determann et al., 2007). A major unresolved issue concerns open source business models and revenue capture, given that open source licenses imply no fee for users. On this topic, articles show that a commercial activity based on open source software is possible, as they describe different possible ways of doing business around open source (Raymond, 1999; Dahlander, 2004; Daffara, 2007; Bonaccorsi and Merito, 2007). These studies usually look at open source-based companies. Open source-based companies encompass a wide range of firms with different categories of activities: providers of packaged open source solutions, IT Services&Software Engineering firms and open source software publishers. However, business models implications are different for each of these categories: providers of packaged solutions and IT Services&Software Engineering firms' activities are based on software developed outside their boundaries, whereas commercial software publishers sponsor the development of the open source software. This paper focuses on open source software publishers' business models as this issue is even more crucial for this category of firms which take the risk of investing in the development of the software. Literature at last identifies and depicts only two generic types of business models for open source software publishers: the business models of ''bundling'' (Pal and Madanmohan, 2002; Dahlander 2004) and the dual licensing business models (Välimäki, 2003; Comino and Manenti, 2007). Nevertheless, these business models are not applicable in all circumstances. Methodology: The objectives of this paper are: (1) to explore in which contexts the two generic business models described in literature can be implemented successfully and (2) to depict an additional business model for open source software publishers which can be used in a different context. To do so, this paper draws upon an explorative case study of IdealX, a French open source security software publisher. This case study consists in a series of 3 interviews conducted between February 2005 and April 2006 with the co-founder and the business manager. It aims at depicting the process of IdealX's search for the appropriate business model between its creation in 2000 and 2006. This software publisher has tried both generic types of open source software publishers' business models before designing its own. Consequently, through IdealX's trials and errors, I investigate the conditions under which such generic business models can be effective. Moreover, this study describes the business model finally designed and adopted by IdealX: an additional open source software publisher's business model based on the principle of ''mutualisation'', which is applicable in a different context. Results and implications: Finally, this article contributes to ongoing empirical work within entrepreneurship and strategic management on open source software publishers' business models: it provides the characteristics of three generic business models (the business model of bundling, the dual licensing business model and the business model of mutualisation) as well as conditions under which they can be successfully implemented (regarding the type of product developed and the competencies of the firm). This paper also goes further into the traditional concept of business model used by scholars in the open source related literature. In this article, a business model is not only considered as a way of generating incomes (''revenue model'' (Amit and Zott, 2001)), but rather as the necessary conjunction of value creation and value capture, according to the recent literature about business models (Amit and Zott, 2001; Chresbrough and Rosenblum, 2002; Teece, 2007). Consequently, this paper analyses the business models from these two components' point of view.
Resumo:
The increasing diversity of the Internet has created a vast number of multilingual resources on the Web. A huge number of these documents are written in various languages other than English. Consequently, the demand for searching in non-English languages is growing exponentially. It is desirable that a search engine can search for information over collections of documents in other languages. This research investigates the techniques for developing high-quality Chinese information retrieval systems. A distinctive feature of Chinese text is that a Chinese document is a sequence of Chinese characters with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose two approaches to deal with the problems. In the first approach, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach. In the second approach, we propose a novel query expansion method which applies text mining techniques in order to find the most relevant words to extend the query. Unlike most existing query expansion methods, which generally select the highly frequent indexing terms from the retrieved documents to expand the query. In our approach, we utilize text mining techniques to find patterns from the retrieved documents that highly correlate with the query term and then use the relevant words in the patterns to expand the original query. This research project develops and implements a Chinese information retrieval system for evaluating the proposed approaches. There are two stages in the experiments. The first stage is to investigate if high accuracy segmentation can make an improvement to Chinese information retrieval. In the second stage, a text mining based query expansion approach is implemented and a further experiment has been done to compare its performance with the standard Rocchio approach with the proposed text mining based query expansion method. The NTCIR5 Chinese collections are used in the experiments. The experiment results show that by incorporating the text mining based query expansion with the hybrid model, significant improvement has been achieved in both precision and recall assessments.
Resumo:
Information Retrieval is an important albeit imperfect component of information technologies. A problem of insufficient diversity of retrieved documents is one of the primary issues studied in this research. This study shows that this problem leads to a decrease of precision and recall, traditional measures of information retrieval effectiveness. This thesis presents an adaptive IR system based on the theory of adaptive dual control. The aim of the approach is the optimization of retrieval precision after all feedback has been issued. This is done by increasing the diversity of retrieved documents. This study shows that the value of recall reflects this diversity. The Probability Ranking Principle is viewed in the literature as the “bedrock” of current probabilistic Information Retrieval theory. Neither the proposed approach nor other methods of diversification of retrieved documents from the literature conform to this principle. This study shows by counterexample that the Probability Ranking Principle does not in general lead to optimal precision in a search session with feedback (for which it may not have been designed but is actively used). Retrieval precision of the search session should be optimized with a multistage stochastic programming model to accomplish the aim. However, such models are computationally intractable. Therefore, approximate linear multistage stochastic programming models are derived in this study, where the multistage improvement of the probability distribution is modelled using the proposed feedback correctness method. The proposed optimization models are based on several assumptions, starting with the assumption that Information Retrieval is conducted in units of topics. The use of clusters is the primary reasons why a new method of probability estimation is proposed. The adaptive dual control of topic-based IR system was evaluated in a series of experiments conducted on the Reuters, Wikipedia and TREC collections of documents. The Wikipedia experiment revealed that the dual control feedback mechanism improves precision and S-recall when all the underlying assumptions are satisfied. In the TREC experiment, this feedback mechanism was compared to a state-of-the-art adaptive IR system based on BM-25 term weighting and the Rocchio relevance feedback algorithm. The baseline system exhibited better effectiveness than the cluster-based optimization model of ADTIR. The main reason for this was insufficient quality of the generated clusters in the TREC collection that violated the underlying assumption.
Resumo:
Most information retrieval (IR) models treat the presence of a term within a document as an indication that the document is somehow "about" that term, they do not take into account when a term might be explicitly negated. Medical data, by its nature, contains a high frequency of negated terms - e.g. "review of systems showed no chest pain or shortness of breath". This papers presents a study of the effects of negation on information retrieval. We present a number of experiments to determine whether negation has a significant negative affect on IR performance and whether language models that take negation into account might improve performance. We use a collection of real medical records as our test corpus. Our findings are that negation has some affect on system performance, but this will likely be confined to domains such as medical data where negation is prevalent.
Resumo:
A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.
Resumo:
This paper presents a framework for evaluating information retrieval of medical records. We use the BLULab corpus, a large collection of real-world de-identified medical records. The collection has been hand coded by clinical terminol- ogists using the ICD-9 medical classification system. The ICD codes are used to devise queries and relevance judge- ments for this collection. Results of initial test runs using a baseline IR system are provided. Queries and relevance judgements are online to aid further research in medical IR. Please visit: http://koopman.id.au/med_eval.