170 resultados para query
Resumo:
Existing multi-model approaches for image set classification extract local models by clustering each image set individually only once, with fixed clusters used for matching with other image sets. However, this may result in the two closest clusters to represent different characteristics of an object, due to different undesirable environmental conditions (such as variations in illumination and pose). To address this problem, we propose to constrain the clustering of each query image set by forcing the clusters to have resemblance to the clusters in the gallery image sets. We first define a Frobenius norm distance between subspaces over Grassmann manifolds based on reconstruction error. We then extract local linear subspaces from a gallery image set via sparse representation. For each local linear subspace, we adaptively construct the corresponding closest subspace from the samples of a probe image set by joint sparse representation. We show that by minimising the sparse representation reconstruction error, we approach the nearest point on a Grassmann manifold. Experiments on Honda, ETH-80 and Cambridge-Gesture datasets show that the proposed method consistently outperforms several other recent techniques, such as Affine Hull based Image Set Distance (AHISD), Sparse Approximated Nearest Points (SANP) and Manifold Discriminant Analysis (MDA).
Resumo:
We present a study to understand the effect that negated terms (e.g., "no fever") and family history (e.g., "family history of diabetes") have on searching clinical records. Our analysis is aimed at devising the most effective means of handling negation and family history. In doing so, we explicitly represent a clinical record according to its different content types: negated, family history and normal content; the retrieval model weights each of these separately. Empirical evaluation shows that overall the presence of negation harms retrieval effectiveness while family history has little effect. We show negation is best handled by weighting negated content (rather than the common practise of removing or replacing it). However, we also show that many queries benefit from the inclusion of negated content and that negation is optimally handled on a per-query basis. Additional evaluation shows that adaptive handing of negated and family history content can have significant benefits.
Resumo:
The top-k retrieval problem aims to find the optimal set of k documents from a number of relevant documents given the user’s query. The key issue is to balance the relevance and diversity of the top-k search results. In this paper, we address this problem using Facility Location Analysis taken from Operations Research, where the locations of facilities are optimally chosen according to some criteria. We show how this analysis technique is a generalization of state-of-the-art retrieval models for diversification (such as the Modern Portfolio Theory for Information Retrieval), which treat the top-k search results like “obnoxious facilities” that should be dispersed as far as possible from each other. However, Facility Location Analysis suggests that the top-k search results could be treated like “desirable facilities” to be placed as close as possible to their customers. This leads to a new top-k retrieval model where the best representatives of the relevant documents are selected. In a series of experiments conducted on two TREC diversity collections, we show that significant improvements can be made over the current state-of-the-art through this alternative treatment of the top-k retrieval problem.
Resumo:
In the TREC Web Diversity track, novelty-biased cumulative gain (α-NDCG) is one of the official measures to assess retrieval performance of IR systems. The measure is characterised by a parameter, α, the effect of which has not been thoroughly investigated. We find that common settings of α, i.e. α=0.5, may prevent the measure from behaving as desired when evaluating result diversification. This is because it excessively penalises systems that cover many intents while it rewards those that redundantly cover only few intents. This issue is crucial since it highly influences systems at top ranks. We revisit our previously proposed threshold, suggesting α be set on a query-basis. The intuitiveness of the measure is then studied by examining actual rankings from TREC 09-10 Web track submissions. By varying α according to our query-based threshold, the discriminative power of α-NDCG is not harmed and in fact, our approach improves α-NDCG's robustness. Experimental results show that the threshold for α can turn the measure to be more intuitive than using its common settings.
Resumo:
In this paper, we consider the problem of document ranking in a non-traditional retrieval task, called subtopic retrieval. This task involves promoting relevant documents that cover many subtopics of a query at early ranks, providing thus diversity within the ranking. In the past years, several approaches have been proposed to diversify retrieval results. These approaches can be classified into two main paradigms, depending upon how the ranks of documents are revised for promoting diversity. In the first approach subtopic diversification is achieved implicitly, by choosing documents that are different from each other, while in the second approach this is done explicitly, by estimating the subtopics covered by documents. Within this context, we compare methods belonging to the two paradigms. Furthermore, we investigate possible strategies for integrating the two paradigms with the aim of formulating a new ranking method for subtopic retrieval. We conduct a number of experiments to empirically validate and contrast the state-of-the-art approaches as well as instantiations of our integration approach. The results show that the integration approach outperforms state-of-the-art strategies with respect to a number of measures.
Resumo:
The location of previously unseen and unregistered individuals in complex camera networks from semantic descriptions is a time consuming and often inaccurate process carried out by human operators, or security staff on the ground. To promote the development and evaluation of automated semantic description based localisation systems, we present a new, publicly available, unconstrained 110 sequence database, collected from 6 stationary cameras. Each sequence contains detailed semantic information for a single search subject who appears in the clip (gender, age, height, build, hair and skin colour, clothing type, texture and colour), and between 21 and 290 frames for each clip are annotated with the target subject location (over 11,000 frames are annotated in total). A novel approach for localising a person given a semantic query is also proposed and demonstrated on this database. The proposed approach incorporates clothing colour and type (for clothing worn below the waist), as well as height and build to detect people. A method to assess the quality of candidate regions, as well as a symmetry driven approach to aid in modelling clothing on the lower half of the body, is proposed within this approach. An evaluation on the proposed dataset shows that a relative improvement in localisation accuracy of up to 21 is achieved over the baseline technique.
Resumo:
During the early design stages of construction projects, accurate and timely cost feedback is critical to design decision making. This is particularly challenging for cost estimators, as they must quickly and accurately estimate the cost of the building when the design is still incomplete and evolving. State-of-the-art software tools typically use a rule-based approach to generate detailed quantities from the design details present in a building model and relate them to the cost items in a cost estimating database. In this paper, we propose a generic approach for creating and maintaining a cost estimate using flexible mappings between a building model and a cost estimate. The approach uses queries on the building design that are used to populate views, and each view is then associated with one or more cost items. The benefit of this approach is that the flexibility of modern query languages allows the estimator to encode a broad variety of relationships between the design and estimate. It also avoids the use of a common standard to which both designers and estimators must conform, allowing the estimator added flexibility and functionality to their work.
Resumo:
This paper reports on the 2nd ShARe/CLEFeHealth evaluation lab which continues our evaluation resource building activities for the medical domain. In this lab we focus on patients' information needs as opposed to the more common campaign focus of the specialised information needs of physicians and other healthcare workers. The usage scenario of the lab is to ease patients and next-of-kins' ease in understanding eHealth information, in particular clinical reports. The 1st ShARe/CLEFeHealth evaluation lab was held in 2013. This lab consisted of three tasks. Task 1 focused on named entity recognition and normalization of disorders; Task 2 on normalization of acronyms/abbreviations; and Task 3 on information retrieval to address questions patients may have when reading clinical reports. This year's lab introduces a new challenge in Task 1 on visual-interactive search and exploration of eHealth data. Its aim is to help patients (or their next-of-kin) in readability issues related to their hospital discharge documents and related information search on the Internet. Task 2 then continues the information extraction work of the 2013 lab, specifically focusing on disorder attribute identification and normalization from clinical text. Finally, this year's Task 3 further extends the 2013 information retrieval task, by cleaning the 2013 document collection and introducing a new query generation method and multilingual queries. De-identified clinical reports used by the three tasks were from US intensive care and originated from the MIMIC II database. Other text documents for Tasks 1 and 3 were from the Internet and originated from the Khresmoi project. Task 2 annotations originated from the ShARe annotations. For Tasks 1 and 3, new annotations, queries, and relevance assessments were created. 50, 79, and 91 people registered their interest in Tasks 1, 2, and 3, respectively. 24 unique teams participated with 1, 10, and 14 teams in Tasks 1, 2 and 3, respectively. The teams were from Africa, Asia, Canada, Europe, and North America. The Task 1 submission, reviewed by 5 expert peers, related to the task evaluation category of Effective use of interaction and targeted the needs of both expert and novice users. The best system had an Accuracy of 0.868 in Task 2a, an F1-score of 0.576 in Task 2b, and Precision at 10 (P@10) of 0.756 in Task 3. The results demonstrate the substantial community interest and capabilities of these systems in making clinical reports easier to understand for patients. The organisers have made data and tools available for future research and development.
Resumo:
This paper presents our system to address the CogALex-IV 2014 shared task of identifying a single word most semantically related to a group of 5 words (queries). Our system uses an implementation of a neural language model and identifies the answer word by finding the most semantically similar word representation to the sum of the query representations. It is a fully unsupervised system which learns on around 20% of the UkWaC corpus. It correctly identifies 85 exact correct targets out of 2,000 queries, 285 approximate targets in lists of 5 suggestions.
Resumo:
The aim of spoken term detection (STD) is to find all occurrences of a specified query term in a large audio database. This process is usually divided into two steps: indexing and search. In a previous study, it was shown that knowing the topic of an audio document would help to improve the accuracy of indexing step which results in a better performance for STD system. In this paper, we propose the use of topic information not only in the indexing step, but also in the search step. Results of our experiments show that topic information could also be used in search step to improve the STD accuracy.
Resumo:
Private title insurance has been the subject of much debate by law reform bodies and academics. This article adds a new dimension to the discussion by analysing its role against a recent scenario where a nun was betrayed by the actions of her brother, and compensation payable from the assurance fund, after much challenge by the registrar, amounted to in excess of $4 million.We ask whether the slow burning of title insurance into the psyche of Australian home purchasers will see state-based assurance fundings looking to minismise their role in the Torrens system. We also query how the rather more immediate electronic establishment of electronic conveyancing will alter the balance between the assurance fund, private title insurance and the increasing responsibilities on stakeholdes involved in conveyancing.
Resumo:
Determination of sequence similarity is a central issue in computational biology, a problem addressed primarily through BLAST, an alignment based heuristic which has underpinned much of the analysis and annotation of the genomic era. Despite their success, alignment-based approaches scale poorly with increasing data set size, and are not robust under structural sequence rearrangements. Successive waves of innovation in sequencing technologies – so-called Next Generation Sequencing (NGS) approaches – have led to an explosion in data availability, challenging existing methods and motivating novel approaches to sequence representation and similarity scoring, including adaptation of existing methods from other domains such as information retrieval. In this work, we investigate locality-sensitive hashing of sequences through binary document signatures, applying the method to a bacterial protein classification task. Here, the goal is to predict the gene family to which a given query protein belongs. Experiments carried out on a pair of small but biologically realistic datasets (the full protein repertoires of families of Chlamydia and Staphylococcus aureus genomes respectively) show that a measure of similarity obtained by locality sensitive hashing gives highly accurate results while offering a number of avenues which will lead to substantial performance improvements over BLAST..
Resumo:
A key concept in many Information Retrieval (IR) tasks, e.g. document indexing, query language modelling, aspect and diversity retrieval, is the relevance measurement of topics, i.e. to what extent an information object (e.g. a document or a query) is about the topics. This paper investigates the interference of relevance measurement of a topic caused by another topic. For example, consider that two user groups are required to judge whether a topic q is relevant to a document d, and q is presented together with another topic (referred to as a companion topic). If different companion topics are used for different groups, interestingly different relevance probabilities of q given d can be reached. In this paper, we present empirical results showing that the relevance of a topic to a document is greatly affected by the companion topic’s relevance to the same document, and the extent of the impact differs with respect to different companion topics. We further analyse the phenomenon from classical and quantum-like interference perspectives, and connect the phenomenon to nonreality and contextuality in quantum mechanics. We demonstrate that quantum like model fits in the empirical data, could be potentially used for predicting the relevance when interference exists.
Resumo:
In a pilot application based on web search engine calledWeb-based Relation Completion (WebRC), we propose to join two columns of entities linked by a predefined relation by mining knowledge from the web through a web search engine. To achieve this, a novel retrieval task Relation Query Expansion (RelQE) is modelled: given an entity (query), the task is to retrieve documents containing entities in predefined relation to the given one. Solving this problem entails expanding the query before submitting it to a web search engine to ensure that mostly documents containing the linked entity are returned in the top K search results. In this paper, we propose a novel Learning-based Relevance Feedback (LRF) approach to solve this retrieval task. Expansion terms are learned from training pairs of entities linked by the predefined relation and applied to new entity-queries to find entities linked by the same relation. After describing the approach, we present experimental results on real-world web data collections, which show that the LRF approach always improves the precision of top-ranked search results to up to 8.6 times the baseline. Using LRF, WebRC also shows performances way above the baseline.
Resumo:
We used our TopSig open-source indexing and retrieval tool to produce runs for the ShARe/CLEF eHealth 2013 track. TopSig was used to produce runs using the query fields and provided discharge summaries, where appropriate. Although the improvement was not great TopSig was able to gain some benefit from utilising the discharge summaries, although the software needed to be modified to support this. This was part of a larger experiment involving determining the applicability and limits to signature-based approaches.