932 resultados para Retrieval


Relevância:

20.00% 20.00%

Publicador:

Resumo:

The structured representation of cases by attribute graphs in a Case-Based Reasoning (CBR) system for course timetabling has been the subject of previous research by the authors. In that system, the case base is organised as a decision tree and the retrieval process chooses those cases which are sub attribute graph isomorphic to the new case. The drawback of that approach is that it is not suitable for solving large problems. This paper presents a multiple-retrieval approach that partitions a large problem into small solvable sub-problems by recursively inputting the unsolved part of the graph into the decision tree for retrieval. The adaptation combines the retrieved partial solutions of all the partitioned sub-problems and employs a graph heuristic method to construct the whole solution for the new case. We present a methodology which is not dependant upon problem specific information and which, as such, represents an approach which underpins the goal of building more general timetabling systems. We also explore the question of whether this multiple-retrieval CBR could be an effective initialisation method for local search methods such as Hill Climbing, Tabu Search and Simulated Annealing. Significant results are obtained from a wide range of experiments. An evaluation of the CBR system is presented and the impact of the approach on timetabling research is discussed. We see that the approach does indeed represent an effective initialisation method for these approaches.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

International audience

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The size of online image datasets is constantly increasing. Considering an image dataset with millions of images, image retrieval becomes a seemingly intractable problem for exhaustive similarity search algorithms. Hashing methods, which encodes high-dimensional descriptors into compact binary strings, have become very popular because of their high efficiency in search and storage capacity. In the first part, we propose a multimodal retrieval method based on latent feature models. The procedure consists of a nonparametric Bayesian framework for learning underlying semantically meaningful abstract features in a multimodal dataset, a probabilistic retrieval model that allows cross-modal queries and an extension model for relevance feedback. In the second part, we focus on supervised hashing with kernels. We describe a flexible hashing procedure that treats binary codes and pairwise semantic similarity as latent and observed variables, respectively, in a probabilistic model based on Gaussian processes for binary classification. We present a scalable inference algorithm with the sparse pseudo-input Gaussian process (SPGP) model and distributed computing. In the last part, we define an incremental hashing strategy for dynamic databases where new images are added to the databases frequently. The method is based on a two-stage classification framework using binary and multi-class SVMs. The proposed method also enforces balance in binary codes by an imbalance penalty to obtain higher quality binary codes. We learn hash functions by an efficient algorithm where the NP-hard problem of finding optimal binary codes is solved via cyclic coordinate descent and SVMs are trained in a parallelized incremental manner. For modifications like adding images from an unseen class, we propose an incremental procedure for effective and efficient updates to the previous hash functions. Experiments on three large-scale image datasets demonstrate that the incremental strategy is capable of efficiently updating hash functions to the same retrieval performance as hashing from scratch.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Image (Video) retrieval is an interesting problem of retrieving images (videos) similar to the query. Images (Videos) are represented in an input (feature) space and similar images (videos) are obtained by finding nearest neighbors in the input representation space. Numerous input representations both in real valued and binary space have been proposed for conducting faster retrieval. In this thesis, we present techniques that obtain improved input representations for retrieval in both supervised and unsupervised settings for images and videos. Supervised retrieval is a well known problem of retrieving same class images of the query. We address the practical aspects of achieving faster retrieval with binary codes as input representations for the supervised setting in the first part, where binary codes are used as addresses into hash tables. In practice, using binary codes as addresses does not guarantee fast retrieval, as similar images are not mapped to the same binary code (address). We address this problem by presenting an efficient supervised hashing (binary encoding) method that aims to explicitly map all the images of the same class ideally to a unique binary code. We refer to the binary codes of the images as `Semantic Binary Codes' and the unique code for all same class images as `Class Binary Code'. We also propose a new class­ based Hamming metric that dramatically reduces the retrieval times for larger databases, where only hamming distance is computed to the class binary codes. We also propose a Deep semantic binary code model, by replacing the output layer of a popular convolutional Neural Network (AlexNet) with the class binary codes and show that the hashing functions learned in this way outperforms the state­ of ­the art, and at the same time provide fast retrieval times. In the second part, we also address the problem of supervised retrieval by taking into account the relationship between classes. For a given query image, we want to retrieve images that preserve the relative order i.e. we want to retrieve all same class images first and then, the related classes images before different class images. We learn such relationship aware binary codes by minimizing the similarity between inner product of the binary codes and the similarity between the classes. We calculate the similarity between classes using output embedding vectors, which are vector representations of classes. Our method deviates from the other supervised binary encoding schemes as it is the first to use output embeddings for learning hashing functions. We also introduce new performance metrics that take into account the related class retrieval results and show significant gains over the state­ of­ the art. High Dimensional descriptors like Fisher Vectors or Vector of Locally Aggregated Descriptors have shown to improve the performance of many computer vision applications including retrieval. In the third part, we will discuss an unsupervised technique for compressing high dimensional vectors into high dimensional binary codes, to reduce storage complexity. In this approach, we deviate from adopting traditional hyperplane hashing functions and instead learn hyperspherical hashing functions. The proposed method overcomes the computational challenges of directly applying the spherical hashing algorithm that is intractable for compressing high dimensional vectors. A practical hierarchical model that utilizes divide and conquer techniques using the Random Select and Adjust (RSA) procedure to compress such high dimensional vectors is presented. We show that our proposed high dimensional binary codes outperform the binary codes obtained using traditional hyperplane methods for higher compression ratios. In the last part of the thesis, we propose a retrieval based solution to the Zero shot event classification problem - a setting where no training videos are available for the event. To do this, we learn a generic set of concept detectors and represent both videos and query events in the concept space. We then compute similarity between the query event and the video in the concept space and videos similar to the query event are classified as the videos belonging to the event. We show that we significantly boost the performance using concept features from other modalities.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Executing a cloud or aerosol physical properties retrieval algorithm from controlled synthetic data is an important step in retrieval algorithm development. Synthetic data can help answer questions about the sensitivity and performance of the algorithm or aid in determining how an existing retrieval algorithm may perform with a planned sensor. Synthetic data can also help in solving issues that may have surfaced in the retrieval results. Synthetic data become very important when other validation methods, such as field campaigns,are of limited scope. These tend to be of relatively short duration and often are costly. Ground stations have limited spatial coverage whilesynthetic data can cover large spatial and temporal scales and a wide variety of conditions at a low cost. In this work I develop an advanced cloud and aerosol retrieval simulator for the MODIS instrument, also known as Multi-sensor Cloud and Aerosol Retrieval Simulator (MCARS). In a close collaboration with the modeling community I have seamlessly combined the GEOS-5 global climate model with the DISORT radiative transfer code, widely used by the remote sensing community, with the observations from the MODIS instrument to create the simulator. With the MCARS simulator it was then possible to solve the long standing issue with the MODIS aerosol optical depth retrievals that had a low bias for smoke aerosols. MODIS aerosol retrieval did not account for effects of humidity on smoke aerosols. The MCARS simulator also revealed an issue that has not been recognized previously, namely,the value of fine mode fraction could create a linear dependence between retrieved aerosol optical depth and land surface reflectance. MCARS provided the ability to examine aerosol retrievals against “ground truth” for hundreds of thousands of simultaneous samples for an area covered by only three AERONET ground stations. Findings from MCARS are already being used to improve the performance of operational MODIS aerosol properties retrieval algorithms. The modeling community will use the MCARS data to create new parameterizations for aerosol properties as a function of properties of the atmospheric column and gain the ability to correct any assimilated retrieval data that may display similar dependencies in comparisons with ground measurements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Compressed covariance sensing using quadratic samplers is gaining increasing interest in recent literature. Covariance matrix often plays the role of a sufficient statistic in many signal and information processing tasks. However, owing to the large dimension of the data, it may become necessary to obtain a compressed sketch of the high dimensional covariance matrix to reduce the associated storage and communication costs. Nested sampling has been proposed in the past as an efficient sub-Nyquist sampling strategy that enables perfect reconstruction of the autocorrelation sequence of Wide-Sense Stationary (WSS) signals, as though it was sampled at the Nyquist rate. The key idea behind nested sampling is to exploit properties of the difference set that naturally arises in quadratic measurement model associated with covariance compression. In this thesis, we will focus on developing novel versions of nested sampling for low rank Toeplitz covariance estimation, and phase retrieval, where the latter problem finds many applications in high resolution optical imaging, X-ray crystallography and molecular imaging. The problem of low rank compressive Toeplitz covariance estimation is first shown to be fundamentally related to that of line spectrum recovery. In absence if noise, this connection can be exploited to develop a particular kind of sampler called the Generalized Nested Sampler (GNS), that can achieve optimal compression rates. In presence of bounded noise, we develop a regularization-free algorithm that provably leads to stable recovery of the high dimensional Toeplitz matrix from its order-wise minimal sketch acquired using a GNS. Contrary to existing TV-norm and nuclear norm based reconstruction algorithms, our technique does not use any tuning parameters, which can be of great practical value. The idea of nested sampling idea also finds a surprising use in the problem of phase retrieval, which has been of great interest in recent times for its convex formulation via PhaseLift, By using another modified version of nested sampling, namely the Partial Nested Fourier Sampler (PNFS), we show that with probability one, it is possible to achieve a certain conjectured lower bound on the necessary measurement size. Moreover, for sparse data, an l1 minimization based algorithm is proposed that can lead to stable phase retrieval using order-wise minimal number of measurements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: About 10% to 15% of infertile men have azoospermia, which could be obstructive or non-obstructive. Diagnostic biopsy from the testis and recently testicular sperm extraction (TESE) are the most precise investigations in these patients. Testicular biopsy can be done unilaterally or bilaterally. The worth of unilateral or bilateral testicular biopsy in men with azoospermia is controversial. Objective: To evaluate the necessity of bilateral diagnostic biopsy from the testis in new era of diagnosis and treatment of male infertility. Materials and Methods: In this retrospective study, we reviewed the results of testis biopsy in 419 azoospermic men, referred to Yazd Research and Clinical Center for Infertility from 2009-2013. Patients with known obstructive azoospermia were excluded from the study. Results: In totally, 254 infertile men (60.6%) were underwent unilateral TESE, which in 175 patients (88.4%) sperm were extracted from their testes successfully. Bilateral testis biopsy was done in 165 patients (39.4%) which in 37 patients (22.4%), sperm were found in their testes tissues. Conclusion: Due to the low probability of positive bilateral TESE results especially when we can’t found sperm in the first side, we recommend that physicians re-evaluate the risk and benefit of this procedure in era of newer and more precise technique of sperm retrieval like micro TESE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Things change. Words change, meaning changes and use changes both words and meaning. In information access systems this means concept schemes such as thesauri or clas- sification schemes change. They always have. Concept schemes that have survived have evolved over time, moving from one version, often called an edition, to the next. If we want to manage how words and meanings - and as a conse- quence use - change in an effective manner, and if we want to be able to search across versions of concept schemes, we have to track these changes. This paper explores how we might expand SKOS, a World Wide Web Consortium (W3C) draft recommendation in order to do that kind of tracking.The Simple Knowledge Organization System (SKOS) Core Guide is sponsored by the Semantic Web Best Practices and Deployment Working Group. The second draft, edited by Alistair Miles and Dan Brickley, was issued in November 2005. SKOS is a “model for expressing the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, other types of controlled vocabulary and also concept schemes embedded in glossaries and terminologies” in RDF. How SKOS handles version in concept schemes is an open issue. The current draft guide suggests using OWL and DCTERMS as mechanisms for concept scheme revision.As it stands an editor of a concept scheme can make notes or declare in OWL that more than one version exists. This paper adds to the SKOS Core by introducing a tracking sys- tem for changes in concept schemes. We call this tracking system vocabulary ontogeny. Ontogeny is a biological term for the development of an organism during its lifetime. Here we use the ontogeny metaphor to describe how vocabularies change over their lifetime. Our purpose here is to create a conceptual mechanism that will track these changes and in so doing enhance information retrieval and prevent document loss through versioning, thereby enabling persistent retrieval.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The MARS (Media Asset Retrieval System) Project is the collaborative effort of public broadcasters,libraries and schools in the Puget Sound region to create a digital online resource that provides access to content produced by public broadcasters via the public libraries. Convergence ConsortiumThe Convergence Consortium is a model for community collaboration, including organizations such as public broadcasters, libraries, museums, and schools in the Puget Sound region to assess the needs of their constituents and pool resources to develop solutions to meet those needs. Specifically, the archives of public broadcasters have been identified as significant resources for the local communities and nationally. These resources can be accessed on the broadcasters websites, and through libraries and used by schools, and integrated with text and photographic archives from other partners.MARS’ goalCreate an online resource that provides effective access to the content produced locally by KCTS (Seattle PBS affiliate) and KUOW (Seattle NPR affiliate). The broadcasts will be made searchable using the CPB Metadata Element Set (under development) and controlled vocabularies (to be developed). This will ensure a user friendly search and navigation mechanism and user satisfaction.Furthermore, the resource can search the local public library’s catalog concurrently and provide the user with relevant TV material, radio material, and books on a given subject.The ultimate goal is to produce a model that can be used in cities around the country.The current phase of the project assesses the community’s need, analyzes the current operational systems, and makes recommendations for the design of the resource.Deliverables• Literature review of the issues surrounding the organization, description and representation of media assets• Needs assessment report of internal and external stakeholders• Profile of the systems in the area of managing and organizing media assetsfor public broadcasting nationwideActivities• Analysis of information seeking behavior• Analysis of collaboration within the respective organizations• Analysis of the scope and context of the proposed system• Examining the availability of information resources and exchangeof resources among users

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The MARS (Media Asset Retrieval System) Project is a collaboration between public broadcasters, libraries and schools in the Puget Sound region to assess the needs of their constituents and pool resources to develop solutions to meet those needs. The Project’s ultimate goal is to create a digital online resource that will provide access to content produced by public broadcasters and libraries. The MARS Project is funded by a grant from the Corporation for Public Broadcasting (CPB) Television Future Fund. Convergence ConsortiumThe Convergence Consortium is a model for community collaboration, including representatives from public broadcasting, libraries and schools in the Puget Sound region. They meet regularly to consider collaborative efforts that will be mutually beneficial to their institutions and constituents. Specifically, the archives of public broadcasters have been identified as significant resources that can be accessed through libraries and used by schools, and integrated with text and photographic archives from other partners.Using the work-centered framework, we collected data through interviews with nine engineers and observation of their searching while they performed their regular, job-related searches on the Web. The framework was used to analyze the data on two levels: 1) the activities and organizational relationships and constrains of work domains, and 2) users’ cognitive and social activities and their subjective preferences during searching.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Conventional web search engines are centralised in that a single entity crawls and indexes the documents selected for future retrieval, and the relevance models used to determine which documents are relevant to a given user query. As a result, these search engines suffer from several technical drawbacks such as handling scale, timeliness and reliability, in addition to ethical concerns such as commercial manipulation and information censorship. Alleviating the need to rely entirely on a single entity, Peer-to-Peer (P2P) Information Retrieval (IR) has been proposed as a solution, as it distributes the functional components of a web search engine – from crawling and indexing documents, to query processing – across the network of users (or, peers) who use the search engine. This strategy for constructing an IR system poses several efficiency and effectiveness challenges which have been identified in past work. Accordingly, this thesis makes several contributions towards advancing the state of the art in P2P-IR effectiveness by improving the query processing and relevance scoring aspects of a P2P web search. Federated search systems are a form of distributed information retrieval model that route the user’s information need, formulated as a query, to distributed resources and merge the retrieved result lists into a final list. P2P-IR networks are one form of federated search in routing queries and merging result among participating peers. The query is propagated through disseminated nodes to hit the peers that are most likely to contain relevant documents, then the retrieved result lists are merged at different points along the path from the relevant peers to the query initializer (or namely, customer). However, query routing in P2P-IR networks is considered as one of the major challenges and critical part in P2P-IR networks; as the relevant peers might be lost in low-quality peer selection while executing the query routing, and inevitably lead to less effective retrieval results. This motivates this thesis to study and propose query routing techniques to improve retrieval quality in such networks. Cluster-based semi-structured P2P-IR networks exploit the cluster hypothesis to organise the peers into similar semantic clusters where each such semantic cluster is managed by super-peers. In this thesis, I construct three semi-structured P2P-IR models and examine their retrieval effectiveness. I also leverage the cluster centroids at the super-peer level as content representations gathered from cooperative peers to propose a query routing approach called Inverted PeerCluster Index (IPI) that simulates the conventional inverted index of the centralised corpus to organise the statistics of peers’ terms. The results show a competitive retrieval quality in comparison to baseline approaches. Furthermore, I study the applicability of using the conventional Information Retrieval models as peer selection approaches where each peer can be considered as a big document of documents. The experimental evaluation shows comparative and significant results and explains that document retrieval methods are very effective for peer selection that brings back the analogy between documents and peers. Additionally, Learning to Rank (LtR) algorithms are exploited to build a learned classifier for peer ranking at the super-peer level. The experiments show significant results with state-of-the-art resource selection methods and competitive results to corresponding classification-based approaches. Finally, I propose reputation-based query routing approaches that exploit the idea of providing feedback on a specific item in the social community networks and manage it for future decision-making. The system monitors users’ behaviours when they click or download documents from the final ranked list as implicit feedback and mines the given information to build a reputation-based data structure. The data structure is used to score peers and then rank them for query routing. I conduct a set of experiments to cover various scenarios including noisy feedback information (i.e, providing positive feedback on non-relevant documents) to examine the robustness of reputation-based approaches. The empirical evaluation shows significant results in almost all measurement metrics with approximate improvement more than 56% compared to baseline approaches. Thus, based on the results, if one were to choose one technique, reputation-based approaches are clearly the natural choices which also can be deployed on any P2P network.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In questa tesi si trattano lo studio e la sperimentazione di un modello generativo retrieval-augmented, basato su Transformers, per il task di Abstractive Summarization su lunghe sentenze legali. La sintesi automatica del testo (Automatic Text Summarization) è diventata un task di Natural Language Processing (NLP) molto importante oggigiorno, visto il grandissimo numero di dati provenienti dal web e banche dati. Inoltre, essa permette di automatizzare un processo molto oneroso per gli esperti, specialmente nel settore legale, in cui i documenti sono lunghi e complicati, per cui difficili e dispendiosi da riassumere. I modelli allo stato dell’arte dell’Automatic Text Summarization sono basati su soluzioni di Deep Learning, in particolare sui Transformers, che rappresentano l’architettura più consolidata per task di NLP. Il modello proposto in questa tesi rappresenta una soluzione per la Long Document Summarization, ossia per generare riassunti di lunghe sequenze testuali. In particolare, l’architettura si basa sul modello RAG (Retrieval-Augmented Generation), recentemente introdotto dal team di ricerca Facebook AI per il task di Question Answering. L’obiettivo consiste nel modificare l’architettura RAG al fine di renderla adatta al task di Abstractive Long Document Summarization. In dettaglio, si vuole sfruttare e testare la memoria non parametrica del modello, con lo scopo di arricchire la rappresentazione del testo di input da riassumere. A tal fine, sono state sperimentate diverse configurazioni del modello su diverse tipologie di esperimenti e sono stati valutati i riassunti generati con diverse metriche automatiche.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Dopo lo sviluppo dei primi casi di Covid-19 in Cina nell’autunno del 2019, ad inizio 2020 l’intero pianeta è precipitato in una pandemia globale che ha stravolto le nostre vite con conseguenze che non si vivevano dall’influenza spagnola. La grandissima quantità di paper scientifici in continua pubblicazione sul coronavirus e virus ad esso affini ha portato alla creazione di un unico dataset dinamico chiamato CORD19 e distribuito gratuitamente. Poter reperire informazioni utili in questa mole di dati ha ulteriormente acceso i riflettori sugli information retrieval systems, capaci di recuperare in maniera rapida ed efficace informazioni preziose rispetto a una domanda dell'utente detta query. Di particolare rilievo è stata la TREC-COVID Challenge, competizione per lo sviluppo di un sistema di IR addestrato e testato sul dataset CORD19. Il problema principale è dato dal fatto che la grande mole di documenti è totalmente non etichettata e risulta dunque impossibile addestrare modelli di reti neurali direttamente su di essi. Per aggirare il problema abbiamo messo a punto nuove soluzioni self-supervised, a cui abbiamo applicato lo stato dell'arte del deep metric learning e dell'NLP. Il deep metric learning, che sta avendo un enorme successo soprattuto nella computer vision, addestra il modello ad "avvicinare" tra loro immagini simili e "allontanare" immagini differenti. Dato che sia le immagini che il testo vengono rappresentati attraverso vettori di numeri reali (embeddings) si possano utilizzare le stesse tecniche per "avvicinare" tra loro elementi testuali pertinenti (e.g. una query e un paragrafo) e "allontanare" elementi non pertinenti. Abbiamo dunque addestrato un modello SciBERT con varie loss, che ad oggi rappresentano lo stato dell'arte del deep metric learning, in maniera completamente self-supervised direttamente e unicamente sul dataset CORD19, valutandolo poi sul set formale TREC-COVID attraverso un sistema di IR e ottenendo risultati interessanti.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Most of the existing open-source search engines, utilize keyword or tf-idf based techniques to find relevant documents and web pages relative to an input query. Although these methods, with the help of a page rank or knowledge graphs, proved to be effective in some cases, they often fail to retrieve relevant instances for more complicated queries that would require a semantic understanding to be exploited. In this Thesis, a self-supervised information retrieval system based on transformers is employed to build a semantic search engine over the library of Gruppo Maggioli company. Semantic search or search with meaning can refer to an understanding of the query, instead of simply finding words matches and, in general, it represents knowledge in a way suitable for retrieval. We chose to investigate a new self-supervised strategy to handle the training of unlabeled data based on the creation of pairs of ’artificial’ queries and the respective positive passages. We claim that by removing the reliance on labeled data, we may use the large volume of unlabeled material on the web without being limited to languages or domains where labeled data is abundant.