921 resultados para cross-language information retrieval
Resumo:
Sharing of information with those in need of it has always been an idealistic goal of networked environments. With the proliferation of computer networks, information is so widely distributed among systems, that it is imperative to have well-organized schemes for retrieval and also discovery. This thesis attempts to investigate the problems associated with such schemes and suggests a software architecture, which is aimed towards achieving a meaningful discovery. Usage of information elements as a modelling base for efficient information discovery in distributed systems is demonstrated with the aid of a novel conceptual entity called infotron.The investigations are focused on distributed systems and their associated problems. The study was directed towards identifying suitable software architecture and incorporating the same in an environment where information growth is phenomenal and a proper mechanism for carrying out information discovery becomes feasible. An empirical study undertaken with the aid of an election database of constituencies distributed geographically, provided the insights required. This is manifested in the Election Counting and Reporting Software (ECRS) System. ECRS system is a software system, which is essentially distributed in nature designed to prepare reports to district administrators about the election counting process and to generate other miscellaneous statutory reports.Most of the distributed systems of the nature of ECRS normally will possess a "fragile architecture" which would make them amenable to collapse, with the occurrence of minor faults. This is resolved with the help of the penta-tier architecture proposed, that contained five different technologies at different tiers of the architecture.The results of experiment conducted and its analysis show that such an architecture would help to maintain different components of the software intact in an impermeable manner from any internal or external faults. The architecture thus evolved needed a mechanism to support information processing and discovery. This necessitated the introduction of the noveI concept of infotrons. Further, when a computing machine has to perform any meaningful extraction of information, it is guided by what is termed an infotron dictionary.The other empirical study was to find out which of the two prominent markup languages namely HTML and XML, is best suited for the incorporation of infotrons. A comparative study of 200 documents in HTML and XML was undertaken. The result was in favor ofXML.The concept of infotron and that of infotron dictionary, which were developed, was applied to implement an Information Discovery System (IDS). IDS is essentially, a system, that starts with the infotron(s) supplied as clue(s), and results in brewing the information required to satisfy the need of the information discoverer by utilizing the documents available at its disposal (as information space). The various components of the system and their interaction follows the penta-tier architectural model and therefore can be considered fault-tolerant. IDS is generic in nature and therefore the characteristics and the specifications were drawn up accordingly. Many subsystems interacted with multiple infotron dictionaries that were maintained in the system.In order to demonstrate the working of the IDS and to discover the information without modification of a typical Library Information System (LIS), an Information Discovery in Library Information System (lDLIS) application was developed. IDLIS is essentially a wrapper for the LIS, which maintains all the databases of the library. The purpose was to demonstrate that the functionality of a legacy system could be enhanced with the augmentation of IDS leading to information discovery service. IDLIS demonstrates IDS in action. IDLIS proves that any legacy system could be augmented with IDS effectively to provide the additional functionality of information discovery service.Possible applications of IDS and scope for further research in the field are covered.
Resumo:
Modeling and predicting co-occurrences of events is a fundamental problem of unsupervised learning. In this contribution we develop a statistical framework for analyzing co-occurrence data in a general setting where elementary observations are joint occurrences of pairs of abstract objects from two finite sets. The main challenge for statistical models in this context is to overcome the inherent data sparseness and to estimate the probabilities for pairs which were rarely observed or even unobserved in a given sample set. Moreover, it is often of considerable interest to extract grouping structure or to find a hierarchical data organization. A novel family of mixture models is proposed which explain the observed data by a finite number of shared aspects or clusters. This provides a common framework for statistical inference and structure discovery and also includes several recently proposed models as special cases. Adopting the maximum likelihood principle, EM algorithms are derived to fit the model parameters. We develop improved versions of EM which largely avoid overfitting problems and overcome the inherent locality of EM--based optimization. Among the broad variety of possible applications, e.g., in information retrieval, natural language processing, data mining, and computer vision, we have chosen document retrieval, the statistical analysis of noun/adjective co-occurrence and the unsupervised segmentation of textured images to test and evaluate the proposed algorithms.
Resumo:
Real-time geoparsing of social media streams (e.g. Twitter, YouTube, Instagram, Flickr, FourSquare) is providing a new 'virtual sensor' capability to end users such as emergency response agencies (e.g. Tsunami early warning centres, Civil protection authorities) and news agencies (e.g. Deutsche Welle, BBC News). Challenges in this area include scaling up natural language processing (NLP) and information retrieval (IR) approaches to handle real-time traffic volumes, reducing false positives, creating real-time infographic displays useful for effective decision support and providing support for trust and credibility analysis using geosemantics. I will present in this seminar on-going work by the IT Innovation Centre over the last 4 years (TRIDEC and REVEAL FP7 projects) in building such systems, and highlights our research towards improving trustworthy and credible of crisis map displays and real-time analytics for trending topics and influential social networks during major news worthy events.
Resumo:
This paper is about the use of natural language to communicate with computers. Most researches that have pursued this goal consider only requests expressed in English. A way to facilitate the use of several languages in natural language systems is by using an interlingua. An interlingua is an intermediary representation for natural language information that can be processed by machines. We propose to convert natural language requests into an interlingua [universal networking language (UNL)] and to execute these requests using software components. In order to achieve this goal, we propose OntoMap, an ontology-based architecture to perform the semantic mapping between UNL sentences and software components. OntoMap also performs component search and retrieval based on semantic information formalized in ontologies and rules.
Resumo:
Successful classification, information retrieval and image analysis tools are intimately related with the quality of the features employed in the process. Pixel intensities, color, texture and shape are, generally, the basis from which most of the features are Computed and used in such fields. This papers presents a novel shape-based feature extraction approach where an image is decomposed into multiple contours, and further characterized by Fourier descriptors. Unlike traditional approaches we make use of topological knowledge to generate well-defined closed contours, which are efficient signatures for image retrieval. The method has been evaluated in the CBIR context and image analysis. The results have shown that the multi-contour decomposition, as opposed to a single shape information, introduced a significant improvement in the discrimination power. (c) 2008 Elsevier B.V. All rights reserved,
Resumo:
Automatic summarization of texts is now crucial for several information retrieval tasks owing to the huge amount of information available in digital media, which has increased the demand for simple, language-independent extractive summarization strategies. In this paper, we employ concepts and metrics of complex networks to select sentences for an extractive summary. The graph or network representing one piece of text consists of nodes corresponding to sentences, while edges connect sentences that share common meaningful nouns. Because various metrics could be used, we developed a set of 14 summarizers, generically referred to as CN-Summ, employing network concepts such as node degree, length of shortest paths, d-rings and k-cores. An additional summarizer was created which selects the highest ranked sentences in the 14 systems, as in a voting system. When applied to a corpus of Brazilian Portuguese texts, some CN-Summ versions performed better than summarizers that do not employ deep linguistic knowledge, with results comparable to state-of-the-art summarizers based on expensive linguistic resources. The use of complex networks to represent texts appears therefore as suitable for automatic summarization, consistent with the belief that the metrics of such networks may capture important text features. (c) 2008 Elsevier Inc. All rights reserved.
Resumo:
Trata das questões de organização e recuperação da informação no caso específico do acervo do Centro de Pesquisa e História Contemporânea do Brasil – CPDOC. Baseia essa análise num estudo de caso do uso do serviço de referência da instituição prestado pela Sala de Consulta e também no utilização da base de dados Accessus. Traça um perfil do usuário do acervo da instituição além de um perfil de pesquisa desses indivíduos ao mapear o comportamento dos usuários diante da ferramenta Accessus. Aborda o contexto da elaboração da base de dados e investiga a criação da linguagem controlada em história e ciências afins que serviu de base para o Accessus. Problematiza as questões de acessibilidade da linguagem a um público não relacionado com a área. Pareia essa problematização com análise dos diferentes perfis de usuários. Discute a forma de indexação do acervo do CPDOC e suscita reflexões sobre esse processo que considere uma relação direta com o perfil dos usuários.
Resumo:
Avaliou-se o uso de linguagem documentária alfabética de catálogos coletivos, na perspectiva das bibliotecas universitárias e no contexto sociocognitivo dos indexadores e dos usuários. Concluiu-se que o uso adequado de linguagens documentárias de áreas científicas especializadas faz-se por meio da avaliação quanto à atualização, especificidade e compatibilidade para atender às necessidades de indexação e recuperação da informação.
Resumo:
Este estudo apresenta uma síntese bibliográfica sobre as metodologias de avaliação que foram propostas por pesquisadores internacionais e nacionais e utilizadas por indexadores de instituições de ensino e/ou pesquisas atuantes em unidades de informação e/ou centros de documentação, bem como aquelas que foram analisadas pelas opiniões dos próprios usuários da informação registrada e disponibilizada em inúmeros sistemas de informações, com enfoques nas abordagens quantitativa, qualitativa e qualitativa/cognitiva, respectivamente.
Resumo:
The indexing automation has been discussed by researches in the area of Information Science however the discussions have not been so clear on the use of indexing software. Thus, it is necessary to know the indexing software, as well as its application in the analysis of documentary contents. To do so, it is proposed, here, to investigate both the consistency of indexing and the exhaustiveness and precision of the information retrieval, by means of comparative analysis between SISA (Sistema de Indizacion Semi-Automatico) automatic index and BIREME ( Centro Latino-Americano e do Caribe de Informação em Ciencias da Saude) manual indexing. The aim of this paper is to contribute to the theoretical development of the indexing automation and the improvement of SISA. Thus, SISA application and evaluation was used based on the calculation of the consistency indexes between the two types of indexing, and the calculation of the exhaustiveness and precision indexes in information retrieval, by means of searching into BDSISA and BIREME databases, composed by descriptors taken from SISA and manual indexing respectively. The differences among the terms used in scientific papers comparing to the DeCS ones were the main difficult factor to achieve higher consistency indexes in the indexing. These differences influenced the exhaustiveness and precision indexes in the information retrieval, showing that it is necessary to improve the documentary language used by SISA software and to incorporate linguistic methods.
Resumo:
Characteristics of speech, especially figures of speech, are used by specific communities or domains, and, in this way, reflect their identities through their choice of vocabulary. This topic should be an object of study in the context of knowledge representation once it deals with different contexts of production of documents. This study aims to explore the dimensions of the concepts of euphemism, dysphemism, and orthophemism, focusing on the latter with the goal of extracting a concept which can be included in discussions about subject analysis and indexing. Euphemism is used as an alternative to a non-preferred expression or as an alternative to an offensive attribution-to avoid potential offense taken by the listener or by other persons, for instance, pass away. Dysphemism, on the other hand, is used by speakers to talk about people and things that frustrate and annoy them-their choice of language indicates disapproval and the topic is therefore denigrated, humiliated, or degraded, for instance, kick the bucket. While euphemism tries to make something sound better, dysphemism tries to make something sound worse. Orthophemism (Allan and Burridge 2006) is also used as an alternative to expressions, but it is a preferred, formal, and direct language of expression when representing an object or a situation, for instance, die. This paper suggests that the comprehension and use of such concepts could support the following issues: possible contributions from linguistics and terminology to subject analysis as demonstrated by Talamo et al. (1992); decrease of polysemy and ambiguity of terms used to represent certain topics of documents; and construction and evaluation of indexing languages. The concept of orthophemism can also serves to support associative relationships in the context of subject analysis, indexing, and even information retrieval related to more specific requests.
Resumo:
In some applications with case-based system, the attributes available for indexing are better described as linguistic variables instead of receiving numerical treatment. In these applications, the concept of fuzzy hypercube can be applied to give a geometrical interpretation of similarities among cases. This paper presents an approach that uses geometrical properties of fuzzy hypercube space to make indexing and retrieval processes of cases.
Resumo:
The need for the representation of both semantics and common sense and its organization in a lexical database or knowledge base has motivated the development of large projects, such as Wordnets, CYC and Mikrokosmos. Besides the generic bases, another approach is the construction of ontologies for specific domains. Among the advantages of such approach there is the possibility of a greater and more detailed coverage of a specific domain and its terminology. Domain ontologies are important resources in several tasks related to the language processing, especially in those related to information retrieval and extraction in textual bases. Information retrieval or even question and answer systems can benefit from the domain knowledge represented in an ontology. Besides embracing the terminology of the field, the ontology makes the relationships among the terms explicit. Copyright 2007 ACM.
Resumo:
A comparative evaluation was made of the use of natural language versus two specialized indexing languages, aiming to demonstrate the influence of the availability of indexing languages on the functioning of information retrieval systems. The study was conducted within the ambit of the construction of search strategies by subject in online university library catalogs. The precision ratio was calculated to determine the accuracy of each indexing language in subjectbased information retrieval. From the comparative evaluation of the use of indexing languages, it was concluded that the term specificity required by the user during retrieval was more satisfactory when the query was made through controlled languages, whose availability and simplicity is also an indispensable requisite.