974 resultados para Automatic term extraction
Resumo:
There are many published methods available for creating keyphrases for documents. Previous work in the field has shown that in a significant proportion of cases author selected keyphrases are not appropriate for the document they accompany. This requires the use of such automated methods to improve the use of keyphrases. Often the keyphrases are not updated when the focus of a paper changes or include keyphrases that are more classificatory than explanatory. The published methods are all evaluated using different corpora, typically one relevant to their field of study. This not only makes it difficult to incorporate the useful elements of algorithms in future work but also makes comparing the results of each method inefficient and ineffective. This paper describes the work undertaken to compare five methods across a common baseline of six corpora. The methods chosen were term frequency, inverse document frequency, the C-Value, the NC-Value, and a synonym based approach. These methods were compared to evaluate performance and quality of results, and to provide a future benchmark. It is shown that, with the comparison metric used for this study Term Frequency and Inverse Document Frequency were the best algorithms, with the synonym based approach following them. Further work in the area is required to determine an appropriate (or more appropriate) comparison metric.
Resumo:
In this paper the approach for automatic road extraction for an urban region using structural, spectral and geometric characteristics of roads has been presented. Roads have been extracted based on two levels: Pre-processing and road extraction methods. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, parking lots, vegetation regions and other open spaces). The road segments are then extracted using Texture Progressive Analysis (TPA) and Normalized cut algorithm. The TPA technique uses binary segmentation based on three levels of texture statistical evaluation to extract road segments where as, Normalizedcut method for road extraction is a graph based method that generates optimal partition of road segments. The performance evaluation (quality measures) for road extraction using TPA and normalized cut method is compared. Thus the experimental result show that normalized cut method is efficient in extracting road segments in urban region from high resolution satellite image.
Resumo:
Analysis of high resolution satellite images has been an important research topic for urban analysis. One of the important features of urban areas in urban analysis is the automatic road network extraction. Two approaches for road extraction based on Level Set and Mean Shift methods are proposed. From an original image it is difficult and computationally expensive to extract roads due to presences of other road-like features with straight edges. The image is preprocessed to improve the tolerance by reducing the noise (the buildings, parking lots, vegetation regions and other open spaces) and roads are first extracted as elongated regions, nonlinear noise segments are removed using a median filter (based on the fact that road networks constitute large number of small linear structures). Then road extraction is performed using Level Set and Mean Shift method. Finally the accuracy for the road extracted images is evaluated based on quality measures. The 1m resolution IKONOS data has been used for the experiment.
Resumo:
Automatic indexing and retrieval of digital data poses major challenges. The main problem arises from the ever increasing mass of digital media and the lack of efficient methods for indexing and retrieval of such data based on the semantic content rather than keywords. To enable intelligent web interactions, or even web filtering, we need to be capable of interpreting the information base in an intelligent manner. For a number of years research has been ongoing in the field of ontological engineering with the aim of using ontologies to add such (meta) knowledge to information. In this paper, we describe the architecture of a system (Dynamic REtrieval Analysis and semantic metadata Management (DREAM)) designed to automatically and intelligently index huge repositories of special effects video clips, based on their semantic content, using a network of scalable ontologies to enable intelligent retrieval. The DREAM Demonstrator has been evaluated as deployed in the film post-production phase to support the process of storage, indexing and retrieval of large data sets of special effects video clips as an exemplar application domain. This paper provides its performance and usability results and highlights the scope for future enhancements of the DREAM architecture which has proven successful in its first and possibly most challenging proving ground, namely film production, where it is already in routine use within our test bed Partners' creative processes. (C) 2009 Published by Elsevier B.V.
Resumo:
This work describes a novel methodology for automatic contour extraction from 2D images of 3D neurons (e.g. camera lucida images and other types of 2D microscopy). Most contour-based shape analysis methods cannot be used to characterize such cells because of overlaps between neuronal processes. The proposed framework is specifically aimed at the problem of contour following even in presence of multiple overlaps. First, the input image is preprocessed in order to obtain an 8-connected skeleton with one-pixel-wide branches, as well as a set of critical regions (i.e., bifurcations and crossings). Next, for each subtree, the tracking stage iteratively labels all valid pixel of branches, tip to a critical region, where it determines the suitable direction to proceed. Finally, the labeled skeleton segments are followed in order to yield the parametric contour of the neuronal shape under analysis. The reported system was successfully tested with respect to several images and the results from a set of three neuron images are presented here, each pertaining to a different class, i.e. alpha, delta and epsilon ganglion cells, containing a total of 34 crossings. The algorithms successfully got across all these overlaps. The method has also been found to exhibit robustness even for images with close parallel segments. The proposed method is robust and may be implemented in an efficient manner. The introduction of this approach should pave the way for more systematic application of contour-based shape analysis methods in neuronal morphology. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
The central objective of research in Information Retrieval (IR) is to discover new techniques to retrieve relevant information in order to satisfy an Information Need. The Information Need is satisfied when relevant information can be provided to the user. In IR, relevance is a fundamental concept which has changed over time, from popular to personal, i.e., what was considered relevant before was information for the whole population, but what is considered relevant now is specific information for each user. Hence, there is a need to connect the behavior of the system to the condition of a particular person and his social context; thereby an interdisciplinary sector called Human-Centered Computing was born. For the modern search engine, the information extracted for the individual user is crucial. According to the Personalized Search (PS), two different techniques are necessary to personalize a search: contextualization (interconnected conditions that occur in an activity), and individualization (characteristics that distinguish an individual). This movement of focus to the individual's need undermines the rigid linearity of the classical model overtaken the ``berry picking'' model which explains that the terms change thanks to the informational feedback received from the search activity introducing the concept of evolution of search terms. The development of Information Foraging theory, which observed the correlations between animal foraging and human information foraging, also contributed to this transformation through attempts to optimize the cost-benefit ratio. This thesis arose from the need to satisfy human individuality when searching for information, and it develops a synergistic collaboration between the frontiers of technological innovation and the recent advances in IR. The search method developed exploits what is relevant for the user by changing radically the way in which an Information Need is expressed, because now it is expressed through the generation of the query and its own context. As a matter of fact the method was born under the pretense to improve the quality of search by rewriting the query based on the contexts automatically generated from a local knowledge base. Furthermore, the idea of optimizing each IR system has led to develop it as a middleware of interaction between the user and the IR system. Thereby the system has just two possible actions: rewriting the query, and reordering the result. Equivalent actions to the approach was described from the PS that generally exploits information derived from analysis of user behavior, while the proposed approach exploits knowledge provided by the user. The thesis went further to generate a novel method for an assessment procedure, according to the "Cranfield paradigm", in order to evaluate this type of IR systems. The results achieved are interesting considering both the effectiveness achieved and the innovative approach undertaken together with the several applications inspired using a local knowledge base.
Resumo:
Recent advances in airborne Light Detection and Ranging (LIDAR) technology allow rapid and inexpensive measurements of topography over large areas. Airborne LIDAR systems usually return a 3-dimensional cloud of point measurements from reflective objects scanned by the laser beneath the flight path. This technology is becoming a primary method for extracting information of different kinds of geometrical objects, such as high-resolution digital terrain models (DTMs), buildings and trees, etc. In the past decade, LIDAR gets more and more interest from researchers in the field of remote sensing and GIS. Compared to the traditional data sources, such as aerial photography and satellite images, LIDAR measurements are not influenced by sun shadow and relief displacement. However, voluminous data pose a new challenge for automated extraction the geometrical information from LIDAR measurements because many raster image processing techniques cannot be directly applied to irregularly spaced LIDAR points. ^ In this dissertation, a framework is proposed to filter out information about different kinds of geometrical objects, such as terrain and buildings from LIDAR automatically. They are essential to numerous applications such as flood modeling, landslide prediction and hurricane animation. The framework consists of several intuitive algorithms. Firstly, a progressive morphological filter was developed to detect non-ground LIDAR measurements. By gradually increasing the window size and elevation difference threshold of the filter, the measurements of vehicles, vegetation, and buildings are removed, while ground data are preserved. Then, building measurements are identified from no-ground measurements using a region growing algorithm based on the plane-fitting technique. Raw footprints for segmented building measurements are derived by connecting boundary points and are further simplified and adjusted by several proposed operations to remove noise, which is caused by irregularly spaced LIDAR measurements. To reconstruct 3D building models, the raw 2D topology of each building is first extracted and then further adjusted. Since the adjusting operations for simple building models do not work well on 2D topology, 2D snake algorithm is proposed to adjust 2D topology. The 2D snake algorithm consists of newly defined energy functions for topology adjusting and a linear algorithm to find the minimal energy value of 2D snake problems. Data sets from urbanized areas including large institutional, commercial, and small residential buildings were employed to test the proposed framework. The results demonstrated that the proposed framework achieves a very good performance. ^
Resumo:
The methodology of extracting information from texts has widely been described in the current literature. However, the methodology has been developed mainly for the purposes of other fields than terminology science. In addition, the research has been English language oriented. Therefore, there are no satisfactory language-independent methods for extracting terminological information from texts. The aim of the present study is to form the basis for a further improvement of methods for extraction of terminological information. A further aim is to determine differences in term extraction between subject groups with or without knowledge of the special field in question. The study is based on the theory of terminology, and has mainly a qualitative approach. The research material consists of electronically readable specialized texts in the subject domain of maritime safety. Textbooks, conference papers, research reports and articles from professional journals in Finnish and in Russian are included. The thesis first deals with certain term extraction methods. These are manual term identification and semi-automatic term extraction, the latter of which was carried out by using three commercial computer programs. The results of term extraction were compared and the recall and precision of the methods were evaluated. The latter part of the study is dedicated to the identification of concept relations. Certain linguistic expressions, which some researchers call knowledge probes, were applied to identify concept relations. The results of the present thesis suggest that special field knowledge is an advantage in manual term identification. However, in the candidate term lists the variation between subject groups was not as remarkable as it was between individual subjects. The term extraction software tested here produces candidate term lists which can be useful, but only after some manual work. Therefore, the work emphasizes the need to further develop term extraction software. Furthermore, the analyses indicate that there are a certain number of terms which were extracted by all the subjects and the software. These terms we call core terms. As the result of the experiment on linguistic expressions which signal concept relations, a proposal of Finnish and Russian knowledge probes in the field of maritime safety was made. The main finding was that it would be useful to combine the use of knowledge probes with semi-automatic term extraction since knowledge probes usually occur in the vicinity of terms.
Resumo:
Les travaux entrepris dans le cadre de la présente thèse portent sur l’analyse de l’équivalence terminologique en corpus parallèle et en corpus comparable. Plus spécifiquement, nous nous intéressons aux corpus de textes spécialisés appartenant au domaine du changement climatique. Une des originalités de cette étude réside dans l’analyse des équivalents de termes simples. Les bases théoriques sur lesquelles nous nous appuyons sont la terminologie textuelle (Bourigault et Slodzian 1999) et l’approche lexico-sémantique (L’Homme 2005). Cette étude poursuit deux objectifs. Le premier est d’effectuer une analyse comparative de l’équivalence dans les deux types de corpus afin de vérifier si l’équivalence terminologique observable dans les corpus parallèles se distingue de celle que l’on trouve dans les corpus comparables. Le deuxième consiste à comparer dans le détail les équivalents associés à un même terme anglais, afin de les décrire et de les répertorier pour en dégager une typologie. L’analyse détaillée des équivalents français de 343 termes anglais est menée à bien grâce à l’exploitation d’outils informatiques (extracteur de termes, aligneur de textes, etc.) et à la mise en place d’une méthodologie rigoureuse divisée en trois parties. La première partie qui est commune aux deux objectifs de la recherche concerne l’élaboration des corpus, la validation des termes anglais et le repérage des équivalents français dans les deux corpus. La deuxième partie décrit les critères sur lesquels nous nous appuyons pour comparer les équivalents des deux types de corpus. La troisième partie met en place la typologie des équivalents associés à un même terme anglais. Les résultats pour le premier objectif montrent que sur les 343 termes anglais analysés, les termes présentant des équivalents critiquables dans les deux corpus sont relativement peu élevés (12), tandis que le nombre de termes présentant des similitudes d’équivalence entre les corpus est très élevé (272 équivalents identiques et 55 équivalents non critiquables). L’analyse comparative décrite dans ce chapitre confirme notre hypothèse selon laquelle la terminologie employée dans les corpus parallèles ne se démarque pas de celle des corpus comparables. Les résultats pour le deuxième objectif montrent que de nombreux termes anglais sont rendus par plusieurs équivalents (70 % des termes analysés). Il est aussi constaté que ce ne sont pas les synonymes qui forment le groupe le plus important des équivalents, mais les quasi-synonymes. En outre, les équivalents appartenant à une autre partie du discours constituent une part importante des équivalents. Ainsi, la typologie élaborée dans cette thèse présente des mécanismes de l’équivalence terminologique peu décrits aussi systématiquement dans les travaux antérieurs.
Resumo:
In any terminological study, candidate term extraction is a very time-consuming task. Corpus analysis tools have automatized some processes allowing the detection of relevant data within the texts, facilitating term candidate selection as well. Nevertheless, these tools are (normally) not specific for terminology research; therefore, the units which are automatically extracted need manual evaluation. Over the last few years some software products have been specifically developed for automatic term extraction. They are based on corpus analysis, but use linguistic and statistical information to filter data more precisely. As a result, the time needed for manual evaluation is reduced. In this framework, we tried to understand if and how these new tools can really be an advantage. In order to develop our project, we simulated a terminology study: we chose a domain (i.e. legal framework for medicinal products for human use) and compiled a corpus from which we extracted terms and phraseologisms using AntConc, a corpus analysis tool. Afterwards, we compared our list with the lists extracted automatically from three different tools (TermoStat Web, TaaS e Sketch Engine) in order to evaluate their performance. In the first chapter we describe some principles relating to terminology and phraseology in language for special purposes and show the advantages offered by corpus linguistics. In the second chapter we illustrate some of the main concepts of the domain selected, as well as some of the main features of legal texts. In the third chapter we describe automatic term extraction and the main criteria to evaluate it; moreover, we introduce the term-extraction tools used for this project. In the fourth chapter we describe our research method and, in the fifth chapter, we show our results and draw some preliminary conclusions on the performance and usefulness of term-extraction tools.
Resumo:
In the paper we consider the technology of new domain's ontologies development. We discuss main principles of ontology development, automatic methods of terms extraction from the domain texts and types of ontology relations.