992 resultados para natural languages


Relevância:

60.00% 60.00%

Publicador:

Resumo:

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In his in uential article about the evolution of the Web, Berners-Lee [1] envisions a Semantic Web in which humans and computers alike are capable of understanding and processing information. This vision is yet to materialize. The main obstacle for the Semantic Web vision is that in today's Web meaning is rooted most often not in formal semantics, but in natural language and, in the sense of semiology, emerges not before interpretation and processing. Yet, an automated form of interpretation and processing can be tackled by precisiating raw natural language. To do that, Web agents extract fuzzy grassroots ontologies through induction from existing Web content. Inductive fuzzy grassroots ontologies thus constitute organically evolved knowledge bases that resemble automated gradual thesauri, which allow precisiating natural language [2]. The Web agents' underlying dynamic, self-organizing, and best-effort induction, enable a sub-syntactical bottom up learning of semiotic associations. Thus, knowledge is induced from the users' natural use of language in mutual Web interactions, and stored in a gradual, thesauri-like lexical-world knowledge database as a top-level ontology, eventually allowing a form of computing with words [3]. Since when computing with words the objects of computation are words, phrases and propositions drawn from natural languages, it proves to be a practical notion to yield emergent semantics for the Semantic Web. In the end, an improved understanding by computers on the one hand should upgrade human- computer interaction on the Web, and, on the other hand allow an initial version of human- intelligence amplification through the Web.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently, the Semantic Web has experienced significant advancements in standards and techniques, as well as in the amount of semantic information available online. Nevertheless, mechanisms are still needed to automatically reconcile information when it is expressed in different natural languages on the Web of Data, in order to improve the access to semantic information across language barriers. In this context several challenges arise [1], such as: (i) ontology translation/localization, (ii) cross-lingual ontology mappings, (iii) representation of multilingual lexical information, and (iv) cross-lingual access and querying of linked data. In the following we will focus on the second challenge, which is the necessity of establishing, representing and storing cross-lingual links among semantic information on the Web. In fact, in a “truly” multilingual Semantic Web, semantic data with lexical representations in one natural language would be mapped to equivalent or related information in other languages, thus making navigation across multilingual information possible for software agents.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Recently, the Semantic Web has experienced signi�cant advancements in standards and techniques, as well as in the amount of semantic information available online. Even so, mechanisms are still needed to automatically reconcile semantic information when it is expressed in di�erent natural languages, so that access to Web information across language barriers can be improved. That requires developing techniques for discovering and representing cross-lingual links on the Web of Data. In this paper we explore the different dimensions of such a problem and reflect on possible avenues of research on that topic.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Web has witnessed an enormous growth in the amount of semantic information published in recent years. This growth has been stimulated to a large extent by the emergence of Linked Data. Although this brings us a big step closer to the vision of a Semantic Web, it also raises new issues such as the need for dealing with information expressed in different natural languages. Indeed, although the Web of Data can contain any kind of information in any language, it still lacks explicit mechanisms to automatically reconcile such information when it is expressed in different languages. This leads to situations in which data expressed in a certain language is not easily accessible to speakers of other languages. The Web of Data shows the potential for being extended to a truly multilingual web as vocabularies and data can be published in a language-independent fashion, while associated language-dependent (linguistic) information supporting the access across languages can be stored separately. In this sense, the multilingual Web of Data can be realized in our view as a layer of services and resources on top of the existing Linked Data infrastructure adding i) linguistic information for data and vocabularies in different languages, ii) mappings between data with labels in different languages, and iii) services to dynamically access and traverse Linked Data across different languages. In this article we present this vision of a multilingual Web of Data. We discuss challenges that need to be addressed to make this vision come true and discuss the role that techniques such as ontology localization, ontology mapping, and cross-lingual ontology-based information access and presentation will play in achieving this. Further, we propose an initial architecture and describe a roadmap that can provide a basis for the implementation of this vision.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In the context of the Semantic Web, resources on the net can be enriched by well-defined, machine-understandable metadata describing their associated conceptual meaning. These metadata consisting of natural language descriptions of concepts are the focus of the activity we describe in this chapter, namely, ontology localization. In the framework of the NeOn Methodology, ontology localization is defined as the activity of adapting an ontology to a particular language and culture. This adaptation mainly involves the translation of the natural language descriptions of the ontology from a source natural language to a target natural language, with the final objective of obtaining a multilingual ontology, that is, an ontology documented in several natural languages. The purpose of this chapter is to provide detailed and prescriptive methodological guidelines to support the performance of this activity.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this article, we argue that there is a growing number of linked datasets in different natural languages, and that there is a need for guidelines and mechanisms to ensure the quality and organic growth of this emerging multilingual data network. However, we have little knowledge regarding the actual state of this data network, its current practices, and the open challenges that it poses. Questions regarding the distribution of natural languages, the links that are established across data in different languages, or how linguistic features are represented, remain mostly unanswered. Addressing these and other language-related issues can help to identify existing problems, propose new mechanisms and guidelines or adapt the ones in use for publishing linked data including language-related features, and, ultimately, provide metrics to evaluate quality aspects. In this article we review, discuss, and extend current guidelines for publishing linked data by focusing on those methods, techniques and tools that can help RDF publishers to cope with language barriers. Whenever possible, we will illustrate and discuss each of these guidelines, methods, and tools on the basis of practical examples that we have encountered in the publication of the datos.bne.es dataset.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

La tesis que se presenta tiene como propósito la construcción automática de ontologías a partir de textos, enmarcándose en el área denominada Ontology Learning. Esta disciplina tiene como objetivo automatizar la elaboración de modelos de dominio a partir de fuentes información estructurada o no estructurada, y tuvo su origen con el comienzo del milenio, a raíz del crecimiento exponencial del volumen de información accesible en Internet. Debido a que la mayoría de información se presenta en la web en forma de texto, el aprendizaje automático de ontologías se ha centrado en el análisis de este tipo de fuente, nutriéndose a lo largo de los años de técnicas muy diversas provenientes de áreas como la Recuperación de Información, Extracción de Información, Sumarización y, en general, de áreas relacionadas con el procesamiento del lenguaje natural. La principal contribución de esta tesis consiste en que, a diferencia de la mayoría de las técnicas actuales, el método que se propone no analiza la estructura sintáctica superficial del lenguaje, sino que estudia su nivel semántico profundo. Su objetivo, por tanto, es tratar de deducir el modelo del dominio a partir de la forma con la que se articulan los significados de las oraciones en lenguaje natural. Debido a que el nivel semántico profundo es independiente de la lengua, el método permitirá operar en escenarios multilingües, en los que es necesario combinar información proveniente de textos en diferentes idiomas. Para acceder a este nivel del lenguaje, el método utiliza el modelo de las interlinguas. Estos formalismos, provenientes del área de la traducción automática, permiten representar el significado de las oraciones de forma independiente de la lengua. Se utilizará en concreto UNL (Universal Networking Language), considerado como la única interlingua de propósito general que está normalizada. La aproximación utilizada en esta tesis supone la continuación de trabajos previos realizados tanto por su autor como por el equipo de investigación del que forma parte, en los que se estudió cómo utilizar el modelo de las interlinguas en las áreas de extracción y recuperación de información multilingüe. Básicamente, el procedimiento definido en el método trata de identificar, en la representación UNL de los textos, ciertas regularidades que permiten deducir las piezas de la ontología del dominio. Debido a que UNL es un formalismo basado en redes semánticas, estas regularidades se presentan en forma de grafos, generalizándose en estructuras denominadas patrones lingüísticos. Por otra parte, UNL aún conserva ciertos mecanismos de cohesión del discurso procedentes de los lenguajes naturales, como el fenómeno de la anáfora. Con el fin de aumentar la efectividad en la comprensión de las expresiones, el método provee, como otra contribución relevante, la definición de un algoritmo para la resolución de la anáfora pronominal circunscrita al modelo de la interlingua, limitada al caso de pronombres personales de tercera persona cuando su antecedente es un nombre propio. El método propuesto se sustenta en la definición de un marco formal, que ha debido elaborarse adaptando ciertas definiciones provenientes de la teoría de grafos e incorporando otras nuevas, con el objetivo de ubicar las nociones de expresión UNL, patrón lingüístico y las operaciones de encaje de patrones, que son la base de los procesos del método. Tanto el marco formal como todos los procesos que define el método se han implementado con el fin de realizar la experimentación, aplicándose sobre un artículo de la colección EOLSS “Encyclopedia of Life Support Systems” de la UNESCO. ABSTRACT The purpose of this thesis is the automatic construction of ontologies from texts. This thesis is set within the area of Ontology Learning. This discipline aims to automatize domain models from structured or unstructured information sources, and had its origin with the beginning of the millennium, as a result of the exponential growth in the volume of information accessible on the Internet. Since most information is presented on the web in the form of text, the automatic ontology learning is focused on the analysis of this type of source, nourished over the years by very different techniques from areas such as Information Retrieval, Information Extraction, Summarization and, in general, by areas related to natural language processing. The main contribution of this thesis consists of, in contrast with the majority of current techniques, the fact that the method proposed does not analyze the syntactic surface structure of the language, but explores his deep semantic level. Its objective, therefore, is trying to infer the domain model from the way the meanings of the sentences are articulated in natural language. Since the deep semantic level does not depend on the language, the method will allow to operate in multilingual scenarios, where it is necessary to combine information from texts in different languages. To access to this level of the language, the method uses the interlingua model. These formalisms, coming from the area of machine translation, allow to represent the meaning of the sentences independently of the language. In this particular case, UNL (Universal Networking Language) will be used, which considered to be the only interlingua of general purpose that is standardized. The approach used in this thesis corresponds to the continuation of previous works carried out both by the author of this thesis and by the research group of which he is part, in which it is studied how to use the interlingua model in the areas of multilingual information extraction and retrieval. Basically, the procedure defined in the method tries to identify certain regularities at the UNL representation of texts that allow the deduction of the parts of the ontology of the domain. Since UNL is a formalism based on semantic networks, these regularities are presented in the form of graphs, generalizing in structures called linguistic patterns. On the other hand, UNL still preserves certain mechanisms of discourse cohesion from natural languages, such as the phenomenon of the anaphora. In order to increase the effectiveness in the understanding of expressions, the method provides, as another significant contribution, the definition of an algorithm for the resolution of pronominal anaphora limited to the model of the interlingua, in the case of third person personal pronouns when its antecedent is a proper noun. The proposed method is based on the definition of a formal framework, adapting some definitions from Graph Theory and incorporating new ones, in order to locate the notions of UNL expression and linguistic pattern, as well as the operations of pattern matching, which are the basis of the method processes. Both the formal framework and all the processes that define the method have been implemented in order to carry out the experimentation, applying on an article of the "Encyclopedia of Life Support Systems" of the UNESCO-EOLSS collection.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

If one has a distribution of words (SLUNs or CLUNS) in a text written in language L(MT), and is adjusted one of the mathematical expressions of distribution that exists in the mathematical literature, some parameter of the elected expression it can be considered as a measure of the diversity. But because the adjustment is not always perfect as usual measure; it is preferable to select an index that doesn't postulate a regularity of distribution expressible for a simple formula. The problem can be approachable statistically, without having special interest for the organization of the text. It can serve as index any monotonous function that has a minimum value when all their elements belong to the same class, that is to say, all the individuals belong to oneself symbol, and a maximum value when each element belongs to a different class, that is to say, each individual is of a different symbol. It should also gather certain conditions like they are: to be not very sensitive to the extension of the text and being invariant to certain number of operations of selection in the text. These operations can be theoretically random. The expressions that offer more advantages are those coming from the theory of the information of Shannon-Weaver. Based on them, the authors develop a theoretical study for indexes of diversity to be applied in texts built in modeling language L(MT), although anything impedes that they can be applied to texts written in natural languages.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En el marco de la gramática contrastiva entre inglés y castellano, asumimos la perspectiva generativista, según la cual la sintaxis humana resulta de procesos derivacionales que combinan, por medio de mecanismos innatos, elementos capaces de codificar significados primitivos. Trabajamos con datos, ya sea reales o inventados, que ilustran los distintos tipos de oraciones ?gramaticales? posibles en una lengua natural. Nos centramos en la noción de ?ergatividad?, utilizada por distintas teorías para explicar fenómenos sintácticos y semánticos no siempre coincidentes. Desde el Funcionalismo/Cognitivismo norteamericano, Scott DeLancey (2001) compara distintas lenguas y distingue morfológicamente los sujetos ?agentivos? en estructuras causadas de los sujetos ?afectados? en estructuras no causadas. Según la Lingüística Sistémica Funcional (Halliday 1985, 2004), la ergatividad abarca, no sólo la presencia o ausencia de Causa en un proceso particular, sino también la relación causal que vincula distintos procesos entre sí. El esquema ergativo de Halliday incluye, además de pares constituidos por eventos de cambio de estado no causados y de causa externa, eventos inergativos de causa externa, y eventos transitivos instigados por otro proceso, expresado lingüísticamente o bien inferido. Desde la perspectiva de la Semántica Relacional (Mateu, 2002), reducimos el número de ?constructos primitivos? disponibles de tres a dos. Consideramos dos alternancias: la ergativo-transitiva y la causa interna/externa con inergativos. No vinculamos ni sintáctica ni derivacionalmente las construcciones transitivas con los eventos inferidos que las instigan. Justificamos la relación derivacional que vincula las construcciones pasivas estáticas con las construcciones ergativas de verbos naturalmente alternantes, por un lado, y, con las construcciones transitivas de verbos de ?locatum? y de inergativos, por el otro. Reanalizamos, respecto de la bibliografía canónica, la naturaleza del clítico ?se? en las construcciones ergativas españolas. Este análisis orienta de manera teórica el abordaje de las construcciones pertinentes, y ofrece equivalencias posibles que pueden resultar útiles para la traducción

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En el marco de la gramática contrastiva entre inglés y castellano, asumimos la perspectiva generativista, según la cual la sintaxis humana resulta de procesos derivacionales que combinan, por medio de mecanismos innatos, elementos capaces de codificar significados primitivos. Trabajamos con datos, ya sea reales o inventados, que ilustran los distintos tipos de oraciones ?gramaticales? posibles en una lengua natural. Nos centramos en la noción de ?ergatividad?, utilizada por distintas teorías para explicar fenómenos sintácticos y semánticos no siempre coincidentes. Desde el Funcionalismo/Cognitivismo norteamericano, Scott DeLancey (2001) compara distintas lenguas y distingue morfológicamente los sujetos ?agentivos? en estructuras causadas de los sujetos ?afectados? en estructuras no causadas. Según la Lingüística Sistémica Funcional (Halliday 1985, 2004), la ergatividad abarca, no sólo la presencia o ausencia de Causa en un proceso particular, sino también la relación causal que vincula distintos procesos entre sí. El esquema ergativo de Halliday incluye, además de pares constituidos por eventos de cambio de estado no causados y de causa externa, eventos inergativos de causa externa, y eventos transitivos instigados por otro proceso, expresado lingüísticamente o bien inferido. Desde la perspectiva de la Semántica Relacional (Mateu, 2002), reducimos el número de ?constructos primitivos? disponibles de tres a dos. Consideramos dos alternancias: la ergativo-transitiva y la causa interna/externa con inergativos. No vinculamos ni sintáctica ni derivacionalmente las construcciones transitivas con los eventos inferidos que las instigan. Justificamos la relación derivacional que vincula las construcciones pasivas estáticas con las construcciones ergativas de verbos naturalmente alternantes, por un lado, y, con las construcciones transitivas de verbos de ?locatum? y de inergativos, por el otro. Reanalizamos, respecto de la bibliografía canónica, la naturaleza del clítico ?se? en las construcciones ergativas españolas. Este análisis orienta de manera teórica el abordaje de las construcciones pertinentes, y ofrece equivalencias posibles que pueden resultar útiles para la traducción

Relevância:

60.00% 60.00%

Publicador:

Resumo:

En el marco de la gramática contrastiva entre inglés y castellano, asumimos la perspectiva generativista, según la cual la sintaxis humana resulta de procesos derivacionales que combinan, por medio de mecanismos innatos, elementos capaces de codificar significados primitivos. Trabajamos con datos, ya sea reales o inventados, que ilustran los distintos tipos de oraciones ?gramaticales? posibles en una lengua natural. Nos centramos en la noción de ?ergatividad?, utilizada por distintas teorías para explicar fenómenos sintácticos y semánticos no siempre coincidentes. Desde el Funcionalismo/Cognitivismo norteamericano, Scott DeLancey (2001) compara distintas lenguas y distingue morfológicamente los sujetos ?agentivos? en estructuras causadas de los sujetos ?afectados? en estructuras no causadas. Según la Lingüística Sistémica Funcional (Halliday 1985, 2004), la ergatividad abarca, no sólo la presencia o ausencia de Causa en un proceso particular, sino también la relación causal que vincula distintos procesos entre sí. El esquema ergativo de Halliday incluye, además de pares constituidos por eventos de cambio de estado no causados y de causa externa, eventos inergativos de causa externa, y eventos transitivos instigados por otro proceso, expresado lingüísticamente o bien inferido. Desde la perspectiva de la Semántica Relacional (Mateu, 2002), reducimos el número de ?constructos primitivos? disponibles de tres a dos. Consideramos dos alternancias: la ergativo-transitiva y la causa interna/externa con inergativos. No vinculamos ni sintáctica ni derivacionalmente las construcciones transitivas con los eventos inferidos que las instigan. Justificamos la relación derivacional que vincula las construcciones pasivas estáticas con las construcciones ergativas de verbos naturalmente alternantes, por un lado, y, con las construcciones transitivas de verbos de ?locatum? y de inergativos, por el otro. Reanalizamos, respecto de la bibliografía canónica, la naturaleza del clítico ?se? en las construcciones ergativas españolas. Este análisis orienta de manera teórica el abordaje de las construcciones pertinentes, y ofrece equivalencias posibles que pueden resultar útiles para la traducción

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Bulgarian and world computer science lost a prominent colleague: Dimitar Petrov Shishkov 22th January 1939, Varna – 8th March 2004, Sofia D. Shishkov graduated mathematics at Sofia University in 1962. In the last year of his studies he started a specialization as a programmer at the Joint Institute of Nuclear Research – Dubna which lasted three years. Then he worked at the Institute of Mathematics for two years. In 1966 D. Shishkov together with a group of experts transferred to the newly created Central Laboratory for Information Technologies. In 1976 he defended his PhD dissertation. He has been an associate professor in computer science at Sofia University since 1985 and a professor in computer science since 2000. His scientific interests and results were in the fields of computer architectures, computational linguistics, artificial intelligence, numerical methods, data structures, etc. He was remarkable with his teaching activities. D. Shishkov was the creator of high-quality software for the first Bulgarian electronic calculator “ELKA” – one of the first calculators in the world as well as for the series of next calculators and for specialized minicomputers. He was the initiator of the international project “Computerization of the natural languages”. He was a member of a range of international scientific organizations. Among his numerous activities was the organization of the I-st Programming competition in 1979. D. Shishkov was the initiator of sport dancing in Bulgaria (1967) and founder of the first sport-dancing high school education in the world. D. Shishkov was a highly accomplished person with a diversity of interests, with a developed social responsibility and accuracy in his work. In 1996 D. Shishkov was awarded with the International Prize ITHEA for outstanding achievements in the field of Information Theories and Applications. We are grateful to D. Shishkov for the chance to work together with him for establishment and development of IJ ITA.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper we try to present how information technologies as tools for the creation of digital bilingual dictionaries can help the preservation of natural languages. Natural languages are an outstanding part of human cultural values and for that reason they should be preserved as part of the world cultural heritage. We describe our work on the bilingual lexical database supporting the Bulgarian-Polish Online dictionary. The main software tools for the web- presentation of the dictionary are shortly described. We focus our special attention on the presentation of verbs, the richest from a specific characteristics viewpoint linguistic category in Bulgarian.