13 resultados para semantic textual similarity
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
The classification of texts has become a major endeavor with so much electronic material available, for it is an essential task in several applications, including search engines and information retrieval. There are different ways to define similarity for grouping similar texts into clusters, as the concept of similarity may depend on the purpose of the task. For instance, in topic extraction similar texts mean those within the same semantic field, whereas in author recognition stylistic features should be considered. In this study, we introduce ways to classify texts employing concepts of complex networks, which may be able to capture syntactic, semantic and even pragmatic features. The interplay between various metrics of the complex networks is analyzed with three applications, namely identification of machine translation (MT) systems, evaluation of quality of machine translated texts and authorship recognition. We shall show that topological features of the networks representing texts can enhance the ability to identify MT systems in particular cases. For evaluating the quality of MT texts, on the other hand, high correlation was obtained with methods capable of capturing the semantics. This was expected because the golden standards used are themselves based on word co-occurrence. Notwithstanding, the Katz similarity, which involves semantic and structure in the comparison of texts, achieved the highest correlation with the NIST measurement, indicating that in some cases the combination of both approaches can improve the ability to quantify quality in MT. In authorship recognition, again the topological features were relevant in some contexts, though for the books and authors analyzed good results were obtained with semantic features as well. Because hybrid approaches encompassing semantic and topological features have not been extensively used, we believe that the methodology proposed here may be useful to enhance text classification considerably, as it combines well-established strategies. (c) 2012 Elsevier B.V. All rights reserved.
Resumo:
Even though the digital processing of documents is increasingly widespread in industry, printed documents are still largely in use. In order to process electronically the contents of printed documents, information must be extracted from digital images of documents. When dealing with complex documents, in which the contents of different regions and fields can be highly heterogeneous with respect to layout, printing quality and the utilization of fonts and typing standards, the reconstruction of the contents of documents from digital images can be a difficult problem. In the present article we present an efficient solution for this problem, in which the semantic contents of fields in a complex document are extracted from a digital image.
Resumo:
The Neotropical evaniid genus Evaniscus Szepligeti currently includes six species. Two new species are described, Evaniscus lansdownei Mullins, sp. n. from Colombia and Brazil and E. rafaeli Kawada, sp. n. from Brazil. Evaniscus sulcigenis Roman, syn. n., is synonymized under E. rufithorax Enderlein. An identification key to species of Evaniscus is provided. Thirty-five parsimony informative morphological characters are analyzed for six ingroup and four outgroup taxa. A topology resulting in a monophyletic Evaniscus is presented with E. tibialis and E. rafaeli as sister to the remaining Evaniscus species. The Hymenoptera Anatomy Ontology and other relevant biomedical ontologies are employed to create semantic phenotype statements in Entity-Quality (EQ) format for species descriptions. This approach is an early effort to formalize species descriptions and to make descriptive data available to other domains.
Resumo:
Background: Early progressive nonfluent aphasia (PNFA) may be difficult to differentiate from semantic dementia (SD) in a nonspecialist setting. There are descriptions of the clinical and neuropsychological profiles of patients with PNFA and SD but few systematic comparisons. Method: We compared the performance of groups with SD (n = 27) and PNFA (n = 16) with comparable ages, education, disease duration, and severity of dementia as measured by the Clinical Dementia Rating Scale on a comprehensive neuropsychological battery. Principal components analysis and intergroup comparisons were used. Results: A 5-factor solution accounted for 78.4% of the total variance with good separation of neuropsychological variables. As expected, both groups were anomic with preserved visuospatial function and mental speed. Patients with SD had lower scores on comprehension-based semantic tests and better performance on verbal working memory and phonological processing tasks. The opposite pattern was found in the PNFA group. Conclusions: Neuropsychological tests that examine verbal and nonverbal semantic associations, verbal working memory, and phonological processing are the most helpful for distinguishing between PNFA and SD.
Resumo:
With the increase in research on the components of Body Image, validated instruments are needed to evaluate its dimensions. The Body Change Inventory (BCI) assesses strategies used to alter body size among adolescents. The scope of this study was to describe the translation and evaluation for semantic equivalence of the BCI in the Portuguese language. The process involved the steps of (1) translation of the questionnaire to the Portuguese language; (2) back-translation to English; (3) evaluation of semantic equivalence; and (4) assessment of comprehension by professional experts and the target population. The six subscales of the instrument were translated into the Portuguese language. Language adaptations were made to render the instrument suitable for the Brazilian reality. The questions were interpreted as easily understandable by both experts and young people. The Body Change Inventory has been translated and adapted into Portuguese. Evaluation of the operational, measurement and functional equivalence are still needed.
Resumo:
XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficiently addressed while comparing XML documents. In this paper, we provide an integrated and fine-grained comparison framework to deal with both structural and semantic similarities in XML documents (detecting the occurrences and repetitions of structurally and semantically similar sub-trees), and to allow the end-user to adjust the comparison process according to her requirements. Our framework consists of four main modules for (i) discovering the structural commonalities between sub-trees, (ii) identifying sub-tree semantic resemblances, (iii) computing tree-based edit operations costs, and (iv) computing tree edit distance. Experimental results demonstrate higher comparison accuracy with respect to alternative methods, while timing experiments reflect the impact of semantic similarity on overall system performance.
Resumo:
The ability to discriminate nestmates from non-nestmates in insect societies is essential to protect colonies from conspecific invaders. The acceptance threshold hypothesis predicts that organisms whose recognition systems classify recipients without errors should optimize the balance between acceptance and rejection. In this process, cuticular hydrocarbons play an important role as cues of recognition in social insects. The aims of this study were to determine whether guards exhibit a restrictive level of rejection towards chemically distinct individuals, becoming more permissive during the encounters with either nestmate or non-nestmate individuals bearing chemically similar profiles. The study demonstrates that Melipona asilvai (Hymenoptera: Apidae: Meliponini) guards exhibit a flexible system of nestmate recognition according to the degree of chemical similarity between the incoming forager and its own cuticular hydrocarbons profile. Guards became less restrictive in their acceptance rates when they encounter non-nestmates with highly similar chemical profiles, which they probably mistake for nestmates, hence broadening their acceptance level.
Resumo:
Traditional supervised data classification considers only physical features (e. g., distance or similarity) of the input data. Here, this type of learning is called low level classification. On the other hand, the human (animal) brain performs both low and high orders of learning and it has facility in identifying patterns according to the semantic meaning of the input data. Data classification that considers not only physical attributes but also the pattern formation is, here, referred to as high level classification. In this paper, we propose a hybrid classification technique that combines both types of learning. The low level term can be implemented by any classification technique, while the high level term is realized by the extraction of features of the underlying network constructed from the input data. Thus, the former classifies the test instances by their physical features or class topologies, while the latter measures the compliance of the test instances to the pattern formation of the data. Our study shows that the proposed technique not only can realize classification according to the pattern formation, but also is able to improve the performance of traditional classification techniques. Furthermore, as the class configuration's complexity increases, such as the mixture among different classes, a larger portion of the high level term is required to get correct classification. This feature confirms that the high level classification has a special importance in complex situations of classification. Finally, we show how the proposed technique can be employed in a real-world application, where it is capable of identifying variations and distortions of handwritten digit images. As a result, it supplies an improvement in the overall pattern recognition rate.
Resumo:
HTLV-1 is endemic in Brazil and HIV/ HTLV-1 coinfection has been detected, mostly in the northeast region. Cosmopolitan HTLV-1a is the main subtype that circulates in Brazil. This study characterized 17 HTLV-1 isolates from HIV coinfected patients of southern (n = 7) and southeastern (n = 10) Brazil. HTLV-1 provirus DNA was amplified by nested PCR (env and LTR) and sequenced. Env sequences (705 bp) from 15 isolates and LTR sequences (731 bp) from 17 isolates showed 99.5% and 98.8% similarity among sequences, respectively. Comparing these sequences with ATK (HTLV-1a) and Mel5 (HTLV-1c) prototypes, similarities of 99% and 97.4%, respectively, for env and LTR with ATK, and 91.6% and 90.3% with Mel5, were detected. Phylogenetic analysis showed that all sequences belonged to the transcontinental subgroup A of the Cosmopolitan subtype, clustering in two Latin American clusters.
Resumo:
There is evidence that the explicit lexical-semantic processing deficits which characterize aphasia may be observed in the absence of implicit semantic impairment. The aim of this article was to critically review the international literature on lexical-semantic processing in aphasia, as tested through the semantic priming paradigm. Specifically, this review focused on aphasia and lexical-semantic processing, the methodological strengths and weaknesses of the semantic paradigms used, and recent evidence from neuroimaging studies on lexical-semantic processing. Furthermore, evidence on dissociations between implicit and explicit lexical-semantic processing reported in the literature will be discussed and interpreted by referring to functional neuroimaging evidence from healthy populations. There is evidence that semantic priming effects can be found both in fluent and in non-fluent aphasias, and that these effects are related to an extensive network which includes the temporal lobe, the pre-frontal cortex, the left frontal gyrus, the left temporal gyrus and the cingulated cortex.
Resumo:
Abstract Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
Resumo:
In this paper, we present a novel approach to perform similarity queries over medical images, maintaining the semantics of a given query posted by the user. Content-based image retrieval systems relying on relevance feedback techniques usually request the users to label relevant/irrelevant images. Thus, we present a highly effective strategy to survey user profiles, taking advantage of such labeling to implicitly gather the user perceptual similarity. The profiles maintain the settings desired for each user, allowing tuning of the similarity assessment, which encompasses the dynamic change of the distance function employed through an interactive process. Experiments on medical images show that the method is effective and can improve the decision making process during analysis.
Resumo:
With the increasing production of information from e-government initiatives, there is also the need to transform a large volume of unstructured data into useful information for society. All this information should be easily accessible and made available in a meaningful and effective way in order to achieve semantic interoperability in electronic government services, which is a challenge to be pursued by governments round the world. Our aim is to discuss the context of e-Government Big Data and to present a framework to promote semantic interoperability through automatic generation of ontologies from unstructured information found in the Internet. We propose the use of fuzzy mechanisms to deal with natural language terms and present some related works found in this area. The results achieved in this study are based on the architectural definition and major components and requirements in order to compose the proposed framework. With this, it is possible to take advantage of the large volume of information generated from e-Government initiatives and use it to benefit society.