13 resultados para Semantic Publishing, Linked Data, Bibliometrics, Informetrics, Data Retrieval, Citations
em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo
Resumo:
With the increasing production of information from e-government initiatives, there is also the need to transform a large volume of unstructured data into useful information for society. All this information should be easily accessible and made available in a meaningful and effective way in order to achieve semantic interoperability in electronic government services, which is a challenge to be pursued by governments round the world. Our aim is to discuss the context of e-Government Big Data and to present a framework to promote semantic interoperability through automatic generation of ontologies from unstructured information found in the Internet. We propose the use of fuzzy mechanisms to deal with natural language terms and present some related works found in this area. The results achieved in this study are based on the architectural definition and major components and requirements in order to compose the proposed framework. With this, it is possible to take advantage of the large volume of information generated from e-Government initiatives and use it to benefit society.
Resumo:
In this paper we have quantified the consistency of word usage in written texts represented by complex networks, where words were taken as nodes, by measuring the degree of preservation of the node neighborhood. Words were considered highly consistent if the authors used them with the same neighborhood. When ranked according to the consistency of use, the words obeyed a log-normal distribution, in contrast to Zipf's law that applies to the frequency of use. Consistency correlated positively with the familiarity and frequency of use, and negatively with ambiguity and age of acquisition. An inspection of some highly consistent words confirmed that they are used in very limited semantic contexts. A comparison of consistency indices for eight authors indicated that these indices may be employed for author recognition. Indeed, as expected, authors of novels could be distinguished from those who wrote scientific texts. Our analysis demonstrated the suitability of the consistency indices, which can now be applied in other tasks, such as emotion recognition.
Resumo:
Abstract Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.
Resumo:
We report new archeointensity data obtained from the analyses of baked clay elements (architectural and kiln brick fragments) sampled in Southeast Brazil and historically and/or archeologically dated between the end of the XVIth century and the beginning of the XXth century AD. The results were determined using the classical Thellier and Thellier protocol as modified by Coe, including partial thermoremanent magnetization (pTRM) and pTRM-tail checks, and the Triaxe protocol, which involves continuous high-temperature magnetization measurements. In both protocols, TRM anisotropy and cooling rate TRM dependence effects were taken into account for intensity determinations which were successfully performed for 150 specimens from 43 fragments, with a good agreement between intensity results obtained from the two procedures. Nine site-mean intensity values were derived from three to eight fragments and defined with standard deviations of less than 8%. The site-mean values vary from similar to 25 mu T to similar to 42 mu T and describe in Southeast Brazil a continuous decreasing trend by similar to 5 mu T per century between similar to 1600 AD and similar to 1900 AD. Their comparison with recent archeointensity results obtained from Northeast Brazil and reduced at a same latitude shows that: (1) the geocentric axial dipole approximation is not valid between these southeastern and northeastern regions of Brazil, whose latitudes differ by similar to 10 degrees, and (2) the available global geomagnetic field models (gufm1 models, their recalibrated versions and the CALSK3 models) are not sufficiently precise to reliably reproduce the non-dipole field effects which prevailed in Brazil for at least the 1600-1750 period. The large non-dipole contribution thus highlighted is most probably linked to the evolution of the South Atlantic Magnetic Anomaly (SAMA) during that period. Furthermore, although our dataset is limited, the Brazilian archeointensity data appear to support the view of a rather oscillatory behavior of the axial dipole moment during the past three centuries that would have been marked in particular by a moderate increase between the end of the XVIIIth century and the middle of the XIXth century followed by the well-known decrease from 1840 AD attested by direct measurements. (C) 2011 Elsevier B.V. All rights reserved.
Resumo:
The Amazonian lowlands include large patches of open vegetation which contrast sharply with the rainforest, and the origin of these patches has been debated. This study focuses on a large area of open vegetation in northern Brazil, where d13C and, in some instances, C/N analyses of the organic matter preserved in late Quaternary sediments were used to achieve floristic reconstructions over time. The main goal was to determine when the modern open vegetation started to develop in this area. The variability in d13C data derived from nine cores ranges from -32.2 to -19.6 parts per thousand, but with nearly 60% of data above -26.5 parts per thousand. The most enriched values were detected only in ecotone and open vegetated areas. The development of open vegetation communities was asynchronous, varying between estimated ages of 6400 and 3000 cal a BP. This suggests that the origin of the studied patches of open vegetation might be linked to sedimentary dynamics of a late Quaternary megafan system. As sedimentation ended, this vegetation type became established over the megafan surface. In addition, the data presented here show that the presence of C4 plants must be used carefully as a proxy to interpret dry paleoclimatic episodes in Amazonian areas. Copyright (c) 2012 John Wiley & Sons, Ltd.
Resumo:
Content-based image retrieval is still a challenging issue due to the inherent complexity of images and choice of the most discriminant descriptors. Recent developments in the field have introduced multidimensional projections to burst accuracy in the retrieval process, but many issues such as introduction of pattern recognition tasks and deeper user intervention to assist the process of choosing the most discriminant features still remain unaddressed. In this paper, we present a novel framework to CBIR that combines pattern recognition tasks, class-specific metrics, and multidimensional projection to devise an effective and interactive image retrieval system. User interaction plays an essential role in the computation of the final multidimensional projection from which image retrieval will be attained. Results have shown that the proposed approach outperforms existing methods, turning out to be a very attractive alternative for managing image data sets.
Resumo:
Background: Infant mortality is an important measure of human development, related to the level of welfare of a society. In order to inform public policy, various studies have tried to identify the factors that influence, at an aggregated level, infant mortality. The objective of this paper is to analyze the regional pattern of infant mortality in Brazil, evaluating the effect of infrastructure, socio-economic, and demographic variables to understand its distribution across the country. Methods: Regressions including socio-economic and living conditions variables are conducted in a structure of panel data. More specifically, a spatial panel data model with fixed effects and a spatial error autocorrelation structure is used to help to solve spatial dependence problems. The use of a spatial modeling approach takes into account the potential presence of spillovers between neighboring spatial units. The spatial units considered are Minimum Comparable Areas, defined to provide a consistent definition across Census years. Data are drawn from the 1980, 1991 and 2000 Census of Brazil, and from data collected by the Ministry of Health (DATASUS). In order to identify the influence of health care infrastructure, variables related to the number of public and private hospitals are included. Results: The results indicate that the panel model with spatial effects provides the best fit to the data. The analysis confirms that the provision of health care infrastructure and social policy measures (e. g. improving education attainment) are linked to reduced rates of infant mortality. An original finding concerns the role of spatial effects in the analysis of IMR. Spillover effects associated with health infrastructure and water and sanitation facilities imply that there are regional benefits beyond the unit of analysis. Conclusions: A spatial modeling approach is important to produce reliable estimates in the analysis of panel IMR data. Substantively, this paper contributes to our understanding of the physical and social factors that influence IMR in the case of a developing country.
Resumo:
Traditional supervised data classification considers only physical features (e. g., distance or similarity) of the input data. Here, this type of learning is called low level classification. On the other hand, the human (animal) brain performs both low and high orders of learning and it has facility in identifying patterns according to the semantic meaning of the input data. Data classification that considers not only physical attributes but also the pattern formation is, here, referred to as high level classification. In this paper, we propose a hybrid classification technique that combines both types of learning. The low level term can be implemented by any classification technique, while the high level term is realized by the extraction of features of the underlying network constructed from the input data. Thus, the former classifies the test instances by their physical features or class topologies, while the latter measures the compliance of the test instances to the pattern formation of the data. Our study shows that the proposed technique not only can realize classification according to the pattern formation, but also is able to improve the performance of traditional classification techniques. Furthermore, as the class configuration's complexity increases, such as the mixture among different classes, a larger portion of the high level term is required to get correct classification. This feature confirms that the high level classification has a special importance in complex situations of classification. Finally, we show how the proposed technique can be employed in a real-world application, where it is capable of identifying variations and distortions of handwritten digit images. As a result, it supplies an improvement in the overall pattern recognition rate.
Resumo:
Dimensionality reduction is employed for visual data analysis as a way to obtaining reduced spaces for high dimensional data or to mapping data directly into 2D or 3D spaces. Although techniques have evolved to improve data segregation on reduced or visual spaces, they have limited capabilities for adjusting the results according to user's knowledge. In this paper, we propose a novel approach to handling both dimensionality reduction and visualization of high dimensional data, taking into account user's input. It employs Partial Least Squares (PLS), a statistical tool to perform retrieval of latent spaces focusing on the discriminability of the data. The method employs a training set for building a highly precise model that can then be applied to a much larger data set very effectively. The reduced data set can be exhibited using various existing visualization techniques. The training data is important to code user's knowledge into the loop. However, this work also devises a strategy for calculating PLS reduced spaces when no training data is available. The approach produces increasingly precise visual mappings as the user feeds back his or her knowledge and is capable of working with small and unbalanced training sets.
Resumo:
Background: Several studies in Drosophila have shown excessive movement of retrogenes from the X chromosome to autosomes, and that these genes are frequently expressed in the testis. This phenomenon has led to several hypotheses invoking natural selection as the process driving male-biased genes to the autosomes. Metta and Schlotterer (BMC Evol Biol 2010, 10:114) analyzed a set of retrogenes where the parental gene has been subsequently lost. They assumed that this class of retrogenes replaced the ancestral functions of the parental gene, and reported that these retrogenes, although mostly originating from movement out of the X chromosome, showed female-biased or unbiased expression. These observations led the authors to suggest that selective forces (such as meiotic sex chromosome inactivation and sexual antagonism) were not responsible for the observed pattern of retrogene movement out of the X chromosome. Results: We reanalyzed the dataset published by Metta and Schlotterer and found several issues that led us to a different conclusion. In particular, Metta and Schlotterer used a dataset combined with expression data in which significant sex-biased expression is not detectable. First, the authors used a segmental dataset where the genes selected for analysis were less testis-biased in expression than those that were excluded from the study. Second, sex-biased expression was defined by comparing male and female whole-body data and not the expression of these genes in gonadal tissues. This approach significantly reduces the probability of detecting sex-biased expressed genes, which explains why the vast majority of the genes analyzed (parental and retrogenes) were equally expressed in both males and females. Third, the female-biased expression observed by Metta and Schltterer is mostly found for parental genes located on the X chromosome, which is known to be enriched with genes with female-biased expression. Fourth, using additional gonad expression data, we found that autosomal genes analyzed by Metta and Schlotterer are less up regulated in ovaries and have higher chance to be expressed in meiotic cells of spermatogenesis when compared to X-linked genes. Conclusions: The criteria used to select retrogenes and the sex-biased expression data based on whole adult flies generated a segmental dataset of female-biased and unbiased expressed genes that was unable to detect the higher propensity of autosomal retrogenes to be expressed in males. Thus, there is no support for the authors' view that the movement of new retrogenes, which originated from X-linked parental genes, was not driven by selection. Therefore, selection-based genetic models remain the most parsimonious explanations for the observed chromosomal distribution of retrogenes.
Resumo:
A common interest in gene expression data analysis is to identify from a large pool of candidate genes the genes that present significant changes in expression levels between a treatment and a control biological condition. Usually, it is done using a statistic value and a cutoff value that are used to separate the genes differentially and nondifferentially expressed. In this paper, we propose a Bayesian approach to identify genes differentially expressed calculating sequentially credibility intervals from predictive densities which are constructed using the sampled mean treatment effect from all genes in study excluding the treatment effect of genes previously identified with statistical evidence for difference. We compare our Bayesian approach with the standard ones based on the use of the t-test and modified t-tests via a simulation study, using small sample sizes which are common in gene expression data analysis. Results obtained report evidence that the proposed approach performs better than standard ones, especially for cases with mean differences and increases in treatment variance in relation to control variance. We also apply the methodologies to a well-known publicly available data set on Escherichia coli bacterium.
Resumo:
Background The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results We have implemented an extension of Chado – the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework. A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different “omics” technologies with patient’s clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans webcite.
Resumo:
Abstract Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.