62 resultados para Data distribution


Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this article, we propose a new Bayesian flexible cure rate survival model, which generalises the stochastic model of Klebanov et al. [Klebanov LB, Rachev ST and Yakovlev AY. A stochastic-model of radiation carcinogenesis - latent time distributions and their properties. Math Biosci 1993; 113: 51-75], and has much in common with the destructive model formulated by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)]. In our approach, the accumulated number of lesions or altered cells follows a compound weighted Poisson distribution. This model is more flexible than the promotion time cure model in terms of dispersion. Moreover, it possesses an interesting and realistic interpretation of the biological mechanism of the occurrence of the event of interest as it includes a destructive process of tumour cells after an initial treatment or the capacity of an individual exposed to irradiation to repair altered cells that results in cancer induction. In other words, what is recorded is only the damaged portion of the original number of altered cells not eliminated by the treatment or repaired by the repair system of an individual. Markov Chain Monte Carlo (MCMC) methods are then used to develop Bayesian inference for the proposed model. Also, some discussions on the model selection and an illustration with a cutaneous melanoma data set analysed by Rodrigues et al. [Rodrigues J, de Castro M, Balakrishnan N and Cancho VG. Destructive weighted Poisson cure rate models. Technical Report, Universidade Federal de Sao Carlos, Sao Carlos-SP. Brazil, 2009 (accepted in Lifetime Data Analysis)] are presented.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background With the development of DNA hybridization microarray technologies, nowadays it is possible to simultaneously assess the expression levels of thousands to tens of thousands of genes. Quantitative comparison of microarrays uncovers distinct patterns of gene expression, which define different cellular phenotypes or cellular responses to drugs. Due to technical biases, normalization of the intensity levels is a pre-requisite to performing further statistical analyses. Therefore, choosing a suitable approach for normalization can be critical, deserving judicious consideration. Results Here, we considered three commonly used normalization approaches, namely: Loess, Splines and Wavelets, and two non-parametric regression methods, which have yet to be used for normalization, namely, the Kernel smoothing and Support Vector Regression. The results obtained were compared using artificial microarray data and benchmark studies. The results indicate that the Support Vector Regression is the most robust to outliers and that Kernel is the worst normalization technique, while no practical differences were observed between Loess, Splines and Wavelets. Conclusion In face of our results, the Support Vector Regression is favored for microarray normalization due to its superiority when compared to the other methods for its robustness in estimating the normalization curve.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background The search for enriched (aka over-represented or enhanced) ontology terms in a list of genes obtained from microarray experiments is becoming a standard procedure for a system-level analysis. This procedure tries to summarize the information focussing on classification designs such as Gene Ontology, KEGG pathways, and so on, instead of focussing on individual genes. Although it is well known in statistics that association and significance are distinct concepts, only the former approach has been used to deal with the ontology term enrichment problem. Results BayGO implements a Bayesian approach to search for enriched terms from microarray data. The R source-code is freely available at http://blasto.iq.usp.br/~tkoide/BayGO in three versions: Linux, which can be easily incorporated into pre-existent pipelines; Windows, to be controlled interactively; and as a web-tool. The software was validated using a bacterial heat shock response dataset, since this stress triggers known system-level responses. Conclusion The Bayesian model accounts for the fact that, eventually, not all the genes from a given category are observable in microarray data due to low intensity signal, quality filters, genes that were not spotted and so on. Moreover, BayGO allows one to measure the statistical association between generic ontology terms and differential expression, instead of working only with the common significance analysis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background One goal of gene expression profiling is to identify signature genes that robustly distinguish different types or grades of tumors. Several tumor classifiers based on expression profiling have been proposed using microarray technique. Due to important differences in the probabilistic models of microarray and SAGE technologies, it is important to develop suitable techniques to select specific genes from SAGE measurements. Results A new framework to select specific genes that distinguish different biological states based on the analysis of SAGE data is proposed. The new framework applies the bolstered error for the identification of strong genes that separate the biological states in a feature space defined by the gene expression of a training set. Credibility intervals defined from a probabilistic model of SAGE measurements are used to identify the genes that distinguish the different states with more reliability among all gene groups selected by the strong genes method. A score taking into account the credibility and the bolstered error values in order to rank the groups of considered genes is proposed. Results obtained using SAGE data from gliomas are presented, thus corroborating the introduced methodology. Conclusion The model representing counting data, such as SAGE, provides additional statistical information that allows a more robust analysis. The additional statistical information provided by the probabilistic model is incorporated in the methodology described in the paper. The introduced method is suitable to identify signature genes that lead to a good separation of the biological states using SAGE and may be adapted for other counting methods such as Massive Parallel Signature Sequencing (MPSS) or the recent Sequencing-By-Synthesis (SBS) technique. Some of such genes identified by the proposed method may be useful to generate classifiers.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space. Results Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster. Conclusion Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills. Results Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al. Conclusion GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Invasive cervical cancer is the second most common malignant tumor affecting Brazilian women. Knowledge on Human Papillomavirus (HPV) genotypes in invasive cervical cancer cases is crucial to guide the introduction and further evaluate the impact of new preventive strategies based on HPV. We aimed to provide updated comprehensive data about the HPV types’ distribution in patients with invasive cervical cancer. Methods Fresh tumor tissue samples of histologically confirmed invasive cervical cancer were collected from 175 women attending two cancer reference hospitals from São Paulo State: ICESP and Hospital de Câncer de Barretos. HPV detection and genotyping were performed by the Linear Array HPV Genotyping Test (Roche Molecular Diagnostics, Pleasanton,USA). Results 170 out of 172 valid samples (99%) were HPV DNA positive. The most frequent types were HPV16 (77.6%), HPV18 (12.3%), HPV31 (8.8%), HPV33 (7.1%) and HPV35 (5.9%). Most infections (75%) were caused by individual HPV types. Women with adenocarcinoma were not younger than those with squamous cell carcinoma, as well, as women infected with HPV33 were older than those infected by other HPV types. Some differences between results obtained in the two hospitals were observed: higher overall prevalence of HPV16, absence of single infection by HPV31 and HPV45 was verified in HC-Barretos in comparison to ICESP patients. Conclusions To our knowledge, this is one of the largest studies made with fresh tumor tissues of invasive cervical cancer cases in Brazil. This study depicted a distinct HPV genotype distribution between two centers that may reflect the local epidemiology of HPV transmission among these populations. Due to the impact of these findings on cervical cancer preventive strategies, extension of this investigation to routine screening populations is warranted.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study aimed to demonstrate that microspheres, used as delivery vehicle of DNA-Hsp65/TDM [plasmid DNA encoding heat shock protein 65 (Hsp65) coencapsulated with trehalose dimycolate (TDM) into PLGA microspheres], are widely spread among several organs after intramuscular administration in BALB/c mice. In general, we showed that these particles were phagocytosed by antigen presenting cells, such as macrophages and dendritic cells. Besides, it was demonstrated herein that draining lymph node cells presented a significant increase in the number of cells expressing costimulatory molecules (CD80 and CD86) and MHC class II, and also that the administration of the DNA-Hsp65/TDM and vector/TDM formulations resulted in the up-regulation of CD80, CD86 and MHC class II expression when compared to control formulations (vector/TDM and empty). Regarding the intracellular trafficking we observed that following phagocytosis, the microspheres were not found in the late endosomes and/or lysosomes, until 15 days after internalization, and we suggest that these constructions were hydrolysed in early compartments. Overall, these data expand our knowledge on PLGA [poly (lactic-co- glycolic acid)] microspheres as gene carriers in vaccination strategies, as well as open perspectives for their potential use in clinical practice.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Smallpox is a lethal disease that was endemic in many parts of the world until eradicated by massive immunization. Due to its lethality, there are serious concerns about its use as a bioweapon. Here we analyze publicly available microarray data to further understand survival of smallpox infected macaques, using systems biology approaches. Our goal is to improve the knowledge about the progression of this disease. Results We used KEGG pathways annotations to define groups of genes (or modules), and subsequently compared them to macaque survival times. This technique provided additional insights about the host response to this disease, such as increased expression of the cytokines and ECM receptors in the individuals with higher survival times. These results could indicate that these gene groups could influence an effective response from the host to smallpox. Conclusion Macaques with higher survival times clearly express some specific pathways previously unidentified using regular gene-by-gene approaches. Our work also shows how third party analysis of public datasets can be important to support new hypotheses to relevant biological problems.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background A popular model for gene regulatory networks is the Boolean network model. In this paper, we propose an algorithm to perform an analysis of gene regulatory interactions using the Boolean network model and time-series data. Actually, the Boolean network is restricted in the sense that only a subset of all possible Boolean functions are considered. We explore some mathematical properties of the restricted Boolean networks in order to avoid the full search approach. The problem is modeled as a Constraint Satisfaction Problem (CSP) and CSP techniques are used to solve it. Results We applied the proposed algorithm in two data sets. First, we used an artificial dataset obtained from a model for the budding yeast cell cycle. The second data set is derived from experiments performed using HeLa cells. The results show that some interactions can be fully or, at least, partially determined under the Boolean model considered. Conclusions The algorithm proposed can be used as a first step for detection of gene/protein interactions. It is able to infer gene relationships from time-series data of gene expression, and this inference process can be aided by a priori knowledge available.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background American cutaneous leishmaniasis (ACL) is a re-emerging disease in the state of São Paulo, Brazil. It is important to understand both the vector and disease distribution to help design control strategies. As an initial step in applying geographic information systems (GIS) and remote sensing (RS) tools to map disease-risk, the objectives of the present work were to: (i) produce a single database of species distributions of the sand fly vectors in the state of São Paulo, (ii) create combined distributional maps of both the incidence of ACL and its sand fly vectors, and (iii) thereby provide individual municipalities with a source of reference material for work carried out in their area. Results A database containing 910 individual records of sand fly occurrence in the state of São Paulo, from 37 different sources, was compiled. These records date from between 1943 to 2009, and describe the presence of at least one of the six incriminated or suspected sand fly vector species in 183/645 (28.4%) municipalities. For the remaining 462 (71.6%) municipalities, we were unable to locate records of any of the six incriminated or suspected sand fly vector species (Nyssomyia intermedia, N. neivai, N. whitmani, Pintomyia fischeri, P. pessoai and Migonemyia migonei). The distribution of each of the six incriminated or suspected vector species of ACL in the state of São Paulo were individually mapped and overlaid on the incidence of ACL for the period 1993 to 1995 and 1998 to 2007. Overall, the maps reveal that the six sand fly vector species analyzed have unique and heterogeneous, although often overlapping, distributions. Several sand fly species - Nyssomyia intermedia and N. neivai - are highly localized, while the other sand fly species - N. whitmani, M. migonei, P. fischeri and P. pessoai - are much more broadly distributed. ACL has been reported in 160/183 (87.4%) of the municipalities with records for at least one of the six incriminated or suspected sand fly vector species, while there are no records of any of these sand fly species in 318/478 (66.5%) municipalities with ACL. Conclusions The maps produced in this work provide basic data on the distribution of the six incriminated or suspected sand fly vectors of ACL in the state of São Paulo, and highlight the complex and geographically heterogeneous pattern of ACL transmission in the region. Further studies are required to clarify the role of each of the six suspected sand fly vector species in different regions of the state of São Paulo, especially in the majority of municipalities where ACL is present but sand fly vectors have not yet been identified.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background The use of the knowledge produced by sciences to promote human health is the main goal of translational medicine. To make it feasible we need computational methods to handle the large amount of information that arises from bench to bedside and to deal with its heterogeneity. A computational challenge that must be faced is to promote the integration of clinical, socio-demographic and biological data. In this effort, ontologies play an essential role as a powerful artifact for knowledge representation. Chado is a modular ontology-oriented database model that gained popularity due to its robustness and flexibility as a generic platform to store biological data; however it lacks supporting representation of clinical and socio-demographic information. Results We have implemented an extension of Chado – the Clinical Module - to allow the representation of this kind of information. Our approach consists of a framework for data integration through the use of a common reference ontology. The design of this framework has four levels: data level, to store the data; semantic level, to integrate and standardize the data by the use of ontologies; application level, to manage clinical databases, ontologies and data integration process; and web interface level, to allow interaction between the user and the system. The clinical module was built based on the Entity-Attribute-Value (EAV) model. We also proposed a methodology to migrate data from legacy clinical databases to the integrative framework. A Chado instance was initialized using a relational database management system. The Clinical Module was implemented and the framework was loaded using data from a factual clinical research database. Clinical and demographic data as well as biomaterial data were obtained from patients with tumors of head and neck. We implemented the IPTrans tool that is a complete environment for data migration, which comprises: the construction of a model to describe the legacy clinical data, based on an ontology; the Extraction, Transformation and Load (ETL) process to extract the data from the source clinical database and load it in the Clinical Module of Chado; the development of a web tool and a Bridge Layer to adapt the web tool to Chado, as well as other applications. Conclusions Open-source computational solutions currently available for translational science does not have a model to represent biomolecular information and also are not integrated with the existing bioinformatics tools. On the other hand, existing genomic data models do not represent clinical patient data. A framework was developed to support translational research by integrating biomolecular information coming from different “omics” technologies with patient’s clinical and socio-demographic data. This framework should present some features: flexibility, compression and robustness. The experiments accomplished from a use case demonstrated that the proposed system meets requirements of flexibility and robustness, leading to the desired integration. The Clinical Module can be accessed in http://dcm.ffclrp.usp.br/caib/pg=iptrans webcite.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results: In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions: This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background Once multi-relational approach has emerged as an alternative for analyzing structured data such as relational databases, since they allow applying data mining in multiple tables directly, thus avoiding expensive joining operations and semantic losses, this work proposes an algorithm with multi-relational approach. Methods Aiming to compare traditional approach performance and multi-relational for mining association rules, this paper discusses an empirical study between PatriciaMine - an traditional algorithm - and its corresponding multi-relational proposed, MR-Radix. Results This work showed advantages of the multi-relational approach in performance over several tables, which avoids the high cost for joining operations from multiple tables and semantic losses. The performance provided by the algorithm MR-Radix shows faster than PatriciaMine, despite handling complex multi-relational patterns. The utilized memory indicates a more conservative growth curve for MR-Radix than PatriciaMine, which shows the increase in demand of frequent items in MR-Radix does not result in a significant growth of utilized memory like in PatriciaMine. Conclusion The comparative study between PatriciaMine and MR-Radix confirmed efficacy of the multi-relational approach in data mining process both in terms of execution time and in relation to memory usage. Besides that, the multi-relational proposed algorithm, unlike other algorithms of this approach, is efficient for use in large relational databases.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Abstract Background The study and analysis of gene expression measurements is the primary focus of functional genomics. Once expression data is available, biologists are faced with the task of extracting (new) knowledge associated to the underlying biological phenomenon. Most often, in order to perform this task, biologists execute a number of analysis activities on the available gene expression dataset rather than a single analysis activity. The integration of heteregeneous tools and data sources to create an integrated analysis environment represents a challenging and error-prone task. Semantic integration enables the assignment of unambiguous meanings to data shared among different applications in an integrated environment, allowing the exchange of data in a semantically consistent and meaningful way. This work aims at developing an ontology-based methodology for the semantic integration of gene expression analysis tools and data sources. The proposed methodology relies on software connectors to support not only the access to heterogeneous data sources but also the definition of transformation rules on exchanged data. Results We have studied the different challenges involved in the integration of computer systems and the role software connectors play in this task. We have also studied a number of gene expression technologies, analysis tools and related ontologies in order to devise basic integration scenarios and propose a reference ontology for the gene expression domain. Then, we have defined a number of activities and associated guidelines to prescribe how the development of connectors should be carried out. Finally, we have applied the proposed methodology in the construction of three different integration scenarios involving the use of different tools for the analysis of different types of gene expression data. Conclusions The proposed methodology facilitates the development of connectors capable of semantically integrating different gene expression analysis tools and data sources. The methodology can be used in the development of connectors supporting both simple and nontrivial processing requirements, thus assuring accurate data exchange and information interpretation from exchanged data.