926 resultados para Genomic data integration
Resumo:
It is well accepted that tumorigenesis is a multi-step procedure involving aberrant functioning of genes regulating cell proliferation, differentiation, apoptosis, genome stability, angiogenesis and motility. To obtain a full understanding of tumorigenesis, it is necessary to collect information on all aspects of cell activity. Recent advances in high throughput technologies allow biologists to generate massive amounts of data, more than might have been imagined decades ago. These advances have made it possible to launch comprehensive projects such as (TCGA) and (ICGC) which systematically characterize the molecular fingerprints of cancer cells using gene expression, methylation, copy number, microRNA and SNP microarrays as well as next generation sequencing assays interrogating somatic mutation, insertion, deletion, translocation and structural rearrangements. Given the massive amount of data, a major challenge is to integrate information from multiple sources and formulate testable hypotheses. This thesis focuses on developing methodologies for integrative analyses of genomic assays profiled on the same set of samples. We have developed several novel methods for integrative biomarker identification and cancer classification. We introduce a regression-based approach to identify biomarkers predictive to therapy response or survival by integrating multiple assays including gene expression, methylation and copy number data through penalized regression. To identify key cancer-specific genes accounting for multiple mechanisms of regulation, we have developed the integIRTy software that provides robust and reliable inferences about gene alteration by automatically adjusting for sample heterogeneity as well as technical artifacts using Item Response Theory. To cope with the increasing need for accurate cancer diagnosis and individualized therapy, we have developed a robust and powerful algorithm called SIBER to systematically identify bimodally expressed genes using next generation RNAseq data. We have shown that prediction models built from these bimodal genes have the same accuracy as models built from all genes. Further, prediction models with dichotomized gene expression measurements based on their bimodal shapes still perform well. The effectiveness of outcome prediction using discretized signals paves the road for more accurate and interpretable cancer classification by integrating signals from multiple sources.
Resumo:
INFOBIOMED is an European Network of Excellence (NoE) funded by the Information Society Directorate-General of the European Commission (EC). A consortium of European organizations from ten different countries is involved within the network. Four pilots, all related to linking clinical and genomic information, are being carried out. From an informatics perspective, various challenges, related to data integration and mining, are included.
Resumo:
En los últimos años ha habido un gran aumento de fuentes de datos biomédicos. La aparición de nuevas técnicas de extracción de datos genómicos y generación de bases de datos que contienen esta información ha creado la necesidad de guardarla para poder acceder a ella y trabajar con los datos que esta contiene. La información contenida en las investigaciones del campo biomédico se guarda en bases de datos. Esto se debe a que las bases de datos permiten almacenar y manejar datos de una manera simple y rápida. Dentro de las bases de datos existen una gran variedad de formatos, como pueden ser bases de datos en Excel, CSV o RDF entre otros. Actualmente, estas investigaciones se basan en el análisis de datos, para a partir de ellos, buscar correlaciones que permitan inferir, por ejemplo, tratamientos nuevos o terapias más efectivas para una determinada enfermedad o dolencia. El volumen de datos que se maneja en ellas es muy grande y dispar, lo que hace que sea necesario el desarrollo de métodos automáticos de integración y homogeneización de los datos heterogéneos. El proyecto europeo p-medicine (FP7-ICT-2009-270089) tiene como objetivo asistir a los investigadores médicos, en este caso de investigaciones relacionadas con el cáncer, proveyéndoles con nuevas herramientas para el manejo de datos y generación de nuevo conocimiento a partir del análisis de los datos gestionados. La ingestión de datos en la plataforma de p-medicine, y el procesamiento de los mismos con los métodos proporcionados, buscan generar nuevos modelos para la toma de decisiones clínicas. Dentro de este proyecto existen diversas herramientas para integración de datos heterogéneos, diseño y gestión de ensayos clínicos, simulación y visualización de tumores y análisis estadístico de datos. Precisamente en el ámbito de la integración de datos heterogéneos surge la necesidad de añadir información externa al sistema proveniente de bases de datos públicas, así como relacionarla con la ya existente mediante técnicas de integración semántica. Para resolver esta necesidad se ha creado una herramienta, llamada Term Searcher, que permite hacer este proceso de una manera semiautomática. En el trabajo aquí expuesto se describe el desarrollo y los algoritmos creados para su correcto funcionamiento. Esta herramienta ofrece nuevas funcionalidades que no existían dentro del proyecto para la adición de nuevos datos provenientes de fuentes públicas y su integración semántica con datos privados.---ABSTRACT---Over the last few years, there has been a huge growth of biomedical data sources. The emergence of new techniques of genomic data generation and data base generation that contain this information, has created the need of storing it in order to access and work with its data. The information employed in the biomedical research field is stored in databases. This is due to the capability of databases to allow storing and managing data in a quick and simple way. Within databases there is a variety of formats, such as Excel, CSV or RDF. Currently, these biomedical investigations are based on data analysis, which lead to the discovery of correlations that allow inferring, for example, new treatments or more effective therapies for a specific disease or ailment. The volume of data handled in them is very large and dissimilar, which leads to the need of developing new methods for automatically integrating and homogenizing the heterogeneous data. The p-medicine (FP7-ICT-2009-270089) European project aims to assist medical researchers, in this case related to cancer research, providing them with new tools for managing and creating new knowledge from the analysis of the managed data. The ingestion of data into the platform and its subsequent processing with the provided tools aims to enable the generation of new models to assist in clinical decision support processes. Inside this project, there exist different tools related to areas such as the integration of heterogeneous data, the design and management of clinical trials, simulation and visualization of tumors and statistical data analysis. Particularly in the field of heterogeneous data integration, there is a need to add external information from public databases, and relate it to the existing ones through semantic integration methods. To solve this need a tool has been created: the term Searcher. This tool aims to make this process in a semiautomatic way. This work describes the development of this tool and the algorithms employed in its operation. This new tool provides new functionalities that did not exist inside the p-medicine project for adding new data from public databases and semantically integrate them with private data.
Resumo:
Novel, low-abundance microbial species can be easily overlooked in standard polymerase chain reaction (PCR)-based surveys. We used community genomic data obtained without PCR or cultivation to reconstruct DNA fragments bearing unusual 16S ribosomal RNA ( rRNA) and protein-coding genes from organisms belonging to novel archaeal lineages. The organisms are minor components of all biofilms growing in pH 0.5 to 1.5 solutions within the Richmond Mine, California. Probes specific for 16S rRNA showed that the fraction less than 0.45 micrometers in diameter is dominated by these organisms. Transmission electron microscope images revealed that the cells are pleomorphic with unusual folded membrane protrusions and have apparent volumes of < 0.006 cubic micrometer.
Resumo:
This paper highlights the challenges of satellite monitoring systems integration, in particular based on Grid platform, and reviews possible solutions for these problems. We describe integration issues on different levels: data integration level and task management level (job submission in terms of Grid). We show example of described technologies for integration of monitoring systems of Ukraine (National Space Agency of Ukraine, NASU) and Russia (Space Research Institute RAS, IKI RAN). Another example refers to the development of InterGrid infrastructure that integrates several regional and national Grid systems: Ukrainian Academician Grid (with Satellite data processing Grid segment) and RSGS Grid (Chinese Academy of Sciences).
Resumo:
An Automatic Vehicle Location (AVL) system is a computer-based vehicle tracking system that is capable of determining a vehicle's location in real time. As a major technology of the Advanced Public Transportation System (APTS), AVL systems have been widely deployed by transit agencies for purposes such as real-time operation monitoring, computer-aided dispatching, and arrival time prediction. AVL systems make a large amount of transit performance data available that are valuable for transit performance management and planning purposes. However, the difficulties of extracting useful information from the huge spatial-temporal database have hindered off-line applications of the AVL data. ^ In this study, a data mining process, including data integration, cluster analysis, and multiple regression, is proposed. The AVL-generated data are first integrated into a Geographic Information System (GIS) platform. The model-based cluster method is employed to investigate the spatial and temporal patterns of transit travel speeds, which may be easily translated into travel time. The transit speed variations along the route segments are identified. Transit service periods such as morning peak, mid-day, afternoon peak, and evening periods are determined based on analyses of transit travel speed variations for different times of day. The seasonal patterns of transit performance are investigated by using the analysis of variance (ANOVA). Travel speed models based on the clustered time-of-day intervals are developed using important factors identified as having significant effects on speed for different time-of-day periods. ^ It has been found that transit performance varied from different seasons and different time-of-day periods. The geographic location of a transit route segment also plays a role in the variation of the transit performance. The results of this research indicate that advanced data mining techniques have good potential in providing automated techniques of assisting transit agencies in service planning, scheduling, and operations control. ^
Resumo:
The mediator software architecture design has been developed to provide data integration and retrieval in distributed, heterogeneous environments. Since the initial conceptualization of this architecture, many new technologies have emerged that can facilitate the implementation of this design. The purpose of this thesis was to show that a mediator framework supporting users of mobile devices could be implemented using common software technologies available today. In addition, the prototype was developed with a view to providing a better understanding of what a mediator is and to expose issues that will have to be addressed in full, more robust designs. The prototype developed for this thesis was implemented using various technologies including: Java, XML, and Simple Object Access Protocol (SOAP) among others. SOAP was used to accomplish inter-process communication. In the end, it is expected that more data intensive software applications will be possible in a world with ever-increasing demands for information.
Resumo:
Data integration systems offer uniform access to a set of autonomous and heterogeneous data sources. One of the main challenges in data integration is reconciling semantic differences among data sources. Approaches that been used to solve this problem can be categorized as schema-based and attribute-based. Schema-based approaches use schema information to identify the semantic similarity in data; furthermore, they focus on reconciling types before reconciling attributes. In contrast, attribute-based approaches use statistical and structural information of attributes to identify the semantic similarity of data in different sources. This research examines an approach to semantic reconciliation based on integrating properties expressed at different levels of abstraction or granularity using the concept of property precedence. Property precedence reconciles the meaning of attributes by identifying similarities between attributes based on what these attributes represent in the real world. In order to use property precedence for semantic integration, we need to identify the precedence of attributes within and across data sources. The goal of this research is to develop and evaluate a method and algorithms that will identify precedence relations among attributes and build property precedence graph (PPG) that can be used to support integration.
Resumo:
The authors would like to thank the College of Life Sciences of Aberdeen University and Marine Scotland Science which funded CP's PhD project. Skate tagging experiments were undertaken as part of Scottish Government project SP004. We thank Ian Burrett for help in catching the fish and the other fishermen and anglers who returned tags. We thank José Manuel Gonzalez-Irusta for extracting and making available the environmental layers used as environmental covariates in the environmental suitability modelling procedure. We also thank Jason Matthiopoulos for insightful suggestions on habitat utilization metrics as well as Stephen C.F. Palmer, and three anonymous reviewers for useful suggestions to improve the clarity and quality of the manuscript.
Resumo:
Here, we describe gene expression compositional assignment (GECA), a powerful, yet simple method based on compositional statistics that can validate the transfer of prior knowledge, such as gene lists, into independent data sets, platforms and technologies. Transcriptional profiling has been used to derive gene lists that stratify patients into prognostic molecular subgroups and assess biomarker performance in the pre-clinical setting. Archived public data sets are an invaluable resource for subsequent in silico validation, though their use can lead to data integration issues. We show that GECA can be used without the need for normalising expression levels between data sets and can outperform rank-based correlation methods. To validate GECA, we demonstrate its success in the cross-platform transfer of gene lists in different domains including: bladder cancer staging, tumour site of origin and mislabelled cell lines. We also show its effectiveness in transferring an epithelial ovarian cancer prognostic gene signature across technologies, from a microarray to a next-generation sequencing setting. In a final case study, we predict the tumour site of origin and histopathology of epithelial ovarian cancer cell lines. In particular, we identify and validate the commonly-used cell line OVCAR-5 as non-ovarian, being gastrointestinal in origin. GECA is available as an open-source R package.
Resumo:
Abstract: Decision support systems have been widely used for years in companies to gain insights from internal data, thus making successful decisions. Lately, thanks to the increasing availability of open data, these systems are also integrating open data to enrich decision making process with external data. On the other hand, within an open-data scenario, decision support systems can be also useful to decide which data should be opened, not only by considering technical or legal constraints, but other requirements, such as "reusing potential" of data. In this talk, we focus on both issues: (i) open data for decision making, and (ii) decision making for opening data. We will first briefly comment some research problems regarding using open data for decision making. Then, we will give an outline of a novel decision-making approach (based on how open data is being actually used in open-source projects hosted in Github) for supporting open data publication. Bio of the speaker: Jose-Norberto Mazón holds a PhD from the University of Alicante (Spain). He is head of the "Cátedra Telefónica" on Big Data and coordinator of the Computing degree at the University of Alicante. He is also member of the WaKe research group at the University of Alicante. His research work focuses on open data management, data integration and business intelligence within "big data" scenarios, and their application to the tourism domain (smart tourism destinations). He has published his research in international journals, such as Decision Support Systems, Information Sciences, Data & Knowledge Engineering or ACM Transaction on the Web. Finally, he is involved in the open data project in the University of Alicante, including its open data portal at http://datos.ua.es
Resumo:
Cancer and cardio-vascular diseases are the leading causes of death world-wide. Caused by systemic genetic and molecular disruptions in cells, these disorders are the manifestation of profound disturbance of normal cellular homeostasis. People suffering or at high risk for these disorders need early diagnosis and personalized therapeutic intervention. Successful implementation of such clinical measures can significantly improve global health. However, development of effective therapies is hindered by the challenges in identifying genetic and molecular determinants of the onset of diseases; and in cases where therapies already exist, the main challenge is to identify molecular determinants that drive resistance to the therapies. Due to the progress in sequencing technologies, the access to a large genome-wide biological data is now extended far beyond few experimental labs to the global research community. The unprecedented availability of the data has revolutionized the capabilities of computational researchers, enabling them to collaboratively address the long standing problems from many different perspectives. Likewise, this thesis tackles the two main public health related challenges using data driven approaches. Numerous association studies have been proposed to identify genomic variants that determine disease. However, their clinical utility remains limited due to their inability to distinguish causal variants from associated variants. In the presented thesis, we first propose a simple scheme that improves association studies in supervised fashion and has shown its applicability in identifying genomic regulatory variants associated with hypertension. Next, we propose a coupled Bayesian regression approach -- eQTeL, which leverages epigenetic data to estimate regulatory and gene interaction potential, and identifies combinations of regulatory genomic variants that explain the gene expression variance. On human heart data, eQTeL not only explains a significantly greater proportion of expression variance in samples, but also predicts gene expression more accurately than other methods. We demonstrate that eQTeL accurately detects causal regulatory SNPs by simulation, particularly those with small effect sizes. Using various functional data, we show that SNPs detected by eQTeL are enriched for allele-specific protein binding and histone modifications, which potentially disrupt binding of core cardiac transcription factors and are spatially proximal to their target. eQTeL SNPs capture a substantial proportion of genetic determinants of expression variance and we estimate that 58% of these SNPs are putatively causal. The challenge of identifying molecular determinants of cancer resistance so far could only be dealt with labor intensive and costly experimental studies, and in case of experimental drugs such studies are infeasible. Here we take a fundamentally different data driven approach to understand the evolving landscape of emerging resistance. We introduce a novel class of genetic interactions termed synthetic rescues (SR) in cancer, which denotes a functional interaction between two genes where a change in the activity of one vulnerable gene (which may be a target of a cancer drug) is lethal, but subsequently altered activity of its partner rescuer gene restores cell viability. Next we describe a comprehensive computational framework --termed INCISOR-- for identifying SR underlying cancer resistance. Applying INCISOR to mine The Cancer Genome Atlas (TCGA), a large collection of cancer patient data, we identified the first pan-cancer SR networks, composed of interactions common to many cancer types. We experimentally test and validate a subset of these interactions involving the master regulator gene mTOR. We find that rescuer genes become increasingly activated as breast cancer progresses, testifying to pervasive ongoing rescue processes. We show that SRs can be utilized to successfully predict patients' survival and response to the majority of current cancer drugs, and importantly, for predicting the emergence of drug resistance from the initial tumor biopsy. Our analysis suggests a potential new strategy for enhancing the effectiveness of existing cancer therapies by targeting their rescuer genes to counteract resistance. The thesis provides statistical frameworks that can harness ever increasing high throughput genomic data to address challenges in determining the molecular underpinnings of hypertension, cardiovascular disease and cancer resistance. We discover novel molecular mechanistic insights that will advance the progress in early disease prevention and personalized therapeutics. Our analyses sheds light on the fundamental biological understanding of gene regulation and interaction, and opens up exciting avenues of translational applications in risk prediction and therapeutics.
Resumo:
Papaya (Carica papaya) is a relevant tropical crop and physico-chemical changes take place very quickly, as a consequence of activation of biochemical pathways by de nova synthesis of several proteins. Thus, in order to have information on the changes in gene expression in ripening papaya, transcripts from the pulp of unripe and ripe fruit were profiled by differential-display RT-PCR (DDRT-PCR). Seventy transcript derived fragments (TDFs) isolated from gels were re-amplified by PCR and differential expression of 40 papaya genes was confirmed by reverse northern blotting. Twenty-nine positively cloned TDFs were sequenced, and 17 were putatively identified by homology search. Ten of these genes were downregulated during ripening and UDP-glucose glucosyltransferase, alpha-2 importin, RNase L inhibitor-like protein, and a syntaxin protein were identified. Among the up-regulated genes there was a carboxylesterase, an integral membrane Yip1 family protein, a glycosyl hydrolase family-like protein and an endopolygalacturonase. Considering their relatedness to papaya quality, the fragments of genes potentially implicated in carbohydrate metabolism and pulp softening may be considered of interest for further studies. According to the results, differential display was a feasible approach to investigate differences in gene expression during fruit ripening, and can provide interesting information about those fruits whose genomic data is scarce, as is the case of papayas. (c) 2009 Elsevier B.V. All rights reserved.
Resumo:
The genus Schistosoma is composed of blood flukes that infect vertebrates, from which three species are major causative agents of human schistosomiasis, a tropical disease that affects more than 200 million people. Current models of the recent evolution of Schistosoma indicate multiple events of migration and speciation from an Asian ancestral species. Transposable elements are important drivers of genome evolution and have been hypothesised to have an important role in speciation. In this work, we describe a comprehensive inventory of Schistosoma mansoni and Schistosoma japonicum retrotransposons, based on their recently published genomic data. We find a considerable difference in retrotransposon representation between the two species (22% and 13%, respectively). A large part of this difference can be attributed to higher representation of two previously described families of S. mansoni retrotransposons (SR2 and Perere-3/SR3), compared with the representation of their closest relative families in S. japonicum. A more detailed analysis suggests that these two S. mansoni families were the subject of recent bursts of transposition that were not paralleled by their S. japonicum counterparts. We hypothesise that these bursts could be a consequence of the evolutionary pressure resulting from migration of Schistosoma from Asia to Africa and their establishment in this new environment, helping both speciation and adaptation. (C) 2009 Australian Society for Parasitology Inc. Published by Elsevier Ltd. All rights reserved.