948 resultados para data integration
Resumo:
Developing and implementing data-oriented workflows for data migration processes are complex tasks involving several problems related to the integration of data coming from different schemas. Usually, they involve very specific requirements - every process is almost unique. Having a way to abstract their representation will help us to better understand and validate them with business users, which is a crucial step for requirements validation. In this demo we present an approach that provides a way to enrich incrementally conceptual models in order to support an automatic way for producing their correspondent physical implementation. In this demo we will show how B2K (Business to Kettle) system works transforming BPMN 2.0 conceptual models into Kettle data-integration executable processes, approaching the most relevant aspects related to model design and enrichment, model to system transformation, and system execution.
Resumo:
Under the framework of constraint based modeling, genome-scale metabolic models (GSMMs) have been used for several tasks, such as metabolic engineering and phenotype prediction. More recently, their application in health related research has spanned drug discovery, biomarker identification and host-pathogen interactions, targeting diseases such as cancer, Alzheimer, obesity or diabetes. In the last years, the development of novel techniques for genome sequencing and other high-throughput methods, together with advances in Bioinformatics, allowed the reconstruction of GSMMs for human cells. Considering the diversity of cell types and tissues present in the human body, it is imperative to develop tissue-specific metabolic models. Methods to automatically generate these models, based on generic human metabolic models and a plethora of omics data, have been proposed. However, their results have not yet been adequately and critically evaluated and compared. This work presents a survey of the most important tissue or cell type specific metabolic model reconstruction methods, which use literature, transcriptomics, proteomics and metabolomics data, together with a global template model. As a case study, we analyzed the consistency between several omics data sources and reconstructed distinct metabolic models of hepatocytes using different methods and data sources as inputs. The results show that omics data sources have a poor overlapping and, in some cases, are even contradictory. Additionally, the hepatocyte metabolic models generated are in many cases not able to perform metabolic functions known to be present in the liver tissue. We conclude that reliable methods for a priori omics data integration are required to support the reconstruction of complex models of human cells.
Resumo:
Somatic copy number aberrations (CNA) represent a mutation type encountered in the majority of cancer genomes. Here, we present the 2014 edition of arrayMap (http://www.arraymap.org), a publicly accessible collection of pre-processed oncogenomic array data sets and CNA profiles, representing a vast range of human malignancies. Since the initial release, we have enhanced this resource both in content and especially with regard to data mining support. The 2014 release of arrayMap contains more than 64,000 genomic array data sets, representing about 250 tumor diagnoses. Data sets included in arrayMap have been assembled from public repositories as well as additional resources, and integrated by applying custom processing pipelines. Online tools have been upgraded for a more flexible array data visualization, including options for processing user provided, non-public data sets. Data integration has been improved by mapping to multiple editions of the human reference genome, with the majority of the data now being available for the UCSC hg18 as well as GRCh37 versions. The large amount of tumor CNA data in arrayMap can be freely downloaded by users to promote data mining projects, and to explore special events such as chromothripsis-like genome patterns.
Resumo:
In this paper we review the impact that the availability of the Schistosoma mansoni genome sequence and annotation has had on schistosomiasis research. Easy access to the genomic information is important and several types of data are currently being integrated, such as proteomics, microarray and polymorphic loci. Access to the genome annotation and powerful means of extracting information are major resources to the research community.
Resumo:
Genome-scale metabolic network reconstructions are now routinely used in the study of metabolic pathways, their evolution and design. The development of such reconstructions involves the integration of information on reactions and metabolites from the scientific literature as well as public databases and existing genome-scale metabolic models. The reconciliation of discrepancies between data from these sources generally requires significant manual curation, which constitutes a major obstacle in efforts to develop and apply genome-scale metabolic network reconstructions. In this work, we discuss some of the major difficulties encountered in the mapping and reconciliation of metabolic resources and review three recent initiatives that aim to accelerate this process, namely BKM-react, MetRxn and MNXref (presented in this article). Each of these resources provides a pre-compiled reconciliation of many of the most commonly used metabolic resources. By reducing the time required for manual curation of metabolite and reaction discrepancies, these resources aim to accelerate the development and application of high-quality genome-scale metabolic network reconstructions and models.
Resumo:
O trabalho que se segue fala sobre o processo ETL (Extract, Transform and Load) e as ferramentas que estão associadas ao ETL. É apresentado um enquadramento teórico sobre esse processo, onde são distinguidas as principais etapas desse processo (Extração, Transformação e Load) e aprofundar um pouco sobre esse conceito. Fala também sobre as ferramentas de ETL (Comerciais e OpenSource), com destaque para a ferramenta Talend Open Studio for Data Integration visto que ela é utilizada para implementação de sistemas de ETL na Unitel T+ e vai ser apresentado um estudo do caso prático sobre esses sistemas.
Resumo:
Uncertainty quantification of petroleum reservoir models is one of the present challenges, which is usually approached with a wide range of geostatistical tools linked with statistical optimisation or/and inference algorithms. The paper considers a data driven approach in modelling uncertainty in spatial predictions. Proposed semi-supervised Support Vector Regression (SVR) model has demonstrated its capability to represent realistic features and describe stochastic variability and non-uniqueness of spatial properties. It is able to capture and preserve key spatial dependencies such as connectivity, which is often difficult to achieve with two-point geostatistical models. Semi-supervised SVR is designed to integrate various kinds of conditioning data and learn dependences from them. A stochastic semi-supervised SVR model is integrated into a Bayesian framework to quantify uncertainty with multiple models fitted to dynamic observations. The developed approach is illustrated with a reservoir case study. The resulting probabilistic production forecasts are described by uncertainty envelopes.
Resumo:
Advanced neuroinformatics tools are required for methods of connectome mapping, analysis, and visualization. The inherent multi-modality of connectome datasets poses new challenges for data organization, integration, and sharing. We have designed and implemented the Connectome Viewer Toolkit - a set of free and extensible open source neuroimaging tools written in Python. The key components of the toolkit are as follows: (1) The Connectome File Format is an XML-based container format to standardize multi-modal data integration and structured metadata annotation. (2) The Connectome File Format Library enables management and sharing of connectome files. (3) The Connectome Viewer is an integrated research and development environment for visualization and analysis of multi-modal connectome data. The Connectome Viewer's plugin architecture supports extensions with network analysis packages and an interactive scripting shell, to enable easy development and community contributions. Integration with tools from the scientific Python community allows the leveraging of numerous existing libraries for powerful connectome data mining, exploration, and comparison. We demonstrate the applicability of the Connectome Viewer Toolkit using Diffusion MRI datasets processed by the Connectome Mapper. The Connectome Viewer Toolkit is available from http://www.cmtk.org/
Resumo:
To identify common variants influencing body mass index (BMI), we analyzed genome-wide association data from 16,876 individuals of European descent. After previously reported variants in FTO, the strongest association signal (rs17782313, P = 2.9 x 10(-6)) mapped 188 kb downstream of MC4R (melanocortin-4 receptor), mutations of which are the leading cause of monogenic severe childhood-onset obesity. We confirmed the BMI association in 60,352 adults (per-allele effect = 0.05 Z-score units; P = 2.8 x 10(-15)) and 5,988 children aged 7-11 (0.13 Z-score units; P = 1.5 x 10(-8)). In case-control analyses (n = 10,583), the odds for severe childhood obesity reached 1.30 (P = 8.0 x 10(-11)). Furthermore, we observed overtransmission of the risk allele to obese offspring in 660 families (P (pedigree disequilibrium test average; PDT-avg) = 2.4 x 10(-4)). The SNP location and patterns of phenotypic associations are consistent with effects mediated through altered MC4R function. Our findings establish that common variants near MC4R influence fat mass, weight and obesity risk at the population level and reinforce the need for large-scale data integration to identify variants influencing continuous biomedical traits.
Resumo:
Simulated-annealing-based conditional simulations provide a flexible means of quantitatively integrating diverse types of subsurface data. Although such techniques are being increasingly used in hydrocarbon reservoir characterization studies, their potential in environmental, engineering and hydrological investigations is still largely unexploited. Here, we introduce a novel simulated annealing (SA) algorithm geared towards the integration of high-resolution geophysical and hydrological data which, compared to more conventional approaches, provides significant advancements in the way that large-scale structural information in the geophysical data is accounted for. Model perturbations in the annealing procedure are made by drawing from a probability distribution for the target parameter conditioned to the geophysical data. This is the only place where geophysical information is utilized in our algorithm, which is in marked contrast to other approaches where model perturbations are made through the swapping of values in the simulation grid and agreement with soft data is enforced through a correlation coefficient constraint. Another major feature of our algorithm is the way in which available geostatistical information is utilized. Instead of constraining realizations to match a parametric target covariance model over a wide range of spatial lags, we constrain the realizations only at smaller lags where the available geophysical data cannot provide enough information. Thus we allow the larger-scale subsurface features resolved by the geophysical data to have much more due control on the output realizations. Further, since the only component of the SA objective function required in our approach is a covariance constraint at small lags, our method has improved convergence and computational efficiency over more traditional methods. Here, we present the results of applying our algorithm to the integration of porosity log and tomographic crosshole georadar data to generate stochastic realizations of the local-scale porosity structure. Our procedure is first tested on a synthetic data set, and then applied to data collected at the Boise Hydrogeophysical Research Site.
Resumo:
El objetivo principal del TFC es la construcción y explotación de un almacén de datos. El proceso de trabajo se basa en la ejecución de un caso práctico, en el cual se presenta un escenario en el que se necesita desarrollar un almacén de datos para la Fundació d'Estudis per a la Conducció Responsable, la cual desea estudiar la evolución del número de desplazamientos en vehículo de motor en Cataluña así como analizar las posibles correlaciones entre medios de locomoción, perfiles de conductores y algunas variables de seguridad vial.
Resumo:
Ohjelmistotuotteen hallinta (SCM)on tärkeä osa ohjelmistoprojekteja. Se koostuu ohjelmistotuotteen hallinnan suunnittelusta, muutoksen hallinnasta, version hallinnasta, kääntämisestä, paketoinnista, kokoonpanon tilanteen seurannasta ja sen tarkistuksesta. Ohjelmistotuotteen hallintatietokanta (SCM DB) on tarkoitettu SCM:n liittyvändatan tallettamiseen yhteen paikkaan, jossa data on kaikkien löydettävissä. SCMDB on relaatiotietokanta ja WWW-käyttöliittymä sille. Tietokantaan talletetaan SCM - infrastruktuuri, SCM -resurssit, SCM -työskentelypaikat, integrointisuunnitteludata, paketoinnin raportit ja ohjeistukset, muutoksenhallintadata ja työkalujen hallintadata. Tietokannalla on monta käyttäjää. SCM managerit tallettavat tietokantaa yleiset tiedot, Integrointimanagerit tallettavat kantaan integrointisuunnitelmaa varten julkaisua koskevat tiedot. Paketointivastuulliset tallettavat kantaan paketointiraportit. Ohj elmistosuunnittelijat tekevät muutosvaatimuksia tietokantaan, jotka muutoksenhallintaelin käsittelee. He näkevät kannan kautta myös virheraportit. Työkalujen koordinointi tapahtuu myös kantaan talletettujen tietojen avulla. Lukemiseen tietokantaa voivat käyttää kaikki testauksesta suunnittelijoihin aikataulujen osalta. Tietokannasta voidaan lukea myös paketointityökalujen tallettamia tietoja ohjelmalohkoista eri pakettiversioissa. Paketointityökalut tai paketointivastuulliset saavat kannasta myös suoraan lähdetiedon paketointityökaluille.
Resumo:
OBJECTIVE: Blood-borne biomarkers reflecting atherosclerotic plaque burden have great potential to improve clinical management of atherosclerotic coronary artery disease and acute coronary syndrome (ACS). APPROACH AND RESULTS: Using data integration from gene expression profiling of coronary thrombi versus peripheral blood mononuclear cells and proteomic analysis of atherosclerotic plaque-derived secretomes versus healthy tissue secretomes, we identified fatty acid-binding protein 4 (FABP4) as a biomarker candidate for coronary artery disease. Its diagnostic and prognostic performance was validated in 3 different clinical settings: (1) in a cross-sectional cohort of patients with stable coronary artery disease, ACS, and healthy individuals (n=820), (2) in a nested case-control cohort of patients with ACS with 30-day follow-up (n=200), and (3) in a population-based nested case-control cohort of asymptomatic individuals with 5-year follow-up (n=414). Circulating FABP4 was marginally higher in patients with ST-segment-elevation myocardial infarction (24.9 ng/mL) compared with controls (23.4 ng/mL; P=0.01). However, elevated FABP4 was associated with adverse secondary cerebrovascular or cardiovascular events during 30-day follow-up after index ACS, independent of age, sex, renal function, and body mass index (odds ratio, 1.7; 95% confidence interval, 1.1-2.5; P=0.02). Circulating FABP4 predicted adverse events with similar prognostic performance as the GRACE in-hospital risk score or N-terminal pro-brain natriuretic peptide. Finally, no significant difference between baseline FABP4 was found in asymptomatic individuals with or without coronary events during 5-year follow-up. CONCLUSIONS: Circulating FABP4 may prove useful as a prognostic biomarker in risk stratification of patients with ACS.
Resumo:
Activated T helper (Th) cells have ability to differentiate into functionally distinct Th1, Th2 and Th17 subsets through a series of overlapping networks that include signaling and transcriptional control and the epigenetic mechanisms to direct immune responses. However, inappropriate execution in the differentiation process and abnormal function of these Th cells can lead to the development of several immune mediated diseases. Therefore, the thesis aimed at identifying genes and gene regulatory mechanisms responsible for Th17 differentiation and to study epigenetic changes associated with early stage of Th1/Th2 cell differentiation. Genome wide transcriptional profiling during early stages of human Th17 cell differentiation demonstrated differential regulation of several novel and currently known genes associated with Th17 differentiation. Selected candidate genes were further validated at protein level and their specificity for Th17 as compared to other T helper subsets was analyzed. Moreover, combination of RNA interference-mediated downregulation of gene expression, genome-wide transcriptome profiling and chromatin immunoprecipitation followed by massive parallel sequencing (ChIP-seq), combined with computational data integration lead to the identification of direct and indirect target genes of STAT3, which is a pivotal upstream transcription factor for Th17 cell polarization. Results indicated that STAT3 directly regulates the expression of several genes that are known to play a role in activation, differentiation, proliferation, and survival of Th17 cells. These results provide a basis for constructing a network regulating gene expression during early human Th17 differentiation. Th1 and Th2 lineage specific enhancers were identified from genome-wide maps of histone modifications generated from the cells differentiating towards Th1 and Th2 lineages at 72h. Further analysis of lineage-specific enhancers revealed known and novel transcription factors that potentially control lineage-specific gene expression. Finally, we found an overlap of a subset of enhancers with SNPs associated with autoimmune diseases through GWASs suggesting a potential role for enhancer elements in the disease development. In conclusion, the results obtained have extended our knowledge of Th differentiation and provided new mechanistic insights into dysregulation of Th cell differentiation in human immune mediated diseases.
Resumo:
A simple, low-cost concentric capillary nebulizer (CCN) was developed and evaluated for ICP spectrometry. The CCN could be operated at sample uptake rates of 0.050-1.00 ml min'^ and under oscillating and non-oscillating conditions. Aerosol characteristics for the CCN were studied using a laser Fraunhofter diffraction analyzer. Solvent transport efficiencies and transport rates, detection limits, and short- and long-term stabilities were evaluated for the CCN with a modified cyclonic spray chamber at different sample uptake rates. The Mg II (280.2nm)/l\/lg 1(285.2nm) ratio was used for matrix effect studies. Results were compared to those with conventional nebulizers, a cross-flow nebulizer with a Scott-type spray chamber, a GemCone nebulizer with a cyclonic spray chamber, and a Meinhard TR-30-K3 concentric nebulizer with a cyclonic spray chamber. Transport efficiencies of up to 57% were obtained for the CCN. For the elements tested, short- and long-term precisions and detection limits obtained with the CCN at 0.050-0.500 ml min'^ are similar to, or better than, those obtained on the same instrument using the conventional nebulizers (at 1.0 ml min'^). The depressive and enhancement effects of easily ionizable element Na, sulfuric acid, and dodecylamine surfactant on analyte signals with the CCN are similar to, or better than, those obtained with the conventional nebulizers. However, capillary clog was observed when the sample solution with high dissolved solids was nebulized for more than 40 min. The effects of data acquisition and data processing on detection limits were studied using inductively coupled plasma-atomic emission spectrometry. The study examined the effects of different detection limit approaches, the effects of data integration modes, the effects of regression modes, the effects of the standard concentration range and the number of standards, the effects of sample uptake rate, and the effect of Integration time. All the experiments followed the same protocols. Three detection limit approaches were examined, lUPAC method, the residual standard deviation (RSD), and the signal-to-background ratio and relative standard deviation of the background (SBR-RSDB). The study demonstrated that the different approaches, the integration modes, the regression methods, and the sample uptake rates can have an effect on detection limits. The study also showed that the different approaches give different detection limits and some methods (for example, RSD) are susceptible to the quality of calibration curves. Multicomponents spectral fitting (MSF) gave the best results among these three integration modes, peak height, peak area, and MSF. Weighted least squares method showed the ability to obtain better quality calibration curves. Although an effect of the number of standards on detection limits was not observed, multiple standards are recommended because they provide more reliable calibration curves. An increase of sample uptake rate and integration time could improve detection limits. However, an improvement with increased integration time on detection limits was not observed because the auto integration mode was used.