957 resultados para Imbalanced datasets
Resumo:
Despite an emerging understanding of the genetic alterations giving rise to various tumors, the mechanisms whereby most oncogenes are overexpressed remain unclear. Here we have utilized an integrated approach of genomewide regulatory element mapping via DNase-seq followed by conventional reporter assays and transcription factor binding site discovery to characterize the transcriptional regulation of the medulloblastoma oncogene Orthodenticle Homeobox 2 (OTX2). Through these studies we have revealed that OTX2 is differentially regulated in medulloblastoma at the level of chromatin accessibility, which is in part mediated by DNA methylation. In cell lines exhibiting chromatin accessibility of OTX2 regulatory regions, we found that autoregulation maintains OTX2 expression. Comparison of medulloblastoma regulatory elements with those of the developing brain reveals that these tumors engage a developmental regulatory program to drive OTX2 transcription. Finally, we have identified a transcriptional regulatory element mediating retinoid-induced OTX2 repression in these tumors. This work characterizes for the first time the mechanisms of OTX2 overexpression in medulloblastoma. Furthermore, this study establishes proof of principle for applying ENCODE datasets towards the characterization of upstream trans-acting factors mediating expression of individual genes.
Resumo:
UNLABELLED: • PREMISE OF THE STUDY: Understanding fern (monilophyte) phylogeny and its evolutionary timescale is critical for broad investigations of the evolution of land plants, and for providing the point of comparison necessary for studying the evolution of the fern sister group, seed plants. Molecular phylogenetic investigations have revolutionized our understanding of fern phylogeny, however, to date, these studies have relied almost exclusively on plastid data.• METHODS: Here we take a curated phylogenomics approach to infer the first broad fern phylogeny from multiple nuclear loci, by combining broad taxon sampling (73 ferns and 12 outgroup species) with focused character sampling (25 loci comprising 35877 bp), along with rigorous alignment, orthology inference and model selection.• KEY RESULTS: Our phylogeny corroborates some earlier inferences and provides novel insights; in particular, we find strong support for Equisetales as sister to the rest of ferns, Marattiales as sister to leptosporangiate ferns, and Dennstaedtiaceae as sister to the eupolypods. Our divergence-time analyses reveal that divergences among the extant fern orders all occurred prior to ∼200 MYA. Finally, our species-tree inferences are congruent with analyses of concatenated data, but generally with lower support. Those cases where species-tree support values are higher than expected involve relationships that have been supported by smaller plastid datasets, suggesting that deep coalescence may be reducing support from the concatenated nuclear data.• CONCLUSIONS: Our study demonstrates the utility of a curated phylogenomics approach to inferring fern phylogeny, and highlights the need to consider underlying data characteristics, along with data quantity, in phylogenetic studies.
Resumo:
With increasing recognition of the roles RNA molecules and RNA/protein complexes play in an unexpected variety of biological processes, understanding of RNA structure-function relationships is of high current importance. To make clean biological interpretations from three-dimensional structures, it is imperative to have high-quality, accurate RNA crystal structures available, and the community has thoroughly embraced that goal. However, due to the many degrees of freedom inherent in RNA structure (especially for the backbone), it is a significant challenge to succeed in building accurate experimental models for RNA structures. This chapter describes the tools and techniques our research group and our collaborators have developed over the years to help RNA structural biologists both evaluate and achieve better accuracy. Expert analysis of large, high-resolution, quality-conscious RNA datasets provides the fundamental information that enables automated methods for robust and efficient error diagnosis in validating RNA structures at all resolutions. The even more crucial goal of correcting the diagnosed outliers has steadily developed toward highly effective, computationally based techniques. Automation enables solving complex issues in large RNA structures, but cannot circumvent the need for thoughtful examination of local details, and so we also provide some guidance for interpreting and acting on the results of current structure validation for RNA.
Resumo:
BACKGROUND: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. FINDINGS: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. CONCLUSIONS: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Resumo:
Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Resumo:
The second round of the community-wide initiative Critical Assessment of automated Structure Determination of Proteins by NMR (CASD-NMR-2013) comprised ten blind target datasets, consisting of unprocessed spectral data, assigned chemical shift lists and unassigned NOESY peak and RDC lists, that were made available in both curated (i.e. manually refined) or un-curated (i.e. automatically generated) form. Ten structure calculation programs, using fully automated protocols only, generated a total of 164 three-dimensional structures (entries) for the ten targets, sometimes using both curated and un-curated lists to generate multiple entries for a single target. The accuracy of the entries could be established by comparing them to the corresponding manually solved structure of each target, which was not available at the time the data were provided. Across the entire data set, 71 % of all entries submitted achieved an accuracy relative to the reference NMR structure better than 1.5 Å. Methods based on NOESY peak lists achieved even better results with up to 100 % of the entries within the 1.5 Å threshold for some programs. However, some methods did not converge for some targets using un-curated NOESY peak lists. Over 90 % of the entries achieved an accuracy better than the more relaxed threshold of 2.5 Å that was used in the previous CASD-NMR-2010 round. Comparisons between entries generated with un-curated versus curated peaks show only marginal improvements for the latter in those cases where both calculations converged.
Resumo:
Parallel computing is now widely used in numerical simulation, particularly for application codes based on finite difference and finite element methods. A popular and successful technique employed to parallelize such codes onto large distributed memory systems is to partition the mesh into sub-domains that are then allocated to processors. The code then executes in parallel, using the SPMD methodology, with message passing for inter-processor interactions. In order to improve the parallel efficiency of an imbalanced structured mesh CFD code, a new dynamic load balancing (DLB) strategy has been developed in which the processor partition range limits of just one of the partitioned dimensions uses non-coincidental limits, as opposed to coincidental limits. The ‘local’ partition limit change allows greater flexibility in obtaining a balanced load distribution, as the workload increase, or decrease, on a processor is no longer restricted by the ‘global’ (coincidental) limit change. The automatic implementation of this generic DLB strategy within an existing parallel code is presented in this chapter, along with some preliminary results.
Resumo:
There is concern in the Cross-Channel region of Nord-Pas-de-Calais (France) and Kent (Great Britain), regarding the extent of atmospheric pollution detected in the area from emitted gaseous (VOC, NOx, S02)and particulate substances. In particular, the air quality of the Cross-Channel or "Trans-Manche" region is highly affected by the heavily industrial area of Dunkerque, in addition to transportation sources linked to cross-channel traffic in Kent and Calais, posing threats to the environment and human health. In the framework of the cross-border EU Interreg IIIA activity, the joint Anglo-French project, ATTMA, has been commissioned to study Aerosol Transport in the Trans-Manche Atmosphere. Using ground monitoring data from UK and French networks and with the assistance of satellite images the project aims to determine dispersion patterns. and identify sources responsible for the pollutants. The findings of this study will increase awareness and have a bearing on future air quality policy in the region. Public interest is evident by the presence of local authorities on both sides of the English Channel as collaborators. The research is based on pollution transport simulations using (a) Lagrangian Particle Dispersion (LPD) models, (b) an Eulerian Receptor Based model. This paper is concerned with part (a), the LPD Models. Lagrangian Particle Dispersion (LPD) models are often used to numerically simulate the dispersion of a passive tracer in the planetary boundary layer by calculating the Lagrangian trajectories of thousands of notional particles. In this contribution, the project investigated the use of two widely used particle dispersion models: the Hybrid Single Particle Lagrangian Integrated Trajectory (HYSPLIT) model and the model FLEXPART. In both models forward tracking and inverse (or·. receptor-based) modes are possible. Certain distinct pollution episodes have been selected from the monitor database EXPER/PF and from UK monitoring stations, and their likely trajectory predicted using prevailing weather data. Global meteorological datasets were downloaded from the ECMWF MARS archive. Part of the difficulty in identifying pollution sources arises from the fact that much of the pollution outside the monitoring area. For example heightened particulate concentrations are to originate from sand storms in the Sahara, or volcanic activity in Iceland or the Caribbean work identifies such long range influences. The output of the simulations shows that there are notable differences between the formulations of and Hysplit, although both models used the same meteorological data and source input, suggesting that the identification of the primary emissions during air pollution episodes may be rather uncertain.
Resumo:
Serial Analysis of Gene Expression (SAGE) is a relatively new method for monitoring gene expression levels and is expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. A promising application of SAGE gene expression data is classification of tumors. In this paper, we build three event models (the multivariate Bernoulli model, the multinomial model and the normalized multinomial model) for SAGE data classification. Both binary classification and multicategory classification are investigated. Experiments on two SAGE datasets show that the multivariate Bernoulli model performs well with small feature sizes, but the multinomial performs better at large feature sizes, while the normalized multinomial performs well with medium feature sizes. The multinomial achieves the highest overall accuracy.
Resumo:
This paper presents an analysis of biofluid behavior in a T-shaped microchannel device and a design optimization for improved biofluid performance in terms of particle liquid separation. The biofluid is modeled with single phase shear rate non-Newtonian flow with blood property. The separation of red blood cell from plasma is evident based on biofluid distribution in the microchannels against various relevant effects and findings, including Zweifach-Fung bifurcation law, Fahraeus effect, Fahraeus-Lindqvist effect and cell free phenomenon. The modeling with the initial device shows that this T-microchannel device can separate red blood cell from plasma but the separation efficiency among different bifurcations varies largely. In accordance with the imbalanced performance, a design optimization is conducted. This includes implementing a series of simulations to investigate the effect of the lengths of the main and branch channels to biofluid behavior and searching an improved design with optimal separation performance. It is found that changing relative lengths of branch channels is effective to both uniformity of flow rate ratio among bifurcations and reduction of difference of the flow velocities between the branch channels, whereas extending the length of the main channel from bifurcation region is only effective for uniformity of flow rate ratio.
Resumo:
In 2000 a Review of Current Marine Observations in relation to present and future needs was undertaken by the Inter-Agency Committee for Marine Science and Technology (IACMST). The Marine Environmental Change Network (MECN) was initiated in 2002 as a direct response to the recommendations of the report. A key part of the current phase of the MECN is to ensure that information from the network is provided to policy makers and other end-users to enable them to produce more accurate assessments of ecosystem state and gain a clearer understanding of factors influencing change in marine ecosystems. The MECN holds workshops on an annual basis, bringing together partners maintaining time-series and long-term datasets as well as end-users interested in outputs from the network. It was decided that the first workshop of the MECN continuation phase should consist of an evaluation of the time series and data sets maintained by partners in the MECN with regard to their ‘fit for purpose’ for answering key science questions and informing policy development. This report is based on the outcomes of the workshop. Section one of the report contains a brief introduction to monitoring, time series and long-term datasets. The various terms are defined and the need for MECN type data to complement compliance monitoring programmes is discussed. Outlines are also given of initiatives such as the United Kingdom Marine Monitoring and Assessment Strategy (UKMMAS) and Oceans 2025. Section two contains detailed information for each of the MECN time series / long-term datasets including information on scientific outputs and current objectives. This information is mainly based on the presentations given at the workshop and therefore follows a format whereby the following headings are addressed: Origin of time series including original objectives; current objectives; policy relevance; products (advice, publications, science and society). Section three consists of comments made by the review panel concerning all the time series and the network. Needs or issues highlighted by the panel with regard to the future of long-term datasets and time-series in the UK are shown along with advice and potential solutions where offered. The recommendations are divided into 4 categories; ‘The MECN and end-user requirements’; ‘Procedures & protocols’; ‘Securing data series’ and ‘Future developments’. Ever since marine environmental protection issues really came to the fore in the 1960s, it has been recognised that there is a requirement for a suitable evidence base on environmental change in order to support policy and management for UK waters. Section four gives a brief summary of the development of marine policy in the UK along with comments on the availability and necessity of long-term marine observations for the implementation of this policy. Policy relating to three main areas is discussed; Marine Conservation (protecting biodiversity and marine ecosystems); Marine Pollution and Fisheries. The conclusion of this section is that there has always been a specific requirement for information on long-term change in marine ecosystems around the UK in order to address concerns over pollution, fishing and general conservation. It is now imperative that this need is addressed in order for the UK to be able to fulfil its policy commitments and manage marine ecosystems in the light of climate change and other factors.
Resumo:
The contract work has demonstrated that older data can be assessed and entered into the MR format. Older data has associated problems but is retrievable. The contract successfully imported all datasets as required. MNCR survey sheets fit well into the MR format. The data validation and verification process can be improved. A number of computerised short cuts can be suggested and the process made more intuitive. Such a move is vital if MR is to be adopted as a standard by the recording community both on a voluntary level and potentially by consultancies.
Resumo:
This work demonstrates an example of the importance of an adequate method to sub-sample model results when comparing with in situ measurements. A test of model skill was performed by employing a point-to-point method to compare a multi-decadal hindcast against a sparse, unevenly distributed historic in situ dataset. The point-to-point method masked out all hindcast cells that did not have a corresponding in situ measurement in order to match each in situ measurement against its most similar cell from the model. The application of the point-to-point method showed that the model was successful at reproducing the inter-annual variability of the in situ datasets. Furthermore, this success was not immediately apparent when the measurements were aggregated to regional averages. Time series, data density and target diagrams were employed to illustrate the impact of switching from the regional average method to the point-to-point method. The comparison based on regional averages gave significantly different and sometimes contradicting results that could lead to erroneous conclusions on the model performance. Furthermore, the point-to-point technique is a more correct method to exploit sparse uneven in situ data while compensating for the variability of its sampling. We therefore recommend that researchers take into account for the limitations of the in situ datasets and process the model to resemble the data as much as possible.
Resumo:
The role of the ocean in the cycling of oxygenated volatile organic compounds (OVOCs) remains largely unanswered due to a paucity of datasets. We describe the method development of a membrane inlet-proton transfer reaction/mass spectrometer (MI-PTR/MS) as an efficient method of analysing methanol, acetaldehyde and acetone in seawater. Validation of the technique with water standards shows that the optimised responses are linear and reproducible. Limits of detection are 27 nM for methanol, 0.7 nM for acetaldehyde and 0.3 nM for acetone. Acetone and acetaldehyde concentrations generated by MI-PTR/MS are compared to a second, independent method based on purge and trap-gas chromatography/flame ionisation detection (P&T-GC/FID) and show excellent agreement. Chromatographic separation of isomeric species acetone and propanal permits correction to mass 59 signal generated by the PTR/MS and overcomes a known uncertainty in reporting acetone concentrations via mass spectrometry. A third bioassay technique using radiolabelled acetone further supported the result generated by this method. We present the development and optimisation of the MI-PTR/MS technique as a reliable and convenient tool for analysing seawater samples for these trace gases. We compare this method with other analytical techniques and discuss its potential use in improving the current understanding of the cycling of oceanic OVOCs.
Resumo:
Observations of Earth from space have been made for over 40 years and have contributed to advances in many aspects of climate science. However, attempts to exploit this wealth of data are often hampered by a lack of homogeneity and continuity and by insufficient understanding of the products and their uncertainties. There is, therefore, a need to reassess and reprocess satellite datasets to maximize their usefulness for climate science. The European Space Agency has responded to this need by establishing the Climate Change Initiative (CCI). The CCI will create new climate data records for (currently) 13 essential climate variables (ECVs) and make these open and easily accessible to all. Each ECV project works closely with users to produce time series from the available satellite observations relevant to users' needs. A climate modeling users' group provides a climate system perspective and a forum to bring the data and modeling communities together. This paper presents the CCI program. It outlines its benefit and presents approaches and challenges for each ECV project, covering clouds, aerosols, ozone, greenhouse gases, sea surface temperature, ocean color, sea level, sea ice, land cover, fire, glaciers, soil moisture, and ice sheets. It also discusses how the CCI approach may contribute to defining and shaping future developments in Earth observation for climate science.