970 resultados para Datasets
Resumo:
UNLABELLED: • PREMISE OF THE STUDY: Understanding fern (monilophyte) phylogeny and its evolutionary timescale is critical for broad investigations of the evolution of land plants, and for providing the point of comparison necessary for studying the evolution of the fern sister group, seed plants. Molecular phylogenetic investigations have revolutionized our understanding of fern phylogeny, however, to date, these studies have relied almost exclusively on plastid data.• METHODS: Here we take a curated phylogenomics approach to infer the first broad fern phylogeny from multiple nuclear loci, by combining broad taxon sampling (73 ferns and 12 outgroup species) with focused character sampling (25 loci comprising 35877 bp), along with rigorous alignment, orthology inference and model selection.• KEY RESULTS: Our phylogeny corroborates some earlier inferences and provides novel insights; in particular, we find strong support for Equisetales as sister to the rest of ferns, Marattiales as sister to leptosporangiate ferns, and Dennstaedtiaceae as sister to the eupolypods. Our divergence-time analyses reveal that divergences among the extant fern orders all occurred prior to ∼200 MYA. Finally, our species-tree inferences are congruent with analyses of concatenated data, but generally with lower support. Those cases where species-tree support values are higher than expected involve relationships that have been supported by smaller plastid datasets, suggesting that deep coalescence may be reducing support from the concatenated nuclear data.• CONCLUSIONS: Our study demonstrates the utility of a curated phylogenomics approach to inferring fern phylogeny, and highlights the need to consider underlying data characteristics, along with data quantity, in phylogenetic studies.
Resumo:
With increasing recognition of the roles RNA molecules and RNA/protein complexes play in an unexpected variety of biological processes, understanding of RNA structure-function relationships is of high current importance. To make clean biological interpretations from three-dimensional structures, it is imperative to have high-quality, accurate RNA crystal structures available, and the community has thoroughly embraced that goal. However, due to the many degrees of freedom inherent in RNA structure (especially for the backbone), it is a significant challenge to succeed in building accurate experimental models for RNA structures. This chapter describes the tools and techniques our research group and our collaborators have developed over the years to help RNA structural biologists both evaluate and achieve better accuracy. Expert analysis of large, high-resolution, quality-conscious RNA datasets provides the fundamental information that enables automated methods for robust and efficient error diagnosis in validating RNA structures at all resolutions. The even more crucial goal of correcting the diagnosed outliers has steadily developed toward highly effective, computationally based techniques. Automation enables solving complex issues in large RNA structures, but cannot circumvent the need for thoughtful examination of local details, and so we also provide some guidance for interpreting and acting on the results of current structure validation for RNA.
Resumo:
BACKGROUND: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. FINDINGS: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. CONCLUSIONS: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Resumo:
Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.
We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.
We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.
Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.
This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.
Resumo:
The second round of the community-wide initiative Critical Assessment of automated Structure Determination of Proteins by NMR (CASD-NMR-2013) comprised ten blind target datasets, consisting of unprocessed spectral data, assigned chemical shift lists and unassigned NOESY peak and RDC lists, that were made available in both curated (i.e. manually refined) or un-curated (i.e. automatically generated) form. Ten structure calculation programs, using fully automated protocols only, generated a total of 164 three-dimensional structures (entries) for the ten targets, sometimes using both curated and un-curated lists to generate multiple entries for a single target. The accuracy of the entries could be established by comparing them to the corresponding manually solved structure of each target, which was not available at the time the data were provided. Across the entire data set, 71 % of all entries submitted achieved an accuracy relative to the reference NMR structure better than 1.5 Å. Methods based on NOESY peak lists achieved even better results with up to 100 % of the entries within the 1.5 Å threshold for some programs. However, some methods did not converge for some targets using un-curated NOESY peak lists. Over 90 % of the entries achieved an accuracy better than the more relaxed threshold of 2.5 Å that was used in the previous CASD-NMR-2010 round. Comparisons between entries generated with un-curated versus curated peaks show only marginal improvements for the latter in those cases where both calculations converged.
Resumo:
There is concern in the Cross-Channel region of Nord-Pas-de-Calais (France) and Kent (Great Britain), regarding the extent of atmospheric pollution detected in the area from emitted gaseous (VOC, NOx, S02)and particulate substances. In particular, the air quality of the Cross-Channel or "Trans-Manche" region is highly affected by the heavily industrial area of Dunkerque, in addition to transportation sources linked to cross-channel traffic in Kent and Calais, posing threats to the environment and human health. In the framework of the cross-border EU Interreg IIIA activity, the joint Anglo-French project, ATTMA, has been commissioned to study Aerosol Transport in the Trans-Manche Atmosphere. Using ground monitoring data from UK and French networks and with the assistance of satellite images the project aims to determine dispersion patterns. and identify sources responsible for the pollutants. The findings of this study will increase awareness and have a bearing on future air quality policy in the region. Public interest is evident by the presence of local authorities on both sides of the English Channel as collaborators. The research is based on pollution transport simulations using (a) Lagrangian Particle Dispersion (LPD) models, (b) an Eulerian Receptor Based model. This paper is concerned with part (a), the LPD Models. Lagrangian Particle Dispersion (LPD) models are often used to numerically simulate the dispersion of a passive tracer in the planetary boundary layer by calculating the Lagrangian trajectories of thousands of notional particles. In this contribution, the project investigated the use of two widely used particle dispersion models: the Hybrid Single Particle Lagrangian Integrated Trajectory (HYSPLIT) model and the model FLEXPART. In both models forward tracking and inverse (or·. receptor-based) modes are possible. Certain distinct pollution episodes have been selected from the monitor database EXPER/PF and from UK monitoring stations, and their likely trajectory predicted using prevailing weather data. Global meteorological datasets were downloaded from the ECMWF MARS archive. Part of the difficulty in identifying pollution sources arises from the fact that much of the pollution outside the monitoring area. For example heightened particulate concentrations are to originate from sand storms in the Sahara, or volcanic activity in Iceland or the Caribbean work identifies such long range influences. The output of the simulations shows that there are notable differences between the formulations of and Hysplit, although both models used the same meteorological data and source input, suggesting that the identification of the primary emissions during air pollution episodes may be rather uncertain.
Resumo:
Serial Analysis of Gene Expression (SAGE) is a relatively new method for monitoring gene expression levels and is expected to contribute significantly to the progress in cancer treatment by enabling a precise and early diagnosis. A promising application of SAGE gene expression data is classification of tumors. In this paper, we build three event models (the multivariate Bernoulli model, the multinomial model and the normalized multinomial model) for SAGE data classification. Both binary classification and multicategory classification are investigated. Experiments on two SAGE datasets show that the multivariate Bernoulli model performs well with small feature sizes, but the multinomial performs better at large feature sizes, while the normalized multinomial performs well with medium feature sizes. The multinomial achieves the highest overall accuracy.
Resumo:
In 2000 a Review of Current Marine Observations in relation to present and future needs was undertaken by the Inter-Agency Committee for Marine Science and Technology (IACMST). The Marine Environmental Change Network (MECN) was initiated in 2002 as a direct response to the recommendations of the report. A key part of the current phase of the MECN is to ensure that information from the network is provided to policy makers and other end-users to enable them to produce more accurate assessments of ecosystem state and gain a clearer understanding of factors influencing change in marine ecosystems. The MECN holds workshops on an annual basis, bringing together partners maintaining time-series and long-term datasets as well as end-users interested in outputs from the network. It was decided that the first workshop of the MECN continuation phase should consist of an evaluation of the time series and data sets maintained by partners in the MECN with regard to their ‘fit for purpose’ for answering key science questions and informing policy development. This report is based on the outcomes of the workshop. Section one of the report contains a brief introduction to monitoring, time series and long-term datasets. The various terms are defined and the need for MECN type data to complement compliance monitoring programmes is discussed. Outlines are also given of initiatives such as the United Kingdom Marine Monitoring and Assessment Strategy (UKMMAS) and Oceans 2025. Section two contains detailed information for each of the MECN time series / long-term datasets including information on scientific outputs and current objectives. This information is mainly based on the presentations given at the workshop and therefore follows a format whereby the following headings are addressed: Origin of time series including original objectives; current objectives; policy relevance; products (advice, publications, science and society). Section three consists of comments made by the review panel concerning all the time series and the network. Needs or issues highlighted by the panel with regard to the future of long-term datasets and time-series in the UK are shown along with advice and potential solutions where offered. The recommendations are divided into 4 categories; ‘The MECN and end-user requirements’; ‘Procedures & protocols’; ‘Securing data series’ and ‘Future developments’. Ever since marine environmental protection issues really came to the fore in the 1960s, it has been recognised that there is a requirement for a suitable evidence base on environmental change in order to support policy and management for UK waters. Section four gives a brief summary of the development of marine policy in the UK along with comments on the availability and necessity of long-term marine observations for the implementation of this policy. Policy relating to three main areas is discussed; Marine Conservation (protecting biodiversity and marine ecosystems); Marine Pollution and Fisheries. The conclusion of this section is that there has always been a specific requirement for information on long-term change in marine ecosystems around the UK in order to address concerns over pollution, fishing and general conservation. It is now imperative that this need is addressed in order for the UK to be able to fulfil its policy commitments and manage marine ecosystems in the light of climate change and other factors.
Resumo:
The contract work has demonstrated that older data can be assessed and entered into the MR format. Older data has associated problems but is retrievable. The contract successfully imported all datasets as required. MNCR survey sheets fit well into the MR format. The data validation and verification process can be improved. A number of computerised short cuts can be suggested and the process made more intuitive. Such a move is vital if MR is to be adopted as a standard by the recording community both on a voluntary level and potentially by consultancies.
Resumo:
This work demonstrates an example of the importance of an adequate method to sub-sample model results when comparing with in situ measurements. A test of model skill was performed by employing a point-to-point method to compare a multi-decadal hindcast against a sparse, unevenly distributed historic in situ dataset. The point-to-point method masked out all hindcast cells that did not have a corresponding in situ measurement in order to match each in situ measurement against its most similar cell from the model. The application of the point-to-point method showed that the model was successful at reproducing the inter-annual variability of the in situ datasets. Furthermore, this success was not immediately apparent when the measurements were aggregated to regional averages. Time series, data density and target diagrams were employed to illustrate the impact of switching from the regional average method to the point-to-point method. The comparison based on regional averages gave significantly different and sometimes contradicting results that could lead to erroneous conclusions on the model performance. Furthermore, the point-to-point technique is a more correct method to exploit sparse uneven in situ data while compensating for the variability of its sampling. We therefore recommend that researchers take into account for the limitations of the in situ datasets and process the model to resemble the data as much as possible.
Resumo:
The role of the ocean in the cycling of oxygenated volatile organic compounds (OVOCs) remains largely unanswered due to a paucity of datasets. We describe the method development of a membrane inlet-proton transfer reaction/mass spectrometer (MI-PTR/MS) as an efficient method of analysing methanol, acetaldehyde and acetone in seawater. Validation of the technique with water standards shows that the optimised responses are linear and reproducible. Limits of detection are 27 nM for methanol, 0.7 nM for acetaldehyde and 0.3 nM for acetone. Acetone and acetaldehyde concentrations generated by MI-PTR/MS are compared to a second, independent method based on purge and trap-gas chromatography/flame ionisation detection (P&T-GC/FID) and show excellent agreement. Chromatographic separation of isomeric species acetone and propanal permits correction to mass 59 signal generated by the PTR/MS and overcomes a known uncertainty in reporting acetone concentrations via mass spectrometry. A third bioassay technique using radiolabelled acetone further supported the result generated by this method. We present the development and optimisation of the MI-PTR/MS technique as a reliable and convenient tool for analysing seawater samples for these trace gases. We compare this method with other analytical techniques and discuss its potential use in improving the current understanding of the cycling of oceanic OVOCs.
Resumo:
Observations of Earth from space have been made for over 40 years and have contributed to advances in many aspects of climate science. However, attempts to exploit this wealth of data are often hampered by a lack of homogeneity and continuity and by insufficient understanding of the products and their uncertainties. There is, therefore, a need to reassess and reprocess satellite datasets to maximize their usefulness for climate science. The European Space Agency has responded to this need by establishing the Climate Change Initiative (CCI). The CCI will create new climate data records for (currently) 13 essential climate variables (ECVs) and make these open and easily accessible to all. Each ECV project works closely with users to produce time series from the available satellite observations relevant to users' needs. A climate modeling users' group provides a climate system perspective and a forum to bring the data and modeling communities together. This paper presents the CCI program. It outlines its benefit and presents approaches and challenges for each ECV project, covering clouds, aerosols, ozone, greenhouse gases, sea surface temperature, ocean color, sea level, sea ice, land cover, fire, glaciers, soil moisture, and ice sheets. It also discusses how the CCI approach may contribute to defining and shaping future developments in Earth observation for climate science.
Resumo:
Differential phenological responses to climate among species are predicted to disrupt trophic interactions, but datasets to evaluate this are scarce. We compared phenological trends for species from 4 levels of a North Sea food web over 24 yr when sea surface temperature (SST) increased significantly. We found little consistency in phenological trends between adjacent trophic levels, no significant relationships with SST, and no significant pairwise correlations between predator and prey phenologies, suggesting that trophic mismatching is occurring. Finer resolution data on timing of peak energy demand (mid-chick-rearing) for 5 seabird species at a major North Sea colony were compared to modelled daily changes in length of 0-group (young of the year) lesser sandeels Ammodytes marinus. The date at which sandeels reached a given threshold length became significantly later during the study. Although the phenology of all the species except shags also became later, these changes were insufficient to keep pace with sandeel length, and thus mean length (and energy value) of 0-group sandeels at mid-chick-rearing showed net declines. The magnitude of declines in energy value varied among the seabirds, being more marked in species showing no phenological response (shag, 4.80 kJ) and in later breeding species feeding on larger sandeels (kittiwake, 2.46 kJ) where, due to the relationship between sandeel length and energy value being non-linear, small reductions in length result in relatively large reductions in energy. However, despite the decline in energy value of 0-group sandeels during chick-rearing, there was no evidence of any adverse effect on breeding success for any of the seabird species. Trophic mismatch appears to be prevalent within the North Sea pelagic food web, suggesting that ecosystem functioning may be disrupted.
Resumo:
Following the publication of our paper (Attrill et al. 2007), we became quickly aware of a couple of errors. We have subsequently been collaborating with Dr. Chris Lynam (Lynam et al. 2004, 2005) to bring together our two datasets, explore the common patterns within our data, and attempt to provide a consensus on how climate is affecting gelatinous plankton in the North Sea. During this reanalysis, two errors within the data were discovered, one involving a transcription error of a column of residuals during de-trended analysis, the other a major data entry error deep in the Continuous Plankton Recorder (CPR) database for sector B2. Here we present a revised version of table 1 from Attrill et al. (2007) to incorporate corrections to these transcription and data entry errors. These corrections alter some of the results in our original data table, mainly to increase and strengthen the number of significant relations we found (e.g., for sector B2 and whole sea area); all previous main results remain robustly significant. Following discussions with Dr. Lynam, two clarifications of statements made in Attrill et al. (2007) are also required. Page 482, Results, last line of first column: ‘‘There were no...robust, consistent relations between jellyfish frequency and any environmental variables for B and D… contrary to the findings of previous shorter time series (Lynam et al. 2005).’’ The Lynam et al. (2004, 2005) papers presented no data for the D sector and found no link in the B sector, contrary to our revised results. Page 482, Discussion, paragraph 1, last sentence: ‘‘… positive association … North of Scotland (Lynam et al. 2005) … does not appear to be maintained.’’ Our paper did not report on any data that covered Lynam et al.’s (2005) North of Scotland area so the statement is not directly supported, although their positive relation North of Scotland, when considered in conjunction with inflow, may agree with the C2 and B2 results of Attrill et al. (2007).
Resumo:
Agglomerative cluster analyses encompass many techniques, which have been widely used in various fields of science. In biology, and specifically ecology, datasets are generally highly variable and may contain outliers, which increase the difficulty to identify the number of clusters. Here we present a new criterion to determine statistically the optimal level of partition in a classification tree. The criterion robustness is tested against perturbated data (outliers) using an observation or variable with values randomly generated. The technique, called Random Simulation Test (RST), is tested on (1) the well-known Iris dataset [Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenic. 7, 179–188], (2) simulated data with predetermined numbers of clusters following Milligan and Cooper [Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 159–179] and finally (3) is applied on real copepod communities data previously analyzed in Beaugrand et al. [Beaugrand, G., Ibanez, F., Lindley, J.A., Reid, P.C., 2002. Diversity of calanoid copepods in the North Atlantic and adjacent seas: species associations and biogeography. Mar. Ecol. Prog. Ser. 232, 179–195]. The technique is compared to several standard techniques. RST performed generally better than existing algorithms on simulated data and proved to be especially efficient with highly variable datasets.