902 resultados para Large Data Sets
Resumo:
BACKGROUND: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. FINDINGS: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. CONCLUSIONS: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Resumo:
PURPOSE: X-ray computed tomography (CT) is widely used, both clinically and preclinically, for fast, high-resolution anatomic imaging; however, compelling opportunities exist to expand its use in functional imaging applications. For instance, spectral information combined with nanoparticle contrast agents enables quantification of tissue perfusion levels, while temporal information details cardiac and respiratory dynamics. The authors propose and demonstrate a projection acquisition and reconstruction strategy for 5D CT (3D+dual energy+time) which recovers spectral and temporal information without substantially increasing radiation dose or sampling time relative to anatomic imaging protocols. METHODS: The authors approach the 5D reconstruction problem within the framework of low-rank and sparse matrix decomposition. Unlike previous work on rank-sparsity constrained CT reconstruction, the authors establish an explicit rank-sparse signal model to describe the spectral and temporal dimensions. The spectral dimension is represented as a well-sampled time and energy averaged image plus regularly undersampled principal components describing the spectral contrast. The temporal dimension is represented as the same time and energy averaged reconstruction plus contiguous, spatially sparse, and irregularly sampled temporal contrast images. Using a nonlinear, image domain filtration approach, the authors refer to as rank-sparse kernel regression, the authors transfer image structure from the well-sampled time and energy averaged reconstruction to the spectral and temporal contrast images. This regularization strategy strictly constrains the reconstruction problem while approximately separating the temporal and spectral dimensions. Separability results in a highly compressed representation for the 5D data in which projections are shared between the temporal and spectral reconstruction subproblems, enabling substantial undersampling. The authors solved the 5D reconstruction problem using the split Bregman method and GPU-based implementations of backprojection, reprojection, and kernel regression. Using a preclinical mouse model, the authors apply the proposed algorithm to study myocardial injury following radiation treatment of breast cancer. RESULTS: Quantitative 5D simulations are performed using the MOBY mouse phantom. Twenty data sets (ten cardiac phases, two energies) are reconstructed with 88 μm, isotropic voxels from 450 total projections acquired over a single 360° rotation. In vivo 5D myocardial injury data sets acquired in two mice injected with gold and iodine nanoparticles are also reconstructed with 20 data sets per mouse using the same acquisition parameters (dose: ∼60 mGy). For both the simulations and the in vivo data, the reconstruction quality is sufficient to perform material decomposition into gold and iodine maps to localize the extent of myocardial injury (gold accumulation) and to measure cardiac functional metrics (vascular iodine). Their 5D CT imaging protocol represents a 95% reduction in radiation dose per cardiac phase and energy and a 40-fold decrease in projection sampling time relative to their standard imaging protocol. CONCLUSIONS: Their 5D CT data acquisition and reconstruction protocol efficiently exploits the rank-sparse nature of spectral and temporal CT data to provide high-fidelity reconstruction results without increased radiation dose or sampling time.
Resumo:
The SB distributional model of Johnson's 1949 paper was introduced by a transformation to normality, that is, z ~ N(0, 1), consisting of a linear scaling to the range (0, 1), a logit transformation, and an affine transformation, z = γ + δu. The model, in its original parameterization, has often been used in forest diameter distribution modelling. In this paper, we define the SB distribution in terms of the inverse transformation from normality, including an initial linear scaling transformation, u = γ′ + δ′z (δ′ = 1/δ and γ′ = �γ/δ). The SB model in terms of the new parameterization is derived, and maximum likelihood estimation schema are presented for both model parameterizations. The statistical properties of the two alternative parameterizations are compared empirically on 20 data sets of diameter distributions of Changbai larch (Larix olgensis Henry). The new parameterization is shown to be statistically better than Johnson's original parameterization for the data sets considered here.
Resumo:
In this paper, the buildingEXODUS evacuation model is described and discussed and attempts at qualitative and quantitative model validation are presented. The data sets used for validation are the Stapelfeldt and Milburn House evacuation data. As part of the validation exercise, the sensitivity of the building-EXODUS predictions to a range of variables is examined, including occupant drive, occupant location, exit flow capacity, exit size, occupant response times and geometry definition. An important consideration that has been highlighted by this work is that any validation exercise must be scrutinised to identify both the results generated and the considerations and assumptions on which they are based. During the course of the validation exercise, both data sets were found to be less than ideal for the purpose of validating complex evacuation. However, the buildingEXODUS evacuation model was found to be able to produce reasonable qualitative and quantitative agreement with the experimental data.
Resumo:
This document is the first out of three iterations of the DMP that will be formally delivered during the project. Version 2 is due in month 24 and version 3 towards the end of the project. The DMP thus is not a fixed document; it evolves and gains more precision and substance during the lifespan of the project. In this first version we describe the planned research data sets related to the RAGE evaluation and validation activities, and the fifteen principles that will guide data management in RAGE. The former are described in the format of the EU data management template, and the latter in terms of their guiding principle, how we propose to implement them, and when they will be implemented. This document is thus first of all relevant to WP5 and WP8 members.
Resumo:
Several environmental/physical variables derived from satellite and in situ data sets were used to understand the variability of coccolithophore abundance in the subarctic North Atlantic. The 7-yr (1997–2004) time-series analysis showed that the combined effects of high solar radiation, shallow mixed layer depth (<20 m), and increased temperatures explained >89% of the coccolithophore variation. The June 1998 bloom, which was associated with high light intensity, unusually high sea-surface temperature, and a very shallow mixed layer, was found to be one of the most extensive (>995,000 km2) blooms ever recorded. There was a pronounced sea-surface temperature shift in the mid-1990s with a peak in 1998, suggesting that exceptionally large blooms are caused by pronounced environmental conditions and the variability of the physical environment strongly affects the spatial extent of these blooms. Consequently, if the physical environment varies, the effects of these blooms on the atmospheric and oceanic environment will vary as well.
Resumo:
Size-fractionated filtration (SFF) is a direct method for estimating pigment concentration in various size classes. It is also common practice to infer the size structure of phytoplankton communities from diagnostic pigments estimated by high-performance liquid chromatography (HPLC). In this paper, the three-component model of Brewin et al. (2010) was fitted to coincident data from HPLC and from SFF collected along Atlantic Meridional Transect cruises. The model accounted for the variability in each data set, but the fitted model parameters differed for the two data sets. Both HPLC and SFF data supported the conceptual framework of the three-component model, which assumes that the chlorophyll concentration in small cells increases to an asymptotic maximum, beyond which further increase in chlorophyll is achieved by the addition of larger celled phytoplankton. The three-component model was extended to a multicomponent model of size structure using observed relationships between model parameters and assuming that the asymptotic concentration that can be reached by cells increased linearly with increase in the upper bound on the cell size. The multicomponent model was verified using independent SFF data for a variety of size fractions and found to perform well (0.628 ≤ r ≤ 0.989) lending support for the underlying assumptions. An advantage of the multicomponent model over the three-component model is that, for the same number of parameters, it can be applied to any size range in a continuous fashion. The multicomponent model provides a useful tool for studying the distribution of phytoplankton size structure at large scales.
Resumo:
Charge exchange X-ray and far-ultraviolet (FUV) aurorae can provide detailed insight into the interaction between solar system plasmas. Using the two complementary experimental techniques of photon emission spectroscopy and translation energy spectroscopy, we have studied state-selective charge exchange in collisions between fully ionized helium and target gasses characteristic of cometary and planetary atmospheres (H2O, CO2, CO, and CH4). The experiments were performed at velocities typical for the solar wind (200-1500 km s(-1)). Data sets are produced that can be used for modeling the interaction of solar wind alpha particles with cometary and planetary atmospheres. These data sets are used to demonstrate the diagnostic potential of helium line emission. Existing Extreme Ultraviolet Explorer (EUVE) observations of comets Hyakutake and Hale-Bopp are analyzed in terms of solar wind and coma characteristics. The case of Hale-Bopp illustrates well the dependence of the helium line emission to the collision velocity. For Hale-Bopp, our model requires low velocities in the interaction zone. We interpret this as the effect of severe post-bow shock cooling in this extraordinary large comet.
Resumo:
The SuperWASP project is an ultra-wide angle search for extra solar planetary transits. However, it can also serendipitously detect solar system objects, such as asteroids and comets. Each SuperWASP instrument consists of up to eight cameras, combined with high-quality peltier-cooled CCDs, which photometrically survey large numbers of stars in the magnitude range 7 15. Each camera covers a 7.8 × 7.8 degree field of view. Located on La Palma, the SuperWASP-I instrument has been observing the Northern Hemisphere with five cameras since its inauguration in April 2004. The ultra-wide angle field of view gives SuperWASP the possibility of discovering new fast moving (near to Earth) asteroids that could have been missed by other instruments. However, it provides an excellent opportunity to produce a magnitude-limited lightcurve survey of known main belt asteroids. As slow moving asteroids stay within a single SuperWASP field for several weeks, and may be seen in many fields, a survey of all objects brighter than magnitude 15 is possible. This will provide a significant increase in the total number of lightcurves available for statistical studies without the inherent bias against longer periods present in the current data sets. We present the methodology used in the automated collection of asteroid data from SuperWASP and some of the first examples of lightcurves from numbered asteroids.
Resumo:
A goal of phylogeography is to relate patterns of genetic differentiation to potential historical geographic isolating events. Quaternary glaciations, particularly the one culminating in the Last Glacial Maximum ~21 ka (thousands of years ago), greatly affected the distributions and population sizes of temperate marine species as their ranges retreated southward to escape ice sheets. Traditional genetic models of glacial refugia and routes of recolonization include these predictions: low genetic diversity in formerly glaciated areas, with a small number of alleles/haplotypes dominating disproportionately large areas, and high diversity including "private" alleles in glacial refugia. In the Northern Hemisphere, low diversity in the north and high diversity in the south are expected. This simple model does not account for the possibility of populations surviving in relatively small northern periglacial refugia. If these periglacial populations experienced extreme bottlenecks, they could have the low genetic diversity expected in recolonized areas with no refugia, but should have more endemic diversity (private alleles) than recently recolonized areas. This review examines evidence of putative glacial refugia for eight benthic marine taxa in the temperate North Atlantic. All data sets were reanalyzed to allow direct comparisons between geographic patterns of genetic diversity and distribution of particular clades and haplotypes including private alleles. We contend that for marine organisms the genetic signatures of northern periglacial and southern refugia can be distinguished from one another. There is evidence for several periglacial refugia in northern latitudes, giving credence to recent climatic reconstructions with less extensive glaciation.
Resumo:
Studies concerning the physiological significance of Ca2+ sparks often depend on the detection and measurement of large populations of events in noisy microscopy images. Automated detection methods have been developed to quickly and objectively distinguish potential sparks from noise artifacts. However, previously described algorithms are not suited to the reliable detection of sparks in images where the local baseline fluorescence and noise properties can vary significantly, and risk introducing additional bias when applied to such data sets. Here, we describe a new, conceptually straightforward approach to spark detection in linescans that addresses this issue by combining variance stabilization with local baseline subtraction. We also show that in addition to greatly increasing the range of images in which sparks can be automatically detected, the use of a more accurate noise model enables our algorithm to achieve similar detection sensitivities with fewer false positives than previous approaches when applied both to synthetic and experimental data sets. We propose, therefore, that it might be a useful tool for improving the reliability and objectivity of spark analysis in general, and describe how it might be further optimized for specific applications.
Resumo:
A robust method for fitting to the results of gel electrophoresis assays of damage to plasmid DNA caused by radiation is presented. This method makes use of nonlinear regression to fit analytically derived dose response curves to observations of the supercoiled, open circular and linear plasmid forms simultaneously, allowing for more accurate results than fitting to individual forms. Comparisons with a commonly used analysis method show that while there is a relatively small benefit between the methods for data sets with small errors, the parameters generated by this method remain much more closely distributed around the true value in the face of increasing measurement uncertainties. This allows for parameters to be specified with greater confidence, reflected in a reduction of errors on fitted parameters. On test data sets, fitted uncertainties were reduced by 30%, similar to the improvement that would be offered by moving from triplicate to fivefold repeats (assuming standard errors). This method has been implemented in a popular spreadsheet package and made available online to improve its accessibility. (C) 2011 by Radiation Research Society
Resumo:
Recent years have witnessed an incredibly increasing interest in the topic of incremental learning. Unlike conventional machine learning situations, data flow targeted by incremental learning becomes available continuously over time. Accordingly, it is desirable to be able to abandon the traditional assumption of the availability of representative training data during the training period to develop decision boundaries. Under scenarios of continuous data flow, the challenge is how to transform the vast amount of stream raw data into information and knowledge representation, and accumulate experience over time to support future decision-making process. In this paper, we propose a general adaptive incremental learning framework named ADAIN that is capable of learning from continuous raw data, accumulating experience over time, and using such knowledge to improve future learning and prediction performance. Detailed system level architecture and design strategies are presented in this paper. Simulation results over several real-world data sets are used to validate the effectiveness of this method.
Resumo:
The relationship between biodiversity and ecological processes is currently the focus of considerable research effort, made all the more urgent by the rate of biodiversity loss world-wide. Rigorous experimental approaches to this question have been dominated by terrestrial ecologists, but shallow-water marine systems offer great opportunities by virtue of their relative ease of manipulation, fast response times and well-understood effects of macrofauna on sediment processes. In this paper, we describe a series of experiments whereby species richness has been manipulated in a controlled way and the concentrations of nutrients (ammonium, nitrate and phosphate) in the overlying water measured under these different treatments. The results indicate variable effects of species and location on ecosystem processes, and are discussed in the context of emerging mainstream ecological theory on biodiversity and ecosystem relations. Extensions of the application of the experimental approach to species-rich, large-scale benthic systems are discussed and the potential for novel analyses of existing data sets is highlighted. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
We read with interest the comments offered by Drs. Hughes and Bradley (1) on our systematic review (2). Four single nucleotide polymorphisms (SNPs), rs9332739 and rs547154 in the complement component 2 gene (C2) and rs4151667 and rs641153 in the complement factor B gene (CFB), were pooled. Hughes and Bradley point out that we omitted the most common variant, rs12614. In fact, rs12614 is in high linkage disequilibrium (LD) with rs641153, which was included, and the major allele of both of these SNPs is in the range of 90% (population code, CEU, in the International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/)). Moreover, our review was initiated in September 2010, at which point only 4 studies had published associations with rs12614, whereas 14 studies (n = 11,378) were available for rs641153. While it is true that both SNPs are better analyzed as a haplotype, these data were simply not available for pooling.
Hughes and Bradley also point out that we obtained and pooled new data that were not previously published. While it is recommended that contact with authors be completed as part of a comprehensive meta-analysis, we acknowledge that these additional data were not previously published and peer reviewed and, hence, do not have the same level of transparency. However, given that sample collections often increase over time and that the instrumentation for genotyping is continually improving, we thought that it would be advantageous to use the most recent information; this is a subjective decision.
We also agree that the allele frequencies given by Kaur et al. (3) were exactly opposite to those expected and were suggestive of strand flipping. However, we specifically queried this with the lead author on 2 separate occasions and were assured it was not.
Hughes and Bradley do make an interesting suggestion that SNPs in high LD should be used as a gauge of genotyping quality in HuGE reviews. This is an interesting idea but difficult to put into practice as the r2 parameter they propose as a measure of LD has some unusual properties. Although r2 is a measure of LD, it is also linked to the allele frequency; even small differences in allele frequencies between 2 linked SNPs can reduce the r2 dramatically. Wray (4) explored these effects and found that, at a baseline allele frequency of 10%, even a difference in allele frequency between 2 SNPs as small as 2% can drop the r2 value below 0.8. This degree of allele frequency difference is consistent with what could be expected for sampling error. Furthermore, when we look at 2 linked dialleleic SNPs, giving 4 possible haplotypes, the absence of 1 haplotype dramatically reduces r2, despite the 2 loci being in high LD as measured by D'. In fact, this is the situation for rs12614 and rs641153, where the low frequency of 1 haplotype means that the r2 is 0.01 but the D' is 1.
Hughes and Bradley also suggest consideration of genotype call rate restrictions as an inclusion criterion for metaanalysis. This would be more appropriate when focusing on genetic variants per se, as considered within the context of a genome-wide association study or other specific genetic analysis where large numbers of SNPs are evaluated (5).
The concerns raised by Hughes and Bradley reflect the limited ability of a meta-analysis based on summary data to tease out inconsistencies best identified at the individual level. We agree that SNPs in LD should be evaluated, but this will not necessarily be straightforward. A move to make genetic data sets publicly available, as in the Database of Genotypes and Phenotypes (http://www.ncbi.nlm.nih.gov/ gap), is a step in the right direction for greater transparency.