902 resultados para Large Data Sets
Resumo:
Microbial communities respond to a variety of environmental factors related to resources (e.g. plant and soil organic matter), habitat (e.g. soil characteristics) and predation (e.g. nematodes, protozoa and viruses). However, the relative contribution of these factors on microbial community composition is poorly understood. Here, we sampled soils from 30 chalk grassland fields located in three different chalk hill ridges of Southern England, using a spatially explicit sampling scheme. We assessed microbial communities via phospholipid fatty acid (PLFA) analyses and PCR-denaturing gradient gel electrophoresis (DGGE) and measured soil characteristics, as well as nematode and plant community composition. The relative influences of space, soil, vegetation and nematodes on soil microorganisms were contrasted using variation partitioning and path analysis. Results indicate that soil characteristics and plant community composition, representing habitat and resources, shape soil microbial community composition, whereas the influence of nematodes, a potential predation factor, appears to be relatively small. Spatial variation in microbial community structure was detected at broad (between fields) and fine (within fields) scales, suggesting that microbial communities exhibit biogeographic patterns at different scales. Although our analysis included several relevant explanatory data sets, a large part of the variation in microbial communities remained unexplained (up to 92% in some analyses). However, in several analyses, significant parts of the variation in microbial community structure could be explained. The results of this study contribute to our understanding of the relative importance of different environmental and spatial factors in driving the composition of soil-borne microbial communities.
Resumo:
The difference between cirrus emissivities at 8 and 11 μm is sensitive to the mean effective ice crystal size of the cirrus cloud, De. By using single scattering properties of ice crystals shaped as planar polycrystals, diameters of up to about 70 μm can be retrieved, instead of up to 45 μm assuming spheres or hexagonal columns. The method described in this article is used for a global determination of mean effective ice crystal sizes of cirrus clouds from TOVS satellite observations. A sensitivity study of the De retrieval to uncertainties in hypotheses on ice crystal shape, size distributions, and temperature profiles, as well as in vertical and horizontal cloud heterogeneities shows that uncertainties can be as large as 30%. However, the TOVS data set is one of few data sets which provides global and long-term coverage. Having analyzed the years 1987–1991, it was found that measured effective ice crystal diameters De are stable from year to year. For 1990 a global median De of 53.5 μm was determined. Averages distinguishing ocean/land, season, and latitude lie between 23 μm in winter over Northern Hemisphere midlatitude land and 64 μm in the tropics. In general, larger Des are found in regions with higher atmospheric water vapor and for cirrus with a smaller effective emissivity.
Resumo:
Population size estimation with discrete or nonparametric mixture models is considered, and reliable ways of construction of the nonparametric mixture model estimator are reviewed and set into perspective. Construction of the maximum likelihood estimator of the mixing distribution is done for any number of components up to the global nonparametric maximum likelihood bound using the EM algorithm. In addition, the estimators of Chao and Zelterman are considered with some generalisations of Zelterman’s estimator. All computations are done with CAMCR, a special software developed for population size estimation with mixture models. Several examples and data sets are discussed and the estimators illustrated. Problems using the mixture model-based estimators are highlighted.
Resumo:
Assaying a large number of genetic markers from patients in clinical trials is now possible in order to tailor drugs with respect to efficacy. The statistical methodology for analysing such massive data sets is challenging. The most popular type of statistical analysis is to use a univariate test for each genetic marker, once all the data from a clinical study have been collected. This paper presents a sequential method for conducting an omnibus test for detecting gene-drug interactions across the genome, thus allowing informed decisions at the earliest opportunity and overcoming the multiple testing problems from conducting many univariate tests. We first propose an omnibus test for a fixed sample size. This test is based on combining F-statistics that test for an interaction between treatment and the individual single nucleotide polymorphism (SNP). As SNPs tend to be correlated, we use permutations to calculate a global p-value. We extend our omnibus test to the sequential case. In order to control the type I error rate, we propose a sequential method that uses permutations to obtain the stopping boundaries. The results of a simulation study show that the sequential permutation method is more powerful than alternative sequential methods that control the type I error rate, such as the inverse-normal method. The proposed method is flexible as we do not need to assume a mode of inheritance and can also adjust for confounding factors. An application to real clinical data illustrates that the method is computationally feasible for a large number of SNPs. Copyright (c) 2007 John Wiley & Sons, Ltd.
Resumo:
In the Biodiversity World (BDW) project we have created a flexible and extensible Web Services-based Grid environment for biodiversity researchers to solve problems in biodiversity and analyse biodiversity patterns. In this environment, heterogeneous and globally distributed biodiversity-related resources such as data sets and analytical tools are made available to be accessed and assembled by users into workflows to perform complex scientific experiments. One such experiment is bioclimatic modelling of the geographical distribution of individual species using climate variables in order to predict past and future climate-related changes in species distribution. Data sources and analytical tools required for such analysis of species distribution are widely dispersed, available on heterogeneous platforms, present data in different formats and lack interoperability. The BDW system brings all these disparate units together so that the user can combine tools with little thought as to their availability, data formats and interoperability. The current Web Servicesbased Grid environment enables execution of the BDW workflow tasks in remote nodes but with a limited scope. The next step in the evolution of the BDW architecture is to enable workflow tasks to utilise computational resources available within and outside the BDW domain. We describe the present BDW architecture and its transition to a new framework which provides a distributed computational environment for mapping and executing workflows in addition to bringing together heterogeneous resources and analytical tools.
Resumo:
We describe a general likelihood-based 'mixture model' for inferring phylogenetic trees from gene-sequence or other character-state data. The model accommodates cases in which different sites in the alignment evolve in qualitatively distinct ways, but does not require prior knowledge of these patterns or partitioning of the data. We call this qualitative variability in the pattern of evolution across sites "pattern-heterogeneity" to distinguish it from both a homogenous process of evolution and from one characterized principally by differences in rates of evolution. We present studies to show that the model correctly retrieves the signals of pattern-heterogeneity from simulated gene-sequence data, and we apply the method to protein-coding genes and to a ribosomal 12S data set. The mixture model outperforms conventional partitioning in both these data sets. We implement the mixture model such that it can simultaneously detect rate- and pattern-heterogeneity. The model simplifies to a homogeneous model or a rate- variability model as special cases, and therefore always performs at least as well as these two approaches, and often considerably improves upon them. We make the model available within a Bayesian Markov-chain Monte Carlo framework for phylogenetic inference, as an easy-to-use computer program.
Resumo:
Once unit-cell dimensions have been determined from a powder diffraction data set and therefore the crystal system is known (e.g. orthorhombic), the method presented by Markvardsen, David, Johnson & Shankland [Acta Cryst. (2001), A57, 47-54] can be used to generate a table ranking the extinction symbols of the given crystal system according to probability. Markvardsen et al. tested a computer program (ExtSym) implementing the method against Pawley refinement outputs generated using the TF12LS program [David, Ibberson & Matthewman (1992). Report RAL-92-032. Rutherford Appleton Laboratory, Chilton, Didcot, Oxon, UK]. Here, it is shown that ExtSym can be used successfully with many well known powder diffraction analysis packages, namely DASH [David, Shankland, van de Streek, Pidcock, Motherwell & Cole (2006). J. Appl. Cryst. 39, 910-915], FullProf [Rodriguez-Carvajal (1993). Physica B, 192, 55-69], GSAS [Larson & Von Dreele (1994). Report LAUR 86-748. Los Alamos National Laboratory, New Mexico, USA], PRODD [Wright (2004). Z. Kristallogr. 219, 1-11] and TOPAS [Coelho (2003). Bruker AXS GmbH, Karlsruhe, Germany]. In addition, a precise description of the optimal input for ExtSym is given to enable other software packages to interface with ExtSym and to allow the improvement/modification of existing interfacing scripts. ExtSym takes as input the powder data in the form of integrated intensities and error estimates for these intensities. The output returned by ExtSym is demonstrated to be strongly dependent on the accuracy of these error estimates and the reason for this is explained. ExtSym is tested against a wide range of data sets, confirming the algorithm to be very successful at ranking the published extinction symbol as the most likely. (C) 2008 International Union of Crystallography Printed in Singapore - all rights reserved.
Resumo:
We have discovered a novel approach of intrusion detection system using an intelligent data classifier based on a self organizing map (SOM). We have surveyed all other unsupervised intrusion detection methods, different alternative SOM based techniques and KDD winner IDS methods. This paper provides a robust designed and implemented intelligent data classifier technique based on a single large size (30x30) self organizing map (SOM) having the capability to detect all types of attacks given in the DARPA Archive 1999 the lowest false positive rate being 0.04 % and higher detection rate being 99.73% tested using full KDD data sets and 89.54% comparable detection rate and 0.18% lowest false positive rate tested using corrected data sets.
Resumo:
This paper introduces a new neurofuzzy model construction and parameter estimation algorithm from observed finite data sets, based on a Takagi and Sugeno (T-S) inference mechanism and a new extended Gram-Schmidt orthogonal decomposition algorithm, for the modeling of a priori unknown dynamical systems in the form of a set of fuzzy rules. The first contribution of the paper is the introduction of a one to one mapping between a fuzzy rule-base and a model matrix feature subspace using the T-S inference mechanism. This link enables the numerical properties associated with a rule-based matrix subspace, the relationships amongst these matrix subspaces, and the correlation between the output vector and a rule-base matrix subspace, to be investigated and extracted as rule-based knowledge to enhance model transparency. The matrix subspace spanned by a fuzzy rule is initially derived as the input regression matrix multiplied by a weighting matrix that consists of the corresponding fuzzy membership functions over the training data set. Model transparency is explored by the derivation of an equivalence between an A-optimality experimental design criterion of the weighting matrix and the average model output sensitivity to the fuzzy rule, so that rule-bases can be effectively measured by their identifiability via the A-optimality experimental design criterion. The A-optimality experimental design criterion of the weighting matrices of fuzzy rules is used to construct an initial model rule-base. An extended Gram-Schmidt algorithm is then developed to estimate the parameter vector for each rule. This new algorithm decomposes the model rule-bases via an orthogonal subspace decomposition approach, so as to enhance model transparency with the capability of interpreting the derived rule-base energy level. This new approach is computationally simpler than the conventional Gram-Schmidt algorithm for resolving high dimensional regression problems, whereby it is computationally desirable to decompose complex models into a few submodels rather than a single model with large number of input variables and the associated curse of dimensionality problem. Numerical examples are included to demonstrate the effectiveness of the proposed new algorithm.
Resumo:
Population size estimation with discrete or nonparametric mixture models is considered, and reliable ways of construction of the nonparametric mixture model estimator are reviewed and set into perspective. Construction of the maximum likelihood estimator of the mixing distribution is done for any number of components up to the global nonparametric maximum likelihood bound using the EM algorithm. In addition, the estimators of Chao and Zelterman are considered with some generalisations of Zelterman’s estimator. All computations are done with CAMCR, a special software developed for population size estimation with mixture models. Several examples and data sets are discussed and the estimators illustrated. Problems using the mixture model-based estimators are highlighted.
Resumo:
The ability to display and inspect powder diffraction data quickly and efficiently is a central part of the data analysis process. Whilst many computer programs are capable of displaying powder data, their focus is typically on advanced operations such as structure solution or Rietveld refinement. This article describes a lightweight software package, Jpowder, whose focus is fast and convenient visualization and comparison of powder data sets in a variety of formats from computers with network access. Jpowder is written in Java and uses its associated Web Start technology to allow ‘single-click deployment’ from a web page, http://www.jpowder.org. Jpowder is open source, free and available for use by anyone.
Resumo:
A program is provided to determine structural parameters of atoms in or adsorbed on surfaces by refinement of atomistic models towards experimentally determined data generated by the normal incidence X-ray standing wave (NIXSW) technique. The method employs a combination of Differential Evolution Genetic Algorithms and Steepest Descent Line Minimisations to provide a fast, reliable and user friendly tool for experimentalists to interpret complex multidimensional NIXSW data sets.
Resumo:
An updated analysis of observed stratospheric temperature variability and trends is presented on the basis of satellite, radiosonde, and lidar observations. Satellite data include measurements from the series of NOAA operational instruments, including the Microwave Sounding Unit covering 1979–2007 and the Stratospheric Sounding Unit (SSU) covering 1979–2005. Radiosonde results are compared for six different data sets, incorporating a variety of homogeneity adjustments to account for changes in instrumentation and observational practices. Temperature changes in the lower stratosphere show cooling of 0.5 K/decade over much of the globe for 1979–2007, with some differences in detail among the different radiosonde and satellite data sets. Substantially larger cooling trends are observed in the Antarctic lower stratosphere during spring and summer, in association with development of the Antarctic ozone hole. Trends in the lower stratosphere derived from radiosonde data are also analyzed for a longer record (back to 1958); trends for the presatellite era (1958–1978) have a large range among the different homogenized data sets, implying large trend uncertainties. Trends in the middle and upper stratosphere have been derived from updated SSU data, taking into account changes in the SSU weighting functions due to observed atmospheric CO2 increases. The results show mean cooling of 0.5–1.5 K/decade during 1979–2005, with the greatest cooling in the upper stratosphere near 40–50 km. Temperature anomalies throughout the stratosphere were relatively constant during the decade 1995–2005. Long records of lidar temperature measurements at a few locations show reasonable agreement with SSU trends, although sampling uncertainties are large in the localized lidar measurements. Updated estimates of the solar cycle influence on stratospheric temperatures show a statistically significant signal in the tropics (30N–S), with an amplitude (solar maximum minus solar minimum) of 0.5 K (lower stratosphere) to 1.0 K (upper stratosphere).
Resumo:
A role for sequential test procedures is emerging in genetic and epidemiological studies using banked biological resources. This stems from the methodology's potential for improved use of information relative to comparable fixed sample designs. Studies in which cost, time and ethics feature prominently are particularly suited to a sequential approach. In this paper sequential procedures for matched case–control studies with binary data will be investigated and assessed. Design issues such as sample size evaluation and error rates are identified and addressed. The methodology is illustrated and evaluated using both real and simulated data sets.
Resumo:
Svalgaard and Cliver (2010) recently reported a consensus between the various reconstructions of the heliospheric field over recent centuries. This is a significant development because, individually, each has uncertainties introduced by instrument calibration drifts, limited numbers of observatories, and the strength of the correlations employed. However, taken collectively, a consistent picture is emerging. We here show that this consensus extends to more data sets and methods than reported by Svalgaard and Cliver, including that used by Lockwood et al. (1999), when their algorithm is used to predict the heliospheric field rather than the open solar flux. One area where there is still some debate relates to the existence and meaning of a floor value to the heliospheric field. From cosmogenic isotope abundances, Steinhilber et al. (2010) have recently deduced that the near-Earth IMF at the end of the Maunder minimum was 1.80 ± 0.59 nT which is considerably lower than the revised floor of 4nT proposed by Svalgaard and Cliver. We here combine cosmogenic and geomagnetic reconstructions and modern observations (with allowance for the effect of solar wind speed and structure on the near-Earth data) to derive an estimate for the open solar flux of (0.48 ± 0.29) × 1014 Wb at the end of the Maunder minimum. By way of comparison, the largest and smallest annual means recorded by instruments in space between 1965 and 2010 are 5.75 × 1014 Wb and 1.37 × 1014 Wb, respectively, set in 1982 and 2009, and the maximum of the 11 year running means was 4.38 × 1014 Wb in 1986. Hence the average open solar flux during the Maunder minimum is found to have been 11% of its peak value during the recent grand solar maximum.