931 resultados para large data sets


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Assaying a large number of genetic markers from patients in clinical trials is now possible in order to tailor drugs with respect to efficacy. The statistical methodology for analysing such massive data sets is challenging. The most popular type of statistical analysis is to use a univariate test for each genetic marker, once all the data from a clinical study have been collected. This paper presents a sequential method for conducting an omnibus test for detecting gene-drug interactions across the genome, thus allowing informed decisions at the earliest opportunity and overcoming the multiple testing problems from conducting many univariate tests. We first propose an omnibus test for a fixed sample size. This test is based on combining F-statistics that test for an interaction between treatment and the individual single nucleotide polymorphism (SNP). As SNPs tend to be correlated, we use permutations to calculate a global p-value. We extend our omnibus test to the sequential case. In order to control the type I error rate, we propose a sequential method that uses permutations to obtain the stopping boundaries. The results of a simulation study show that the sequential permutation method is more powerful than alternative sequential methods that control the type I error rate, such as the inverse-normal method. The proposed method is flexible as we do not need to assume a mode of inheritance and can also adjust for confounding factors. An application to real clinical data illustrates that the method is computationally feasible for a large number of SNPs. Copyright (c) 2007 John Wiley & Sons, Ltd.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

In the Biodiversity World (BDW) project we have created a flexible and extensible Web Services-based Grid environment for biodiversity researchers to solve problems in biodiversity and analyse biodiversity patterns. In this environment, heterogeneous and globally distributed biodiversity-related resources such as data sets and analytical tools are made available to be accessed and assembled by users into workflows to perform complex scientific experiments. One such experiment is bioclimatic modelling of the geographical distribution of individual species using climate variables in order to predict past and future climate-related changes in species distribution. Data sources and analytical tools required for such analysis of species distribution are widely dispersed, available on heterogeneous platforms, present data in different formats and lack interoperability. The BDW system brings all these disparate units together so that the user can combine tools with little thought as to their availability, data formats and interoperability. The current Web Servicesbased Grid environment enables execution of the BDW workflow tasks in remote nodes but with a limited scope. The next step in the evolution of the BDW architecture is to enable workflow tasks to utilise computational resources available within and outside the BDW domain. We describe the present BDW architecture and its transition to a new framework which provides a distributed computational environment for mapping and executing workflows in addition to bringing together heterogeneous resources and analytical tools.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We describe a general likelihood-based 'mixture model' for inferring phylogenetic trees from gene-sequence or other character-state data. The model accommodates cases in which different sites in the alignment evolve in qualitatively distinct ways, but does not require prior knowledge of these patterns or partitioning of the data. We call this qualitative variability in the pattern of evolution across sites "pattern-heterogeneity" to distinguish it from both a homogenous process of evolution and from one characterized principally by differences in rates of evolution. We present studies to show that the model correctly retrieves the signals of pattern-heterogeneity from simulated gene-sequence data, and we apply the method to protein-coding genes and to a ribosomal 12S data set. The mixture model outperforms conventional partitioning in both these data sets. We implement the mixture model such that it can simultaneously detect rate- and pattern-heterogeneity. The model simplifies to a homogeneous model or a rate- variability model as special cases, and therefore always performs at least as well as these two approaches, and often considerably improves upon them. We make the model available within a Bayesian Markov-chain Monte Carlo framework for phylogenetic inference, as an easy-to-use computer program.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Once unit-cell dimensions have been determined from a powder diffraction data set and therefore the crystal system is known (e.g. orthorhombic), the method presented by Markvardsen, David, Johnson & Shankland [Acta Cryst. (2001), A57, 47-54] can be used to generate a table ranking the extinction symbols of the given crystal system according to probability. Markvardsen et al. tested a computer program (ExtSym) implementing the method against Pawley refinement outputs generated using the TF12LS program [David, Ibberson & Matthewman (1992). Report RAL-92-032. Rutherford Appleton Laboratory, Chilton, Didcot, Oxon, UK]. Here, it is shown that ExtSym can be used successfully with many well known powder diffraction analysis packages, namely DASH [David, Shankland, van de Streek, Pidcock, Motherwell & Cole (2006). J. Appl. Cryst. 39, 910-915], FullProf [Rodriguez-Carvajal (1993). Physica B, 192, 55-69], GSAS [Larson & Von Dreele (1994). Report LAUR 86-748. Los Alamos National Laboratory, New Mexico, USA], PRODD [Wright (2004). Z. Kristallogr. 219, 1-11] and TOPAS [Coelho (2003). Bruker AXS GmbH, Karlsruhe, Germany]. In addition, a precise description of the optimal input for ExtSym is given to enable other software packages to interface with ExtSym and to allow the improvement/modification of existing interfacing scripts. ExtSym takes as input the powder data in the form of integrated intensities and error estimates for these intensities. The output returned by ExtSym is demonstrated to be strongly dependent on the accuracy of these error estimates and the reason for this is explained. ExtSym is tested against a wide range of data sets, confirming the algorithm to be very successful at ranking the published extinction symbol as the most likely. (C) 2008 International Union of Crystallography Printed in Singapore - all rights reserved.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

We have discovered a novel approach of intrusion detection system using an intelligent data classifier based on a self organizing map (SOM). We have surveyed all other unsupervised intrusion detection methods, different alternative SOM based techniques and KDD winner IDS methods. This paper provides a robust designed and implemented intelligent data classifier technique based on a single large size (30x30) self organizing map (SOM) having the capability to detect all types of attacks given in the DARPA Archive 1999 the lowest false positive rate being 0.04 % and higher detection rate being 99.73% tested using full KDD data sets and 89.54% comparable detection rate and 0.18% lowest false positive rate tested using corrected data sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper introduces a new neurofuzzy model construction and parameter estimation algorithm from observed finite data sets, based on a Takagi and Sugeno (T-S) inference mechanism and a new extended Gram-Schmidt orthogonal decomposition algorithm, for the modeling of a priori unknown dynamical systems in the form of a set of fuzzy rules. The first contribution of the paper is the introduction of a one to one mapping between a fuzzy rule-base and a model matrix feature subspace using the T-S inference mechanism. This link enables the numerical properties associated with a rule-based matrix subspace, the relationships amongst these matrix subspaces, and the correlation between the output vector and a rule-base matrix subspace, to be investigated and extracted as rule-based knowledge to enhance model transparency. The matrix subspace spanned by a fuzzy rule is initially derived as the input regression matrix multiplied by a weighting matrix that consists of the corresponding fuzzy membership functions over the training data set. Model transparency is explored by the derivation of an equivalence between an A-optimality experimental design criterion of the weighting matrix and the average model output sensitivity to the fuzzy rule, so that rule-bases can be effectively measured by their identifiability via the A-optimality experimental design criterion. The A-optimality experimental design criterion of the weighting matrices of fuzzy rules is used to construct an initial model rule-base. An extended Gram-Schmidt algorithm is then developed to estimate the parameter vector for each rule. This new algorithm decomposes the model rule-bases via an orthogonal subspace decomposition approach, so as to enhance model transparency with the capability of interpreting the derived rule-base energy level. This new approach is computationally simpler than the conventional Gram-Schmidt algorithm for resolving high dimensional regression problems, whereby it is computationally desirable to decompose complex models into a few submodels rather than a single model with large number of input variables and the associated curse of dimensionality problem. Numerical examples are included to demonstrate the effectiveness of the proposed new algorithm.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Population size estimation with discrete or nonparametric mixture models is considered, and reliable ways of construction of the nonparametric mixture model estimator are reviewed and set into perspective. Construction of the maximum likelihood estimator of the mixing distribution is done for any number of components up to the global nonparametric maximum likelihood bound using the EM algorithm. In addition, the estimators of Chao and Zelterman are considered with some generalisations of Zelterman’s estimator. All computations are done with CAMCR, a special software developed for population size estimation with mixture models. Several examples and data sets are discussed and the estimators illustrated. Problems using the mixture model-based estimators are highlighted.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The ability to display and inspect powder diffraction data quickly and efficiently is a central part of the data analysis process. Whilst many computer programs are capable of displaying powder data, their focus is typically on advanced operations such as structure solution or Rietveld refinement. This article describes a lightweight software package, Jpowder, whose focus is fast and convenient visualization and comparison of powder data sets in a variety of formats from computers with network access. Jpowder is written in Java and uses its associated Web Start technology to allow ‘single-click deployment’ from a web page, http://www.jpowder.org. Jpowder is open source, free and available for use by anyone.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A program is provided to determine structural parameters of atoms in or adsorbed on surfaces by refinement of atomistic models towards experimentally determined data generated by the normal incidence X-ray standing wave (NIXSW) technique. The method employs a combination of Differential Evolution Genetic Algorithms and Steepest Descent Line Minimisations to provide a fast, reliable and user friendly tool for experimentalists to interpret complex multidimensional NIXSW data sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

An updated analysis of observed stratospheric temperature variability and trends is presented on the basis of satellite, radiosonde, and lidar observations. Satellite data include measurements from the series of NOAA operational instruments, including the Microwave Sounding Unit covering 1979–2007 and the Stratospheric Sounding Unit (SSU) covering 1979–2005. Radiosonde results are compared for six different data sets, incorporating a variety of homogeneity adjustments to account for changes in instrumentation and observational practices. Temperature changes in the lower stratosphere show cooling of 0.5 K/decade over much of the globe for 1979–2007, with some differences in detail among the different radiosonde and satellite data sets. Substantially larger cooling trends are observed in the Antarctic lower stratosphere during spring and summer, in association with development of the Antarctic ozone hole. Trends in the lower stratosphere derived from radiosonde data are also analyzed for a longer record (back to 1958); trends for the presatellite era (1958–1978) have a large range among the different homogenized data sets, implying large trend uncertainties. Trends in the middle and upper stratosphere have been derived from updated SSU data, taking into account changes in the SSU weighting functions due to observed atmospheric CO2 increases. The results show mean cooling of 0.5–1.5 K/decade during 1979–2005, with the greatest cooling in the upper stratosphere near 40–50 km. Temperature anomalies throughout the stratosphere were relatively constant during the decade 1995–2005. Long records of lidar temperature measurements at a few locations show reasonable agreement with SSU trends, although sampling uncertainties are large in the localized lidar measurements. Updated estimates of the solar cycle influence on stratospheric temperatures show a statistically significant signal in the tropics (30N–S), with an amplitude (solar maximum minus solar minimum) of 0.5 K (lower stratosphere) to 1.0 K (upper stratosphere).

Relevância:

90.00% 90.00%

Publicador:

Resumo:

A role for sequential test procedures is emerging in genetic and epidemiological studies using banked biological resources. This stems from the methodology's potential for improved use of information relative to comparable fixed sample designs. Studies in which cost, time and ethics feature prominently are particularly suited to a sequential approach. In this paper sequential procedures for matched case–control studies with binary data will be investigated and assessed. Design issues such as sample size evaluation and error rates are identified and addressed. The methodology is illustrated and evaluated using both real and simulated data sets.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Svalgaard and Cliver (2010) recently reported a consensus between the various reconstructions of the heliospheric field over recent centuries. This is a significant development because, individually, each has uncertainties introduced by instrument calibration drifts, limited numbers of observatories, and the strength of the correlations employed. However, taken collectively, a consistent picture is emerging. We here show that this consensus extends to more data sets and methods than reported by Svalgaard and Cliver, including that used by Lockwood et al. (1999), when their algorithm is used to predict the heliospheric field rather than the open solar flux. One area where there is still some debate relates to the existence and meaning of a floor value to the heliospheric field. From cosmogenic isotope abundances, Steinhilber et al. (2010) have recently deduced that the near-Earth IMF at the end of the Maunder minimum was 1.80 ± 0.59 nT which is considerably lower than the revised floor of 4nT proposed by Svalgaard and Cliver. We here combine cosmogenic and geomagnetic reconstructions and modern observations (with allowance for the effect of solar wind speed and structure on the near-Earth data) to derive an estimate for the open solar flux of (0.48 ± 0.29) × 1014 Wb at the end of the Maunder minimum. By way of comparison, the largest and smallest annual means recorded by instruments in space between 1965 and 2010 are 5.75 × 1014 Wb and 1.37 × 1014 Wb, respectively, set in 1982 and 2009, and the maximum of the 11 year running means was 4.38 × 1014 Wb in 1986. Hence the average open solar flux during the Maunder minimum is found to have been 11% of its peak value during the recent grand solar maximum.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Overall phylogenetic relationships within the genus Pelargonium (Geraniaceae) were inferred based on DNA sequences from mitochondrial(mt)-encoded nad1 b/c exons and from chloroplast(cp)-encoded trnL (UAA) 5' exon-trnF (GAA) exon regions using two species of Geranium and Sarcocaulon vanderetiae as outgroups. The group II intron between nad1 exons b and c was found to be absent from the Pelargonium, Geranium, and Sarcocaulon sequences presented here as well as from Erodium, which is the first recorded loss of this intron in angiosperms. Separate phylogenetic analyses of the mtDNA and cpDNA data sets produced largely congruent topologies, indicating linkage between mitochondrial and chloroplast genome inheritance. Simultaneous analysis of the combined data sets yielded a well-resolved topology with high clade support exhibiting a basic split into small and large chromosome species, the first group containing two lineages and the latter three. One large chromosome lineage (x = 11) comprises species from sections Myrrhidium and Chorisma and is sister to a lineage comprising P. mutans (x = 11) and species from section Jenkinsonia (x = 9). Sister to these two lineages is a lineage comprising species from sections Ciconium (x = 9) and Subsucculentia (x = 10). Cladistic evaluation of this pattern suggests that x = 11 is the ancestral basic chromosome number for the genus.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper uses data provided by three major real estate advisory firms to investigate the level and pattern of variation in the measurement of historic real estate rental values for the main European office centres. The paper assesses the extent to which the data providing organizations agree on historic market performance in terms of returns, risk and timing and examines the relationship between market maturity and agreement. The analysis suggests that at the aggregate level and for many markets, there is substantial agreement on direction, quantity and timing of market change. However, there is substantial variability in the level of agreement among cities. The paper also assesses whether the different data sets produce different explanatory models and market forecast. It is concluded that, although disagreement on the direction of market change is high for many market, the different data sets often produce similar explanatory models and predict similar relative performance.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The requirement to forecast volcanic ash concentrations was amplified as a response to the 2010 Eyjafjallajökull eruption when ash safety limits for aviation were introduced in the European area. The ability to provide accurate quantitative forecasts relies to a large extent on the source term which is the emissions of ash as a function of time and height. This study presents source term estimations of the ash emissions from the Eyjafjallajökull eruption derived with an inversion algorithm which constrains modeled ash emissions with satellite observations of volcanic ash. The algorithm is tested with input from two different dispersion models, run on three different meteorological input data sets. The results are robust to which dispersion model and meteorological data are used. Modeled ash concentrations are compared quantitatively to independent measurements from three different research aircraft and one surface measurement station. These comparisons show that the models perform reasonably well in simulating the ash concentrations, and simulations using the source term obtained from the inversion are in overall better agreement with the observations (rank correlation = 0.55, Figure of Merit in Time (FMT) = 25–46%) than simulations using simplified source terms (rank correlation = 0.21, FMT = 20–35%). The vertical structures of the modeled ash clouds mostly agree with lidar observations, and the modeled ash particle size distributions agree reasonably well with observed size distributions. There are occasionally large differences between simulations but the model mean usually outperforms any individual model. The results emphasize the benefits of using an ensemble-based forecast for improved quantification of uncertainties in future ash crises.