931 resultados para large data sets
Resumo:
Developments in the statistical analysis of compositional data over the last two decades have made possible a much deeper exploration of the nature of variability, and the possible processes associated with compositional data sets from many disciplines. In this paper we concentrate on geochemical data sets. First we explain how hypotheses of compositional variability may be formulated within the natural sample space, the unit simplex, including useful hypotheses of subcompositional discrimination and specific perturbational change. Then we develop through standard methodology, such as generalised likelihood ratio tests, statistical tools to allow the systematic investigation of a complete lattice of such hypotheses. Some of these tests are simple adaptations of existing multivariate tests but others require special construction. We comment on the use of graphical methods in compositional data analysis and on the ordination of specimens. The recent development of the concept of compositional processes is then explained together with the necessary tools for a staying- in-the-simplex approach, namely compositional singular value decompositions. All these statistical techniques are illustrated for a substantial compositional data set, consisting of 209 major-oxide and rare-element compositions of metamorphosed limestones from the Northeast and Central Highlands of Scotland. Finally we point out a number of unresolved problems in the statistical analysis of compositional processes
Resumo:
R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets
Resumo:
The log-ratio methodology makes available powerful tools for analyzing compositional data. Nevertheless, the use of this methodology is only possible for those data sets without null values. Consequently, in those data sets where the zeros are present, a previous treatment becomes necessary. Last advances in the treatment of compositional zeros have been centered especially in the zeros of structural nature and in the rounded zeros. These tools do not contemplate the particular case of count compositional data sets with null values. In this work we deal with \count zeros" and we introduce a treatment based on a mixed Bayesian-multiplicative estimation. We use the Dirichlet probability distribution as a prior and we estimate the posterior probabilities. Then we apply a multiplicative modi¯cation for the non-zero values. We present a case study where this new methodology is applied. Key words: count data, multiplicative replacement, composition, log-ratio analysis
Resumo:
It has been shown that the accuracy of mammographic abnormality detection methods is strongly dependent on the breast tissue characteristics, where a dense breast drastically reduces detection sensitivity. In addition, breast tissue density is widely accepted to be an important risk indicator for the development of breast cancer. Here, we describe the development of an automatic breast tissue classification methodology, which can be summarized in a number of distinct steps: 1) the segmentation of the breast area into fatty versus dense mammographic tissue; 2) the extraction of morphological and texture features from the segmented breast areas; and 3) the use of a Bayesian combination of a number of classifiers. The evaluation, based on a large number of cases from two different mammographic data sets, shows a strong correlation ( and 0.67 for the two data sets) between automatic and expert-based Breast Imaging Reporting and Data System mammographic density assessment
Resumo:
Abstract This seminar is a research discussion around a very interesting problem, which may be a good basis for a WAISfest theme. A little over a year ago Professor Alan Dix came to tell us of his plans for a magnificent adventure:to walk all of the way round Wales - 1000 miles 'Alan Walks Wales'. The walk was a personal journey, but also a technological and community one, exploring the needs of the walker and the people along the way. Whilst walking he recorded his thoughts in an audio diary, took lots of photos, wrote a blog and collected data from the tech instruments he was wearing. As a result Alan has extensive quantitative data (bio-sensing and location) and qualitative data (text, images and some audio). There are challenges in analysing individual kinds of data, including merging similar data streams, entity identification, time-series and textual data mining, dealing with provenance, ontologies for paths, and journeys. There are also challenges for author and third-party annotation, linking the data-sets and visualising the merged narrative or facets of it.
Resumo:
The influence of the basis set size and the correlation energy in the static electrical properties of the CO molecule is assessed. In particular, we have studied both the nuclear relaxation and the vibrational contributions to the static molecular electrical properties, the vibrational Stark effect (VSE) and the vibrational intensity effect (VIE). From a mathematical point of view, when a static and uniform electric field is applied to a molecule, the energy of this system can be expressed in terms of a double power series with respect to the bond length and to the field strength. From the power series expansion of the potential energy, field-dependent expressions for the equilibrium geometry, for the potential energy and for the force constant are obtained. The nuclear relaxation and vibrational contributions to the molecular electrical properties are analyzed in terms of the derivatives of the electronic molecular properties. In general, the results presented show that accurate inclusion of the correlation energy and large basis sets are needed to calculate the molecular electrical properties and their derivatives with respect to either nuclear displacements or/and field strength. With respect to experimental data, the calculated power series coefficients are overestimated by the SCF, CISD, and QCISD methods. On the contrary, perturbation methods (MP2 and MP4) tend to underestimate them. In average and using the 6-311 + G(3df) basis set and for the CO molecule, the nuclear relaxation and the vibrational contributions to the molecular electrical properties amount to 11.7%, 3.3%, and 69.7% of the purely electronic μ, α, and β values, respectively
Resumo:
Conservation planning requires identifying pertinent habitat factors and locating geographic locations where land management may improve habitat conditions for high priority species. I derived habitat models and mapped predicted abundance for the Golden-winged Warbler (Vermivora chrysoptera), a species of high conservation concern, using bird counts, environmental variables, and hierarchical models applied at multiple spatial scales. My aim was to understand habitat associations at multiple spatial scales and create a predictive abundance map for purposes of conservation planning for the Golden-winged Warbler. My models indicated a substantial influence of landscape conditions, including strong positive associations with total forest composition within the landscape. However, many of the associations I observed were counter to reported associations at finer spatial extents; for instance, I found Golden-winged Warblers negatively associated with several measures of edge habitat. No single spatial scale dominated, indicating that this species is responding to factors at multiple spatial scales. I found Golden-winged Warbler abundance was negatively related with Blue-winged Warbler (Vermivora cyanoptera) abundance. I also observed a north-south spatial trend suggestive of a regional climate effect that was not previously noted for this species. The map of predicted abundance indicated a large area of concentrated abundance in west-central Wisconsin, with smaller areas of high abundance along the northern periphery of the Prairie Hardwood Transition. This map of predicted abundance compared favorably with independent evaluation data sets and can thus be used to inform regional planning efforts devoted to conserving this species.
Resumo:
Population models are essential components of large-scale conservation and management plans for the federally endangered Golden-cheeked Warbler (Setophaga chrysoparia; hereafter GCWA). However, existing models are based on vital rate estimates calculated using relatively small data sets that are now more than a decade old. We estimated more current, precise adult and juvenile apparent survival (Φ) probabilities and their associated variances for male GCWAs. In addition to providing estimates for use in population modeling, we tested hypotheses about spatial and temporal variation in Φ. We assessed whether a linear trend in Φ or a change in the overall mean Φ corresponded to an observed increase in GCWA abundance during 1992-2000 and if Φ varied among study plots. To accomplish these objectives, we analyzed long-term GCWA capture-resight data from 1992 through 2011, collected across seven study plots on the Fort Hood Military Reservation using a Cormack-Jolly-Seber model structure within program MARK. We also estimated Φ process and sampling variances using a variance-components approach. Our results did not provide evidence of site-specific variation in adult Φ on the installation. Because of a lack of data, we could not assess whether juvenile Φ varied spatially. We did not detect a strong temporal association between GCWA abundance and Φ. Mean estimates of Φ for adult and juvenile male GCWAs for all years analyzed were 0.47 with a process variance of 0.0120 and a sampling variance of 0.0113 and 0.28 with a process variance of 0.0076 and a sampling variance of 0.0149, respectively. Although juvenile Φ did not differ greatly from previous estimates, our adult Φ estimate suggests previous GCWA population models were overly optimistic with respect to adult survival. These updated Φ probabilities and their associated variances will be incorporated into new population models to assist with GCWA conservation decision making.
Resumo:
The longwave radiative cooling of the clear-sky atmosphere (Q(LWc)) is a crucial component of the global hydrological cycle and is composed of the clear-sky outgoing longwave radiation to space (OLRc) and the net downward minus upward clear-sky longwave radiation to the surface (SNLc). Estimates of QLWc from reanalyses and observations are presented for the period 1979-2004. Compared to other reanalyses data sets, the European Centre for Medium-range Weather Forecasts 40-year reanalysis (ERA40) produces the largest Q(LWc) over the tropical oceans (217 W m(-2)), explained by the least negative SNLc. On the basis of comparisons with data derived from satellite measurements, ERA40 provides the most realistic QLWc climatology over the tropical oceans but exhibits a spurious interannual variability for column integrated water vapor (CWV) and SNLc. Interannual monthly anomalies of QLWc are broadly consistent between data sets with large increases during the warm El Nino events. Since relative humidity ( RH) errors applying throughout the troposphere result in compensating effects on the cooling to space and to the surface, they exert only a marginal effect on QLWc. An observed increase in CWV with surface temperature of 3 kg m(-2) K-1 over the tropical oceans is important in explaining a positive relationship between QLWc and surface temperature, in particular over ascending regimes; over tropical ocean descending regions this relationship ranges from 3.6 to 4.6 +/- 0.4 W m(-2) K-1 for the data sets considered, consistent with idealized sensitivity tests in which tropospheric warming is applied and RH is held constant and implying an increase in precipitation with warming.
Resumo:
The distribution and variability of water vapor and its links with radiative cooling and latent heating via precipitation are crucial to understanding feedbacks and processes operating within the climate system. Column-integrated water vapor (CWV) and additional variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) 40-year reanalysis (ERA40) are utilized to quantify the spatial and temporal variability in tropical water vapor over the period 1979–2001. The moisture variability is partitioned between dynamical and thermodynamic influences and compared with variations in precipitation provided by the Climate Prediction Center Merged Analysis of Precipitation (CMAP) and the Global Precipitation Climatology Project (GPCP). The spatial distribution of CWV is strongly determined by thermodynamic constraints. Spatial variability in CWV is dominated by changes in the large-scale dynamics, in particular associated with the El Niño–Southern Oscillation (ENSO). Trends in CWV are also dominated by dynamics rather than thermodynamics over the period considered. However, increases in CWV associated with changes in temperature are significant over the equatorial east Pacific when analyzing interannual variability and over the north and northwest Pacific when analyzing trends. Significant positive trends in CWV tend to predominate over the oceans while negative trends in CWV are found over equatorial Africa and Brazil. Links between changes in CWV and vertical motion fields are identified over these regions and also the equatorial Atlantic. However, trends in precipitation are generally incoherent and show little association with the CWV trends. This may in part reflect the inadequacies of the precipitation data sets and reanalysis products when analyzing decadal variability. Though the dynamic component of CWV is a major factor in determining precipitation variability in the tropics, in some regions/seasons the thermodynamic component cancels its effect on precipitation variability.
Resumo:
For many networks in nature, science and technology, it is possible to order the nodes so that most links are short-range, connecting near-neighbours, and relatively few long-range links, or shortcuts, are present. Given a network as a set of observed links (interactions), the task of finding an ordering of the nodes that reveals such a range-dependent structure is closely related to some sparse matrix reordering problems arising in scientific computation. The spectral, or Fiedler vector, approach for sparse matrix reordering has successfully been applied to biological data sets, revealing useful structures and subpatterns. In this work we argue that a periodic analogue of the standard reordering task is also highly relevant. Here, rather than encouraging nonzeros only to lie close to the diagonal of a suitably ordered adjacency matrix, we also allow them to inhabit the off-diagonal corners. Indeed, for the classic small-world model of Watts & Strogatz (1998, Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442) this type of periodic structure is inherent. We therefore devise and test a new spectral algorithm for periodic reordering. By generalizing the range-dependent random graph class of Grindrod (2002, Range-dependent random graphs and their application to modeling large small-world proteome datasets. Phys. Rev. E, 66, 066702-1–066702-7) to the periodic case, we can also construct a computable likelihood ratio that suggests whether a given network is inherently linear or periodic. Tests on synthetic data show that the new algorithm can detect periodic structure, even in the presence of noise. Further experiments on real biological data sets then show that some networks are better regarded as periodic than linear. Hence, we find both qualitative (reordered networks plots) and quantitative (likelihood ratios) evidence of periodicity in biological networks.
Resumo:
Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. The first paper of this series examined the effects of the former on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to create normally distributed random fields of different size that are realizations of known processes described by variograms with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were computed from these data by Matheron's estimator and by three robust estimators. The effects of standard data transformations on the coefficient of skewness and on the variogram were also investigated. Cross-validation was used to assess the performance of models fitted to experimental variograms computed from a range of data contaminated by outliers for kriging. The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both square roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near to the edge or the centre of the field. The results of cross-validation showed that the robust estimators and the removal of outliers were the most effective ways of dealing with outliers for variogram estimation and kriging. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
Microbial communities respond to a variety of environmental factors related to resources (e.g. plant and soil organic matter), habitat (e.g. soil characteristics) and predation (e.g. nematodes, protozoa and viruses). However, the relative contribution of these factors on microbial community composition is poorly understood. Here, we sampled soils from 30 chalk grassland fields located in three different chalk hill ridges of Southern England, using a spatially explicit sampling scheme. We assessed microbial communities via phospholipid fatty acid (PLFA) analyses and PCR-denaturing gradient gel electrophoresis (DGGE) and measured soil characteristics, as well as nematode and plant community composition. The relative influences of space, soil, vegetation and nematodes on soil microorganisms were contrasted using variation partitioning and path analysis. Results indicate that soil characteristics and plant community composition, representing habitat and resources, shape soil microbial community composition, whereas the influence of nematodes, a potential predation factor, appears to be relatively small. Spatial variation in microbial community structure was detected at broad (between fields) and fine (within fields) scales, suggesting that microbial communities exhibit biogeographic patterns at different scales. Although our analysis included several relevant explanatory data sets, a large part of the variation in microbial communities remained unexplained (up to 92% in some analyses). However, in several analyses, significant parts of the variation in microbial community structure could be explained. The results of this study contribute to our understanding of the relative importance of different environmental and spatial factors in driving the composition of soil-borne microbial communities.
Resumo:
The difference between cirrus emissivities at 8 and 11 μm is sensitive to the mean effective ice crystal size of the cirrus cloud, De. By using single scattering properties of ice crystals shaped as planar polycrystals, diameters of up to about 70 μm can be retrieved, instead of up to 45 μm assuming spheres or hexagonal columns. The method described in this article is used for a global determination of mean effective ice crystal sizes of cirrus clouds from TOVS satellite observations. A sensitivity study of the De retrieval to uncertainties in hypotheses on ice crystal shape, size distributions, and temperature profiles, as well as in vertical and horizontal cloud heterogeneities shows that uncertainties can be as large as 30%. However, the TOVS data set is one of few data sets which provides global and long-term coverage. Having analyzed the years 1987–1991, it was found that measured effective ice crystal diameters De are stable from year to year. For 1990 a global median De of 53.5 μm was determined. Averages distinguishing ocean/land, season, and latitude lie between 23 μm in winter over Northern Hemisphere midlatitude land and 64 μm in the tropics. In general, larger Des are found in regions with higher atmospheric water vapor and for cirrus with a smaller effective emissivity.
Resumo:
Population size estimation with discrete or nonparametric mixture models is considered, and reliable ways of construction of the nonparametric mixture model estimator are reviewed and set into perspective. Construction of the maximum likelihood estimator of the mixing distribution is done for any number of components up to the global nonparametric maximum likelihood bound using the EM algorithm. In addition, the estimators of Chao and Zelterman are considered with some generalisations of Zelterman’s estimator. All computations are done with CAMCR, a special software developed for population size estimation with mixture models. Several examples and data sets are discussed and the estimators illustrated. Problems using the mixture model-based estimators are highlighted.