902 resultados para Large Data Sets


Relevância:

90.00% 90.00%

Publicador:

Resumo:

Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing features. In this paper we review the problem of learning from incomplete data from two statistical perspectives---the likelihood-based and the Bayesian. The goal is two-fold: to place current neural network approaches to missing data within a statistical framework, and to describe a set of algorithms, derived from the likelihood-based framework, that handle clustering, classification, and function approximation from incomplete data in a principled and efficient manner. These algorithms are based on mixture modeling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster, Laird, and Rubin 1977)---both for the estimation of mixture components and for coping with the missing data.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

One of the tantalising remaining problems in compositional data analysis lies in how to deal with data sets in which there are components which are essential zeros. By an essential zero we mean a component which is truly zero, not something recorded as zero simply because the experimental design or the measuring instrument has not been sufficiently sensitive to detect a trace of the part. Such essential zeros occur in many compositional situations, such as household budget patterns, time budgets, palaeontological zonation studies, ecological abundance studies. Devices such as nonzero replacement and amalgamation are almost invariably ad hoc and unsuccessful in such situations. From consideration of such examples it seems sensible to build up a model in two stages, the first determining where the zeros will occur and the second how the unit available is distributed among the non-zero parts. In this paper we suggest two such models, an independent binomial conditional logistic normal model and a hierarchical dependent binomial conditional logistic normal model. The compositional data in such modelling consist of an incidence matrix and a conditional compositional matrix. Interesting statistical problems arise, such as the question of estimability of parameters, the nature of the computational process for the estimation of both the incidence and compositional parameters caused by the complexity of the subcompositional structure, the formation of meaningful hypotheses, and the devising of suitable testing methodology within a lattice of such essential zero-compositional hypotheses. The methodology is illustrated by application to both simulated and real compositional data

Relevância:

90.00% 90.00%

Publicador:

Resumo:

As stated in Aitchison (1986), a proper study of relative variation in a compositional data set should be based on logratios, and dealing with logratios excludes dealing with zeros. Nevertheless, it is clear that zero observations might be present in real data sets, either because the corresponding part is completely absent –essential zeros– or because it is below detection limit –rounded zeros. Because the second kind of zeros is usually understood as “a trace too small to measure”, it seems reasonable to replace them by a suitable small value, and this has been the traditional approach. As stated, e.g. by Tauber (1999) and by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000), the principal problem in compositional data analysis is related to rounded zeros. One should be careful to use a replacement strategy that does not seriously distort the general structure of the data. In particular, the covariance structure of the involved parts –and thus the metric properties– should be preserved, as otherwise further analysis on subpopulations could be misleading. Following this point of view, a non-parametric imputation method is introduced in Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2000). This method is analyzed in depth by Martín-Fernández, Barceló-Vidal, and Pawlowsky-Glahn (2003) where it is shown that the theoretical drawbacks of the additive zero replacement method proposed in Aitchison (1986) can be overcome using a new multiplicative approach on the non-zero parts of a composition. The new approach has reasonable properties from a compositional point of view. In particular, it is “natural” in the sense that it recovers the “true” composition if replacement values are identical to the missing values, and it is coherent with the basic operations on the simplex. This coherence implies that the covariance structure of subcompositions with no zeros is preserved. As a generalization of the multiplicative replacement, in the same paper a substitution method for missing values on compositional data sets is introduced

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Developments in the statistical analysis of compositional data over the last two decades have made possible a much deeper exploration of the nature of variability, and the possible processes associated with compositional data sets from many disciplines. In this paper we concentrate on geochemical data sets. First we explain how hypotheses of compositional variability may be formulated within the natural sample space, the unit simplex, including useful hypotheses of subcompositional discrimination and specific perturbational change. Then we develop through standard methodology, such as generalised likelihood ratio tests, statistical tools to allow the systematic investigation of a complete lattice of such hypotheses. Some of these tests are simple adaptations of existing multivariate tests but others require special construction. We comment on the use of graphical methods in compositional data analysis and on the ordination of specimens. The recent development of the concept of compositional processes is then explained together with the necessary tools for a staying- in-the-simplex approach, namely compositional singular value decompositions. All these statistical techniques are illustrated for a substantial compositional data set, consisting of 209 major-oxide and rare-element compositions of metamorphosed limestones from the Northeast and Central Highlands of Scotland. Finally we point out a number of unresolved problems in the statistical analysis of compositional processes

Relevância:

90.00% 90.00%

Publicador:

Resumo:

R from http://www.r-project.org/ is ‘GNU S’ – a language and environment for statistical computing and graphics. The environment in which many classical and modern statistical techniques have been implemented, but many are supplied as packages. There are 8 standard packages and many more are available through the cran family of Internet sites http://cran.r-project.org . We started to develop a library of functions in R to support the analysis of mixtures and our goal is a MixeR package for compositional data analysis that provides support for operations on compositions: perturbation and power multiplication, subcomposition with or without residuals, centering of the data, computing Aitchison’s, Euclidean, Bhattacharyya distances, compositional Kullback-Leibler divergence etc. graphical presentation of compositions in ternary diagrams and tetrahedrons with additional features: barycenter, geometric mean of the data set, the percentiles lines, marking and coloring of subsets of the data set, theirs geometric means, notation of individual data in the set . . . dealing with zeros and missing values in compositional data sets with R procedures for simple and multiplicative replacement strategy, the time series analysis of compositional data. We’ll present the current status of MixeR development and illustrate its use on selected data sets

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The log-ratio methodology makes available powerful tools for analyzing compositional data. Nevertheless, the use of this methodology is only possible for those data sets without null values. Consequently, in those data sets where the zeros are present, a previous treatment becomes necessary. Last advances in the treatment of compositional zeros have been centered especially in the zeros of structural nature and in the rounded zeros. These tools do not contemplate the particular case of count compositional data sets with null values. In this work we deal with \count zeros" and we introduce a treatment based on a mixed Bayesian-multiplicative estimation. We use the Dirichlet probability distribution as a prior and we estimate the posterior probabilities. Then we apply a multiplicative modi¯cation for the non-zero values. We present a case study where this new methodology is applied. Key words: count data, multiplicative replacement, composition, log-ratio analysis

Relevância:

90.00% 90.00%

Publicador:

Resumo:

It has been shown that the accuracy of mammographic abnormality detection methods is strongly dependent on the breast tissue characteristics, where a dense breast drastically reduces detection sensitivity. In addition, breast tissue density is widely accepted to be an important risk indicator for the development of breast cancer. Here, we describe the development of an automatic breast tissue classification methodology, which can be summarized in a number of distinct steps: 1) the segmentation of the breast area into fatty versus dense mammographic tissue; 2) the extraction of morphological and texture features from the segmented breast areas; and 3) the use of a Bayesian combination of a number of classifiers. The evaluation, based on a large number of cases from two different mammographic data sets, shows a strong correlation ( and 0.67 for the two data sets) between automatic and expert-based Breast Imaging Reporting and Data System mammographic density assessment

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Abstract This seminar is a research discussion around a very interesting problem, which may be a good basis for a WAISfest theme. A little over a year ago Professor Alan Dix came to tell us of his plans for a magnificent adventure:to walk all of the way round Wales - 1000 miles 'Alan Walks Wales'. The walk was a personal journey, but also a technological and community one, exploring the needs of the walker and the people along the way. Whilst walking he recorded his thoughts in an audio diary, took lots of photos, wrote a blog and collected data from the tech instruments he was wearing. As a result Alan has extensive quantitative data (bio-sensing and location) and qualitative data (text, images and some audio). There are challenges in analysing individual kinds of data, including merging similar data streams, entity identification, time-series and textual data mining, dealing with provenance, ontologies for paths, and journeys. There are also challenges for author and third-party annotation, linking the data-sets and visualising the merged narrative or facets of it.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The influence of the basis set size and the correlation energy in the static electrical properties of the CO molecule is assessed. In particular, we have studied both the nuclear relaxation and the vibrational contributions to the static molecular electrical properties, the vibrational Stark effect (VSE) and the vibrational intensity effect (VIE). From a mathematical point of view, when a static and uniform electric field is applied to a molecule, the energy of this system can be expressed in terms of a double power series with respect to the bond length and to the field strength. From the power series expansion of the potential energy, field-dependent expressions for the equilibrium geometry, for the potential energy and for the force constant are obtained. The nuclear relaxation and vibrational contributions to the molecular electrical properties are analyzed in terms of the derivatives of the electronic molecular properties. In general, the results presented show that accurate inclusion of the correlation energy and large basis sets are needed to calculate the molecular electrical properties and their derivatives with respect to either nuclear displacements or/and field strength. With respect to experimental data, the calculated power series coefficients are overestimated by the SCF, CISD, and QCISD methods. On the contrary, perturbation methods (MP2 and MP4) tend to underestimate them. In average and using the 6-311 + G(3df) basis set and for the CO molecule, the nuclear relaxation and the vibrational contributions to the molecular electrical properties amount to 11.7%, 3.3%, and 69.7% of the purely electronic μ, α, and β values, respectively

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Conservation planning requires identifying pertinent habitat factors and locating geographic locations where land management may improve habitat conditions for high priority species. I derived habitat models and mapped predicted abundance for the Golden-winged Warbler (Vermivora chrysoptera), a species of high conservation concern, using bird counts, environmental variables, and hierarchical models applied at multiple spatial scales. My aim was to understand habitat associations at multiple spatial scales and create a predictive abundance map for purposes of conservation planning for the Golden-winged Warbler. My models indicated a substantial influence of landscape conditions, including strong positive associations with total forest composition within the landscape. However, many of the associations I observed were counter to reported associations at finer spatial extents; for instance, I found Golden-winged Warblers negatively associated with several measures of edge habitat. No single spatial scale dominated, indicating that this species is responding to factors at multiple spatial scales. I found Golden-winged Warbler abundance was negatively related with Blue-winged Warbler (Vermivora cyanoptera) abundance. I also observed a north-south spatial trend suggestive of a regional climate effect that was not previously noted for this species. The map of predicted abundance indicated a large area of concentrated abundance in west-central Wisconsin, with smaller areas of high abundance along the northern periphery of the Prairie Hardwood Transition. This map of predicted abundance compared favorably with independent evaluation data sets and can thus be used to inform regional planning efforts devoted to conserving this species.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Population models are essential components of large-scale conservation and management plans for the federally endangered Golden-cheeked Warbler (Setophaga chrysoparia; hereafter GCWA). However, existing models are based on vital rate estimates calculated using relatively small data sets that are now more than a decade old. We estimated more current, precise adult and juvenile apparent survival (Φ) probabilities and their associated variances for male GCWAs. In addition to providing estimates for use in population modeling, we tested hypotheses about spatial and temporal variation in Φ. We assessed whether a linear trend in Φ or a change in the overall mean Φ corresponded to an observed increase in GCWA abundance during 1992-2000 and if Φ varied among study plots. To accomplish these objectives, we analyzed long-term GCWA capture-resight data from 1992 through 2011, collected across seven study plots on the Fort Hood Military Reservation using a Cormack-Jolly-Seber model structure within program MARK. We also estimated Φ process and sampling variances using a variance-components approach. Our results did not provide evidence of site-specific variation in adult Φ on the installation. Because of a lack of data, we could not assess whether juvenile Φ varied spatially. We did not detect a strong temporal association between GCWA abundance and Φ. Mean estimates of Φ for adult and juvenile male GCWAs for all years analyzed were 0.47 with a process variance of 0.0120 and a sampling variance of 0.0113 and 0.28 with a process variance of 0.0076 and a sampling variance of 0.0149, respectively. Although juvenile Φ did not differ greatly from previous estimates, our adult Φ estimate suggests previous GCWA population models were overly optimistic with respect to adult survival. These updated Φ probabilities and their associated variances will be incorporated into new population models to assist with GCWA conservation decision making.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The longwave radiative cooling of the clear-sky atmosphere (Q(LWc)) is a crucial component of the global hydrological cycle and is composed of the clear-sky outgoing longwave radiation to space (OLRc) and the net downward minus upward clear-sky longwave radiation to the surface (SNLc). Estimates of QLWc from reanalyses and observations are presented for the period 1979-2004. Compared to other reanalyses data sets, the European Centre for Medium-range Weather Forecasts 40-year reanalysis (ERA40) produces the largest Q(LWc) over the tropical oceans (217 W m(-2)), explained by the least negative SNLc. On the basis of comparisons with data derived from satellite measurements, ERA40 provides the most realistic QLWc climatology over the tropical oceans but exhibits a spurious interannual variability for column integrated water vapor (CWV) and SNLc. Interannual monthly anomalies of QLWc are broadly consistent between data sets with large increases during the warm El Nino events. Since relative humidity ( RH) errors applying throughout the troposphere result in compensating effects on the cooling to space and to the surface, they exert only a marginal effect on QLWc. An observed increase in CWV with surface temperature of 3 kg m(-2) K-1 over the tropical oceans is important in explaining a positive relationship between QLWc and surface temperature, in particular over ascending regimes; over tropical ocean descending regions this relationship ranges from 3.6 to 4.6 +/- 0.4 W m(-2) K-1 for the data sets considered, consistent with idealized sensitivity tests in which tropospheric warming is applied and RH is held constant and implying an increase in precipitation with warming.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

The distribution and variability of water vapor and its links with radiative cooling and latent heating via precipitation are crucial to understanding feedbacks and processes operating within the climate system. Column-integrated water vapor (CWV) and additional variables from the European Centre for Medium-Range Weather Forecasts (ECMWF) 40-year reanalysis (ERA40) are utilized to quantify the spatial and temporal variability in tropical water vapor over the period 1979–2001. The moisture variability is partitioned between dynamical and thermodynamic influences and compared with variations in precipitation provided by the Climate Prediction Center Merged Analysis of Precipitation (CMAP) and the Global Precipitation Climatology Project (GPCP). The spatial distribution of CWV is strongly determined by thermodynamic constraints. Spatial variability in CWV is dominated by changes in the large-scale dynamics, in particular associated with the El Niño–Southern Oscillation (ENSO). Trends in CWV are also dominated by dynamics rather than thermodynamics over the period considered. However, increases in CWV associated with changes in temperature are significant over the equatorial east Pacific when analyzing interannual variability and over the north and northwest Pacific when analyzing trends. Significant positive trends in CWV tend to predominate over the oceans while negative trends in CWV are found over equatorial Africa and Brazil. Links between changes in CWV and vertical motion fields are identified over these regions and also the equatorial Atlantic. However, trends in precipitation are generally incoherent and show little association with the CWV trends. This may in part reflect the inadequacies of the precipitation data sets and reanalysis products when analyzing decadal variability. Though the dynamic component of CWV is a major factor in determining precipitation variability in the tropics, in some regions/seasons the thermodynamic component cancels its effect on precipitation variability.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

For many networks in nature, science and technology, it is possible to order the nodes so that most links are short-range, connecting near-neighbours, and relatively few long-range links, or shortcuts, are present. Given a network as a set of observed links (interactions), the task of finding an ordering of the nodes that reveals such a range-dependent structure is closely related to some sparse matrix reordering problems arising in scientific computation. The spectral, or Fiedler vector, approach for sparse matrix reordering has successfully been applied to biological data sets, revealing useful structures and subpatterns. In this work we argue that a periodic analogue of the standard reordering task is also highly relevant. Here, rather than encouraging nonzeros only to lie close to the diagonal of a suitably ordered adjacency matrix, we also allow them to inhabit the off-diagonal corners. Indeed, for the classic small-world model of Watts & Strogatz (1998, Collective dynamics of ‘small-world’ networks. Nature, 393, 440–442) this type of periodic structure is inherent. We therefore devise and test a new spectral algorithm for periodic reordering. By generalizing the range-dependent random graph class of Grindrod (2002, Range-dependent random graphs and their application to modeling large small-world proteome datasets. Phys. Rev. E, 66, 066702-1–066702-7) to the periodic case, we can also construct a computable likelihood ratio that suggests whether a given network is inherently linear or periodic. Tests on synthetic data show that the new algorithm can detect periodic structure, even in the presence of noise. Further experiments on real biological data sets then show that some networks are better regarded as periodic than linear. Hence, we find both qualitative (reordered networks plots) and quantitative (likelihood ratios) evidence of periodicity in biological networks.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Asymmetry in a distribution can arise from a long tail of values in the underlying process or from outliers that belong to another population that contaminate the primary process. The first paper of this series examined the effects of the former on the variogram and this paper examines the effects of asymmetry arising from outliers. Simulated annealing was used to create normally distributed random fields of different size that are realizations of known processes described by variograms with different nugget:sill ratios. These primary data sets were then contaminated with randomly located and spatially aggregated outliers from a secondary process to produce different degrees of asymmetry. Experimental variograms were computed from these data by Matheron's estimator and by three robust estimators. The effects of standard data transformations on the coefficient of skewness and on the variogram were also investigated. Cross-validation was used to assess the performance of models fitted to experimental variograms computed from a range of data contaminated by outliers for kriging. The results showed that where skewness was caused by outliers the variograms retained their general shape, but showed an increase in the nugget and sill variances and nugget:sill ratios. This effect was only slightly more for the smallest data set than for the two larger data sets and there was little difference between the results for the latter. Overall, the effect of size of data set was small for all analyses. The nugget:sill ratio showed a consistent decrease after transformation to both square roots and logarithms; the decrease was generally larger for the latter, however. Aggregated outliers had different effects on the variogram shape from those that were randomly located, and this also depended on whether they were aggregated near to the edge or the centre of the field. The results of cross-validation showed that the robust estimators and the removal of outliers were the most effective ways of dealing with outliers for variogram estimation and kriging. (C) 2007 Elsevier Ltd. All rights reserved.