20 resultados para Missing data
em CentAUR: Central Archive University of Reading - UK
Resumo:
Resolving the relationships between Metazoa and other eukaryotic groups as well as between metazoan phyla is central to the understanding of the origin and evolution of animals. The current view is based on limited data sets, either a single gene with many species (e.g., ribosomal RNA) or many genes but with only a few species. Because a reliable phylogenetic inference simultaneously requires numerous genes and numerous species, we assembled a very large data set containing 129 orthologous proteins (similar to30,000 aligned amino acid positions) for 36 eukaryotic species. Included in the alignments are data from the choanoflagellate Monosiga ovata, obtained through the sequencing of about 1,000 cDNAs. We provide conclusive support for choanoflagellates as the closest relative of animals and for fungi as the second closest. The monophyly of Plantae and chromalveolates was recovered but without strong statistical support. Within animals, in contrast to the monophyly of Coelomata observed in several recent large-scale analyses, we recovered a paraphyletic Coelamata, with nematodes and platyhelminths nested within. To include a diverse sample of organisms, data from EST projects were used for several species, resulting in a large amount of missing data in our alignment (about 25%). By using different approaches, we verify that the inferred phylogeny is not sensitive to these missing data. Therefore, this large data set provides a reliable phylogenetic framework for studying eukaryotic and animal evolution and will be easily extendable when large amounts of sequence information become available from a broader taxonomic range.
Resumo:
An important feature of agribusiness promotion programs is their lagged impact on consumption. Efficient investment in advertising requires reliable estimates of these lagged responses and it is desirable from both applied and theoretical standpoints to have a flexible method for estimating them. This note derives an alternative Bayesian methodology for estimating lagged responses when investments occur intermittently within a time series. The method exploits a latent-variable extension of the natural-conjugate, normal-linear model, Gibbs sampling and data augmentation. It is applied to a monthly time series on Turkish pasta consumption (1993:5-1998:3) and three, nonconsecutive promotion campaigns (1996:3, 1997:3, 1997:10). The results suggest that responses were greatest to the second campaign, which allocated its entire budget to television media; that its impact peaked in the sixth month following expenditure; and that the rate of return (measured in metric tons additional consumption per thousand dollars expended) was around a factor of 20.
Resumo:
An improved algorithm for the generation of gridded window brightness temperatures is presented. The primary data source is the International Satellite Cloud Climatology Project, level B3 data, covering the period from July 1983 to the present. The algorithm rakes window brightness, temperatures from multiple satellites, both geostationary and polar orbiting, which have already been navigated and normalized radiometrically to the National Oceanic and Atmospheric Administration's Advanced Very High Resolution Radiometer, and generates 3-hourly global images on a 0.5 degrees by 0.5 degrees latitude-longitude grid. The gridding uses a hierarchical scheme based on spherical kernel estimators. As part of the gridding procedure, the geostationary data are corrected for limb effects using a simple empirical correction to the radiances, from which the corrected temperatures are computed. This is in addition to the application of satellite zenith angle weighting to downweight limb pixels in preference to nearer-nadir pixels. The polar orbiter data are windowed on the target time with temporal weighting to account for the noncontemporaneous nature of the data. Large regions of missing data are interpolated from adjacent processed images using a form of motion compensated interpolation based on the estimation of motion vectors using an hierarchical block matching scheme. Examples are shown of the various stages in the process. Also shown are examples of the usefulness of this type of data in GCM validation.
Resumo:
Relationships between the four families placed in the angiosperm order Fabales (Leguminosae, Polygalaceae, Quillajaceae, Surianaceae) were hitherto poorly resolved. We combine published molecular data for the chloroplast regions matK and rbcL with 66 morphological characters surveyed for 73 ingroup and two outgroup species, and use Parsimony and Bayesian approaches to explore matrices with different missing data. All combined analyses using Parsimony recovered the topology Polygalaceae (Leguminosae (Quillajaceae + Surianaceae)). Bayesian analyses with matched morphological and molecular sampling recover the same topology, but analyses based on other data recover a different Bayesian topology: ((Polygalaceae + Leguminosae) (Quillajaceae + Surianaceae)). We explore the evolution of floral characters in the context of the more consistent topology: Polygalaceae (Leguminosae (Quillajaceae + Surianaceae)). This reveals synapomorphies for (Leguminosae (Quillajaceae + Surianaceae)) as the presence of free filaments and marginal/ventral placentation, for (Quillajaceae + Surianaceae) as pentamery and apocarpy, and for Leguminosae the presence of an abaxial median sepal and unicarpellate gynoecium. An octamerous androecium is synapomorphic for Polygalaceae. The development of papilionate flowers, and the evolutionary context in which these phenotypes appeared in Leguminosae and Polygalaceae, shows that the morphologies are convergent rather than synapomorphic within Fabales.
Resumo:
Data augmentation is a powerful technique for estimating models with latent or missing data, but applications in agricultural economics have thus far been few. This paper showcases the technique in an application to data on milk market participation in the Ethiopian highlands. There, a key impediment to economic development is an apparently low rate of market participation. Consequently, economic interest centers on the “locations” of nonparticipants in relation to the market and their “reservation values” across covariates. These quantities are of policy interest because they provide measures of the additional inputs necessary in order for nonparticipants to enter the market. One quantity of primary interest is the minimum amount of surplus milk (the “minimum efficient scale of operations”) that the household must acquire before market participation becomes feasible. We estimate this quantity through routine application of data augmentation and Gibbs sampling applied to a random-censored Tobit regression. Incorporating random censoring affects markedly the marketable-surplus requirements of the household, but only slightly the covariates requirements estimates and, generally, leads to more plausible policy estimates than the estimates obtained from the zero-censored formulation
Resumo:
Considerable effort is presently being devoted to producing high-resolution sea surface temperature (SST) analyses with a goal of spatial grid resolutions as low as 1 km. Because grid resolution is not the same as feature resolution, a method is needed to objectively determine the resolution capability and accuracy of SST analysis products. Ocean model SST fields are used in this study as simulated “true” SST data and subsampled based on actual infrared and microwave satellite data coverage. The subsampled data are used to simulate sampling errors due to missing data. Two different SST analyses are considered and run using both the full and the subsampled model SST fields, with and without additional noise. The results are compared as a function of spatial scales of variability using wavenumber auto- and cross-spectral analysis. The spectral variance at high wavenumbers (smallest wavelengths) is shown to be attenuated relative to the true SST because of smoothing that is inherent to both analysis procedures. Comparisons of the two analyses (both having grid sizes of roughly ) show important differences. One analysis tends to reproduce small-scale features more accurately when the high-resolution data coverage is good but produces more spurious small-scale noise when the high-resolution data coverage is poor. Analysis procedures can thus generate small-scale features with and without data, but the small-scale features in an SST analysis may be just noise when high-resolution data are sparse. Users must therefore be skeptical of high-resolution SST products, especially in regions where high-resolution (~5 km) infrared satellite data are limited because of cloud cover.
Resumo:
Background: Dietary intervention studies suggest that flavan-3-ol intake can improve vascular function and reduce the risk of cardiovascular diseases (CVD). However, results from prospective studies failed to show a consistent beneficial effect. Objective: To investigate associations between flavan-3-ol intake and CVD risk in the Norfolk arm of the European Prospective Investigation into Cancer and Nutrition (EPIC-Norfolk). Design: Data was available from 24,885 (11,252 men; 13,633 women) participants, recruited between 1993 and 1997 into the EPIC-Norfolk study. Flavan-3-ol intake was assessed using 7-day food diaries and the FLAVIOLA Flavanol Food Composition database. Missing data for plasma cholesterol and vitamin C were imputed using multiple imputation. Associations between flavan-3-ol intake and blood pressure at baseline were determined using linear regression models. Associations with CVD risk were estimated using Cox regression analyses. Results: Median intake of total flavan-3-ols was 1034 mg/d (range: 0 – 8531 mg/d) for men and 970 mg/d (0 – 6695 mg/d) for women, median intake of flavan-3-ol monomers was 233 mg/d (0 – 3248 mg/d) for men and 217 (0 – 2712 mg/d) for women. There were no consistent associations between flavan-3-ol monomer intake and baseline systolic and diastolic blood pressure (BP). After 286,147 person-years of follow up, there were 8463 cardio-vascular events and 1987 CVD related deaths; no consistent association between flavan-3-ol intake and CVD risk (HR 0.93, 95% CI:0.87; 1.00; Q1 vs Q5) or mortality was observed (HR 0.93, 95% CI: 0.84; 1.04). Conclusions: Flavan-3-ol intake in EPIC-Norfolk is not sufficient to achieve a statistically significant reduction in CVD risk.
Resumo:
Background Cognitive–behavioural therapy (CBT) for childhood anxiety disorders is associated with modest outcomes in the context of parental anxiety disorder. Objectives This study evaluated whether or not the outcome of CBT for children with anxiety disorders in the context of maternal anxiety disorders is improved by the addition of (i) treatment of maternal anxiety disorders, or (ii) treatment focused on maternal responses. The incremental cost-effectiveness of the additional treatments was also evaluated. Design Participants were randomised to receive (i) child cognitive–behavioural therapy (CCBT); (ii) CCBT with CBT to target maternal anxiety disorders [CCBT + maternal cognitive–behavioural therapy (MCBT)]; or (iii) CCBT with an intervention to target mother–child interactions (MCIs) (CCBT + MCI). Setting A NHS university clinic in Berkshire, UK. Participants Two hundred and eleven children with a primary anxiety disorder, whose mothers also had an anxiety disorder. Interventions All families received eight sessions of individual CCBT. Mothers in the CCBT + MCBT arm also received eight sessions of CBT targeting their own anxiety disorders. Mothers in the MCI arm received 10 sessions targeting maternal parenting cognitions and behaviours. Non-specific interventions were delivered to balance groups for therapist contact. Main outcome measures Primary clinical outcomes were the child’s primary anxiety disorder status and degree of improvement at the end of treatment. Follow-up assessments were conducted at 6 and 12 months. Outcomes in the economic analyses were identified and measured using estimated quality-adjusted life-years (QALYs). QALYS were combined with treatment, health and social care costs and presented within an incremental cost–utility analysis framework with associated uncertainty. Results MCBT was associated with significant short-term improvement in maternal anxiety; however, after children had received CCBT, group differences were no longer apparent. CCBT + MCI was associated with a reduction in maternal overinvolvement and more confident expectations of the child. However, neither CCBT + MCBT nor CCBT + MCI conferred a significant post-treatment benefit over CCBT in terms of child anxiety disorder diagnoses [adjusted risk ratio (RR) 1.18, 95% confidence interval (CI) 0.87 to 1.62, p = 0.29; adjusted RR CCBT + MCI vs. control: adjusted RR 1.22, 95% CI 0.90 to 1.67, p = 0.20, respectively] or global improvement ratings (adjusted RR 1.25, 95% CI 1.00 to 1.59, p = 0.05; adjusted RR 1.20, 95% CI 0.95 to 1.53, p = 0.13). CCBT + MCI outperformed CCBT on some secondary outcome measures. Furthermore, primary economic analyses suggested that, at commonly accepted thresholds of cost-effectiveness, the probability that CCBT + MCI will be cost-effective in comparison with CCBT (plus non-specific interventions) is about 75%. Conclusions Good outcomes were achieved for children and their mothers across treatment conditions. There was no evidence of a benefit to child outcome of supplementing CCBT with either intervention focusing on maternal anxiety disorder or maternal cognitions and behaviours. However, supplementing CCBT with treatment that targeted maternal cognitions and behaviours represented a cost-effective use of resources, although the high percentage of missing data on some economic variables is a shortcoming. Future work should consider whether or not effects of the adjunct interventions are enhanced in particular contexts. The economic findings highlight the utility of considering the use of a broad range of services when evaluating interventions with this client group. Trial registration Current Controlled Trials ISRCTN19762288. Funding This trial was funded by the Medical Research Council (MRC) and Berkshire Healthcare Foundation Trust and managed by the National Institute for Health Research (NIHR) on behalf of the MRC–NIHR partnership (09/800/17) and will be published in full in Health Technology Assessment; Vol. 19, No. 38.
Resumo:
1. Comparative analyses are used to address the key question of what makes a species more prone to extinction by exploring the links between vulnerability and intrinsic species’ traits and/or extrinsic factors. This approach requires comprehensive species data but information is rarely available for all species of interest. As a result comparative analyses often rely on subsets of relatively few species that are assumed to be representative samples of the overall studied group. 2. Our study challenges this assumption and quantifies the taxonomic, spatial, and data type biases associated with the quantity of data available for 5415 mammalian species using the freely available life-history database PanTHERIA. 3. Moreover, we explore how existing biases influence results of comparative analyses of extinction risk by using subsets of data that attempt to correct for detected biases. In particular, we focus on links between four species’ traits commonly linked to vulnerability (distribution range area, adult body mass, population density and gestation length) and conduct univariate and multivariate analyses to understand how biases affect model predictions. 4. Our results show important biases in data availability with c.22% of mammals completely lacking data. Missing data, which appear to be not missing at random, occur frequently in all traits (14–99% of cases missing). Data availability is explained by intrinsic traits, with larger mammals occupying bigger range areas being the best studied. Importantly, we find that existing biases affect the results of comparative analyses by overestimating the risk of extinction and changing which traits are identified as important predictors. 5. Our results raise concerns over our ability to draw general conclusions regarding what makes a species more prone to extinction. Missing data represent a prevalent problem in comparative analyses, and unfortunately, because data are not missing at random, conventional approaches to fill data gaps, are not valid or present important challenges. These results show the importance of making appropriate inferences from comparative analyses by focusing on the subset of species for which data are available. Ultimately, addressing the data bias problem requires greater investment in data collection and dissemination, as well as the development of methodological approaches to effectively correct existing biases.
Resumo:
The common GIS-based approach to regional analyses of soil organic carbon (SOC) stocks and changes is to define geographic layers for which unique sets of driving variables are derived, which include land use, climate, and soils. These GIS layers, with their associated attribute data, can then be fed into a range of empirical and dynamic models. Common methodologies for collating and formatting regional data sets on land use, climate, and soils were adopted for the project Assessment of Soil Organic Carbon Stocks and Changes at National Scale (GEFSOC). This permitted the development of a uniform protocol for handling the various input for the dynamic GEFSOC Modelling System. Consistent soil data sets for Amazon-Brazil, the Indo-Gangetic Plains (IGP) of India, Jordan and Kenya, the case study areas considered in the GEFSOC project, were prepared using methodologies developed for the World Soils and Terrain Database (SOTER). The approach involved three main stages: (1) compiling new soil geographic and attribute data in SOTER format; (2) using expert estimates and common sense to fill selected gaps in the measured or primary data; (3) using a scheme of taxonomy-based pedotransfer rules and expert-rules to derive soil parameter estimates for similar soil units with missing soil analytical data. The most appropriate approach varied from country to country, depending largely on the overall accessibility and quality of the primary soil data available in the case study areas. The secondary SOTER data sets discussed here are appropriate for a wide range of environmental applications at national scale. These include agro-ecological zoning, land evaluation, modelling of soil C stocks and changes, and studies of soil vulnerability to pollution. Estimates of national-scale stocks of SOC, calculated using SOTER methods, are presented as a first example of database application. Independent estimates of SOC stocks are needed to evaluate the outcome of the GEFSOC Modelling System for current conditions of land use and climate. (C) 2007 Elsevier B.V. All rights reserved.
Resumo:
To construct Biodiversity richness maps from Environmental Niche Models (ENMs) of thousands of species is time consuming. A separate species occurrence data pre-processing phase enables the experimenter to control test AUC score variance due to species dataset size. Besides, removing duplicate occurrences and points with missing environmental data, we discuss the need for coordinate precision, wide dispersion, temporal and synonymity filters. After species data filtering, the final task of a pre-processing phase should be the automatic generation of species occurrence datasets which can then be directly ’plugged-in’ to the ENM. A software application capable of carrying out all these tasks will be a valuable time-saver particularly for large scale biodiversity studies.
Resumo:
The contribution investigates the problem of estimating the size of a population, also known as the missing cases problem. Suppose a registration system is targeting to identify all cases having a certain characteristic such as a specific disease (cancer, heart disease, ...), disease related condition (HIV, heroin use, ...) or a specific behavior (driving a car without license). Every case in such a registration system has a certain notification history in that it might have been identified several times (at least once) which can be understood as a particular capture-recapture situation. Typically, cases are left out which have never been listed at any occasion, and it is this frequency one wants to estimate. In this paper modelling is concentrating on the counting distribution, e.g. the distribution of the variable that counts how often a given case has been identified by the registration system. Besides very simple models like the binomial or Poisson distribution, finite (nonparametric) mixtures of these are considered providing rather flexible modelling tools. Estimation is done using maximum likelihood by means of the EM algorithm. A case study on heroin users in Bangkok in the year 2001 is completing the contribution.
Resumo:
Estimation of population size with missing zero-class is an important problem that is encountered in epidemiological assessment studies. Fitting a Poisson model to the observed data by the method of maximum likelihood and estimation of the population size based on this fit is an approach that has been widely used for this purpose. In practice, however, the Poisson assumption is seldom satisfied. Zelterman (1988) has proposed a robust estimator for unclustered data that works well in a wide class of distributions applicable for count data. In the work presented here, we extend this estimator to clustered data. The estimator requires fitting a zero-truncated homogeneous Poisson model by maximum likelihood and thereby using a Horvitz-Thompson estimator of population size. This was found to work well, when the data follow the hypothesized homogeneous Poisson model. However, when the true distribution deviates from the hypothesized model, the population size was found to be underestimated. In the search of a more robust estimator, we focused on three models that use all clusters with exactly one case, those clusters with exactly two cases and those with exactly three cases to estimate the probability of the zero-class and thereby use data collected on all the clusters in the Horvitz-Thompson estimator of population size. Loss in efficiency associated with gain in robustness was examined based on a simulation study. As a trade-off between gain in robustness and loss in efficiency, the model that uses data collected on clusters with at most three cases to estimate the probability of the zero-class was found to be preferred in general. In applications, we recommend obtaining estimates from all three models and making a choice considering the estimates from the three models, robustness and the loss in efficiency. (© 2008 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)
Resumo:
Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.
Resumo:
Current methods for estimating vegetation parameters are generally sub-optimal in the way they exploit information and do not generally consider uncertainties. We look forward to a future where operational dataassimilation schemes improve estimates by tracking land surface processes and exploiting multiple types of observations. Dataassimilation schemes seek to combine observations and models in a statistically optimal way taking into account uncertainty in both, but have not yet been much exploited in this area. The EO-LDAS scheme and prototype, developed under ESA funding, is designed to exploit the anticipated wealth of data that will be available under GMES missions, such as the Sentinel family of satellites, to provide improved mapping of land surface biophysical parameters. This paper describes the EO-LDAS implementation, and explores some of its core functionality. EO-LDAS is a weak constraint variational dataassimilationsystem. The prototype provides a mechanism for constraint based on a prior estimate of the state vector, a linear dynamic model, and EarthObservationdata (top-of-canopy reflectance here). The observation operator is a non-linear optical radiative transfer model for a vegetation canopy with a soil lower boundary, operating over the range 400 to 2500 nm. Adjoint codes for all model and operator components are provided in the prototype by automatic differentiation of the computer codes. In this paper, EO-LDAS is applied to the problem of daily estimation of six of the parameters controlling the radiative transfer operator over the course of a year (> 2000 state vector elements). Zero and first order process model constraints are implemented and explored as the dynamic model. The assimilation estimates all state vector elements simultaneously. This is performed in the context of a typical Sentinel-2 MSI operating scenario, using synthetic MSI observations simulated with the observation operator, with uncertainties typical of those achieved by optical sensors supposed for the data. The experiments consider a baseline state vector estimation case where dynamic constraints are applied, and assess the impact of dynamic constraints on the a posteriori uncertainties. The results demonstrate that reductions in uncertainty by a factor of up to two might be obtained by applying the sorts of dynamic constraints used here. The hyperparameter (dynamic model uncertainty) required to control the assimilation are estimated by a cross-validation exercise. The result of the assimilation is seen to be robust to missing observations with quite large data gaps.