89 resultados para missing data imputation

em Biblioteca Digital da Produção Intelectual da Universidade de São Paulo (BDPI/USP)


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Grass reference evapotranspiration (ETo) is an important agrometeorological parameter for climatological and hydrological studies, as well as for irrigation planning and management. There are several methods to estimate ETo, but their performance in different environments is diverse, since all of them have some empirical background. The FAO Penman-Monteith (FAD PM) method has been considered as a universal standard to estimate ETo for more than a decade. This method considers many parameters related to the evapotranspiration process: net radiation (Rn), air temperature (7), vapor pressure deficit (Delta e), and wind speed (U); and has presented very good results when compared to data from lysimeters Populated with short grass or alfalfa. In some conditions, the use of the FAO PM method is restricted by the lack of input variables. In these cases, when data are missing, the option is to calculate ETo by the FAD PM method using estimated input variables, as recommended by FAD Irrigation and Drainage Paper 56. Based on that, the objective of this study was to evaluate the performance of the FAO PM method to estimate ETo when Rn, Delta e, and U data are missing, in Southern Ontario, Canada. Other alternative methods were also tested for the region: Priestley-Taylor, Hargreaves, and Thornthwaite. Data from 12 locations across Southern Ontario, Canada, were used to compare ETo estimated by the FAD PM method with a complete data set and with missing data. The alternative ETo equations were also tested and calibrated for each location. When relative humidity (RH) and U data were missing, the FAD PM method was still a very good option for estimating ETo for Southern Ontario, with RMSE smaller than 0.53 mm day(-1). For these cases, U data were replaced by the normal values for the region and Delta e was estimated from temperature data. The Priestley-Taylor method was also a good option for estimating ETo when U and Delta e data were missing, mainly when calibrated locally (RMSE = 0.40 mm day(-1)). When Rn was missing, the FAD PM method was not good enough for estimating ETo, with RMSE increasing to 0.79 mm day(-1). When only T data were available, adjusted Hargreaves and modified Thornthwaite methods were better options to estimate ETo than the FAO) PM method, since RMSEs from these methods, respectively 0.79 and 0.83 mm day(-1), were significantly smaller than that obtained by FAO PM (RMSE = 1.12 mm day(-1). (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

When missing data occur in studies designed to compare the accuracy of diagnostic tests, a common, though naive, practice is to base the comparison of sensitivity, specificity, as well as of positive and negative predictive values on some subset of the data that fits into methods implemented in standard statistical packages. Such methods are usually valid only under the strong missing completely at random (MCAR) assumption and may generate biased and less precise estimates. We review some models that use the dependence structure of the completely observed cases to incorporate the information of the partially categorized observations into the analysis and show how they may be fitted via a two-stage hybrid process involving maximum likelihood in the first stage and weighted least squares in the second. We indicate how computational subroutines written in R may be used to fit the proposed models and illustrate the different analysis strategies with observational data collected to compare the accuracy of three distinct non-invasive diagnostic methods for endometriosis. The results indicate that even when the MCAR assumption is plausible, the naive partial analyses should be avoided.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

We review some issues related to the implications of different missing data mechanisms on statistical inference for contingency tables and consider simulation studies to compare the results obtained under such models to those where the units with missing data are disregarded. We confirm that although, in general, analyses under the correct missing at random and missing completely at random models are more efficient even for small sample sizes, there are exceptions where they may not improve the results obtained by ignoring the partially classified data. We show that under the missing not at random (MNAR) model, estimates on the boundary of the parameter space as well as lack of identifiability of the parameters of saturated models may be associated with undesirable asymptotic properties of maximum likelihood estimators and likelihood ratio tests; even in standard cases the bias of the estimators may be low only for very large samples. We also show that the probability of a boundary solution obtained under the correct MNAR model may be large even for large samples and that, consequently, we may not always conclude that a MNAR model is misspecified because the estimate is on the boundary of the parameter space.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Some factors complicate comparisons between linkage maps from different studies. This problem can be resolved if measures of precision, such as confidence intervals and frequency distributions, are associated with markers. We examined the precision of distances and ordering of microsatellite markers in the consensus linkage maps of chromosomes 1, 3 and 4 from two F 2 reciprocal Brazilian chicken populations, using bootstrap sampling. Single and consensus maps were constructed. The consensus map was compared with the International Consensus Linkage Map and with the whole genome sequence. Some loci showed segregation distortion and missing data, but this did not affect the analyses negatively. Several inversions and position shifts were detected, based on 95% confidence intervals and frequency distributions of loci. Some discrepancies in distances between loci and in ordering were due to chance, whereas others could be attributed to other effects, including reciprocal crosses, sampling error of the founder animals from the two populations, F(2) population structure, number of and distance between microsatellite markers, number of informative meioses, loci segregation patterns, and sex. In the Brazilian consensus GGA1, locus LEI1038 was in a position closer to the true genome sequence than in the International Consensus Map, whereas for GGA3 and GGA4, no such differences were found. Extending these analyses to the remaining chromosomes should facilitate comparisons and the integration of several available genetic maps, allowing meta-analyses for map construction and quantitative trait loci (QTL) mapping. The precision of the estimates of QTL positions and their effects would be increased with such information.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Background: Cerebral palsy (CP) patients have motor limitations that can affect functionality and abilities for activities of daily living (ADL). Health related quality of life and health status instruments validated to be applied to these patients do not directly approach the concepts of functionality or ADL. The Child Health Assessment Questionnaire (CHAQ) seems to be a good instrument to approach this dimension, but it was never used for CP patients. The purpose of the study was to verify the psychometric properties of CHAQ applied to children and adolescents with CP. Methods: Parents or guardians of children and adolescents with CP, aged 5 to 18 years, answered the CHAQ. A healthy group of 314 children and adolescents was recruited during the validation of the CHAQ Brazilian-version. Data quality, reliability and validity were studied. The motor function was evaluated by the Gross Motor Function Measure (GMFM). Results: Ninety-six parents/guardians answered the questionnaire. The age of the patients ranged from 5 to 17.9 years (average: 9.3). The rate of missing data was low(< 9.3%). The floor effect was observed in two domains, being higher only in the visual analogue scales (<= 35.5%). The ceiling effect was significant in all domains and particularly high in patients with quadriplegia (81.8 to 90.9%) and extrapyramidal (45.4 to 91.0%). The Cronbach alpha coefficient ranged from 0.85 to 0.95. The validity was appropriate: for the discriminant validity the correlation of the disability index with the visual analogue scales was not significant; for the convergent validity CHAQ disability index had a strong correlation with the GMFM (0.77); for the divergent validity there was no correlation between GMFM and the pain and overall evaluation scales; for the criterion validity GMFM as well as CHAQ detected differences in the scores among the clinical type of CP (p < 0.01); for the construct validity, the patients' disability index score (mean: 2.16; SD: 0.72) was higher than the healthy group ( mean: 0.12; SD: 0.23)(p < 0.01). Conclusion: CHAQ reliability and validity were adequate to this population. However, further studies are necessary to verify the influence of the ceiling effect on the responsiveness of the instrument.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

When building genetic maps, it is necessary to choose from several marker ordering algorithms and criteria, and the choice is not always simple. In this study, we evaluate the efficiency of algorithms try (TRY), seriation (SER), rapid chain delineation (RCD), recombination counting and ordering (RECORD) and unidirectional growth (UG), as well as the criteria PARF (product of adjacent recombination fractions), SARF (sum of adjacent recombination fractions), SALOD (sum of adjacent LOD scores) and LHMC (likelihood through hidden Markov chains), used with the RIPPLE algorithm for error verification, in the construction of genetic linkage maps. A linkage map of a hypothetical diploid and monoecious plant species was simulated containing one linkage group and 21 markers with fixed distance of 3 cM between them. In all, 700 F(2) populations were randomly simulated with and 400 individuals with different combinations of dominant and co-dominant markers, as well as 10 and 20% of missing data. The simulations showed that, in the presence of co-dominant markers only, any combination of algorithm and criteria may be used, even for a reduced population size. In the case of a smaller proportion of dominant markers, any of the algorithms and criteria (except SALOD) investigated may be used. In the presence of high proportions of dominant markers and smaller samples (around 100), the probability of repulsion linkage increases between them and, in this case, use of the algorithms TRY and SER associated to RIPPLE with criterion LHMC would provide better results. Heredity (2009) 103, 494-502; doi:10.1038/hdy.2009.96; published online 29 July 2009

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Crown group Archosauria, which includes birds, dinosaurs, crocodylomorphs, and several extinct Mesozoic groups, is a primary division of the vertebrate tree of life. However, the higher-level phylogenetic relationships within Archosauria are poorly resolved and controversial, despite years of study. The phylogeny of crocodile-line archosaurs (Crurotarsi) is particularly contentious, and has been plagued by problematic taxon and character sampling. Recent discoveries and renewed focus on archosaur anatomy enable the compilation of a new dataset, which assimilates and standardizes character data pertinent to higher-level archosaur phylogeny, and is scored across the largest group of taxa yet analysed. This dataset includes 47 new characters (25% of total) and eight taxa that have yet to be included in an analysis, and total taxonomic sampling is more than twice that of any previous study. This analysis produces a well-resolved phylogeny, which recovers mostly traditional relationships within Avemetatarsalia, places Phytosauria as a basal crurotarsan clade, finds a close relationship between Aetosauria and Crocodylomorpha, and recovers a monophyletic Rauisuchia comprised of two major subclades. Support values are low, suggesting rampant homoplasy and missing data within Archosauria, but the phylogeny is highly congruent with stratigraphy. Comparison with alternative analyses identifies numerous scoring differences, but indicates that character sampling is the main source of incongruence. The phylogeny implies major missing lineages in the Early Triassic and may support a Carnian-Norian extinction event.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Background: Genome wide association studies (GWAS) are becoming the approach of choice to identify genetic determinants of complex phenotypes and common diseases. The astonishing amount of generated data and the use of distinct genotyping platforms with variable genomic coverage are still analytical challenges. Imputation algorithms combine directly genotyped markers information with haplotypic structure for the population of interest for the inference of a badly genotyped or missing marker and are considered a near zero cost approach to allow the comparison and combination of data generated in different studies. Several reports stated that imputed markers have an overall acceptable accuracy but no published report has performed a pair wise comparison of imputed and empiric association statistics of a complete set of GWAS markers. Results: In this report we identified a total of 73 imputed markers that yielded a nominally statistically significant association at P < 10(-5) for type 2 Diabetes Mellitus and compared them with results obtained based on empirical allelic frequencies. Interestingly, despite their overall high correlation, association statistics based on imputed frequencies were discordant in 35 of the 73 (47%) associated markers, considerably inflating the type I error rate of imputed markers. We comprehensively tested several quality thresholds, the haplotypic structure underlying imputed markers and the use of flanking markers as predictors of inaccurate association statistics derived from imputed markers. Conclusions: Our results suggest that association statistics from imputed markers showing specific MAF (Minor Allele Frequencies) range, located in weak linkage disequilibrium blocks or strongly deviating from local patterns of association are prone to have inflated false positive association signals. The present study highlights the potential of imputation procedures and proposes simple procedures for selecting the best imputed markers for follow-up genotyping studies.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

The substitution of missing values, also called imputation, is an important data preparation task for many domains. Ideally, the substitution of missing values should not insert biases into the dataset. This aspect has been usually assessed by some measures of the prediction capability of imputation methods. Such measures assume the simulation of missing entries for some attributes whose values are actually known. These artificially missing values are imputed and then compared with the original values. Although this evaluation is useful, it does not allow the influence of imputed values in the ultimate modelling task (e.g. in classification) to be inferred. We argue that imputation cannot be properly evaluated apart from the modelling task. Thus, alternative approaches are needed. This article elaborates on the influence of imputed values in classification. In particular, a practical procedure for estimating the inserted bias is described. As an additional contribution, we have used such a procedure to empirically illustrate the performance of three imputation methods (majority, naive Bayes and Bayesian networks) in three datasets. Three classifiers (decision tree, naive Bayes and nearest neighbours) have been used as modelling tools in our experiments. The achieved results illustrate a variety of situations that can take place in the data preparation practice.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Interval-censored survival data, in which the event of interest is not observed exactly but is only known to occur within some time interval, occur very frequently. In some situations, event times might be censored into different, possibly overlapping intervals of variable widths; however, in other situations, information is available for all units at the same observed visit time. In the latter cases, interval-censored data are termed grouped survival data. Here we present alternative approaches for analyzing interval-censored data. We illustrate these techniques using a survival data set involving mango tree lifetimes. This study is an example of grouped survival data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advances in diagnostic research are moving towards methods whereby the periodontal risk can be identified and quantified by objective measures using biomarkers. Patients with periodontitis may have elevated circulating levels of specific inflammatory markers that can be correlated to the severity of the disease. The purpose of this study was to evaluate whether differences in the serum levels of inflammatory biomarkers are differentially expressed in healthy and periodontitis patients. Twenty-five patients (8 healthy patients and 17 chronic periodontitis patients) were enrolled in the study. A 15 mL blood sample was used for identification of the inflammatory markers, with a human inflammatory flow cytometry multiplex assay. Among 24 assessed cytokines, only 3 (RANTES, MIG and Eotaxin) were statistically different between groups (p<0.05). In conclusion, some of the selected markers of inflammation are differentially expressed in healthy and periodontitis patients. Cytokine profile analysis may be further explored to distinguish the periodontitis patients from the ones free of disease and also to be used as a measure of risk. The present data, however, are limited and larger sample size studies are required to validate the findings of the specific biomarkers.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

ABSTRACT Microphysical and thermodynamical features of two tropical systems, namely Hurricane Ivan and Typhoon Conson, and one sub-tropical, Catarina, have been analyzed based on space-born radar PR measurements available on the TRMM satellite. The procedure to classify the reflectivity profiles followed the Heymsfield et al (2000) and Steiner et al (1995) methodologies. The water and ice content have been calculated using a relationship obtained with data of the surface SPOL radar and PR in Rondonia State in Brazil. The diabatic heating rate due to latent heat release has been estimated using the methodology developed by Tao et al (1990). A more detailed analysis has been performed for Hurricane Catarina, the first of its kind in South Atlantic. High water content mean value has been found in Conson and Ivan at low levels and close to their centers. Results indicate that hurricane Catarina was shallower than the other two systems, with less water and the water was concentrated closer to its center. The mean ice content in Catarina was about 0.05 g kg-1 while in Conson it was 0.06 g kg-1 and in Ivan 0.08 g kg-1. Conson and Ivan had water content up to 0.3 g kg-1 above the 0ºC layer, while Catarina had less than 0.15 g kg-1. The latent heat released by Catarina showed to be very similar to the other two systems, except in the regions closer to the center.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Lepidocharax, new genus, and Lepidocharax diamantina and L. burnsi new species from eastern Brazil are described herein. Lepidocharax is considered a monophyletic genus of the Stevardiinae and can be distinguished from the other members of this subfamily except Planaltina, Pseudocorynopoma, and Xenurobrycon by having the dorsal-fin origin vertically aligned with the anal-fin origin, vs. dorsal fin origin anterior or posterior to anal-fin origin. Additionally the new genus can be distinguished from those three genera by not having the scales extending over the ventral caudal-fin lobe modified to form the dorsal border of the pheromone pouch organ or to represent a pouch scale in sexually mature males. In this paper, we describe these two recently discovered species and the ultrastructure of their spermatozoa.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

During the exploration and mapping of new caves in Serra do Ramalho karst area, southern Bahia state, cavers from the Grupo Bambuí de Pesquisas Espeleológicas - GBPE (Belo Horizonte) noticed the presence of troglomorphic catfishes (species with reduced eyes and/or melanic pigmentation), which we intensively investigated with regards to their ecology and behavior since 2005. Non-troglomorphic fishes regularly found in the studied caves were included in this investigation. We present here data on the natural history of two troglobitic (exclusively subterranean troglomorphic species) fishes - Rhamdia enfurnada Bichuette & Trajano, 2005 (Heptapteridae; Gruna do Enfurnado) and Trichomycterus undescribed species (Trichomycteridae; Lapa dos Peixes and Gruna da Água Clara), and non-troglomorphic Hoplias cf. malabaricus, probably a troglophile (able to form populations both in epigean and subterranean habitats) in the Gruna do Enfurnado, and Pimelodella sp., a species with a sink population in the Lapa dos Peixes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Geographic Data Warehouses (GDW) are one of the main technologies used in decision-making processes and spatial analysis, and the literature proposes several conceptual and logical data models for GDW. However, little effort has been focused on studying how spatial data redundancy affects SOLAP (Spatial On-Line Analytical Processing) query performance over GDW. In this paper, we investigate this issue. Firstly, we compare redundant and non-redundant GDW schemas and conclude that redundancy is related to high performance losses. We also analyze the issue of indexing, aiming at improving SOLAP query performance on a redundant GDW. Comparisons of the SB-index approach, the star-join aided by R-tree and the star-join aided by GiST indicate that the SB-index significantly improves the elapsed time in query processing from 25% up to 99% with regard to SOLAP queries defined over the spatial predicates of intersection, enclosure and containment and applied to roll-up and drill-down operations. We also investigate the impact of the increase in data volume on the performance. The increase did not impair the performance of the SB-index, which highly improved the elapsed time in query processing. Performance tests also show that the SB-index is far more compact than the star-join, requiring only a small fraction of at most 0.20% of the volume. Moreover, we propose a specific enhancement of the SB-index to deal with spatial data redundancy. This enhancement improved performance from 80 to 91% for redundant GDW schemas.