965 resultados para Bayesian p-values
Resumo:
Surveys can collect important data that inform policy decisions and drive social science research. Large government surveys collect information from the U.S. population on a wide range of topics, including demographics, education, employment, and lifestyle. Analysis of survey data presents unique challenges. In particular, one needs to account for missing data, for complex sampling designs, and for measurement error. Conceptually, a survey organization could spend lots of resources getting high-quality responses from a simple random sample, resulting in survey data that are easy to analyze. However, this scenario often is not realistic. To address these practical issues, survey organizations can leverage the information available from other sources of data. For example, in longitudinal studies that suffer from attrition, they can use the information from refreshment samples to correct for potential attrition bias. They can use information from known marginal distributions or survey design to improve inferences. They can use information from gold standard sources to correct for measurement error.
This thesis presents novel approaches to combining information from multiple sources that address the three problems described above.
The first method addresses nonignorable unit nonresponse and attrition in a panel survey with a refreshment sample. Panel surveys typically suffer from attrition, which can lead to biased inference when basing analysis only on cases that complete all waves of the panel. Unfortunately, the panel data alone cannot inform the extent of the bias due to attrition, so analysts must make strong and untestable assumptions about the missing data mechanism. Many panel studies also include refreshment samples, which are data collected from a random sample of new
individuals during some later wave of the panel. Refreshment samples offer information that can be utilized to correct for biases induced by nonignorable attrition while reducing reliance on strong assumptions about the attrition process. To date, these bias correction methods have not dealt with two key practical issues in panel studies: unit nonresponse in the initial wave of the panel and in the
refreshment sample itself. As we illustrate, nonignorable unit nonresponse
can significantly compromise the analyst's ability to use the refreshment samples for attrition bias correction. Thus, it is crucial for analysts to assess how sensitive their inferences---corrected for panel attrition---are to different assumptions about the nature of the unit nonresponse. We present an approach that facilitates such sensitivity analyses, both for suspected nonignorable unit nonresponse
in the initial wave and in the refreshment sample. We illustrate the approach using simulation studies and an analysis of data from the 2007-2008 Associated Press/Yahoo News election panel study.
The second method incorporates informative prior beliefs about
marginal probabilities into Bayesian latent class models for categorical data.
The basic idea is to append synthetic observations to the original data such that
(i) the empirical distributions of the desired margins match those of the prior beliefs, and (ii) the values of the remaining variables are left missing. The degree of prior uncertainty is controlled by the number of augmented records. Posterior inferences can be obtained via typical MCMC algorithms for latent class models, tailored to deal efficiently with the missing values in the concatenated data.
We illustrate the approach using a variety of simulations based on data from the American Community Survey, including an example of how augmented records can be used to fit latent class models to data from stratified samples.
The third method leverages the information from a gold standard survey to model reporting error. Survey data are subject to reporting error when respondents misunderstand the question or accidentally select the wrong response. Sometimes survey respondents knowingly select the wrong response, for example, by reporting a higher level of education than they actually have attained. We present an approach that allows an analyst to model reporting error by incorporating information from a gold standard survey. The analyst can specify various reporting error models and assess how sensitive their conclusions are to different assumptions about the reporting error process. We illustrate the approach using simulations based on data from the 1993 National Survey of College Graduates. We use the method to impute error-corrected educational attainments in the 2010 American Community Survey using the 2010 National Survey of College Graduates as the gold standard survey.
Resumo:
A Bayesian optimisation algorithm for a nurse scheduling problem is presented, which involves choosing a suitable scheduling rule from a set for each nurse's assignment. When a human scheduler works, he normally builds a schedule systematically following a set of rules. After much practice, the scheduler gradually masters the knowledge of which solution parts go well with others. He can identify good parts and is aware of the solution quality even if the scheduling process is not yet completed, thus having the ability to finish a schedule by using flexible, rather than fixed, rules. In this paper, we design a more human-like scheduling algorithm, by using a Bayesian optimisation algorithm to implement explicit learning from past solutions. A nurse scheduling problem from a UK hospital is used for testing. Unlike our previous work that used Genetic Algorithms to implement implicit learning [1], the learning in the proposed algorithm is explicit, i.e. we identify and mix building blocks directly. The Bayesian optimisation algorithm is applied to implement such explicit learning by building a Bayesian network of the joint distribution of solutions. The conditional probability of each variable in the network is computed according to an initial set of promising solutions. Subsequently, each new instance for each variable is generated by using the corresponding conditional probabilities, until all variables have been generated, i.e. in our case, new rule strings have been obtained. Sets of rule strings are generated in this way, some of which will replace previous strings based on fitness. If stopping conditions are not met, the conditional probabilities for all nodes in the Bayesian network are updated again using the current set of promising rule strings. For clarity, consider the following toy example of scheduling five nurses with two rules (1: random allocation, 2: allocate nurse to low-cost shifts). In the beginning of the search, the probabilities of choosing rule 1 or 2 for each nurse is equal, i.e. 50%. After a few iterations, due to the selection pressure and reinforcement learning, we experience two solution pathways: Because pure low-cost or random allocation produces low quality solutions, either rule 1 is used for the first 2-3 nurses and rule 2 on remainder or vice versa. In essence, Bayesian network learns 'use rule 2 after 2-3x using rule 1' or vice versa. It should be noted that for our and most other scheduling problems, the structure of the network model is known and all variables are fully observed. In this case, the goal of learning is to find the rule values that maximize the likelihood of the training data. Thus, learning can amount to 'counting' in the case of multinomial distributions. For our problem, we use our rules: Random, Cheapest Cost, Best Cover and Balance of Cost and Cover. In more detail, the steps of our Bayesian optimisation algorithm for nurse scheduling are: 1. Set t = 0, and generate an initial population P(0) at random; 2. Use roulette-wheel selection to choose a set of promising rule strings S(t) from P(t); 3. Compute conditional probabilities of each node according to this set of promising solutions; 4. Assign each nurse using roulette-wheel selection based on the rules' conditional probabilities. A set of new rule strings O(t) will be generated in this way; 5. Create a new population P(t+1) by replacing some rule strings from P(t) with O(t), and set t = t+1; 6. If the termination conditions are not met (we use 2000 generations), go to step 2. Computational results from 52 real data instances demonstrate the success of this approach. They also suggest that the learning mechanism in the proposed approach might be suitable for other scheduling problems. Another direction for further research is to see if there is a good constructing sequence for individual data instances, given a fixed nurse scheduling order. If so, the good patterns could be recognized and then extracted as new domain knowledge. Thus, by using this extracted knowledge, we can assign specific rules to the corresponding nurses beforehand, and only schedule the remaining nurses with all available rules, making it possible to reduce the solution space. Acknowledgements The work was funded by the UK Government's major funding agency, Engineering and Physical Sciences Research Council (EPSRC), under grand GR/R92899/01. References [1] Aickelin U, "An Indirect Genetic Algorithm for Set Covering Problems", Journal of the Operational Research Society, 53(10): 1118-1126,
Resumo:
A Bayesian optimisation algorithm for a nurse scheduling problem is presented, which involves choosing a suitable scheduling rule from a set for each nurse's assignment. When a human scheduler works, he normally builds a schedule systematically following a set of rules. After much practice, the scheduler gradually masters the knowledge of which solution parts go well with others. He can identify good parts and is aware of the solution quality even if the scheduling process is not yet completed, thus having the ability to finish a schedule by using flexible, rather than fixed, rules. In this paper, we design a more human-like scheduling algorithm, by using a Bayesian optimisation algorithm to implement explicit learning from past solutions. A nurse scheduling problem from a UK hospital is used for testing. Unlike our previous work that used Genetic Algorithms to implement implicit learning [1], the learning in the proposed algorithm is explicit, i.e. we identify and mix building blocks directly. The Bayesian optimisation algorithm is applied to implement such explicit learning by building a Bayesian network of the joint distribution of solutions. The conditional probability of each variable in the network is computed according to an initial set of promising solutions. Subsequently, each new instance for each variable is generated by using the corresponding conditional probabilities, until all variables have been generated, i.e. in our case, new rule strings have been obtained. Sets of rule strings are generated in this way, some of which will replace previous strings based on fitness. If stopping conditions are not met, the conditional probabilities for all nodes in the Bayesian network are updated again using the current set of promising rule strings. For clarity, consider the following toy example of scheduling five nurses with two rules (1: random allocation, 2: allocate nurse to low-cost shifts). In the beginning of the search, the probabilities of choosing rule 1 or 2 for each nurse is equal, i.e. 50%. After a few iterations, due to the selection pressure and reinforcement learning, we experience two solution pathways: Because pure low-cost or random allocation produces low quality solutions, either rule 1 is used for the first 2-3 nurses and rule 2 on remainder or vice versa. In essence, Bayesian network learns 'use rule 2 after 2-3x using rule 1' or vice versa. It should be noted that for our and most other scheduling problems, the structure of the network model is known and all variables are fully observed. In this case, the goal of learning is to find the rule values that maximize the likelihood of the training data. Thus, learning can amount to 'counting' in the case of multinomial distributions. For our problem, we use our rules: Random, Cheapest Cost, Best Cover and Balance of Cost and Cover. In more detail, the steps of our Bayesian optimisation algorithm for nurse scheduling are: 1. Set t = 0, and generate an initial population P(0) at random; 2. Use roulette-wheel selection to choose a set of promising rule strings S(t) from P(t); 3. Compute conditional probabilities of each node according to this set of promising solutions; 4. Assign each nurse using roulette-wheel selection based on the rules' conditional probabilities. A set of new rule strings O(t) will be generated in this way; 5. Create a new population P(t+1) by replacing some rule strings from P(t) with O(t), and set t = t+1; 6. If the termination conditions are not met (we use 2000 generations), go to step 2. Computational results from 52 real data instances demonstrate the success of this approach. They also suggest that the learning mechanism in the proposed approach might be suitable for other scheduling problems. Another direction for further research is to see if there is a good constructing sequence for individual data instances, given a fixed nurse scheduling order. If so, the good patterns could be recognized and then extracted as new domain knowledge. Thus, by using this extracted knowledge, we can assign specific rules to the corresponding nurses beforehand, and only schedule the remaining nurses with all available rules, making it possible to reduce the solution space. Acknowledgements The work was funded by the UK Government's major funding agency, Engineering and Physical Sciences Research Council (EPSRC), under grand GR/R92899/01. References [1] Aickelin U, "An Indirect Genetic Algorithm for Set Covering Problems", Journal of the Operational Research Society, 53(10): 1118-1126,
Resumo:
The cerebral cortex presents self-similarity in a proper interval of spatial scales, a property typical of natural objects exhibiting fractal geometry. Its complexity therefore can be characterized by the value of its fractal dimension (FD). In the computation of this metric, it has usually been employed a frequentist approach to probability, with point estimator methods yielding only the optimal values of the FD. In our study, we aimed at retrieving a more complete evaluation of the FD by utilizing a Bayesian model for the linear regression analysis of the box-counting algorithm. We used T1-weighted MRI data of 86 healthy subjects (age 44.2 ± 17.1 years, mean ± standard deviation, 48% males) in order to gain insights into the confidence of our measure and investigate the relationship between mean Bayesian FD and age. Our approach yielded a stronger and significant (P < .001) correlation between mean Bayesian FD and age as compared to the previous implementation. Thus, our results make us suppose that the Bayesian FD is a more truthful estimation for the fractal dimension of the cerebral cortex compared to the frequentist FD.
Resumo:
Garlic is a spice and a medicinal plant; hence, there is an increasing interest in 'developing' new varieties with different culinary properties or with high content of nutraceutical compounds. Phenotypic traits and dominant molecular markers are predominantly used to evaluate the genetic diversity of garlic clones. However, 24 SSR markers (codominant) specific for garlic are available in the literature, fostering germplasm researches. In this study, we genotyped 130 garlic accessions from Brazil and abroad using 17 polymorphic SSR markers to assess the genetic diversity and structure. This is the first attempt to evaluate a large set of accessions maintained by Brazilian institutions. A high level of redundancy was detected in the collection (50 % of the accessions represented eight haplotypes). However, non-redundant accessions presented high genetic diversity. We detected on average five alleles per locus, Shannon index of 1.2, HO of 0.5, and HE of 0.6. A core collection was set with 17 accessions, covering 100 % of the alleles with minimum redundancy. Overall FST and D values indicate a strong genetic structure within accessions. Two major groups identified by both model-based (Bayesian approach) and hierarchical clustering (UPGMA dendrogram) techniques were coherent with the classification of accessions according to maturity time (growth cycle): early-late and midseason accessions. Assessing genetic diversity and structure of garlic collections is the first step towards an efficient management and conservation of accessions in genebanks, as well as to advance future genetic studies and improvement of garlic worldwide.
Resumo:
Current data indicate that the size of high-density lipoprotein (HDL) may be considered an important marker for cardiovascular disease risk. We established reference values of mean HDL size and volume in an asymptomatic representative Brazilian population sample (n=590) and their associations with metabolic parameters by gender. Size and volume were determined in HDL isolated from plasma by polyethyleneglycol precipitation of apoB-containing lipoproteins and measured using the dynamic light scattering (DLS) technique. Although the gender and age distributions agreed with other studies, the mean HDL size reference value was slightly lower than in some other populations. Both HDL size and volume were influenced by gender and varied according to age. HDL size was associated with age and HDL-C (total population); non- white ethnicity and CETP inversely (females); HDL-C and PLTP mass (males). On the other hand, HDL volume was determined only by HDL-C (total population and in both genders) and by PLTP mass (males). The reference values for mean HDL size and volume using the DLS technique were established in an asymptomatic and representative Brazilian population sample, as well as their related metabolic factors. HDL-C was a major determinant of HDL size and volume, which were differently modulated in females and in males.
Resumo:
Universidade Estadual de Campinas. Faculdade de Educação Física
Resumo:
OBJECTIVE: The aim of the present study was to verify the torque precision of metallic brackets with MBT prescription using the canine brackets as the representative sample of six commercial brands. MATERIAL AND METHODS: Twenty maxillary and 20 mandibular canine brackets of one of the following commercial brands were selected: 3M Unitek, Abzil, American Orthodontics, TP Orthodontics, Morelli and Ortho Organizers. The torque angle, established by reference points and lines, was measured by an operator using an optical microscope coupled to a computer. The values were compared to those established by the MBT prescription. RESULTS: The results showed that for the maxillary canine brackets, only the Morelli torque (-3.33º) presented statistically significant difference from the proposed values (-7º). For the mandibular canines, American Orthodontics (-6.34º) and Ortho Organizers (-6.25º) presented statistically significant differences from the standards (-6º). Comparing the brands, Morelli presented statistically significant differences in comparison with all the other brands for maxillary canine brackets. For the mandibular canine brackets, there was no statistically significant difference between the brands. CONCLUSIONS: There are significant variations in torque values of some of the brackets assessed, which would clinically compromise the buccolingual positioning of the tooth at the end of orthodontic treatment.
Resumo:
This study evaluated bone response to a Ca- and P- enriched titanium (Ti) surface treated by a multiphase anodic spark deposition coating (BSP-AK). Two mongrel dogs received bilateral implantation of 3 Ti cylinders (4.1 x 12 mm) in the humerus, being either BSP-AK treated or untreated (machined - control). At 8 weeks postimplantation, bone fragments containing the implants were harvested and processed for histologic and histomorphometric analyses. Bone formation was observed in cortical area and towards the medullary canal associated to approximately 1/3 of implant extension. In most cases, in the medullary area, collagen fiber bundles were detected adjacent and oriented parallel to Ti surfaces. Such connective tissue formation exhibited focal areas of mineralized matrix lined by active osteoblasts. The mean percentages of bone-to-implant contact were 2.3 (0.0-7.2 range) for BSP-AK and 0.4 (0.0-1.3 range) for control. Although the Mann-Whitney test did not detect statistically significant differences between groups, these results indicate a trend of BSP-AK treated surfaces to support contact osteogenesis in an experimental model that produces low bone-to-implant contact values.
Resumo:
O domínio do Cerrado compreende uma área contínua nos estados centrais do Brasil e áreas disjuntas em outros estados, incluindo São Paulo. Essa vegetação ocupava originalmente 21% do território brasileiro, restando atualmente apenas 21,6% de sua extensão original. A área recoberta por essa vegetação em São Paulo cobria 14% de sua área total e seus remanescentes recobrem menos de 1% da ocorrência original dessa vegetação. Estudos recentes indicam que o valor da produtividade líquida no Cerrado Pé-de-Gigante (SP) constitui um pequeno dreno de carbono e indicou que a sazonalidade foi o fator determinante do valor observado. Os estudos dos fluxos de carbono em ecossistemas terrestres são raramente acompanhados de abordagens ecofisiológicas de modo a explorar a relação funcional das espécies que compõem o ecossistema e os valores líquidos obtidos para o mesmo. Assim, o objetivo deste trabalho foi caracterizar estruturalmente a vegetação presente na área de maior influência da torre de fluxo instalada no Cerrado Pé-de-Gigante, visando possibilitar estudos relacionados à quantificação em longo prazo da dinâmica dos fluxos de água, energia e CO2 na vegetação de Cerrado. Para isso foram levantadas 20 parcelas (10 x 10 m) em 0,2 ha de Cerrado, e amostraram-se todas as plantas com perímetro ao nível do solo >6 cm (exceto lianas e árvores mortas). A distribuição das classes de diâmetro e estrutura vertical, assim como os parâmetros fitossociológicos foram analisados. Encontramos 1451 indivíduos, distribuídos em 85 espécies, 52 gêneros e 31 famílias. A densidade absoluta e área basal foram de 7255 ind. ha-1 e de 7,9 m².ha-1, respectivamente. A família Leguminosae apresentou o maior número de espécies (13). O Índice de diversidade de Shannon (H') foi 3,27 nats.ind-1. A distribuição em classes de diâmetro mostrou uma curva de "J" invertido, estando a maioria dos indivíduos na primeira classe. Concluímos que a área deve ser classificada como Cerrado denso, devido principalmente à dominância pela espécie arbórea Anadenanthera falcata, cuja ocorrência no estado foi relatada apenas em locais com solos ricos em saturação de bases na região das Cuestas Basálticas, devido também à maior área basal dos indivíduos, comparando com outros fragmentos de Cerrado. Além da espécie citada, Myrcia lingua e Xylopia aromatica, apresentaram os maiores IVI (Valor de importância).
Resumo:
The aim of this study was to evaluate the ability of the BANA Test to detect different levels of Porphyromonas gingivalis, Treponema denticola and Tannerella forsythia or their combinations in subgingival samples at the initial diagnosis and after periodontal therapy. Periodontal sites with probing depths between 5-7 mm and clinical attachment level between 5-10 mm, from 53 subjects with chronic periodontitis, were sampled in four periods: initial diagnosis (T0), immediately (T1), 45 (T2) and 60 days (T3) after scaling and root planing. BANA Test and Checkerboard DNA-DNA hybridization identified red complex species in the subgingival biofilm. In all experimental periods, the highest frequencies of score 2 (Checkerboard DNA-DNA hybridization) for P. gingivalis, T. denticola and T. forsythia were observed when strong enzymatic activity (BANA) was present (p < 0.01). The best agreement was observed at initial diagnosis. The BANA Test sensitivity was 95.54% (T0), 65.18% (T1), 65.22% (T2) and 50.26% (T3). The specificity values were 12.24% (T0), 57.38% (T1), 46.27% (T2) and 53.48% (T3). The BANA Test is more effective for the detection of red complex pathogens when the bacterial levels are high, i.e. in the initial diagnosis of chronic periodontitis.
Resumo:
Gene clustering is a useful exploratory technique to group together genes with similar expression levels under distinct cell cycle phases or distinct conditions. It helps the biologist to identify potentially meaningful relationships between genes. In this study, we propose a clustering method based on multivariate normal mixture models, where the number of clusters is predicted via sequential hypothesis tests: at each step, the method considers a mixture model of m components (m = 2 in the first step) and tests if in fact it should be m - 1. If the hypothesis is rejected, m is increased and a new test is carried out. The method continues (increasing m) until the hypothesis is accepted. The theoretical core of the method is the full Bayesian significance test, an intuitive Bayesian approach, which needs no model complexity penalization nor positive probabilities for sharp hypotheses. Numerical experiments were based on a cDNA microarray dataset consisting of expression levels of 205 genes belonging to four functional categories, for 10 distinct strains of Saccharomyces cerevisiae. To analyze the method's sensitivity to data dimension, we performed principal components analysis on the original dataset and predicted the number of classes using 2 to 10 principal components. Compared to Mclust (model-based clustering), our method shows more consistent results.
Resumo:
The aim of this study was to compare REML/BLUP and Least Square procedures in the prediction and estimation of genetic parameters and breeding values in soybean progenies. F(2:3) and F(4:5) progenies were evaluated in the 2005/06 growing season and the F(2:4) and F(4:6) generations derived thereof were evaluated in 2006/07. These progenies were originated from two semi-early, experimental lines that differ in grain yield. The experiments were conducted in a lattice design and plots consisted of a 2 m row, spaced 0.5 m apart. The trait grain yield per plot was evaluated. It was observed that early selection is more efficient for the discrimination of the best lines from the F(4) generation onwards. No practical differences were observed between the least square and REML/BLUP procedures in the case of the models and simplifications for REML/BLUP used here.