987 resultados para Statistical efficiency
Resumo:
Complex diseases such as cancer result from multiple genetic changes and environmental exposures. Due to the rapid development of genotyping and sequencing technologies, we are now able to more accurately assess causal effects of many genetic and environmental factors. Genome-wide association studies have been able to localize many causal genetic variants predisposing to certain diseases. However, these studies only explain a small portion of variations in the heritability of diseases. More advanced statistical models are urgently needed to identify and characterize some additional genetic and environmental factors and their interactions, which will enable us to better understand the causes of complex diseases. In the past decade, thanks to the increasing computational capabilities and novel statistical developments, Bayesian methods have been widely applied in the genetics/genomics researches and demonstrating superiority over some regular approaches in certain research areas. Gene-environment and gene-gene interaction studies are among the areas where Bayesian methods may fully exert its functionalities and advantages. This dissertation focuses on developing new Bayesian statistical methods for data analysis with complex gene-environment and gene-gene interactions, as well as extending some existing methods for gene-environment interactions to other related areas. It includes three sections: (1) Deriving the Bayesian variable selection framework for the hierarchical gene-environment and gene-gene interactions; (2) Developing the Bayesian Natural and Orthogonal Interaction (NOIA) models for gene-environment interactions; and (3) extending the applications of two Bayesian statistical methods which were developed for gene-environment interaction studies, to other related types of studies such as adaptive borrowing historical data. We propose a Bayesian hierarchical mixture model framework that allows us to investigate the genetic and environmental effects, gene by gene interactions (epistasis) and gene by environment interactions in the same model. It is well known that, in many practical situations, there exists a natural hierarchical structure between the main effects and interactions in the linear model. Here we propose a model that incorporates this hierarchical structure into the Bayesian mixture model, such that the irrelevant interaction effects can be removed more efficiently, resulting in more robust, parsimonious and powerful models. We evaluate both of the 'strong hierarchical' and 'weak hierarchical' models, which specify that both or one of the main effects between interacting factors must be present for the interactions to be included in the model. The extensive simulation results show that the proposed strong and weak hierarchical mixture models control the proportion of false positive discoveries and yield a powerful approach to identify the predisposing main effects and interactions in the studies with complex gene-environment and gene-gene interactions. We also compare these two models with the 'independent' model that does not impose this hierarchical constraint and observe their superior performances in most of the considered situations. The proposed models are implemented in the real data analysis of gene and environment interactions in the cases of lung cancer and cutaneous melanoma case-control studies. The Bayesian statistical models enjoy the properties of being allowed to incorporate useful prior information in the modeling process. Moreover, the Bayesian mixture model outperforms the multivariate logistic model in terms of the performances on the parameter estimation and variable selection in most cases. Our proposed models hold the hierarchical constraints, that further improve the Bayesian mixture model by reducing the proportion of false positive findings among the identified interactions and successfully identifying the reported associations. This is practically appealing for the study of investigating the causal factors from a moderate number of candidate genetic and environmental factors along with a relatively large number of interactions. The natural and orthogonal interaction (NOIA) models of genetic effects have previously been developed to provide an analysis framework, by which the estimates of effects for a quantitative trait are statistically orthogonal regardless of the existence of Hardy-Weinberg Equilibrium (HWE) within loci. Ma et al. (2012) recently developed a NOIA model for the gene-environment interaction studies and have shown the advantages of using the model for detecting the true main effects and interactions, compared with the usual functional model. In this project, we propose a novel Bayesian statistical model that combines the Bayesian hierarchical mixture model with the NOIA statistical model and the usual functional model. The proposed Bayesian NOIA model demonstrates more power at detecting the non-null effects with higher marginal posterior probabilities. Also, we review two Bayesian statistical models (Bayesian empirical shrinkage-type estimator and Bayesian model averaging), which were developed for the gene-environment interaction studies. Inspired by these Bayesian models, we develop two novel statistical methods that are able to handle the related problems such as borrowing data from historical studies. The proposed methods are analogous to the methods for the gene-environment interactions on behalf of the success on balancing the statistical efficiency and bias in a unified model. By extensive simulation studies, we compare the operating characteristics of the proposed models with the existing models including the hierarchical meta-analysis model. The results show that the proposed approaches adaptively borrow the historical data in a data-driven way. These novel models may have a broad range of statistical applications in both of genetic/genomic and clinical studies.
Resumo:
This thesis addresses computational challenges arising from Bayesian analysis of complex real-world problems. Many of the models and algorithms designed for such analysis are ‘hybrid’ in nature, in that they are a composition of components for which their individual properties may be easily described but the performance of the model or algorithm as a whole is less well understood. The aim of this research project is to after a better understanding of the performance of hybrid models and algorithms. The goal of this thesis is to analyse the computational aspects of hybrid models and hybrid algorithms in the Bayesian context. The first objective of the research focuses on computational aspects of hybrid models, notably a continuous finite mixture of t-distributions. In the mixture model, an inference of interest is the number of components, as this may relate to both the quality of model fit to data and the computational workload. The analysis of t-mixtures using Markov chain Monte Carlo (MCMC) is described and the model is compared to the Normal case based on the goodness of fit. Through simulation studies, it is demonstrated that the t-mixture model can be more flexible and more parsimonious in terms of number of components, particularly for skewed and heavytailed data. The study also reveals important computational issues associated with the use of t-mixtures, which have not been adequately considered in the literature. The second objective of the research focuses on computational aspects of hybrid algorithms for Bayesian analysis. Two approaches will be considered: a formal comparison of the performance of a range of hybrid algorithms and a theoretical investigation of the performance of one of these algorithms in high dimensions. For the first approach, the delayed rejection algorithm, the pinball sampler, the Metropolis adjusted Langevin algorithm, and the hybrid version of the population Monte Carlo (PMC) algorithm are selected as a set of examples of hybrid algorithms. Statistical literature shows how statistical efficiency is often the only criteria for an efficient algorithm. In this thesis the algorithms are also considered and compared from a more practical perspective. This extends to the study of how individual algorithms contribute to the overall efficiency of hybrid algorithms, and highlights weaknesses that may be introduced by the combination process of these components in a single algorithm. The second approach to considering computational aspects of hybrid algorithms involves an investigation of the performance of the PMC in high dimensions. It is well known that as a model becomes more complex, computation may become increasingly difficult in real time. In particular the importance sampling based algorithms, including the PMC, are known to be unstable in high dimensions. This thesis examines the PMC algorithm in a simplified setting, a single step of the general sampling, and explores a fundamental problem that occurs in applying importance sampling to a high-dimensional problem. The precision of the computed estimate from the simplified setting is measured by the asymptotic variance of the estimate under conditions on the importance function. Additionally, the exponential growth of the asymptotic variance with the dimension is demonstrated and we illustrates that the optimal covariance matrix for the importance function can be estimated in a special case.
Resumo:
Statistical methods are often used to analyse commercial catch and effort data to provide standardised fishing effort and/or a relative index of fish abundance for input into stock assessment models. Achieving reliable results has proved difficult in Australia's Northern Prawn Fishery (NPF), due to a combination of such factors as the biological characteristics of the animals, some aspects of the fleet dynamics, and the changes in fishing technology. For this set of data, we compared four modelling approaches (linear models, mixed models, generalised estimating equations, and generalised linear models) with respect to the outcomes of the standardised fishing effort or the relative index of abundance. We also varied the number and form of vessel covariates in the models. Within a subset of data from this fishery, modelling correlation structures did not alter the conclusions from simpler statistical models. The random-effects models also yielded similar results. This is because the estimators are all consistent even if the correlation structure is mis-specified, and the data set is very large. However, the standard errors from different models differed, suggesting that different methods have different statistical efficiency. We suggest that there is value in modelling the variance function and the correlation structure, to make valid and efficient statistical inferences and gain insight into the data. We found that fishing power was separable from the indices of prawn abundance only when we offset the impact of vessel characteristics at assumed values from external sources. This may be due to the large degree of confounding within the data, and the extreme temporal changes in certain aspects of individual vessels, the fleet and the fleet dynamics.
Resumo:
The standard approach to signal reconstruction in frequency-domain optical-coherence tomography (FDOCT) is to apply the inverse Fourier transform to the measurements. This technique offers limited resolution (due to Heisenberg's uncertainty principle). We propose a new super-resolution reconstruction method based on a parametric representation. We consider multilayer specimens, wherein each layer has a constant refractive index and show that the backscattered signal from such a specimen fits accurately in to the framework of finite-rate-of-innovation (FRI) signal model and is represented by a finite number of free parameters. We deploy the high-resolution Prony method and show that high-quality, super-resolved reconstruction is possible with fewer measurements (about one-fourth of the number required for the standard Fourier technique). To further improve robustness to noise in practical scenarios, we take advantage of an iterated singular-value decomposition algorithm (Cadzow denoiser). We present results of Monte Carlo analyses, and assess statistical efficiency of the reconstruction techniques by comparing their performance against the Cramer-Rao bound. Reconstruction results on experimental data obtained from technical as well as biological specimens show a distinct improvement in resolution and signal-to-reconstruction noise offered by the proposed method in comparison with the standard approach.
Resumo:
In this study we show that forest areas contribute significantly to the estimated benefits from om outdoor recreation in Northern Ireland. Secondly we provide empirical evidence of the gains in the statistical efficiency of both benefit and parameter estimates obtained by analysing follow-up responses with Double Bounded interval data analysis. As these gains are considerable, it is clearly worth considering this method in CVM survey design even when moderately large sample sizes are used. Finally we demonstrate that estimates of means and medians of WTP distributions for access to forest recreation show plausible magnitude, are consistent with previous UK studies, and converge across parametric and non-parametic methods of estimation.
Resumo:
Background The Well London programme used community engagement, complemented by changes to the physical and social neighbourhood environment, to improve physical activity levels, healthy eating and mental wellbeing in the most deprived communities in London. The effectiveness of Well London is being evaluated in a pair-matched cluster randomised trial (CRT). The baseline survey data are reported here. Methods The CRT involved 20 matched pairs of intervention and control communities (defined as UK census lower super output areas; ranked in the 11% most deprived LSOAs in London by Index of Multiple Deprivation) across 20 London boroughs. The primary trial outcomes, sociodemographic information and environmental neighbourhood characteristics were assessed in three quantitative components within the Well London CRT at baseline: a cross-sectional, interviewer-administered adult household survey; a self-completed, school-based adolescent questionnaire; a fieldworker completed neighbourhood environmental audit. Baseline data collection occurred in 2008. Physical activity, healthy eating and mental wellbeing were assessed using standardised, validated questionnaire tools. Multiple imputation was used to account for missing data in the outcomes and other variables in the adult and adolescent surveys. Results There were 4107 adults and 1214 adolescent respondents in the baseline surveys. The intervention and control areas were broadly comparable with respect to the primary outcomes and key sociodemographic characteristics. The environmental characteristics of the intervention and control neighbourhoods were broadly similar. There was greater between cluster variation in the primary outcomes in the adult population compared to the adolescent population. Levels of healthy eating, smoking and self-reported anxiety/depression were similar in the Well London population and the national Health Survey for England. Levels of physical activity were higher in the Well London population but this is likely to be due to the different measurement tools used in the two surveys. Conclusions Randomisation of social interventions such as Well London is acceptable and feasible and in this study the intervention and control arms are well balanced with respect to the primary outcomes and key sociodemographic characteristics. The matched design has improved the statistical efficiency of the study amongst adults but less so amongst adolescents. Follow-up data collection will be completed 2012.
Resumo:
Malgré des progrès constants en termes de capacité de calcul, mémoire et quantité de données disponibles, les algorithmes d'apprentissage machine doivent se montrer efficaces dans l'utilisation de ces ressources. La minimisation des coûts est évidemment un facteur important, mais une autre motivation est la recherche de mécanismes d'apprentissage capables de reproduire le comportement d'êtres intelligents. Cette thèse aborde le problème de l'efficacité à travers plusieurs articles traitant d'algorithmes d'apprentissage variés : ce problème est vu non seulement du point de vue de l'efficacité computationnelle (temps de calcul et mémoire utilisés), mais aussi de celui de l'efficacité statistique (nombre d'exemples requis pour accomplir une tâche donnée). Une première contribution apportée par cette thèse est la mise en lumière d'inefficacités statistiques dans des algorithmes existants. Nous montrons ainsi que les arbres de décision généralisent mal pour certains types de tâches (chapitre 3), de même que les algorithmes classiques d'apprentissage semi-supervisé à base de graphe (chapitre 5), chacun étant affecté par une forme particulière de la malédiction de la dimensionalité. Pour une certaine classe de réseaux de neurones, appelés réseaux sommes-produits, nous montrons qu'il peut être exponentiellement moins efficace de représenter certaines fonctions par des réseaux à une seule couche cachée, comparé à des réseaux profonds (chapitre 4). Nos analyses permettent de mieux comprendre certains problèmes intrinsèques liés à ces algorithmes, et d'orienter la recherche dans des directions qui pourraient permettre de les résoudre. Nous identifions également des inefficacités computationnelles dans les algorithmes d'apprentissage semi-supervisé à base de graphe (chapitre 5), et dans l'apprentissage de mélanges de Gaussiennes en présence de valeurs manquantes (chapitre 6). Dans les deux cas, nous proposons de nouveaux algorithmes capables de traiter des ensembles de données significativement plus grands. Les deux derniers chapitres traitent de l'efficacité computationnelle sous un angle différent. Dans le chapitre 7, nous analysons de manière théorique un algorithme existant pour l'apprentissage efficace dans les machines de Boltzmann restreintes (la divergence contrastive), afin de mieux comprendre les raisons qui expliquent le succès de cet algorithme. Finalement, dans le chapitre 8 nous présentons une application de l'apprentissage machine dans le domaine des jeux vidéo, pour laquelle le problème de l'efficacité computationnelle est relié à des considérations d'ingénierie logicielle et matérielle, souvent ignorées en recherche mais ô combien importantes en pratique.
Resumo:
We present an unsupervised learning algorithm that acquires a natural-language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.
Resumo:
Resiliência remete à habilidade do ser humano de demonstrar êxito diante das adversidades da vida, superá-las e, inclusive, ser fortalecido ou transformado por elas. O construto tem sido estudado há cerca de quarenta anos na Psiquiatria com foco em crianças, mas sua investigação é bem mais recente com a população adulta. No mundo da competição esportiva, os estudos são escassos. O contexto esportivo apresenta altos desafios e adversidades constantes que os atletas precisam vencer para cumprir as metas profissionais; por isso, convivem, muito frequentemente, com seus limites físicos e psicológicos. Assim, a resiliência pode ser um importante aspecto em suas vidas profissionais. Este estudo objetiva descrever os níveis de resiliência dos atletas no Basquetebol e identificar possíveis relações entre resiliência e alguns indicadores de eficiência estatística. Participaram da pesquisa, voluntariamente, 71 atletas profissionais adultos e atuantes da modalidade. As variáveis foram avaliadas por meio da Escala de Avaliação de Resiliência EAR, de um questionário de dados sociodemográficos e de índices de eficiência registrados pela Federação Paulista de Basquetebol. Os resultados de análises estatísticas descritivas e de correlações bivariadas de Pearson permitiram observar que os atletas demonstraram um alto nível de resiliência com destaque para a persistência diante das dificuldades e a aceitação positiva de mudanças. Os fatores que compõem a resiliência não apresentaram correlação significativa no tocante ao coeficiente de eficiência dos atletas. Ao comparar as médias por meio da análise de variância percebeu-se que os atletas que possuíam entre cinco e dez anos de profissão apresentaram melhores médias de coeficiente de eficiência. Os resultados revelam, ainda, que os atletas que atuam menos de 8 minutos na partida, em média, produzem menores índices de eficiência estatística e que os atletas que pertencem às equipes de resultados medianos na tabela de classificação tendem a apresentar maior percepção de competência pessoal que os atletas que atuam nas equipes mais mal colocadas. Os fatores de resiliência não se diferenciam em função da experiência dos atletas, nem do tempo em média que permanecem em quadra. Esses resultados revelam a necessidade de questionar se os indicadores de eficiência estatística seriam os critérios mais adequados para verificar o papel da resiliência na vida de atletas de Basquetebol e apontam para a necessidade de aumentar o número de estudos sobre a influência de características individuais no mundo dos esportes profissionais.
Resumo:
Paired-tow calibration studies provide information on changes in survey catchability that may occur because of some necessary change in protocols (e.g., change in vessel or vessel gear) in a fish stock survey. This information is important to ensure the continuity of annual time-series of survey indices of stock size that provide the basis for fish stock assessments. There are several statistical models used to analyze the paired-catch data from calibration studies. Our main contributions are results from simulation experiments designed to measure the accuracy of statistical inferences derived from some of these models. Our results show that a model commonly used to analyze calibration data can provide unreliable statistical results when there is between-tow spatial variation in the stock densities at each paired-tow site. However, a generalized linear mixed-effects model gave very reliable results over a wide range of spatial variations in densities and we recommend it for the analysis of paired-tow survey calibration data. This conclusion also applies if there is between-tow variation in catchability.
Resumo:
The preceding discussion and review of literature show that studies on gear selectivity have received great attention, while gear efficiency studies do not seem to have received equal consideration. In temperate waters, fishing industry is well organised and relatively large and well equipped vessels and gear are used for commercial fishing and the number of species are less; whereas in tropics particularly in India, small scale fishery dominates the scene and the fishery is multispecies operated upon by nmltigear. Therefore many of the problems faced in India may not exist in developed countries. Perhaps this would be the reason for the paucity of literature on the problems in estimation of relative efficiency. Much work has been carried out in estimating relative efficiency (Pycha, 1962; Pope, 1963; Gulland, 1967; Dickson, 1971 and Collins, 1979). The main subject of interest in the present thesis is an investigation into the problems in the comparison of fishing gears. especially in using classical test procedures with special reference to the prevailing fishing practices (that is. with reference to the catch data generated by the existing system). This has been taken up with a view to standardizing an approach for comparing the efficiency of fishing gear. Besides this, the implications of the terms ‘gear efficiency‘ and ‘gear selectivity‘ have been examined and based on the commonly used selectivity model (Holt, 1963), estimation of the ratio of fishing power of two gear has been considered. An attempt to determine the size of fish for which a gear is most efficient.has also been made. The work has been presented in eight chapters
Resumo:
This paper uses an output oriented Data Envelopment Analysis (DEA) measure of technical efficiency to assess the technical efficiencies of the Brazilian banking system. Four approaches to estimation are compared in order to assess the significance of factors affecting inefficiency. These are nonparametric Analysis of Covariance, maximum likelihood using a family of exponential distributions, maximum likelihood using a family of truncated normal distributions, and the normal Tobit model. The sole focus of the paper is on a combined measure of output and the data analyzed refers to the year 2001. The factors of interest in the analysis and likely to affect efficiency are bank nature (multiple and commercial), bank type (credit, business, bursary and retail), bank size (large, medium, small and micro), bank control (private and public), bank origin (domestic and foreign), and non-performing loans. The latter is a measure of bank risk. All quantitative variables, including non-performing loans, are measured on a per employee basis. The best fits to the data are provided by the exponential family and the nonparametric Analysis of Covariance. The significance of a factor however varies according to the model fit although it can be said that there is some agreements between the best models. A highly significant association in all models fitted is observed only for nonperforming loans. The nonparametric Analysis of Covariance is more consistent with the inefficiency median responses observed for the qualitative factors. The findings of the analysis reinforce the significant association of the level of bank inefficiency, measured by DEA residuals, with the risk of bank failure.