949 resultados para Multiple testing
Resumo:
In this report, we describe a simple correction for multiple testing of single-nucleotide polymorphisms (SNPs) in linkage disequilibrium (LD) with each other, on the basis of the spectral decomposition (SpD) of matrices of pairwise LD between SNPs. This method provides a useful alternative to more computationally intensive permutation tests. Additionally, output from SNPSpD includes eigenvalues, principal-component coefficients, and factor "loadings" after varimax rotation, enabling the selection of a subset of SNPs that optimize the information in a genomic region.
Resumo:
Background
Biomedical researchers are now often faced with situations where it is necessary to test a large number of hypotheses simultaneously, eg, in comparative gene expression studies using high-throughput microarray technology. To properly control false positive errors the FDR (false discovery rate) approach has become widely used in multiple testing. The accurate estimation of FDR requires the proportion of true null hypotheses being accurately estimated. To date many methods for estimating this quantity have been proposed. Typically when a new method is introduced, some simulations are carried out to show the improved accuracy of the new method. However, the simulations are often very limited to covering only a few points in the parameter space.
Results
Here I have carried out extensive in silico experiments to compare some commonly used methods for estimating the proportion of true null hypotheses. The coverage of these simulations is unprecedented thorough over the parameter space compared to typical simulation studies in the literature. Thus this work enables us to draw conclusions globally as to the performance of these different methods. It was found that a very simple method gives the most accurate estimation in a dominantly large area of the parameter space. Given its simplicity and its overall superior accuracy I recommend its use as the first choice for estimating the proportion of true null hypotheses in multiple testing.
Resumo:
Most panel unit root tests are designed to test the joint null hypothesis of a unit root for each individual series in a panel. After a rejection, it will often be of interest to identify which series can be deemed to be stationary and which series can be deemed nonstationary. Researchers will sometimes carry out this classification on the basis of n individual (univariate) unit root tests based on some ad hoc significance level. In this paper, we demonstrate how to use the false discovery rate (FDR) in evaluating I(1)=I(0) classifications based on individual unit root tests when the size of the cross section (n) and time series (T) dimensions are large. We report results from a simulation experiment and illustrate the methods on two data sets.
Resumo:
In this work we aim to propose a new approach for preliminary epidemiological studies on Standardized Mortality Ratios (SMR) collected in many spatial regions. A preliminary study on SMRs aims to formulate hypotheses to be investigated via individual epidemiological studies that avoid bias carried on by aggregated analyses. Starting from collecting disease counts and calculating expected disease counts by means of reference population disease rates, in each area an SMR is derived as the MLE under the Poisson assumption on each observation. Such estimators have high standard errors in small areas, i.e. where the expected count is low either because of the low population underlying the area or the rarity of the disease under study. Disease mapping models and other techniques for screening disease rates among the map aiming to detect anomalies and possible high-risk areas have been proposed in literature according to the classic and the Bayesian paradigm. Our proposal is approaching this issue by a decision-oriented method, which focus on multiple testing control, without however leaving the preliminary study perspective that an analysis on SMR indicators is asked to. We implement the control of the FDR, a quantity largely used to address multiple comparisons problems in the eld of microarray data analysis but which is not usually employed in disease mapping. Controlling the FDR means providing an estimate of the FDR for a set of rejected null hypotheses. The small areas issue arises diculties in applying traditional methods for FDR estimation, that are usually based only on the p-values knowledge (Benjamini and Hochberg, 1995; Storey, 2003). Tests evaluated by a traditional p-value provide weak power in small areas, where the expected number of disease cases is small. Moreover tests cannot be assumed as independent when spatial correlation between SMRs is expected, neither they are identical distributed when population underlying the map is heterogeneous. The Bayesian paradigm oers a way to overcome the inappropriateness of p-values based methods. Another peculiarity of the present work is to propose a hierarchical full Bayesian model for FDR estimation in testing many null hypothesis of absence of risk.We will use concepts of Bayesian models for disease mapping, referring in particular to the Besag York and Mollié model (1991) often used in practice for its exible prior assumption on the risks distribution across regions. The borrowing of strength between prior and likelihood typical of a hierarchical Bayesian model takes the advantage of evaluating a singular test (i.e. a test in a singular area) by means of all observations in the map under study, rather than just by means of the singular observation. This allows to improve the power test in small areas and addressing more appropriately the spatial correlation issue that suggests that relative risks are closer in spatially contiguous regions. The proposed model aims to estimate the FDR by means of the MCMC estimated posterior probabilities b i's of the null hypothesis (absence of risk) for each area. An estimate of the expected FDR conditional on data (\FDR) can be calculated in any set of b i's relative to areas declared at high-risk (where thenull hypothesis is rejected) by averaging the b i's themselves. The\FDR can be used to provide an easy decision rule for selecting high-risk areas, i.e. selecting as many as possible areas such that the\FDR is non-lower than a prexed value; we call them\FDR based decision (or selection) rules. The sensitivity and specicity of such rule depend on the accuracy of the FDR estimate, the over-estimation of FDR causing a loss of power and the under-estimation of FDR producing a loss of specicity. Moreover, our model has the interesting feature of still being able to provide an estimate of relative risk values as in the Besag York and Mollié model (1991). A simulation study to evaluate the model performance in FDR estimation accuracy, sensitivity and specificity of the decision rule, and goodness of estimation of relative risks, was set up. We chose a real map from which we generated several spatial scenarios whose counts of disease vary according to the spatial correlation degree, the size areas, the number of areas where the null hypothesis is true and the risk level in the latter areas. In summarizing simulation results we will always consider the FDR estimation in sets constituted by all b i's selected lower than a threshold t. We will show graphs of the\FDR and the true FDR (known by simulation) plotted against a threshold t to assess the FDR estimation. Varying the threshold we can learn which FDR values can be accurately estimated by the practitioner willing to apply the model (by the closeness between\FDR and true FDR). By plotting the calculated sensitivity and specicity (both known by simulation) vs the\FDR we can check the sensitivity and specicity of the corresponding\FDR based decision rules. For investigating the over-smoothing level of relative risk estimates we will compare box-plots of such estimates in high-risk areas (known by simulation), obtained by both our model and the classic Besag York Mollié model. All the summary tools are worked out for all simulated scenarios (in total 54 scenarios). Results show that FDR is well estimated (in the worst case we get an overestimation, hence a conservative FDR control) in small areas, low risk levels and spatially correlated risks scenarios, that are our primary aims. In such scenarios we have good estimates of the FDR for all values less or equal than 0.10. The sensitivity of\FDR based decision rules is generally low but specicity is high. In such scenario the use of\FDR = 0:05 or\FDR = 0:10 based selection rule can be suggested. In cases where the number of true alternative hypotheses (number of true high-risk areas) is small, also FDR = 0:15 values are well estimated, and \FDR = 0:15 based decision rules gains power maintaining an high specicity. On the other hand, in non-small areas and non-small risk level scenarios the FDR is under-estimated unless for very small values of it (much lower than 0.05); this resulting in a loss of specicity of a\FDR = 0:05 based decision rule. In such scenario\FDR = 0:05 or, even worse,\FDR = 0:1 based decision rules cannot be suggested because the true FDR is actually much higher. As regards the relative risk estimation, our model achieves almost the same results of the classic Besag York Molliè model. For this reason, our model is interesting for its ability to perform both the estimation of relative risk values and the FDR control, except for non-small areas and large risk level scenarios. A case of study is nally presented to show how the method can be used in epidemiology.
Resumo:
Nell'era genomica moderna, la mole di dati generata dal sequenziamento genetico è diventata estremamente elevata. L’analisi di dati genomici richiede l’utilizzo di metodi di significatività statistica per quantificare la robustezza delle correlazioni individuate nei dati. La significatività statistica ci permette di capire se le relazioni nei dati che stiamo analizzando abbiano effettivamente un peso statistico, cioè se l’evento che stiamo analizzando è successo “per caso” o è effettivamente corretto pensare che avvenga con una probabilità utile. Indipendentemente dal test statistico utilizzato, in presenza di test multipli di verifica (“Multiple Testing Hypothesis”) è necessario utilizzare metodi per la correzione della significatività statistica (“Multiple Testing Correction”). Lo scopo di questa tesi è quello di rendere disponibili le implementazioni dei più noti metodi di correzione della significatività statistica. È stata creata una raccolta di questi metodi, sottoforma di libreria, proprio perché nel panorama bioinformatico moderno non è stato trovato nulla del genere.
Resumo:
An optimal multiple testing procedure is identified for linear hypotheses under the general linear model, maximizing the expected number of false null hypotheses rejected at any significance level. The optimal procedure depends on the unknown data-generating distribution, but can be consistently estimated. Drawing information together across many hypotheses, the estimated optimal procedure provides an empirical alternative hypothesis by adapting to underlying patterns of departure from the null. Proposed multiple testing procedures based on the empirical alternative are evaluated through simulations and an application to gene expression microarray data. Compared to a standard multiple testing procedure, it is not unusual for use of an empirical alternative hypothesis to increase by 50% or more the number of true positives identified at a given significance level.
Resumo:
The present study assessed the effectiveness of the Cognitive Interview (CI) in a multiple-testing situation. One-hundred and eighty-two undergraduate psychology students viewed a short film clip depicting an automobile accident. Subsequently, the subjects were interviewed twice using either the CI or standard interviewing technique. In both instances, subjects who received the CI recalled more accurate information (m=32.30 at Time 1 and m=30.51 at Time 2) than subjects who received the standard interview (m=18.14 at Time 1 and m=18.38 at Time 2). There was no effect of type of interview at Time 1 on amount recalled at Time 2. This research has implications not only for judicial fact-finders, but also for further researchers interested in the CI procedure.
Resumo:
Recent association studies in multiple sclerosis (MS) have identified and replicated several single nucleotide polymorphism (SNP) susceptibility loci including CLEC16A, IL2RA, IL7R, RPL5, CD58, CD40 and chromosome 12q13–14 in addition to the well established allele HLA-DR15. There is potential that these genetic susceptibility factors could also modulate MS disease severity, as demonstrated previously for the MS risk allele HLA-DR15. We investigated this hypothesis in a cohort of 1006 well characterised MS patients from South-Eastern Australia. We tested the MS-associated SNPs for association with five measures of disease severity incorporating disability, age of onset, cognition and brain atrophy. We observed trends towards association between the RPL5 risk SNP and time between first demyelinating event and relapse, and between the CD40 risk SNP and symbol digit test score. No associations were significant after correction for multiple testing. We found no evidence for the hypothesis that these new MS disease risk-associated SNPs influence disease severity.
Resumo:
CONTEXT People meeting diagnostic criteria for anxiety or depressive disorders tend to score high on the personality scale of neuroticism. Studying this personality dimension can give insights into the etiology of these important psychiatric disorders. OBJECTIVES To undertake a comprehensive genome-wide linkage study of neuroticism using large study samples that have been measured multiple times and to compare the results between countries for replication and across time within countries for consistency. DESIGN Genome-wide linkage scan. SETTING Twin individuals and their family members from Australia and the Netherlands. PARTICIPANTS Nineteen thousand six hundred thirty-five sibling pairs completed self-report questionnaires for neuroticism up to 5 times over a period of up to 22 years. Five thousand sixty-nine sibling pairs were genotyped with microsatellite markers. METHODS Nonparametric linkage analyses were conducted in MERLIN-REGRESS for the mean neuroticism scores averaged across time. Additional analyses were conducted for the time-specific measures of neuroticism from each country to investigate consistency of linkage results. RESULTS Three chromosomal regions exceeded empirically derived thresholds for suggestive linkage using mean neuroticism scores: 10p 5 Kosambi cM (cM) (Dutch study sample), 14q 103 cM (Dutch study sample), and 18q 117 cM (combined Australian and Dutch study sample), but only 14q retained significance after correction for multiple testing. These regions all showed evidence for linkage in individual time-specific measures of neuroticism and 1 (18q) showed some evidence for replication between countries. Linkage intervals for these regions all overlap with regions identified in other studies of neuroticism or related traits and/or in studies of anxiety in mice. CONCLUSIONS Our results demonstrate the value of the availability of multiple measures over time and add to the optimism reported in recent reviews for replication of linkage regions for neuroticism. These regions are likely to harbor causal variants for neuroticism and its related psychiatric disorders and can inform prioritization of results from genome-wide association studies.
Resumo:
Assaying a large number of genetic markers from patients in clinical trials is now possible in order to tailor drugs with respect to efficacy. The statistical methodology for analysing such massive data sets is challenging. The most popular type of statistical analysis is to use a univariate test for each genetic marker, once all the data from a clinical study have been collected. This paper presents a sequential method for conducting an omnibus test for detecting gene-drug interactions across the genome, thus allowing informed decisions at the earliest opportunity and overcoming the multiple testing problems from conducting many univariate tests. We first propose an omnibus test for a fixed sample size. This test is based on combining F-statistics that test for an interaction between treatment and the individual single nucleotide polymorphism (SNP). As SNPs tend to be correlated, we use permutations to calculate a global p-value. We extend our omnibus test to the sequential case. In order to control the type I error rate, we propose a sequential method that uses permutations to obtain the stopping boundaries. The results of a simulation study show that the sequential permutation method is more powerful than alternative sequential methods that control the type I error rate, such as the inverse-normal method. The proposed method is flexible as we do not need to assume a mode of inheritance and can also adjust for confounding factors. An application to real clinical data illustrates that the method is computationally feasible for a large number of SNPs. Copyright (c) 2007 John Wiley & Sons, Ltd.