904 resultados para monotone missing data
Resumo:
Background: Selection bias in HIV prevalence estimates occurs if non-participation in testing is correlated with HIV status. Longitudinal data suggests that individuals who know or suspect they are HIV positive are less likely to participate in testing in HIV surveys, in which case methods to correct for missing data which are based on imputation and observed characteristics will produce biased results. Methods: The identity of the HIV survey interviewer is typically associated with HIV testing participation, but is unlikely to be correlated with HIV status. Interviewer identity can thus be used as a selection variable allowing estimation of Heckman-type selection models. These models produce asymptotically unbiased HIV prevalence estimates, even when non-participation is correlated with unobserved characteristics, such as knowledge of HIV status. We introduce a new random effects method to these selection models which overcomes non-convergence caused by collinearity, small sample bias, and incorrect inference in existing approaches. Our method is easy to implement in standard statistical software, and allows the construction of bootstrapped standard errors which adjust for the fact that the relationship between testing and HIV status is uncertain and needs to be estimated. Results: Using nationally representative data from the Demographic and Health Surveys, we illustrate our approach with new point estimates and confidence intervals (CI) for HIV prevalence among men in Ghana (2003) and Zambia (2007). In Ghana, we find little evidence of selection bias as our selection model gives an HIV prevalence estimate of 1.4% (95% CI 1.2% – 1.6%), compared to 1.6% among those with a valid HIV test. In Zambia, our selection model gives an HIV prevalence estimate of 16.3% (95% CI 11.0% - 18.4%), compared to 12.1% among those with a valid HIV test. Therefore, those who decline to test in Zambia are found to be more likely to be HIV positive. Conclusions: Our approach corrects for selection bias in HIV prevalence estimates, is possible to implement even when HIV prevalence or non-participation is very high or very low, and provides a practical solution to account for both sampling and parameter uncertainty in the estimation of confidence intervals. The wide confidence intervals estimated in an example with high HIV prevalence indicate that it is difficult to correct statistically for the bias that may occur when a large proportion of people refuse to test.
Adjusting HIV Prevalence Estimates for Non-participation: an Application to Demographic Surveillance
Resumo:
Introduction: HIV testing is a cornerstone of efforts to combat the HIV epidemic, and testing conducted as part of surveillance provides invaluable data on the spread of infection and the effectiveness of campaigns to reduce the transmission of HIV. However, participation in HIV testing can be low, and if respondents systematically select not to be tested because they know or suspect they are HIV positive (and fear disclosure), standard approaches to deal with missing data will fail to remove selection bias. We implemented Heckman-type selection models, which can be used to adjust for missing data that are not missing at random, and established the extent of selection bias in a population-based HIV survey in an HIV hyperendemic community in rural South Africa.
Methods: We used data from a population-based HIV survey carried out in 2009 in rural KwaZulu-Natal, South Africa. In this survey, 5565 women (35%) and 2567 men (27%) provided blood for an HIV test. We accounted for missing data using interviewer identity as a selection variable which predicted consent to HIV testing but was unlikely to be independently associated with HIV status. Our approach involved using this selection variable to examine the HIV status of residents who would ordinarily refuse to test, except that they were allocated a persuasive interviewer. Our copula model allows for flexibility when modelling the dependence structure between HIV survey participation and HIV status.
Results: For women, our selection model generated an HIV prevalence estimate of 33% (95% CI 27–40) for all people eligible to consent to HIV testing in the survey. This estimate is higher than the estimate of 24% generated when only information from respondents who participated in testing is used in the analysis, and the estimate of 27% when imputation analysis is used to predict missing data on HIV status. For men, we found an HIV prevalence of 25% (95% CI 15–35) using the selection model, compared to 16% among those who participated in testing, and 18% estimated with imputation. We provide new confidence intervals that correct for the fact that the relationship between testing and HIV status is unknown and requires estimation.
Conclusions: We confirm the feasibility and value of adopting selection models to account for missing data in population-based HIV surveys and surveillance systems. Elements of survey design, such as interviewer identity, present the opportunity to adopt this approach in routine applications. Where non-participation is high, true confidence intervals are much wider than those generated by standard approaches to dealing with missing data suggest.
Resumo:
How can we correlate neural activity in the human brain as it responds to words, with behavioral data expressed as answers to questions about these same words? In short, we want to find latent variables, that explain both the brain activity, as well as the behavioral responses. We show that this is an instance of the Coupled Matrix-Tensor Factorization (CMTF) problem. We propose Scoup-SMT, a novel, fast, and parallel algorithm that solves the CMTF problem and produces a sparse latent low-rank subspace of the data. In our experiments, we find that Scoup-SMT is 50-100 times faster than a state-of-the-art algorithm for CMTF, along with a 5 fold increase in sparsity. Moreover, we extend Scoup-SMT to handle missing data without degradation of performance. We apply Scoup-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Scoup-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Scoup-SMT, by applying it on a Facebook dataset (users, friends, wall-postings); there, Scoup-SMT spots spammer-like anomalies.
Resumo:
Background: Prospective investigations of the association between impaired orthostatic blood pressure (BP) regulation and cognitive decline in older adults are limited, and findings to-date have been mixed. The aim of this study was to determine whether impaired recovery of orthostatic BP was associated with change in cognitive function over a 2-year period, in a population based sample of community dwelling older adults.
Methods: Data from the first two waves of the Irish Longitudinal Study on Ageing were analysed. Orthostatic BP was measured during a lying to standing orthostatic stress protocol at wave 1 using beat-to-beat digital plethysmography, and impaired recovery of BP at 40 s post stand was investigated. Cognitive function was assessed at wave 1 and wave 2 (2 years later) using the Mini-Mental State Exam (MMSE), verbal fluency and word recall tasks.
Results: After adjustment for measured, potential confounders, and multiple imputation for missing data, the change in the number of errors between waves on the MMSE was 10 % higher [IRR (95 % CI) = 1.10 (0.96, 1.26)] in those with impaired recovery at 40 s. However, this was not statistically significant (p = 0.17). Impaired BP recovery was not associated with change in performance on any of the other cognitive measures.
Conclusions: There was no clear evidence for an association between impaired recovery of orthostatic BP and change in cognition over a 2-year period in this nationally representative cohort of older adults. Longer follow-up and more detailed cognitive testing would be advantageous to further investigate the relationship between orthostatic BP and cognitive decline.
Resumo:
Background The Well London programme used community engagement, complemented by changes to the physical and social neighbourhood environment, to improve physical activity levels, healthy eating and mental wellbeing in the most deprived communities in London. The effectiveness of Well London is being evaluated in a pair-matched cluster randomised trial (CRT). The baseline survey data are reported here. Methods The CRT involved 20 matched pairs of intervention and control communities (defined as UK census lower super output areas; ranked in the 11% most deprived LSOAs in London by Index of Multiple Deprivation) across 20 London boroughs. The primary trial outcomes, sociodemographic information and environmental neighbourhood characteristics were assessed in three quantitative components within the Well London CRT at baseline: a cross-sectional, interviewer-administered adult household survey; a self-completed, school-based adolescent questionnaire; a fieldworker completed neighbourhood environmental audit. Baseline data collection occurred in 2008. Physical activity, healthy eating and mental wellbeing were assessed using standardised, validated questionnaire tools. Multiple imputation was used to account for missing data in the outcomes and other variables in the adult and adolescent surveys. Results There were 4107 adults and 1214 adolescent respondents in the baseline surveys. The intervention and control areas were broadly comparable with respect to the primary outcomes and key sociodemographic characteristics. The environmental characteristics of the intervention and control neighbourhoods were broadly similar. There was greater between cluster variation in the primary outcomes in the adult population compared to the adolescent population. Levels of healthy eating, smoking and self-reported anxiety/depression were similar in the Well London population and the national Health Survey for England. Levels of physical activity were higher in the Well London population but this is likely to be due to the different measurement tools used in the two surveys. Conclusions Randomisation of social interventions such as Well London is acceptable and feasible and in this study the intervention and control arms are well balanced with respect to the primary outcomes and key sociodemographic characteristics. The matched design has improved the statistical efficiency of the study amongst adults but less so amongst adolescents. Follow-up data collection will be completed 2012.
Resumo:
Between the Bullet and the Hole is a film centred on the elusive and complex effects of war on women's role in ballistic research and early computing. The film features new and archival high-speed bullet photography, schlieren and electric spark imagery, bullet sound wave imagery, forensic ballistic photography, slide rulers, punch cards, computer diagrams, and a soundtrack by Scanner. Like a frantic animation storyboard, it explores the flickering space between the frames, testing the perceptual mechanics of visual interpolation, the possibility of reading or deciphering the gap between before and after. Interpolation - the main task of the women studying ballistics in WW2 - is the construction or guessing of missing data using only two known data points. The film tries to unpack this gap, open it up to interrogation. It questions how we read, interpolate or construct the gaps between bullet and hole, perpetrator and victim, presence and absence. The project involves exchanges with specialists in this area such as the Cranfield University Forensics department, London-based Forensic Firearms consultancy, the Imperial War Museum, the ENIAC programmers project, the Smithsonian Institute, and Forensic Scientists at Palm Beach County Sheriff's Office (USA). Exhibitions: Solo exhibition at Dallas Contemporary (Texas, Jan-Mar 2016), including newly commissioned lenticular prints and a dual slide projector installation; Group exhibition the Sydney Biennale (Sydney, Mar-June 2016); UK premiere and solo retrospective screening at Whitechapel Gallery (London); forthcoming solo exhibition at Iliya Fridman Gallery (NY, Oct-Dec 2016); Film festivals and screenings: International Film Festival Rotterdam (Jan 2016); Whitechapel Gallery (London Feb 2016); Cornerhouse/Home (Manchester Nov 2016); Public lectures: Whitechapel Gallery with prof. David Alan Grier and Morgan Quaintance; Carriageworks (Sydney) Prof. Douglas Khan; Monash University (Melbourne); Gertrude Space (Melbourne). Reviews and interviews: Artforum, Studio International, Mousse Magazine.
Resumo:
Affiliation: Département de Biochimie, Université de Montréal
Resumo:
Les données manquantes sont fréquentes dans les enquêtes et peuvent entraîner d’importantes erreurs d’estimation de paramètres. Ce mémoire méthodologique en sociologie porte sur l’influence des données manquantes sur l’estimation de l’effet d’un programme de prévention. Les deux premières sections exposent les possibilités de biais engendrées par les données manquantes et présentent les approches théoriques permettant de les décrire. La troisième section porte sur les méthodes de traitement des données manquantes. Les méthodes classiques sont décrites ainsi que trois méthodes récentes. La quatrième section contient une présentation de l’Enquête longitudinale et expérimentale de Montréal (ELEM) et une description des données utilisées. La cinquième expose les analyses effectuées, elle contient : la méthode d’analyse de l’effet d’une intervention à partir de données longitudinales, une description approfondie des données manquantes de l’ELEM ainsi qu’un diagnostic des schémas et du mécanisme. La sixième section contient les résultats de l’estimation de l’effet du programme selon différents postulats concernant le mécanisme des données manquantes et selon quatre méthodes : l’analyse des cas complets, le maximum de vraisemblance, la pondération et l’imputation multiple. Ils indiquent (I) que le postulat sur le type de mécanisme MAR des données manquantes semble influencer l’estimation de l’effet du programme et que (II) les estimations obtenues par différentes méthodes d’estimation mènent à des conclusions similaires sur l’effet de l’intervention.
Resumo:
L’explosion du nombre de séquences permet à la phylogénomique, c’est-à-dire l’étude des liens de parenté entre espèces à partir de grands alignements multi-gènes, de prendre son essor. C’est incontestablement un moyen de pallier aux erreurs stochastiques des phylogénies simple gène, mais de nombreux problèmes demeurent malgré les progrès réalisés dans la modélisation du processus évolutif. Dans cette thèse, nous nous attachons à caractériser certains aspects du mauvais ajustement du modèle aux données, et à étudier leur impact sur l’exactitude de l’inférence. Contrairement à l’hétérotachie, la variation au cours du temps du processus de substitution en acides aminés a reçu peu d’attention jusqu’alors. Non seulement nous montrons que cette hétérogénéité est largement répandue chez les animaux, mais aussi que son existence peut nuire à la qualité de l’inférence phylogénomique. Ainsi en l’absence d’un modèle adéquat, la suppression des colonnes hétérogènes, mal gérées par le modèle, peut faire disparaître un artéfact de reconstruction. Dans un cadre phylogénomique, les techniques de séquençage utilisées impliquent souvent que tous les gènes ne sont pas présents pour toutes les espèces. La controverse sur l’impact de la quantité de cellules vides a récemment été réactualisée, mais la majorité des études sur les données manquantes sont faites sur de petits jeux de séquences simulées. Nous nous sommes donc intéressés à quantifier cet impact dans le cas d’un large alignement de données réelles. Pour un taux raisonnable de données manquantes, il appert que l’incomplétude de l’alignement affecte moins l’exactitude de l’inférence que le choix du modèle. Au contraire, l’ajout d’une séquence incomplète mais qui casse une longue branche peut restaurer, au moins partiellement, une phylogénie erronée. Comme les violations de modèle constituent toujours la limitation majeure dans l’exactitude de l’inférence phylogénétique, l’amélioration de l’échantillonnage des espèces et des gènes reste une alternative utile en l’absence d’un modèle adéquat. Nous avons donc développé un logiciel de sélection de séquences qui construit des jeux de données reproductibles, en se basant sur la quantité de données présentes, la vitesse d’évolution et les biais de composition. Lors de cette étude nous avons montré que l’expertise humaine apporte pour l’instant encore un savoir incontournable. Les différentes analyses réalisées pour cette thèse concluent à l’importance primordiale du modèle évolutif.
Resumo:
Malgré des progrès constants en termes de capacité de calcul, mémoire et quantité de données disponibles, les algorithmes d'apprentissage machine doivent se montrer efficaces dans l'utilisation de ces ressources. La minimisation des coûts est évidemment un facteur important, mais une autre motivation est la recherche de mécanismes d'apprentissage capables de reproduire le comportement d'êtres intelligents. Cette thèse aborde le problème de l'efficacité à travers plusieurs articles traitant d'algorithmes d'apprentissage variés : ce problème est vu non seulement du point de vue de l'efficacité computationnelle (temps de calcul et mémoire utilisés), mais aussi de celui de l'efficacité statistique (nombre d'exemples requis pour accomplir une tâche donnée). Une première contribution apportée par cette thèse est la mise en lumière d'inefficacités statistiques dans des algorithmes existants. Nous montrons ainsi que les arbres de décision généralisent mal pour certains types de tâches (chapitre 3), de même que les algorithmes classiques d'apprentissage semi-supervisé à base de graphe (chapitre 5), chacun étant affecté par une forme particulière de la malédiction de la dimensionalité. Pour une certaine classe de réseaux de neurones, appelés réseaux sommes-produits, nous montrons qu'il peut être exponentiellement moins efficace de représenter certaines fonctions par des réseaux à une seule couche cachée, comparé à des réseaux profonds (chapitre 4). Nos analyses permettent de mieux comprendre certains problèmes intrinsèques liés à ces algorithmes, et d'orienter la recherche dans des directions qui pourraient permettre de les résoudre. Nous identifions également des inefficacités computationnelles dans les algorithmes d'apprentissage semi-supervisé à base de graphe (chapitre 5), et dans l'apprentissage de mélanges de Gaussiennes en présence de valeurs manquantes (chapitre 6). Dans les deux cas, nous proposons de nouveaux algorithmes capables de traiter des ensembles de données significativement plus grands. Les deux derniers chapitres traitent de l'efficacité computationnelle sous un angle différent. Dans le chapitre 7, nous analysons de manière théorique un algorithme existant pour l'apprentissage efficace dans les machines de Boltzmann restreintes (la divergence contrastive), afin de mieux comprendre les raisons qui expliquent le succès de cet algorithme. Finalement, dans le chapitre 8 nous présentons une application de l'apprentissage machine dans le domaine des jeux vidéo, pour laquelle le problème de l'efficacité computationnelle est relié à des considérations d'ingénierie logicielle et matérielle, souvent ignorées en recherche mais ô combien importantes en pratique.
Resumo:
We present distribution independent bounds on the generalization misclassification performance of a family of kernel classifiers with margin. Support Vector Machine classifiers (SVM) stem out of this class of machines. The bounds are derived through computations of the $V_gamma$ dimension of a family of loss functions where the SVM one belongs to. Bounds that use functions of margin distributions (i.e. functions of the slack variables of SVM) are derived.
Resumo:
Customer satisfaction and retention are key issues for organizations in today’s competitive market place. As such, much research and revenue has been invested in developing accurate ways of assessing consumer satisfaction at both the macro (national) and micro (organizational) level, facilitating comparisons in performance both within and between industries. Since the instigation of the national customer satisfaction indices (CSI), partial least squares (PLS) has been used to estimate the CSI models in preference to structural equation models (SEM) because they do not rely on strict assumptions about the data. However, this choice was based upon some misconceptions about the use of SEM’s and does not take into consideration more recent advances in SEM, including estimation methods that are robust to non-normality and missing data. In this paper, both SEM and PLS approaches were compared by evaluating perceptions of the Isle of Man Post Office Products and Customer service using a CSI format. The new robust SEM procedures were found to be advantageous over PLS. Product quality was found to be the only driver of customer satisfaction, while image and satisfaction were the only predictors of loyalty, thus arguing for the specificity of postal services
Resumo:
The R-package “compositions”is a tool for advanced compositional analysis. Its basic functionality has seen some conceptual improvement, containing now some facilities to work with and represent ilr bases built from balances, and an elaborated subsys- tem for dealing with several kinds of irregular data: (rounded or structural) zeroes, incomplete observations and outliers. The general approach to these irregularities is based on subcompositions: for an irregular datum, one can distinguish a “regular” sub- composition (where all parts are actually observed and the datum behaves typically) and a “problematic” subcomposition (with those unobserved, zero or rounded parts, or else where the datum shows an erratic or atypical behaviour). Systematic classification schemes are proposed for both outliers and missing values (including zeros) focusing on the nature of irregularities in the datum subcomposition(s). To compute statistics with values missing at random and structural zeros, a projection approach is implemented: a given datum contributes to the estimation of the desired parameters only on the subcompositon where it was observed. For data sets with values below the detection limit, two different approaches are provided: the well-known imputation technique, and also the projection approach. To compute statistics in the presence of outliers, robust statistics are adapted to the characteristics of compositional data, based on the minimum covariance determinant approach. The outlier classification is based on four different models of outlier occur- rence and Monte-Carlo-based tests for their characterization. Furthermore the package provides special plots helping to understand the nature of outliers in the dataset. Keywords: coda-dendrogram, lost values, MAR, missing data, MCD estimator, robustness, rounded zeros
Resumo:
El objetivo de esta tesis es predecir el rendimiento de los estudiantes de doctorado en la Universidad de Girona según características personales (background), actitudinales y de redes sociales de los estudiantes. La población estudiada son estudiantes de tercer y cuarto curso de doctorado y sus directores de tesis doctoral. Para obtener los datos se ha diseño un cuestionario web especificando sus ventajas y teniendo en cuenta algunos problemas tradicionales de no cobertura o no respuesta. El cuestionario web se hizo debido a la complejidad que comportan de las preguntas de red social. El cuestionario electrónico permite, mediante una serie de instrucciones, reducir el tiempo para responder y hacerlo menos cargado. Este cuestionario web, además es auto administrado, lo cual nos permite, según la literatura, unas respuestas mas honestas que cuestionario con encuestador. Se analiza la calidad de las preguntas de red social en cuestionario web para datos egocéntricos. Para eso se calcula la fiabilidad y la validez de este tipo de preguntas, por primera vez a través del modelo Multirasgo Multimétodo (Multitrait Multimethod). Al ser datos egocéntricos, se pueden considerar jerárquicos, y por primera vez se una un modelo Multirasgo Multimétodo Multinivel (multilevel Multitrait Multimethod). Las la fiabilidad y validez se pueden obtener a nivel individual (within group component) o a nivel de grupo (between group component) y se usan para llevar a cabo un meta-análisis con otras universidades europeas para analizar ciertas características de diseño del cuestionario. Estas características analizan si para preguntas de red social hechas en cuestionarios web son más fiables y validas hechas "by questions" o "by alters", si son presentes todas las etiquetas de frecuencia para los ítems o solo la del inicio y final, o si es mejor que el diseño del cuestionario esté en con color o blanco y negro. También se analiza la calidad de la red social en conjunto, en este caso específico son los grupos de investigación de la universidad. Se tratan los problemas de los datos ausentes en las redes completas. Se propone una nueva alternativa a la solución típica de la red egocéntrica o los respondientes proxies. Esta nueva alternativa la hemos nombrado "Nosduocentered Network" (red Nosduocentrada), se basa en dos actores centrales en una red. Estimando modelos de regresión, esta "Nosduocentered network" tiene mas poder predictivo para el rendimiento de los estudiantes de doctorado que la red egocéntrica. Además se corrigen las correlaciones de las variables actitudinales por atenuación debido al pequeño tamaño muestral. Finalmente, se hacen regresiones de los tres tipos de variables (background, actitudinales y de red social) y luego se combinan para analizar cual para predice mejor el rendimiento (según publicaciones académicas) de los estudiantes de doctorado. Los resultados nos llevan a predecir el rendimiento académico de los estudiantes de doctorado depende de variables personales (background) i actitudinales. Asimismo, se comparan los resultados obtenidos con otros estudios publicados.
Resumo:
An improved algorithm for the generation of gridded window brightness temperatures is presented. The primary data source is the International Satellite Cloud Climatology Project, level B3 data, covering the period from July 1983 to the present. The algorithm rakes window brightness, temperatures from multiple satellites, both geostationary and polar orbiting, which have already been navigated and normalized radiometrically to the National Oceanic and Atmospheric Administration's Advanced Very High Resolution Radiometer, and generates 3-hourly global images on a 0.5 degrees by 0.5 degrees latitude-longitude grid. The gridding uses a hierarchical scheme based on spherical kernel estimators. As part of the gridding procedure, the geostationary data are corrected for limb effects using a simple empirical correction to the radiances, from which the corrected temperatures are computed. This is in addition to the application of satellite zenith angle weighting to downweight limb pixels in preference to nearer-nadir pixels. The polar orbiter data are windowed on the target time with temporal weighting to account for the noncontemporaneous nature of the data. Large regions of missing data are interpolated from adjacent processed images using a form of motion compensated interpolation based on the estimation of motion vectors using an hierarchical block matching scheme. Examples are shown of the various stages in the process. Also shown are examples of the usefulness of this type of data in GCM validation.