971 resultados para Gaussian mixture models
Resumo:
wo methods for registering laser-scans of human heads and transforming them to a new semantically consistent topology defined by a user-provided template mesh are described. Both algorithms are stated within the Iterative Closest Point framework. The first method is based on finding landmark correspondences by iteratively registering the vicinity of a landmark with a re-weighted error function. Thin-plate spline interpolation is then used to deform the template mesh and finally the scan is resampled in the topology of the deformed template. The second algorithm employs a morphable shape model, which can be computed from a database of laser-scans using the first algorithm. It directly optimizes pose and shape of the morphable model. The use of the algorithm with PCA mixture models, where the shape is split up into regions each described by an individual subspace, is addressed. Mixture models require either blending or regularization strategies, both of which are described in detail. For both algorithms, strategies for filling in missing geometry for incomplete laser-scans are described. While an interpolation-based approach can be used to fill in small or smooth regions, the model-driven algorithm is capable of fitting a plausible complete head mesh to arbitrarily small geometry, which is known as "shape completion". The importance of regularization in the case of extreme shape completion is shown.
Resumo:
Recurrent wheezing or asthma is a common problem in children that has increased considerably in prevalence in the past few decades. The causes and underlying mechanisms are poorly understood and it is thought that a numb er of distinct diseases causing similar symptoms are involved. Due to the lack of a biologically founded classification system, children are classified according to their observed disease related features (symptoms, signs, measurements) into phenotypes. The objectives of this PhD project were a) to develop tools for analysing phenotypic variation of a disease, and b) to examine phenotypic variability of wheezing among children by applying these tools to existing epidemiological data. A combination of graphical methods (multivariate co rrespondence analysis) and statistical models (latent variables models) was used. In a first phase, a model for discrete variability (latent class model) was applied to data on symptoms and measurements from an epidemiological study to identify distinct phenotypes of wheezing. In a second phase, the modelling framework was expanded to include continuous variability (e.g. along a severity gradient) and combinations of discrete and continuo us variability (factor models and factor mixture models). The third phase focused on validating the methods using simulation studies. The main body of this thesis consists of 5 articles (3 published, 1 submitted and 1 to be submitted) including applications, methodological contributions and a review. The main findings and contributions were: 1) The application of a latent class model to epidemiological data (symptoms and physiological measurements) yielded plausible pheno types of wheezing with distinguishing characteristics that have previously been used as phenotype defining characteristics. 2) A method was proposed for including responses to conditional questions (e.g. questions on severity or triggers of wheezing are asked only to children with wheeze) in multivariate modelling.ii 3) A panel of clinicians was set up to agree on a plausible model for wheezing diseases. The model can be used to generate datasets for testing the modelling approach. 4) A critical review of methods for defining and validating phenotypes of wheeze in children was conducted. 5) The simulation studies showed that a parsimonious parameterisation of the models is required to identify the true underlying structure of the data. The developed approach can deal with some challenges of real-life cohort data such as variables of mixed mode (continuous and categorical), missing data and conditional questions. If carefully applied, the approach can be used to identify whether the underlying phenotypic variation is discrete (classes), continuous (factors) or a combination of these. These methods could help improve precision of research into causes and mechanisms and contribute to the development of a new classification of wheezing disorders in children and other diseases which are difficult to classify.
Resumo:
Complex diseases such as cancer result from multiple genetic changes and environmental exposures. Due to the rapid development of genotyping and sequencing technologies, we are now able to more accurately assess causal effects of many genetic and environmental factors. Genome-wide association studies have been able to localize many causal genetic variants predisposing to certain diseases. However, these studies only explain a small portion of variations in the heritability of diseases. More advanced statistical models are urgently needed to identify and characterize some additional genetic and environmental factors and their interactions, which will enable us to better understand the causes of complex diseases. In the past decade, thanks to the increasing computational capabilities and novel statistical developments, Bayesian methods have been widely applied in the genetics/genomics researches and demonstrating superiority over some regular approaches in certain research areas. Gene-environment and gene-gene interaction studies are among the areas where Bayesian methods may fully exert its functionalities and advantages. This dissertation focuses on developing new Bayesian statistical methods for data analysis with complex gene-environment and gene-gene interactions, as well as extending some existing methods for gene-environment interactions to other related areas. It includes three sections: (1) Deriving the Bayesian variable selection framework for the hierarchical gene-environment and gene-gene interactions; (2) Developing the Bayesian Natural and Orthogonal Interaction (NOIA) models for gene-environment interactions; and (3) extending the applications of two Bayesian statistical methods which were developed for gene-environment interaction studies, to other related types of studies such as adaptive borrowing historical data. We propose a Bayesian hierarchical mixture model framework that allows us to investigate the genetic and environmental effects, gene by gene interactions (epistasis) and gene by environment interactions in the same model. It is well known that, in many practical situations, there exists a natural hierarchical structure between the main effects and interactions in the linear model. Here we propose a model that incorporates this hierarchical structure into the Bayesian mixture model, such that the irrelevant interaction effects can be removed more efficiently, resulting in more robust, parsimonious and powerful models. We evaluate both of the 'strong hierarchical' and 'weak hierarchical' models, which specify that both or one of the main effects between interacting factors must be present for the interactions to be included in the model. The extensive simulation results show that the proposed strong and weak hierarchical mixture models control the proportion of false positive discoveries and yield a powerful approach to identify the predisposing main effects and interactions in the studies with complex gene-environment and gene-gene interactions. We also compare these two models with the 'independent' model that does not impose this hierarchical constraint and observe their superior performances in most of the considered situations. The proposed models are implemented in the real data analysis of gene and environment interactions in the cases of lung cancer and cutaneous melanoma case-control studies. The Bayesian statistical models enjoy the properties of being allowed to incorporate useful prior information in the modeling process. Moreover, the Bayesian mixture model outperforms the multivariate logistic model in terms of the performances on the parameter estimation and variable selection in most cases. Our proposed models hold the hierarchical constraints, that further improve the Bayesian mixture model by reducing the proportion of false positive findings among the identified interactions and successfully identifying the reported associations. This is practically appealing for the study of investigating the causal factors from a moderate number of candidate genetic and environmental factors along with a relatively large number of interactions. The natural and orthogonal interaction (NOIA) models of genetic effects have previously been developed to provide an analysis framework, by which the estimates of effects for a quantitative trait are statistically orthogonal regardless of the existence of Hardy-Weinberg Equilibrium (HWE) within loci. Ma et al. (2012) recently developed a NOIA model for the gene-environment interaction studies and have shown the advantages of using the model for detecting the true main effects and interactions, compared with the usual functional model. In this project, we propose a novel Bayesian statistical model that combines the Bayesian hierarchical mixture model with the NOIA statistical model and the usual functional model. The proposed Bayesian NOIA model demonstrates more power at detecting the non-null effects with higher marginal posterior probabilities. Also, we review two Bayesian statistical models (Bayesian empirical shrinkage-type estimator and Bayesian model averaging), which were developed for the gene-environment interaction studies. Inspired by these Bayesian models, we develop two novel statistical methods that are able to handle the related problems such as borrowing data from historical studies. The proposed methods are analogous to the methods for the gene-environment interactions on behalf of the success on balancing the statistical efficiency and bias in a unified model. By extensive simulation studies, we compare the operating characteristics of the proposed models with the existing models including the hierarchical meta-analysis model. The results show that the proposed approaches adaptively borrow the historical data in a data-driven way. These novel models may have a broad range of statistical applications in both of genetic/genomic and clinical studies.
Resumo:
A number of methods for cooperative localization has been proposed, but most of them provide only location estimate, without associated uncertainty. On the other hand, nonparametric belief propagation (NBP), which provides approximated posterior distributions of the location estimates, is expensive mostly because of the transmission of the particles. In this paper, we propose a novel approach to reduce communication overhead for cooperative positioning using NBP. It is based on: i) communication of the beliefs (instead of the messages), ii) approximation of the belief with Gaussian mixture of very few components, and iii) censoring. According to our simulations results, these modifications reduce significantly communication overhead while providing the estimates almost as accurate as the transmission of the particles.
Resumo:
En entornos hostiles tales como aquellas instalaciones científicas donde la radiación ionizante es el principal peligro, el hecho de reducir las intervenciones humanas mediante el incremento de las operaciones robotizadas está siendo cada vez más de especial interés. CERN, la Organización Europea para la Investigación Nuclear, tiene alrededor de unos 50 km de superficie subterránea donde robots móviles controlador de forma remota podrían ayudar en su funcionamiento, por ejemplo, a la hora de llevar a cabo inspecciones remotas sobre radiación en los diferentes áreas destinados al efecto. No solo es preciso considerar que los robots deben ser capaces de recorrer largas distancias y operar durante largos periodos de tiempo, sino que deben saber desenvolverse en los correspondientes túneles subterráneos, tener en cuenta la presencia de campos electromagnéticos, radiación ionizante, etc. y finalmente, el hecho de que los robots no deben interrumpir el funcionamiento de los aceleradores. El hecho de disponer de un sistema de comunicaciones inalámbrico fiable y robusto es esencial para la correcta ejecución de las misiones que los robots deben afrontar y por supuesto, para evitar tales situaciones en las que es necesario la recuperación manual de los robots al agotarse su energía o al perder el enlace de comunicaciones. El objetivo de esta Tesis es proveer de las directrices y los medios necesarios para reducir el riesgo de fallo en la misión y maximizar las capacidades de los robots móviles inalámbricos los cuales disponen de almacenamiento finito de energía al trabajar en entornos peligrosos donde no se dispone de línea de vista directa. Para ello se proponen y muestran diferentes estrategias y métodos de comunicación inalámbrica. Teniendo esto en cuenta, se presentan a continuación los objetivos de investigación a seguir a lo largo de la Tesis: predecir la cobertura de comunicaciones antes y durante las misiones robotizadas; optimizar la capacidad de red inalámbrica de los robots móviles con respecto a su posición; y mejorar el rango operacional de esta clase de robots. Por su parte, las contribuciones a la Tesis se citan más abajo. El primer conjunto de contribuciones son métodos novedosos para predecir el consumo de energía y la autonomía en la comunicación antes y después de disponer de los robots en el entorno seleccionado. Esto es importante para proporcionar conciencia de la situación del robot y evitar fallos en la misión. El consumo de energía se predice usando una estrategia propuesta la cual usa modelos de consumo provenientes de diferentes componentes en un robot. La predicción para la cobertura de comunicaciones se desarrolla usando un nuevo filtro de RSS (Radio Signal Strength) y técnicas de estimación con la ayuda de Filtros de Kalman. El segundo conjunto de contribuciones son métodos para optimizar el rango de comunicaciones usando novedosas técnicas basadas en muestreo espacial que son robustas frente a ruidos de campos de detección y radio y que proporcionan redundancia. Se emplean métodos de diferencia central finitos para determinar los gradientes 2D RSS y se usa la movilidad del robot para optimizar el rango de comunicaciones y la capacidad de red. Este método también se valida con un caso de estudio centrado en la teleoperación háptica de robots móviles inalámbricos. La tercera contribución es un algoritmo robusto y estocástico descentralizado para la optimización de la posición al considerar múltiples robots autónomos usados principalmente para extender el rango de comunicaciones desde la estación de control al robot que está desarrollando la tarea. Todos los métodos y algoritmos propuestos se verifican y validan usando simulaciones y experimentos de campo con variedad de robots móviles disponibles en CERN. En resumen, esta Tesis ofrece métodos novedosos y demuestra su uso para: predecir RSS; optimizar la posición del robot; extender el rango de las comunicaciones inalámbricas; y mejorar las capacidades de red de los robots móviles inalámbricos para su uso en aplicaciones dentro de entornos peligrosos, que como ya se mencionó anteriormente, se destacan las instalaciones científicas con emisión de radiación ionizante. En otros términos, se ha desarrollado un conjunto de herramientas para mejorar, facilitar y hacer más seguras las misiones de los robots en entornos hostiles. Esta Tesis demuestra tanto en teoría como en práctica que los robots móviles pueden mejorar la calidad de las comunicaciones inalámbricas mediante la profundización en el estudio de su movilidad para optimizar dinámicamente sus posiciones y mantener conectividad incluso cuando no existe línea de vista. Los métodos desarrollados en la Tesis son especialmente adecuados para su fácil integración en robots móviles y pueden ser aplicados directamente en la capa de aplicación de la red inalámbrica. ABSTRACT In hostile environments such as in scientific facilities where ionising radiation is a dominant hazard, reducing human interventions by increasing robotic operations are desirable. CERN, the European Organization for Nuclear Research, has around 50 km of underground scientific facilities, where wireless mobile robots could help in the operation of the accelerator complex, e.g. in conducting remote inspections and radiation surveys in different areas. The main challenges to be considered here are not only that the robots should be able to go over long distances and operate for relatively long periods, but also the underground tunnel environment, the possible presence of electromagnetic fields, radiation effects, and the fact that the robots shall in no way interrupt the operation of the accelerators. Having a reliable and robust wireless communication system is essential for successful execution of such robotic missions and to avoid situations of manual recovery of the robots in the event that the robot runs out of energy or when the robot loses its communication link. The goal of this thesis is to provide means to reduce risk of mission failure and maximise mission capabilities of wireless mobile robots with finite energy storage capacity working in a radiation environment with non-line-of-sight (NLOS) communications by employing enhanced wireless communication methods. Towards this goal, the following research objectives are addressed in this thesis: predict the communication range before and during robotic missions; optimise and enhance wireless communication qualities of mobile robots by using robot mobility and employing multi-robot network. This thesis provides introductory information on the infrastructures where mobile robots will need to operate, the tasks to be carried out by mobile robots and the problems encountered in these environments. The reporting of research work carried out to improve wireless communication comprises an introduction to the relevant radio signal propagation theory and technology followed by explanation of the research in the following stages: An analysis of the wireless communication requirements for mobile robot for different tasks in a selection of CERN facilities; predictions of energy and communication autonomies (in terms of distance and time) to reduce risk of energy and communication related failures during missions; autonomous navigation of a mobile robot to find zone(s) of maximum radio signal strength to improve communication coverage area; and autonomous navigation of one or more mobile robots acting as mobile wireless relay (repeater) points in order to provide a tethered wireless connection to a teleoperated mobile robot carrying out inspection or radiation monitoring activities in a challenging radio environment. The specific contributions of this thesis are outlined below. The first sets of contributions are novel methods for predicting the energy autonomy and communication range(s) before and after deployment of the mobile robots in the intended environments. This is important in order to provide situational awareness and avoid mission failures. The energy consumption is predicted by using power consumption models of different components in a mobile robot. This energy prediction model will pave the way for choosing energy-efficient wireless communication strategies. The communication range prediction is performed using radio signal propagation models and applies radio signal strength (RSS) filtering and estimation techniques with the help of Kalman filters and Gaussian process models. The second set of contributions are methods to optimise the wireless communication qualities by using novel spatial sampling based techniques that are robust to sensing and radio field noises and provide redundancy features. Central finite difference (CFD) methods are employed to determine the 2-D RSS gradients and use robot mobility to optimise the communication quality and the network throughput. This method is also validated with a case study application involving superior haptic teleoperation of wireless mobile robots where an operator from a remote location can smoothly navigate a mobile robot in an environment with low-wireless signals. The third contribution is a robust stochastic position optimisation algorithm for multiple autonomous relay robots which are used for wireless tethering of radio signals and thereby to enhance the wireless communication qualities. All the proposed methods and algorithms are verified and validated using simulations and field experiments with a variety of mobile robots available at CERN. In summary, this thesis offers novel methods and demonstrates their use to predict energy autonomy and wireless communication range, optimise robots position to improve communication quality and enhance communication range and wireless network qualities of mobile robots for use in applications in hostile environmental characteristics such as scientific facilities emitting ionising radiations. In simpler terms, a set of tools are developed in this thesis for improving, easing and making safer robotic missions in hostile environments. This thesis validates both in theory and experiments that mobile robots can improve wireless communication quality by exploiting robots mobility to dynamically optimise their positions and maintain connectivity even when the (radio signal) environment possess non-line-of-sight characteristics. The methods developed in this thesis are well-suited for easier integration in mobile robots and can be applied directly at the application layer of the wireless network. The results of the proposed methods have outperformed other comparable state-of-the-art methods.
Resumo:
Peer reviewed
Resumo:
Although NSSI engagement is a growing public health concern, little research has documented the developmental precursors to NSSI in longitudinal studies using youth samples. This study aimed to expand upon previous research on groups of NSSI engagement in a population-based sample of youth using multi-wave data. Moreover, this study examined whether chronic peer and romantic stress, the serotonin transporter gene (5-HTTLPR), parenting behaviors, and negative attributional style predicted the NSSI group membership as well as the role of sex and grade. Participants were 549 youth in beginning in the 3rd, 6th, and 9th grades at the baseline assessment. NSSI was assessed across 7 waves of data. Chronic peer and romantic stress, 5-HTTLPR, parenting behaviors, and negative attributional style were assessed at baseline. Growth mixture models, conducted to test the latent trajectory of NSSI groups did not converge. Three NSSI groups were manually created according to classifications that were determined a priori. NSSI groups included: no NSSI (85.1%), episodic NSSI (8.5%), and repeated NSSI (6.4%). Chronic peer and romantic stress, sex, and grade differentiated the no NSSI vs. repeated NSSI groups and the episodic NSSI vs. repeated NSSI groups. Specifically, higher levels of stress, being female, and being in higher grades related to repeated NSSI. 5-HTTLPR differentiated the no NSSI vs. repeated NSSI groups, such that carrying the short allele of 5-HTTLPR related to repeated NSSI. Exploratory analyses revealed that the relationship between attributional style and NSSI group was moderated by grade. This study suggests chronic interpersonal peer and romantic stress is an important factor placing youth at greater risk for repeatedly engaging in NSSI.
Resumo:
Using a sample of 339 university graduates from the University of Alicante (Spain) three years after completion of their studies, we studied the relationships between general intelligence (GI), personality traits, emotional intelligence (EI), academic performance, and occupational attainment and compared the results of conventional regression analysis with the results obtained from applying regression mixture models. The results reveal the influence of unobserved population heterogeneity (latent class) on the relationship between predictors and criteria and the improvement in the prediction obtained from applying regression mixture models compared to applying a conventional regression model.
Resumo:
Normal mixture models are often used to cluster continuous data. However, conventional approaches for fitting these models will have problems in producing nonsingular estimates of the component-covariance matrices when the dimension of the observations is large relative to the number of observations. In this case, methods such as principal components analysis (PCA) and the mixture of factor analyzers model can be adopted to avoid these estimation problems. We examine these approaches applied to the Cabernet wine data set of Ashenfelter (1999), considering the clustering of both the wines and the judges, and comparing our results with another analysis. The mixture of factor analyzers model proves particularly effective in clustering the wines, accurately classifying many of the wines by location.
Resumo:
Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the t mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations n is very large relative to their dimension p. As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate t family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.
Resumo:
A Bayesian procedure for the retrieval of wind vectors over the ocean using satellite borne scatterometers requires realistic prior near-surface wind field models over the oceans. We have implemented carefully chosen vector Gaussian Process models; however in some cases these models are too smooth to reproduce real atmospheric features, such as fronts. At the scale of the scatterometer observations, fronts appear as discontinuities in wind direction. Due to the nature of the retrieval problem a simple discontinuity model is not feasible, and hence we have developed a constrained discontinuity vector Gaussian Process model which ensures realistic fronts. We describe the generative model and show how to compute the data likelihood given the model. We show the results of inference using the model with Markov Chain Monte Carlo methods on both synthetic and real data.
Resumo:
Investigations into the modelling techniques that depict the transport of discrete phases (gas bubbles or solid particles) and model biochemical reactions in a bubble column reactor are discussed here. The mixture model was used to calculate gas-liquid, solid-liquid and gasliquid-solid interactions. Multiphase flow is a difficult phenomenon to capture, particularly in bubble columns where the major driving force is caused by the injection of gas bubbles. The gas bubbles cause a large density difference to occur that results in transient multi-dimensional fluid motion. Standard design procedures do not account for the transient motion, due to the simplifying assumptions of steady plug flow. Computational fluid dynamics (CFD) can assist in expanding the understanding of complex flows in bubble columns by characterising the flow phenomena for many geometrical configurations. Therefore, CFD has a role in the education of chemical and biochemical engineers, providing the examples of flow phenomena that many engineers may not experience, even through experimentation. The performance of the mixture model was investigated for three domains (plane, rectangular and cylindrical) and three flow models (laminar, k-e turbulence and the Reynolds stresses). mThis investigation raised many questions about how gas-liquid interactions are captured numerically. To answer some of these questions the analogy between thermal convection in a cavity and gas-liquid flow in bubble columns was invoked. This involved modelling the buoyant motion of air in a narrow cavity for a number of turbulence schemes. The difference in density was caused by a temperature gradient that acted across the width of the cavity. Multiple vortices were obtained when the Reynolds stresses were utilised with the addition of a basic flow profile after each time step. To implement the three-phase models an alternative mixture model was developed and compared against a commercially available mixture model for three turbulence schemes. The scheme where just the Reynolds stresses model was employed, predicted the transient motion of the fluids quite well for both mixture models. Solid-liquid and then alternative formulations of gas-liquid-solid model were compared against one another. The alternative form of the mixture model was found to perform particularly well for both gas and solid phase transport when calculating two and three-phase flow. The improvement in the solutions obtained was a result of the inclusion of the Reynolds stresses model and differences in the mixture models employed. The differences between the alternative mixture models were found in the volume fraction equation (flux and deviatoric stress tensor terms) and the viscosity formulation for the mixture phase.
Resumo:
This thesis addresses the Batch Reinforcement Learning methods in Robotics. This sub-class of Reinforcement Learning has shown promising results and has been the focus of recent research. Three contributions are proposed that aim to extend the state-of-art methods allowing for a faster and more stable learning process, such as required for learning in Robotics. The Q-learning update-rule is widely applied, since it allows to learn without the presence of a model of the environment. However, this update-rule is transition-based and does not take advantage of the underlying episodic structure of collected batch of interactions. The Q-Batch update-rule is proposed in this thesis, to process experiencies along the trajectories collected in the interaction phase. This allows a faster propagation of obtained rewards and penalties, resulting in faster and more robust learning. Non-parametric function approximations are explored, such as Gaussian Processes. This type of approximators allows to encode prior knowledge about the latent function, in the form of kernels, providing a higher level of exibility and accuracy. The application of Gaussian Processes in Batch Reinforcement Learning presented a higher performance in learning tasks than other function approximations used in the literature. Lastly, in order to extract more information from the experiences collected by the agent, model-learning techniques are incorporated to learn the system dynamics. In this way, it is possible to augment the set of collected experiences with experiences generated through planning using the learned models. Experiments were carried out mainly in simulation, with some tests carried out in a physical robotic platform. The obtained results show that the proposed approaches are able to outperform the classical Fitted Q Iteration.
Resumo:
The recent advent of new technologies has led to huge amounts of genomic data. With these data come new opportunities to understand biological cellular processes underlying hidden regulation mechanisms and to identify disease related biomarkers for informative diagnostics. However, extracting biological insights from the immense amounts of genomic data is a challenging task. Therefore, effective and efficient computational techniques are needed to analyze and interpret genomic data. In this thesis, novel computational methods are proposed to address such challenges: a Bayesian mixture model, an extended Bayesian mixture model, and an Eigen-brain approach. The Bayesian mixture framework involves integration of the Bayesian network and the Gaussian mixture model. Based on the proposed framework and its conjunction with K-means clustering and principal component analysis (PCA), biological insights are derived such as context specific/dependent relationships and nested structures within microarray where biological replicates are encapsulated. The Bayesian mixture framework is then extended to explore posterior distributions of network space by incorporating a Markov chain Monte Carlo (MCMC) model. The extended Bayesian mixture model summarizes the sampled network structures by extracting biologically meaningful features. Finally, an Eigen-brain approach is proposed to analyze in situ hybridization data for the identification of the cell-type specific genes, which can be useful for informative blood diagnostics. Computational results with region-based clustering reveals the critical evidence for the consistency with brain anatomical structure.
Resumo:
The objective of this study was to gain an understanding of the effects of population heterogeneity, missing data, and causal relationships on parameter estimates from statistical models when analyzing change in medication use. From a public health perspective, two timely topics were addressed: the use and effects of statins in populations in primary prevention of cardiovascular disease and polypharmacy in older population. Growth mixture models were applied to characterize the accumulation of cardiovascular and diabetes medications among apparently healthy population of statin initiators. The causal effect of statin adherence on the incidence of acute cardiovascular events was estimated using marginal structural models in comparison with discrete-time hazards models. The impact of missing data on the growth estimates of evolution of polypharmacy was examined comparing statistical models under different assumptions for missing data mechanism. The data came from Finnish administrative registers and from the population-based Geriatric Multidisciplinary Strategy for the Good Care of the Elderly study conducted in Kuopio, Finland, during 2004–07. Five distinct patterns of accumulating medications emerged among the population of apparently healthy statin initiators during two years after statin initiation. Proper accounting for time-varying dependencies between adherence to statins and confounders using marginal structural models produced comparable estimation results with those from a discrete-time hazards model. Missing data mechanism was shown to be a key component when estimating the evolution of polypharmacy among older persons. In conclusion, population heterogeneity, missing data and causal relationships are important aspects in longitudinal studies that associate with the study question and should be critically assessed when performing statistical analyses. Analyses should be supplemented with sensitivity analyses towards model assumptions.