941 resultados para Biology, Biostatistics|Hydrology
Resumo:
The primary interest was in predicting the distribution runs in a sequence of Bernoulli trials. Difference equation techniques were used to express the number of runs of a given length k in n trials under three assumptions (1) no runs of length greater than k, (2) no runs of length less than k, (3) no other assumptions about the length of runs. Generating functions were utilized to obtain the distributions of the future number of runs, future number of minimum run lengths and future number of the maximum run lengths unconditional on the number of successes and failures in the Bernoulli sequence. When applying the model to Texas hydrology data, the model provided an adequate fit for the data in eight of the ten regions. Suggested health applications of this approach to run theory are provided. ^
Resumo:
Contexte. Les études cas-témoins sont très fréquemment utilisées par les épidémiologistes pour évaluer l’impact de certaines expositions sur une maladie particulière. Ces expositions peuvent être représentées par plusieurs variables dépendant du temps, et de nouvelles méthodes sont nécessaires pour estimer de manière précise leurs effets. En effet, la régression logistique qui est la méthode conventionnelle pour analyser les données cas-témoins ne tient pas directement compte des changements de valeurs des covariables au cours du temps. Par opposition, les méthodes d’analyse des données de survie telles que le modèle de Cox à risques instantanés proportionnels peuvent directement incorporer des covariables dépendant du temps représentant les histoires individuelles d’exposition. Cependant, cela nécessite de manipuler les ensembles de sujets à risque avec précaution à cause du sur-échantillonnage des cas, en comparaison avec les témoins, dans les études cas-témoins. Comme montré dans une étude de simulation précédente, la définition optimale des ensembles de sujets à risque pour l’analyse des données cas-témoins reste encore à être élucidée, et à être étudiée dans le cas des variables dépendant du temps. Objectif: L’objectif général est de proposer et d’étudier de nouvelles versions du modèle de Cox pour estimer l’impact d’expositions variant dans le temps dans les études cas-témoins, et de les appliquer à des données réelles cas-témoins sur le cancer du poumon et le tabac. Méthodes. J’ai identifié de nouvelles définitions d’ensemble de sujets à risque, potentiellement optimales (le Weighted Cox model and le Simple weighted Cox model), dans lesquelles différentes pondérations ont été affectées aux cas et aux témoins, afin de refléter les proportions de cas et de non cas dans la population source. Les propriétés des estimateurs des effets d’exposition ont été étudiées par simulation. Différents aspects d’exposition ont été générés (intensité, durée, valeur cumulée d’exposition). Les données cas-témoins générées ont été ensuite analysées avec différentes versions du modèle de Cox, incluant les définitions anciennes et nouvelles des ensembles de sujets à risque, ainsi qu’avec la régression logistique conventionnelle, à des fins de comparaison. Les différents modèles de régression ont ensuite été appliqués sur des données réelles cas-témoins sur le cancer du poumon. Les estimations des effets de différentes variables de tabac, obtenues avec les différentes méthodes, ont été comparées entre elles, et comparées aux résultats des simulations. Résultats. Les résultats des simulations montrent que les estimations des nouveaux modèles de Cox pondérés proposés, surtout celles du Weighted Cox model, sont bien moins biaisées que les estimations des modèles de Cox existants qui incluent ou excluent simplement les futurs cas de chaque ensemble de sujets à risque. De plus, les estimations du Weighted Cox model étaient légèrement, mais systématiquement, moins biaisées que celles de la régression logistique. L’application aux données réelles montre de plus grandes différences entre les estimations de la régression logistique et des modèles de Cox pondérés, pour quelques variables de tabac dépendant du temps. Conclusions. Les résultats suggèrent que le nouveau modèle de Cox pondéré propose pourrait être une alternative intéressante au modèle de régression logistique, pour estimer les effets d’expositions dépendant du temps dans les études cas-témoins
Resumo:
Les avancées en biotechnologie ont permis l’identification d’un grand nombre de mécanismes moléculaires, soulignant également la complexité de la régulation génique. Néanmoins, avoir une vision globale de l’homéostasie cellulaire, nous est pour l’instant inaccessible et nous ne sommes en mesure que d’en avoir qu’une vue fractionnée. Étant donné l’avancement des connaissances des dysfonctionnements moléculaires observés dans les maladies génétiques telles que la fibrose kystique, il est encore difficile de produire des thérapies efficaces. La fibrose kystique est causée par la mutation de gène CFTR (cystic fibrosis transmembrane conductance regulator), qui code pour un canal chlorique transmembranaire. La mutation la plus fréquente (ΔF508) induit un repliement incorrect de la protéine et sa rétention dans le réticulum endoplasmique. L’absence de CFTR fonctionnel à la membrane a un impact sur l’homéostasie ionique et sur l’hydratation de la muqueuse respiratoire. Ceci a pour conséquence un défaut dans la clairance mucocilliaire, induisant infection chronique et inflammation excessive, deux facteurs fondamentaux de la physiopathologie. L’inflammation joue un rôle très important dans l’évolution de la maladie et malgré le nombre important d’études sur le sujet, la régulation du processus inflammatoire est encore très mal comprise et la place qu’y occupe le CFTR n’est pas établie. Toutefois, plusieurs autres facteurs, tels que le stress oxydatif participent à la physiopathologie de la maladie, et considérer leurs impacts est important pour permettre une vision globale des acteurs impliqués. Dans notre étude, nous exploitons la technologie des puces à ADN, pour évaluer l’état transcriptionnel d’une cellule épithéliale pulmonaire humaine fibro-kystique. Dans un premier temps, l’analyse de notre expérience identifie 128 gènes inflammatoires sur-exprimés dans les cellules FK par rapport aux cellules non FK où apparaissent plusieurs familles de gènes inflammatoires comme les cytokines ou les calgranulines. L’analyse de la littérature et des annotations suggèrent que la modulation de ces transcripts dépend de la cascade de NF-κB et/ou des voies de signalisation associées aux interférons (IFN). En outre, leurs modulations pourraient être associées à des modifications épigénétiques de leurs loci chromosomiques. Dans un second temps, nous étudions l’activité transcriptionnelle d’une cellule épithéliale pulmonaire humaine FK en présence de DMNQ, une molécule cytotoxique. Notre but est d’identifier les processus biologiques perturbés par la mutation du gène CFTR en présence du stress oxydatif. Fondé sur une analyse canonique de redondance, nous identifions 60 gènes associés à la mort cellulaire et leur variance, observée dans notre expérience, s’explique par un effet conjoint de la mutation et du stress oxydatif. La mesure de l’activité des caspases 3/7, des effecteurs de l’apoptose (la mort cellulaire programmée), montre que les cellules porteuses de la mutation ΔF508, dans des conditions de stress oxydatif, seraient moins apoptotiques que les cellules saines. Nos données transcriptomiques suggèrent que la sous-activité de la cascade des MAPK et la sur-expression des gènes anti-apoptotiques pourraient être impliquées dans le déséquilibre de la balance apoptotique.
Resumo:
The developmental processes and functions of an organism are controlled by the genes and the proteins that are derived from these genes. The identification of key genes and the reconstruction of gene networks can provide a model to help us understand the regulatory mechanisms for the initiation and progression of biological processes or functional abnormalities (e.g. diseases) in living organisms. In this dissertation, I have developed statistical methods to identify the genes and transcription factors (TFs) involved in biological processes, constructed their regulatory networks, and also evaluated some existing association methods to find robust methods for coexpression analyses. Two kinds of data sets were used for this work: genotype data and gene expression microarray data. On the basis of these data sets, this dissertation has two major parts, together forming six chapters. The first part deals with developing association methods for rare variants using genotype data (chapter 4 and 5). The second part deals with developing and/or evaluating statistical methods to identify genes and TFs involved in biological processes, and construction of their regulatory networks using gene expression data (chapter 2, 3, and 6). For the first part, I have developed two methods to find the groupwise association of rare variants with given diseases or traits. The first method is based on kernel machine learning and can be applied to both quantitative as well as qualitative traits. Simulation results showed that the proposed method has improved power over the existing weighted sum method (WS) in most settings. The second method uses multiple phenotypes to select a few top significant genes. It then finds the association of each gene with each phenotype while controlling the population stratification by adjusting the data for ancestry using principal components. This method was applied to GAW 17 data and was able to find several disease risk genes. For the second part, I have worked on three problems. First problem involved evaluation of eight gene association methods. A very comprehensive comparison of these methods with further analysis clearly demonstrates the distinct and common performance of these eight gene association methods. For the second problem, an algorithm named the bottom-up graphical Gaussian model was developed to identify the TFs that regulate pathway genes and reconstruct their hierarchical regulatory networks. This algorithm has produced very significant results and it is the first report to produce such hierarchical networks for these pathways. The third problem dealt with developing another algorithm called the top-down graphical Gaussian model that identifies the network governed by a specific TF. The network produced by the algorithm is proven to be of very high accuracy.
Resumo:
I studied the apolipoprotein (apo) B 3$\sp\prime$ variable number tandem repeat (VNTR) and did computer simulations of the stepwise mutation model to address four questions: (1) How did the apo B VNTR originate? (2) What is the mutational mechanism of repeat number change at the apo B VNTR? (3) To what extent are population and molecular level events responsible for the determination of the contemporary apo B allele frequency distribution? (4) Can VNTR allele frequency distributions be explained by a simple and conservative mutation-drift model? I used three general approaches to address these questions: (1) I characterized the apo B VNTR region in non-human primate species; (2) I constructed haplotypes of polymorphic markers flanking the apo B VNTR in a sample of individuals from Lorrain, France and studied the associations between the flanking-marker haplotypes and apo B VNTR size; (3) I did computer simulations of the one-step stepwise mutation model and compared the results to real data in terms of four allele frequency distribution characteristics.^ The results of this work have allowed me to conclude that the apo B VNTR originated after an initial duplication of a sequence which is still present as a single copy sequence in New World monkey species. I conclude that this locus did not originate by the transposition of an array of repeats from somewhere else in the genome. It is unlikely that recombination is the primary mutational mechanism. Furthermore, the clustered nature of these associations implicates a stepwise mutational mechanism. From the high frequencies of certain haplotype-allele size combinations, it is evident that population level events have also been important in the determination of the apo B VNTR allele frequency distribution. Results from computer simulations of the one-step stepwise mutation model have allowed me to conclude that bimodal and multimodal allele frequency distributions are not unexpected at loci evolving via stepwise mutation mechanisms. Short tandem repeat loci fit the stepwise mutation model best, followed by microsatellite loci. I therefore conclude that there are differences in the mutational mechanisms of VNTR loci as classed by repeat unit size. (Abstract shortened by UMI.) ^
Resumo:
Variable number of tandem repeats (VNTR) are genetic loci at which short sequence motifs are found repeated different numbers of times among chromosomes. To explore the potential utility of VNTR loci in evolutionary studies, I have conducted a series of studies to address the following questions: (1) What are the population genetic properties of these loci? (2) What are the mutational mechanisms of repeat number change at these loci? (3) Can DNA profiles be used to measure the relatedness between a pair of individuals? (4) Can DNA fingerprint be used to measure the relatedness between populations in evolutionary studies? (5) Can microsatellite and short tandem repeat (STR) loci which mutate stepwisely be used in evolutionary analyses?^ A large number of VNTR loci typed in many populations were studied by means of statistical methods developed recently. The results of this work indicate that there is no significant departure from Hardy-Weinberg expectation (HWE) at VNTR loci in most of the human populations examined, and the departure from HWE in some VNTR loci are not solely caused by the presence of population sub-structure.^ A statistical procedure is developed to investigate the mutational mechanisms of VNTR loci by studying the allele frequency distributions of these loci. Comparisons of frequency distribution data on several hundreds VNTR loci with the predictions of two mutation models demonstrated that there are differences among VNTR loci grouped by repeat unit sizes.^ By extending the ITO method, I derived the distribution of the number of shared bands between individuals with any kinship relationship. A maximum likelihood estimation procedure is proposed to estimate the relatedness between individuals from the observed number of shared bands between them.^ It was believed that classical measures of genetic distance are not applicable to analysis of DNA fingerprints which reveal many minisatellite loci simultaneously in the genome, because the information regarding underlying alleles and loci is not available. I proposed a new measure of genetic distance based on band sharing between individuals that is applicable to DNA fingerprint data.^ To address the concern that microsatellite and STR loci may not be useful for evolutionary studies because of the convergent nature of their mutation mechanisms, by a theoretical study as well as by computer simulation, I conclude that the possible bias caused by the convergent mutations can be corrected, and a novel measure of genetic distance that makes the correction is suggested. In summary, I conclude that hypervariable VNTR loci are useful in evolutionary studies of closely related populations or species, especially in the study of human evolution and the history of geographic dispersal of Homo sapiens. (Abstract shortened by UMI.) ^
Resumo:
This study describes the patterns of occurrence of amyotrophic lateral sclerosis (ALS) and parkinsonism-dementia complex (PDC) of Guam during 1950-1989. Both ALS and PDC occur with high frequency among the indigenous Chamorro population, first recognized in the early 1950's. Reports in the early 1980's indicated that both ALS and PDC were disappearing, due to a purported reduction in exposure to harmful environmental factors as a result of the dramatic changes in lifestyle that took place after World War II. However, this study provides compelling evidence that ALS and PDC have not disappeared on Guam and that rates for both are higher during 1980-1989 than previously reported.^ The patterns of occurrence for both ALS and PDC overlap in most respects: (1) incidence and mortality are decreasing; (2) median age at onset is increasing; (3) males are at increased risk for developing disease; (4) risk is higher for those residing in the south compared to the non-south; and (5) age-specific incidence is decreasing over time except in the oldest age groups.^ Age-specific incidence of ALS and PDC, separately and together, is generally higher for cohorts born before 1920 than for those born after 1920. A significant birth cohort effect on the incidence of PDC for the 1906-1915 birth cohort was found, but not for ALS and for ALS and PDC together. Whether or not a cohort effect, period effect, or both are associated with incidence of ALS and PDC cannot be determined from the data currently available and will require additional follow-up of individuals born after 1920.^ The epidemiological data amassed over this 40-year period provide evidence that supports an environmental exposure model for disease occurrence as opposed to a simple genetic or infectious disease model. Whether neurodegenerative disease in this population occurs as a consequence of a single exposure or is explained by a multifactorial model such as a genetic predisposition with some environmental interaction is yet to be determined. However, descriptive studies such as this can provide clues concerning timing and location of potential adverse exposures but cannot determine etiology, underscoring the urgent need for analytic studies of ALS and PDC to further investigate existing etiologic hypotheses and to test new hypotheses. ^
Resumo:
The factorial validity of the SF-36 was evaluated using confirmatory factor analysis (CFA) methods, structural equation modeling (SEM), and multigroup structural equation modeling (MSEM). First, the measurement and structural model of the hypothesized SF-36 was explicated. Second, the model was tested for the validity of a second-order factorial structure, upon evidence of model misfit, determined the best-fitting model, and tested the validity of the best-fitting model on a second random sample from the same population. Third, the best-fitting model was tested for invariance of the factorial structure across race, age, and educational subgroups using MSEM.^ The findings support the second-order factorial structure of the SF-36 as proposed by Ware and Sherbourne (1992). However, the results suggest that: (a) Mental Health and Physical Health covary; (b) general mental health cross-loads onto Physical Health; (c) general health perception loads onto Mental Health instead of Physical Health; (d) many of the error terms are correlated; and (e) the physical function scale is not reliable across these two samples. This hierarchical factor pattern was replicated across both samples of health care workers, suggesting that the post hoc model fitting was not data specific. Subgroup analysis suggests that the physical function scale is not reliable across the "age" or "education" subgroups and that the general mental health scale path from Mental Health is not reliable across the "white/nonwhite" or "education" subgroups.^ The importance of this study is in the use of SEM and MSEM in evaluating sample data from the use of the SF-36. These methods are uniquely suited to the analysis of latent variable structures and are widely used in other fields. The use of latent variable models for self reported outcome measures has become widespread, and should now be applied to medical outcomes research. Invariance testing is superior to mean scores or summary scores when evaluating differences between groups. From a practical, as well as, psychometric perspective, it seems imperative that construct validity research related to the SF-36 establish whether this same hierarchical structure and invariance holds for other populations.^ This project is presented as three articles to be submitted for publication. ^
Resumo:
Nuclear morphometry (NM) uses image analysis to measure features of the cell nucleus which are classified as: bulk properties, shape or form, and DNA distribution. Studies have used these measurements as diagnostic and prognostic indicators of disease with inconclusive results. The distributional properties of these variables have not been systematically investigated although much of the medical data exhibit nonnormal distributions. Measurements are done on several hundred cells per patient so summary measurements reflecting the underlying distribution are needed.^ Distributional characteristics of 34 NM variables from prostate cancer cells were investigated using graphical and analytical techniques. Cells per sample ranged from 52 to 458. A small sample of patients with benign prostatic hyperplasia (BPH), representing non-cancer cells, was used for general comparison with the cancer cells.^ Data transformations such as log, square root and 1/x did not yield normality as measured by the Shapiro-Wilks test for normality. A modulus transformation, used for distributions having abnormal kurtosis values, also did not produce normality.^ Kernel density histograms of the 34 variables exhibited non-normality and 18 variables also exhibited bimodality. A bimodality coefficient was calculated and 3 variables: DNA concentration, shape and elongation, showed the strongest evidence of bimodality and were studied further.^ Two analytical approaches were used to obtain a summary measure for each variable for each patient: cluster analysis to determine significant clusters and a mixture model analysis using a two component model having a Gaussian distribution with equal variances. The mixture component parameters were used to bootstrap the log likelihood ratio to determine the significant number of components, 1 or 2. These summary measures were used as predictors of disease severity in several proportional odds logistic regression models. The disease severity scale had 5 levels and was constructed of 3 components: extracapsulary penetration (ECP), lymph node involvement (LN+) and seminal vesicle involvement (SV+) which represent surrogate measures of prognosis. The summary measures were not strong predictors of disease severity. There was some indication from the mixture model results that there were changes in mean levels and proportions of the components in the lower severity levels. ^
Resumo:
This paper reports a comparison of three modeling strategies for the analysis of hospital mortality in a sample of general medicine inpatients in a Department of Veterans Affairs medical center. Logistic regression, a Markov chain model, and longitudinal logistic regression were evaluated on predictive performance as measured by the c-index and on accuracy of expected numbers of deaths compared to observed. The logistic regression used patient information collected at admission; the Markov model was comprised of two absorbing states for discharge and death and three transient states reflecting increasing severity of illness as measured by laboratory data collected during the hospital stay; longitudinal regression employed Generalized Estimating Equations (GEE) to model covariance structure for the repeated binary outcome. Results showed that the logistic regression predicted hospital mortality as well as the alternative methods but was limited in scope of application. The Markov chain provides insights into how day to day changes of illness severity lead to discharge or death. The longitudinal logistic regression showed that increasing illness trajectory is associated with hospital mortality. The conclusion is reached that for standard applications in modeling hospital mortality, logistic regression is adequate, but for new challenges facing health services research today, alternative methods are equally predictive, practical, and can provide new insights. ^
Resumo:
Health care providers face the problem of trying to make decisions with inadequate information and also with an overload of (often contradictory) information. Physicians often choose treatment long before they know which disease is present. Indeed, uncertainty is intrinsic to the practice of medicine. Decision analysis can help physicians structure and work through a medical decision problem, and can provide reassurance that decisions are rational and consistent with the beliefs and preferences of other physicians and patients. ^ The primary purpose of this research project is to develop the theory, methods, techniques and tools necessary for designing and implementing a system to support solving medical decision problems. A case study involving “abdominal pain” serves as a prototype for implementing the system. The research, however, focuses on a generic class of problems and aims at covering theoretical as well as practical aspects of the system developed. ^ The main contributions of this research are: (1) bridging the gap between the statistical approach and the knowledge-based (expert) approach to medical decision making; (2) linking a collection of methods, techniques and tools together to allow for the design of a medical decision support system, based on a framework that involves the Analytic Network Process (ANP), the generalization of the Analytic Hierarchy Process (AHP) to dependence and feedback, for problems involving diagnosis and treatment; (3) enhancing the representation and manipulation of uncertainty in the ANP framework by incorporating group consensus weights; and (4) developing a computer program to assist in the implementation of the system. ^
Resumo:
Many studies in biostatistics deal with binary data. Some of these studies involve correlated observations, which can complicate the analysis of the resulting data. Studies of this kind typically arise when a high degree of commonality exists between test subjects. If there exists a natural hierarchy in the data, multilevel analysis is an appropriate tool for the analysis. Two examples are the measurements on identical twins, or the study of symmetrical organs or appendages such as in the case of ophthalmic studies. Although this type of matching appears ideal for the purposes of comparison, analysis of the resulting data while ignoring the effect of intra-cluster correlation has been shown to produce biased results.^ This paper will explore the use of multilevel modeling of simulated binary data with predetermined levels of correlation. Data will be generated using the Beta-Binomial method with varying degrees of correlation between the lower level observations. The data will be analyzed using the multilevel software package MlwiN (Woodhouse, et al, 1995). Comparisons between the specified intra-cluster correlation of these data and the estimated correlations, using multilevel analysis, will be used to examine the accuracy of this technique in analyzing this type of data. ^
Resumo:
A non-parametric method was developed and tested to compare the partial areas under two correlated Receiver Operating Characteristic curves. Based on the theory of generalized U-statistics the mathematical formulas have been derived for computing ROC area, and the variance and covariance between the portions of two ROC curves. A practical SAS application also has been developed to facilitate the calculations. The accuracy of the non-parametric method was evaluated by comparing it to other methods. By applying our method to the data from a published ROC analysis of CT image, our results are very close to theirs. A hypothetical example was used to demonstrate the effects of two crossed ROC curves. The two ROC areas are the same. However each portion of the area between two ROC curves were found to be significantly different by the partial ROC curve analysis. For computation of ROC curves with large scales, such as a logistic regression model, we applied our method to the breast cancer study with Medicare claims data. It yielded the same ROC area computation as the SAS Logistic procedure. Our method also provides an alternative to the global summary of ROC area comparison by directly comparing the true-positive rates for two regression models and by determining the range of false-positive values where the models differ. ^
Resumo:
The use of group-randomized trials is particularly widespread in the evaluation of health care, educational, and screening strategies. Group-randomized trials represent a subset of a larger class of designs often labeled nested, hierarchical, or multilevel and are characterized by the randomization of intact social units or groups, rather than individuals. The application of random effects models to group-randomized trials requires the specification of fixed and random components of the model. The underlying assumption is usually that these random components are normally distributed. This research is intended to determine if the Type I error rate and power are affected when the assumption of normality for the random component representing the group effect is violated. ^ In this study, simulated data are used to examine the Type I error rate, power, bias and mean squared error of the estimates of the fixed effect and the observed intraclass correlation coefficient (ICC) when the random component representing the group effect possess distributions with non-normal characteristics, such as heavy tails or severe skewness. The simulated data are generated with various characteristics (e.g. number of schools per condition, number of students per school, and several within school ICCs) observed in most small, school-based, group-randomized trials. The analysis is carried out using SAS PROC MIXED, Version 6.12, with random effects specified in a random statement and restricted maximum likelihood (REML) estimation specified. The results from the non-normally distributed data are compared to the results obtained from the analysis of data with similar design characteristics but normally distributed random effects. ^ The results suggest that the violation of the normality assumption for the group component by a skewed or heavy-tailed distribution does not appear to influence the estimation of the fixed effect, Type I error, and power. Negative biases were detected when estimating the sample ICC and dramatically increased in magnitude as the true ICC increased. These biases were not as pronounced when the true ICC was within the range observed in most group-randomized trials (i.e. 0.00 to 0.05). The normally distributed group effect also resulted in bias ICC estimates when the true ICC was greater than 0.05. However, this may be a result of higher correlation within the data. ^