931 resultados para large data sets
Resumo:
Generalized linear mixed models (GLMMs) provide an elegant framework for the analysis of correlated data. Due to the non-closed form of the likelihood, GLMMs are often fit by computational procedures like penalized quasi-likelihood (PQL). Special cases of these models are generalized linear models (GLMs), which are often fit using algorithms like iterative weighted least squares (IWLS). High computational costs and memory space constraints often make it difficult to apply these iterative procedures to data sets with very large number of cases. This paper proposes a computationally efficient strategy based on the Gauss-Seidel algorithm that iteratively fits sub-models of the GLMM to subsetted versions of the data. Additional gains in efficiency are achieved for Poisson models, commonly used in disease mapping problems, because of their special collapsibility property which allows data reduction through summaries. Convergence of the proposed iterative procedure is guaranteed for canonical link functions. The strategy is applied to investigate the relationship between ischemic heart disease, socioeconomic status and age/gender category in New South Wales, Australia, based on outcome data consisting of approximately 33 million records. A simulation study demonstrates the algorithm's reliability in analyzing a data set with 12 million records for a (non-collapsible) logistic regression model.
Resumo:
We propose a novel class of models for functional data exhibiting skewness or other shape characteristics that vary with spatial or temporal location. We use copulas so that the marginal distributions and the dependence structure can be modeled independently. Dependence is modeled with a Gaussian or t-copula, so that there is an underlying latent Gaussian process. We model the marginal distributions using the skew t family. The mean, variance, and shape parameters are modeled nonparametrically as functions of location. A computationally tractable inferential framework for estimating heterogeneous asymmetric or heavy-tailed marginal distributions is introduced. This framework provides a new set of tools for increasingly complex data collected in medical and public health studies. Our methods were motivated by and are illustrated with a state-of-the-art study of neuronal tracts in multiple sclerosis patients and healthy controls. Using the tools we have developed, we were able to find those locations along the tract most affected by the disease. However, our methods are general and highly relevant to many functional data sets. In addition to the application to one-dimensional tract profiles illustrated here, higher-dimensional extensions of the methodology could have direct applications to other biological data including functional and structural MRI.
Resumo:
Functional neuroimaging techniques enable investigations into the neural basis of human cognition, emotions, and behaviors. In practice, applications of functional magnetic resonance imaging (fMRI) have provided novel insights into the neuropathophysiology of major psychiatric,neurological, and substance abuse disorders, as well as into the neural responses to their treatments. Modern activation studies often compare localized task-induced changes in brain activity between experimental groups. One may also extend voxel-level analyses by simultaneously considering the ensemble of voxels constituting an anatomically defined region of interest (ROI) or by considering means or quantiles of the ROI. In this work we present a Bayesian extension of voxel-level analyses that offers several notable benefits. First, it combines whole-brain voxel-by-voxel modeling and ROI analyses within a unified framework. Secondly, an unstructured variance/covariance for regional mean parameters allows for the study of inter-regional functional connectivity, provided enough subjects are available to allow for accurate estimation. Finally, an exchangeable correlation structure within regions allows for the consideration of intra-regional functional connectivity. We perform estimation for our model using Markov Chain Monte Carlo (MCMC) techniques implemented via Gibbs sampling which, despite the high throughput nature of the data, can be executed quickly (less than 30 minutes). We apply our Bayesian hierarchical model to two novel fMRI data sets: one considering inhibitory control in cocaine-dependent men and the second considering verbal memory in subjects at high risk for Alzheimer’s disease. The unifying hierarchical model presented in this manuscript is shown to enhance the interpretation content of these data sets.
Resumo:
Quantifying the health effects associated with simultaneous exposure to many air pollutants is now a research priority of the US EPA. Bayesian hierarchical models (BHM) have been extensively used in multisite time series studies of air pollution and health to estimate health effects of a single pollutant adjusted for potential confounding of other pollutants and other time-varying factors. However, when the scientific goal is to estimate the impacts of many pollutants jointly, a straightforward application of BHM is challenged by the need to specify a random-effect distribution on a high-dimensional vector of nuisance parameters, which often do not have an easy interpretation. In this paper we introduce a new BHM formulation, which we call "reduced BHM", aimed at analyzing clustered data sets in the presence of a large number of random effects that are not of primary scientific interest. At the first stage of the reduced BHM, we calculate the integrated likelihood of the parameter of interest (e.g. excess number of deaths attributed to simultaneous exposure to high levels of many pollutants). At the second stage, we specify a flexible random-effect distribution directly on the parameter of interest. The reduced BHM overcomes many of the challenges in the specification and implementation of full BHM in the context of a large number of nuisance parameters. In simulation studies we show that the reduced BHM performs comparably to the full BHM in many scenarios, and even performs better in some cases. Methods are applied to estimate location-specific and overall relative risks of cardiovascular hospital admissions associated with simultaneous exposure to elevated levels of particulate matter and ozone in 51 US counties during the period 1999-2005.
Resumo:
We present a state-of-the-art application of smoothing for dependent bivariate binomial spatial data to Loa loa prevalence mapping in West Africa. This application is special because it starts with the non-spatial calibration of survey instruments, continues with the spatial model building and assessment and ends with robust, tested software that will be used by the field scientists of the World Health Organization for online prevalence map updating. From a statistical perspective several important methodological issues were addressed: (a) building spatial models that are complex enough to capture the structure of the data but remain computationally usable; (b)reducing the computational burden in the handling of very large covariate data sets; (c) devising methods for comparing spatial prediction methods for a given exceedance policy threshold.
Resumo:
210Pb, 137Cs and 14C dated sediments of two late Holocene landslide lakes in the Provincial Park Lagunas de Yala (Laguna Rodeo, Laguna Comedero, 24°06′S, 65°30′W, 2100 m asl, northwestern Argentina) reveal a high-resolution multi-proxy data set of climate change and human impact for the past ca. 2000 years. Comparison of the lake sediment data set for the 20th century (sediment mass accumulation rates MARs, pollen spectra, nutrient and charcoal fluxes) with independent dendroecological data from the catchment (fire scars, tree growth) and long regional precipitation series (from 1934 onwards) show that (1) the lake sediment data set is internally highly consistent and compares well with independent data sets, (2) the chronology of the sediment is reliable, (3) large fires (1940s, 1983/1984–1989) as documented in the local fire scar frequency are recorded in the charcoal flux to the lake sediments and coincide with low wet-season precipitation rates (e.g., 1940s, 1983/1984) and/or high interannual precipitation variability (late 1940s), and (4) the regional increase in precipitation after 1970 is recorded in an increase in the MARs (L. Rodeo from 100 to 390 mg cm−2 yr−1) and in an increase in fern spores reflecting wet vegetation. The most significant change in MARs and nutrient fluxes (Corg and P) of the past 2000 years is observed with the transition from the Inca Empire to the Spanish Conquest around 1600 AD. Compared with the pre-17th century conditions, MARs increased by a factor of ca. 5 to >8 (to 800 +130, −280 mg cm−2 yr−1), PO4 fluxes increased by a factor of 7, and Corg fluxes by a factor of 10.5 for the time between 1640 and 1930 AD. 17th to 19th century MARs and nutrient fluxes also exceed 20th century values. Excess Pb deposition as indicated by a significant increase in Pb/Zr and Pb/Rb ratios in the sediments after the 1950s coincides with a rapid expansion of the regional mining industry. Excess Pb is interpreted as atmospheric deposition and direct human impact due to Pb smelting.
Resumo:
Electroencephalograms (EEG) are often contaminated with high amplitude artifacts limiting the usability of data. Methods that reduce these artifacts are often restricted to certain types of artifacts, require manual interaction or large training data sets. Within this paper we introduce a novel method, which is able to eliminate many different types of artifacts without manual intervention. The algorithm first decomposes the signal into different sub-band signals in order to isolate different types of artifacts into specific frequency bands. After signal decomposition with principal component analysis (PCA) an adaptive threshold is applied to eliminate components with high variance corresponding to the dominant artifact activity. Our results show that the algorithm is able to significantly reduce artifacts while preserving the EEG activity. Parameters for the algorithm do not have to be identified for every patient individually making the method a good candidate for preprocessing in automatic seizure detection and prediction algorithms.
Resumo:
PURPOSE: To describe the implementation and use of an electronic patient-referral system as an aid to the efficient referral of patients to a remote and specialized treatment center. METHODS AND MATERIALS: A system for the exchange of radiotherapy data between different commercial planning systems and a specially developed planning system for proton therapy has been developed through the use of the PAPYRUS diagnostic image standard as an intermediate format. To ensure the cooperation of the different TPS manufacturers, the number of data sets defined for transfer has been restricted to the three core data sets of CT, VOIs, and three-dimensional dose distributions. As a complement to the exchange of data, network-wide application-sharing (video-conferencing) technologies have been adopted to provide methods for the interactive discussion and assessment of treatments plans with one or more partner clinics. RESULTS: Through the use of evaluation plans based on the exchanged data, referring clinics can accurately assess the advantages offered by proton therapy on a patient-by-patient basis, while the practicality or otherwise of the proposed treatments can simultaneously be assessed by the proton therapy center. Such a system, along with the interactive capabilities provided by video-conferencing methods, has been found to be an efficient solution to the problem of patient assessment and selection at a specialized treatment center, and is a necessary first step toward the full electronic integration of such centers with their remotely situated referral centers.
Resumo:
This dissertation has three separate parts: the first part deals with the general pedigree association testing incorporating continuous covariates; the second part deals with the association tests under population stratification using the conditional likelihood tests; the third part deals with the genome-wide association studies based on the real rheumatoid arthritis (RA) disease data sets from Genetic Analysis Workshop 16 (GAW16) problem 1. Many statistical tests are developed to test the linkage and association using either case-control status or phenotype covariates for family data structure, separately. Those univariate analyses might not use all the information coming from the family members in practical studies. On the other hand, the human complex disease do not have a clear inheritance pattern, there might exist the gene interactions or act independently. In part I, the new proposed approach MPDT is focused on how to use both the case control information as well as the phenotype covariates. This approach can be applied to detect multiple marker effects. Based on the two existing popular statistics in family studies for case-control and quantitative traits respectively, the new approach could be used in the simple family structure data set as well as general pedigree structure. The combined statistics are calculated using the two statistics; A permutation procedure is applied for assessing the p-value with adjustment from the Bonferroni for the multiple markers. We use simulation studies to evaluate the type I error rates and the powers of the proposed approach. Our results show that the combined test using both case-control information and phenotype covariates not only has the correct type I error rates but also is more powerful than the other existing methods. For multiple marker interactions, our proposed method is also very powerful. Selective genotyping is an economical strategy in detecting and mapping quantitative trait loci in the genetic dissection of complex disease. When the samples arise from different ethnic groups or an admixture population, all the existing selective genotyping methods may result in spurious association due to different ancestry distributions. The problem can be more serious when the sample size is large, a general requirement to obtain sufficient power to detect modest genetic effects for most complex traits. In part II, I describe a useful strategy in selective genotyping while population stratification is present. Our procedure used a principal component based approach to eliminate any effect of population stratification. The paper evaluates the performance of our procedure using both simulated data from an early study data sets and also the HapMap data sets in a variety of population admixture models generated from empirical data. There are one binary trait and two continuous traits in the rheumatoid arthritis dataset of Problem 1 in the Genetic Analysis Workshop 16 (GAW16): RA status, AntiCCP and IgM. To allow multiple traits, we suggest a set of SNP-level F statistics by the concept of multiple-correlation to measure the genetic association between multiple trait values and SNP-specific genotypic scores and obtain their null distributions. Hereby, we perform 6 genome-wide association analyses using the novel one- and two-stage approaches which are based on single, double and triple traits. Incorporating all these 6 analyses, we successfully validate the SNPs which have been identified to be responsible for rheumatoid arthritis in the literature and detect more disease susceptibility SNPs for follow-up studies in the future. Except for chromosome 13 and 18, each of the others is found to harbour susceptible genetic regions for rheumatoid arthritis or related diseases, i.e., lupus erythematosus. This topic is discussed in part III.
Resumo:
One of the original ocean-bottom time-lapse seismic studies was performed at the Teal South oil field in the Gulf of Mexico during the late 1990’s. This work reexamines some aspects of previous work using modern analysis techniques to provide improved quantitative interpretations. Using three-dimensional volume visualization of legacy data and the two phases of post-production time-lapse data, I provide additional insight into the fluid migration pathways and the pressure communication between different reservoirs, separated by faults. This work supports a conclusion from previous studies that production from one reservoir caused regional pressure decline that in turn resulted in liberation of gas from multiple surrounding unproduced reservoirs. I also provide an explanation for unusual time-lapse changes in amplitude-versus-offset (AVO) data related to the compaction of the producing reservoir which, in turn, changed an isotropic medium to an anisotropic medium. In the first part of this work, I examine regional changes in seismic response due to the production of oil and gas from one reservoir. The previous studies primarily used two post-production ocean-bottom surveys (Phase I and Phase II), and not the legacy streamer data, due to the unavailability of legacy prestack data and very different acquisition parameters. In order to incorporate the legacy data in the present study, all three poststack data sets were cross-equalized and examined using instantaneous amplitude and energy volumes. This approach appears quite effective and helps to suppress changes unrelated to production while emphasizing those large-amplitude changes that are related to production in this noisy (by current standards) suite of data. I examine the multiple data sets first by using the instantaneous amplitude and energy attributes, and then also examine specific apparent time-lapse changes through direct comparisons of seismic traces. In so doing, I identify time-delays that, when corrected for, indicate water encroachment at the base of the producing reservoir. I also identify specific sites of leakage from various unproduced reservoirs, the result of regional pressure blowdown as explained in previous studies; those earlier studies, however, were unable to identify direct evidence of fluid movement. Of particular interest is the identification of one site where oil apparently leaked from one reservoir into a “new” reservoir that did not originally contain oil, but was ideally suited as a trap for fluids leaking from the neighboring spill-point. With continued pressure drop, oil in the new reservoir increased as more oil entered into the reservoir and expanded, liberating gas from solution. Because of the limited volume available for oil and gas in that temporary trap, oil and gas also escaped from it into the surrounding formation. I also note that some of the reservoirs demonstrate time-lapse changes only in the “gas cap” and not in the oil zone, even though gas must be coming out of solution everywhere in the reservoir. This is explained by interplay between pore-fluid modulus reduction by gas saturation decrease and dry-frame modulus increase by frame stiffening. In the second part of this work, I examine various rock-physics models in an attempt to quantitatively account for frame-stiffening that results from reduced pore-fluid pressure in the producing reservoir, searching for a model that would predict the unusual AVO features observed in the time-lapse prestack and stacked data at Teal South. While several rock-physics models are successful at predicting the time-lapse response for initial production, most fail to match the observations for continued production between Phase I and Phase II. Because the reservoir was initially overpressured and unconsolidated, reservoir compaction was likely significant, and is probably accomplished largely by uniaxial strain in the vertical direction; this implies that an anisotropic model may be required. Using Walton’s model for anisotropic unconsolidated sand, I successfully model the time-lapse changes for all phases of production. This observation may be of interest for application to other unconsolidated overpressured reservoirs under production.
Resumo:
A combinatorial protocol (CP) is introduced here to interface it with the multiple linear regression (MLR) for variable selection. The efficiency of CP-MLR is primarily based on the restriction of entry of correlated variables to the model development stage. It has been used for the analysis of Selwood et al data set [16], and the obtained models are compared with those reported from GFA [8] and MUSEUM [9] approaches. For this data set CP-MLR could identify three highly independent models (27, 28 and 31) with Q2 value in the range of 0.632-0.518. Also, these models are divergent and unique. Even though, the present study does not share any models with GFA [8], and MUSEUM [9] results, there are several descriptors common to all these studies, including the present one. Also a simulation is carried out on the same data set to explain the model formation in CP-MLR. The results demonstrate that the proposed method should be able to offer solutions to data sets with 50 to 60 descriptors in reasonable time frame. By carefully selecting the inter-parameter correlation cutoff values in CP-MLR one can identify divergent models and handle data sets larger than the present one without involving excessive computer time.
Resumo:
Low self-esteem and depression are strongly correlated in cross-sectional studies, yet little is known about their prospective effects on each other. The vulnerability model hypothesizes that low self-esteem serves as a risk factor for depression, whereas the scar model hypothesizes that low self-esteem is an outcome, not a cause, of depression. To test these models, the authors used 2 large longitudinal data sets, each with 4 repeated assessments between the ages of 15 and 21 years and 18 and 21 years, respectively. Cross-lagged regression analyses indicated that low self-esteem predicted subsequent levels of depression, but depression did not predict subsequent levels of self-esteem. These findings held for both men and women and after controlling for content overlap between the self-esteem and depression scales. Thus, the results supported the vulnerability model, but not the scar model, of self-esteem and depression.
Resumo:
Large progress has been made in the past few years towards quantifying and understanding climate variability during past centuries. At the same time, present-day climate has been studied using state-of-the-art data sets and tools with respect to the physical and chemical mechanisms governing climate variability. Both the understanding of the past and the knowledge of the processes are important for assessing and attributing the anthropogenic effect on present and future climate. The most important time period in this context is the past approximately 100 years, which comprises large natural variations and extremes (such as long droughts) as well as anthropogenic influences, most pronounced in the past few decades. Recent and ongoing research efforts steadily improve the observational record of the 20th century, while atmospheric circulation models are used to underpin the mechanisms behind large climatic variations. Atmospheric chemistry and composition are important for understanding climate variability and change, and considerable progress has been made in the past few years in this field. The evolving integration of these research areas in a more comprehensive analysis of recent climate variability was reflected in the organisation of a workshop “Climate variability and extremes in the past 100 years” in Gwatt near Thun (Switzerland), 24–26 July 2006. The aim of this workshop was to bring together scientists working on data issues together with statistical climatologists, modellers, and atmospheric chemists to discuss gaps in our understanding of climate variability during the past approximately 100 years.
Resumo:
BACKGROUND: Mode of inheritance of equine recurrent airway obstruction (RAO) is unknown. HYPOTHESIS: Major genes are responsible for RAO. ANIMALS: Direct offspring of 2 RAO-affected Warmblood stallions (n = 197; n = 163) and a representative sample of Swiss Warmbloods (n = 401). METHODS: One environmental and 4 genetic models (general, mixed inheritance, major gene, and polygene) were tested for Horse Owner Assessed Respiratory Signs Index (1-4, unaffected to severely affected) by segregation analyses of the 2 half-sib sire families, both combined and separately, using prevalences estimated in a representative sample. RESULTS: In all data sets the mixed inheritance model was most likely to explain the pattern of inheritance. In all 3 datasets the mixed inheritance model did not differ significantly from the general model (P= .62, P= 1.00, and P= .27) but was always better than the major gene model (P < .01) and the polygene model (P < .01). The frequency of the deleterious allele differed considerably between the 2 sire families (P= .23 and P= .06). In both sire families the displacement was large (t= 17.52 and t= 12.24) and the heritability extremely large (h(2)= 1). CONCLUSIONS AND CLINICAL RELEVANCE: Segregation analyses clearly reveal the presence of a major gene playing a role in RAO. In 1 family, the mode of inheritance was autosomal dominant, whereas in the other family it was autosomal recessive. Although the expression of RAO is influenced by exposure to hay, these findings suggest a strong, complex genetic background for RAO.
Resumo:
BACKGROUND The coping resources questionnaire for back pain (FBR) uses 12 items to measure the perceived helpfulness of different coping resources (CRs, social emotional support, practical help, knowledge, movement and relaxation, leisure and pleasure, spirituality and cognitive strategies). The aim of the study was to evaluate the instrument in a clinical patient sample assessed in a primary care setting. SAMPLE AND METHODS The study was a secondary evaluation of empirical data from a large cohort study in general practices. The 58 participating primary care practices recruited patients who reported chronic back pain in the consultation. Besides the FBR and a pain sketch, the patients completed scales measuring depression, anxiety, resilience, sociodemographic factors and pain characteristics. To allow computing of retested parameters the FBR was sent to some of the original participants again after 6 months (90% response rate). We calculated consistency and retest reliability coefficients as well as correlations between the FBR subscales and depression, anxiety and resilience scores to account for validity. By means of a cluster analysis groups with different resource profiles were formed. Results. RESULTS For the study 609 complete FBR baseline data sets could be used for statistical analysis. The internal consistency scores ranged fromα=0.58 to α=0.78 and retest reliability scores were between rTT=0.41 and rTT=0.63. Correlation with depression, fear and resilience ranged from r=-0.38 to r=0.42. The cluster analysis resulted in four groups with relatively homogenous intragroup profiles (high CRs, low spirituality, medium CRs, low CRs). The four groups differed significantly in fear and depression (the more inefficient the resources the higher the difference) as well as in resilience (the more inefficient the lower the difference). The group with low CRs also reported permanent pain with no relief. The groups did not otherwise differ. CONCLUSIONS The FBR is an economic instrument that is suitable for practical use e.g. in primary care practices to identify strengths and deficits in the CRs of chronic pain patients that can then be specified in face to face consultation. However, due to the rather low reliability, the use of subscales for profile differentiation and follow-up measurement in individual diagnoses is limited.