963 resultados para Link variable method
Resumo:
BACKGROUND Record linkage of existing individual health care data is an efficient way to answer important epidemiological research questions. Reuse of individual health-related data faces several problems: Either a unique personal identifier, like social security number, is not available or non-unique person identifiable information, like names, are privacy protected and cannot be accessed. A solution to protect privacy in probabilistic record linkages is to encrypt these sensitive information. Unfortunately, encrypted hash codes of two names differ completely if the plain names differ only by a single character. Therefore, standard encryption methods cannot be applied. To overcome these challenges, we developed the Privacy Preserving Probabilistic Record Linkage (P3RL) method. METHODS In this Privacy Preserving Probabilistic Record Linkage method we apply a three-party protocol, with two sites collecting individual data and an independent trusted linkage center as the third partner. Our method consists of three main steps: pre-processing, encryption and probabilistic record linkage. Data pre-processing and encryption are done at the sites by local personnel. To guarantee similar quality and format of variables and identical encryption procedure at each site, the linkage center generates semi-automated pre-processing and encryption templates. To retrieve information (i.e. data structure) for the creation of templates without ever accessing plain person identifiable information, we introduced a novel method of data masking. Sensitive string variables are encrypted using Bloom filters, which enables calculation of similarity coefficients. For date variables, we developed special encryption procedures to handle the most common date errors. The linkage center performs probabilistic record linkage with encrypted person identifiable information and plain non-sensitive variables. RESULTS In this paper we describe step by step how to link existing health-related data using encryption methods to preserve privacy of persons in the study. CONCLUSION Privacy Preserving Probabilistic Record linkage expands record linkage facilities in settings where a unique identifier is unavailable and/or regulations restrict access to the non-unique person identifiable information needed to link existing health-related data sets. Automated pre-processing and encryption fully protect sensitive information ensuring participant confidentiality. This method is suitable not just for epidemiological research but also for any setting with similar challenges.
Resumo:
Numerous studies reported a strong link between working memory capacity (WMC) and fluid intelligence (Gf), although views differ in respect to how close these two constructs are related to each other. In the present study, we used a WMC task with five levels of task demands to assess the relationship between WMC and Gf by means of a new methodological approach referred to as fixed-links modeling. Fixed-links models belong to the family of confirmatory factor analysis (CFA) and are of particular interest for experimental, repeated-measures designs. With this technique, processes systematically varying across task conditions can be disentangled from processes unaffected by the experimental manipulation. Proceeding from the assumption that experimental manipulation in a WMC task leads to increasing demands on WMC, the processes systematically varying across task conditions can be assumed to be WMC-specific. Processes not varying across task conditions, on the other hand, are probably independent of WMC. Fixed-links models allow for representing these two kinds of processes by two independent latent variables. In contrast to traditional CFA where a common latent variable is derived from the different task conditions, fixed-links models facilitate a more precise or purified representation of the WMC-related processes of interest. By using fixed-links modeling to analyze data of 200 participants, we identified a non-experimental latent variable, representing processes that remained constant irrespective of the WMC task conditions, and an experimental latent variable which reflected processes that varied as a function of experimental manipulation. This latter variable represents the increasing demands on WMC and, hence, was considered a purified measure of WMC controlled for the constant processes. Fixed-links modeling showed that both the purified measure of WMC (β = .48) as well as the constant processes involved in the task (β = .45) were related to Gf. Taken together, these two latent variables explained the same portion of variance of Gf as a single latent variable obtained by traditional CFA (β = .65) indicating that traditional CFA causes an overestimation of the effective relationship between WMC and Gf. Thus, fixed-links modeling provides a feasible method for a more valid investigation of the functional relationship between specific constructs.
Resumo:
Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^
Resumo:
In population studies, most current methods focus on identifying one outcome-related SNP at a time by testing for differences of genotype frequencies between disease and healthy groups or among different population groups. However, testing a great number of SNPs simultaneously has a problem of multiple testing and will give false-positive results. Although, this problem can be effectively dealt with through several approaches such as Bonferroni correction, permutation testing and false discovery rates, patterns of the joint effects by several genes, each with weak effect, might not be able to be determined. With the availability of high-throughput genotyping technology, searching for multiple scattered SNPs over the whole genome and modeling their joint effect on the target variable has become possible. Exhaustive search of all SNP subsets is computationally infeasible for millions of SNPs in a genome-wide study. Several effective feature selection methods combined with classification functions have been proposed to search for an optimal SNP subset among big data sets where the number of feature SNPs far exceeds the number of observations. ^ In this study, we take two steps to achieve the goal. First we selected 1000 SNPs through an effective filter method and then we performed a feature selection wrapped around a classifier to identify an optimal SNP subset for predicting disease. And also we developed a novel classification method-sequential information bottleneck method wrapped inside different search algorithms to identify an optimal subset of SNPs for classifying the outcome variable. This new method was compared with the classical linear discriminant analysis in terms of classification performance. Finally, we performed chi-square test to look at the relationship between each SNP and disease from another point of view. ^ In general, our results show that filtering features using harmononic mean of sensitivity and specificity(HMSS) through linear discriminant analysis (LDA) is better than using LDA training accuracy or mutual information in our study. Our results also demonstrate that exhaustive search of a small subset with one SNP, two SNPs or 3 SNP subset based on best 100 composite 2-SNPs can find an optimal subset and further inclusion of more SNPs through heuristic algorithm doesn't always increase the performance of SNP subsets. Although sequential forward floating selection can be applied to prevent from the nesting effect of forward selection, it does not always out-perform the latter due to overfitting from observing more complex subset states. ^ Our results also indicate that HMSS as a criterion to evaluate the classification ability of a function can be used in imbalanced data without modifying the original dataset as against classification accuracy. Our four studies suggest that Sequential Information Bottleneck(sIB), a new unsupervised technique, can be adopted to predict the outcome and its ability to detect the target status is superior to the traditional LDA in the study. ^ From our results we can see that the best test probability-HMSS for predicting CVD, stroke,CAD and psoriasis through sIB is 0.59406, 0.641815, 0.645315 and 0.678658, respectively. In terms of group prediction accuracy, the highest test accuracy of sIB for diagnosing a normal status among controls can reach 0.708999, 0.863216, 0.639918 and 0.850275 respectively in the four studies if the test accuracy among cases is required to be not less than 0.4. On the other hand, the highest test accuracy of sIB for diagnosing a disease among cases can reach 0.748644, 0.789916, 0.705701 and 0.749436 respectively in the four studies if the test accuracy among controls is required to be at least 0.4. ^ A further genome-wide association study through Chi square test shows that there are no significant SNPs detected at the cut-off level 9.09451E-08 in the Framingham heart study of CVD. Study results in WTCCC can only detect two significant SNPs that are associated with CAD. In the genome-wide study of psoriasis most of top 20 SNP markers with impressive classification accuracy are also significantly associated with the disease through chi-square test at the cut-off value 1.11E-07. ^ Although our classification methods can achieve high accuracy in the study, complete descriptions of those classification results(95% confidence interval or statistical test of differences) require more cost-effective methods or efficient computing system, both of which can't be accomplished currently in our genome-wide study. We should also note that the purpose of this study is to identify subsets of SNPs with high prediction ability and those SNPs with good discriminant power are not necessary to be causal markers for the disease.^
Resumo:
Ordinal outcomes are frequently employed in diagnosis and clinical trials. Clinical trials of Alzheimer's disease (AD) treatments are a case in point using the status of mild, moderate or severe disease as outcome measures. As in many other outcome oriented studies, the disease status may be misclassified. This study estimates the extent of misclassification in an ordinal outcome such as disease status. Also, this study estimates the extent of misclassification of a predictor variable such as genotype status. An ordinal logistic regression model is commonly used to model the relationship between disease status, the effect of treatment, and other predictive factors. A simulation study was done. First, data based on a set of hypothetical parameters and hypothetical rates of misclassification was created. Next, the maximum likelihood method was employed to generate likelihood equations accounting for misclassification. The Nelder-Mead Simplex method was used to solve for the misclassification and model parameters. Finally, this method was applied to an AD dataset to detect the amount of misclassification present. The estimates of the ordinal regression model parameters were close to the hypothetical parameters. β1 was hypothesized at 0.50 and the mean estimate was 0.488, β2 was hypothesized at 0.04 and the mean of the estimates was 0.04. Although the estimates for the rates of misclassification of X1 were not as close as β1 and β2, they validate this method. X 1 0-1 misclassification was hypothesized as 2.98% and the mean of the simulated estimates was 1.54% and, in the best case, the misclassification of k from high to medium was hypothesized at 4.87% and had a sample mean of 3.62%. In the AD dataset, the estimate for the odds ratio of X 1 of having both copies of the APOE 4 allele changed from an estimate of 1.377 to an estimate 1.418, demonstrating that the estimates of the odds ratio changed when the analysis includes adjustment for misclassification. ^
Resumo:
Studies on the relationship between psychosocial determinants and HIV risk behaviors have produced little evidence to support hypotheses based on theoretical relationships. One limitation inherent in many articles in the literature is the method of measurement of the determinants and the analytic approach selected. ^ To reduce the misclassification associated with unit scaling of measures specific to internalized homonegativity, I evaluated the psychometric properties of the Reactions to Homosexuality scale in a confirmatory factor analytic framework. In addition, I assessed the measurement invariance of the scale across racial/ethnic classifications in a sample of men who have sex with men. The resulting measure contained eight items loading on three first-order factors. Invariance assessment identified metric and partial strong invariance between racial/ethnic groups in the sample. ^ Application of the updated measure to a structural model allowed for the exploration of direct and indirect effects of internalized homonegativity on unprotected anal intercourse. Pathways identified in the model show that drug and alcohol use at last sexual encounter, the number of sexual partners in the previous three months and sexual compulsivity all contribute directly to risk behavior. Internalized homonegativity reduced the likelihood of exposure to drugs, alcohol or higher numbers of partners. For men who developed compulsive sexual behavior as a coping strategy for internalized homonegativity, there was an increase in the prevalence odds of risk behavior. ^ In the final stage of the analysis, I conducted a latent profile analysis of the items in the updated Reactions to Homosexuality scale. This analysis identified five distinct profiles, which suggested that the construct was not homogeneous in samples of men who have sex with men. Lack of prior consideration of these distinct manifestations of internalized homonegativity may have contributed to the analytic difficulty in identifying a relationship between the trait and high-risk sexual practices. ^
Resumo:
Interaction effect is an important scientific interest for many areas of research. Common approach for investigating the interaction effect of two continuous covariates on a response variable is through a cross-product term in multiple linear regression. In epidemiological studies, the two-way analysis of variance (ANOVA) type of method has also been utilized to examine the interaction effect by replacing the continuous covariates with their discretized levels. However, the implications of model assumptions of either approach have not been examined and the statistical validation has only focused on the general method, not specifically for the interaction effect.^ In this dissertation, we investigated the validity of both approaches based on the mathematical assumptions for non-skewed data. We showed that linear regression may not be an appropriate model when the interaction effect exists because it implies a highly skewed distribution for the response variable. We also showed that the normality and constant variance assumptions required by ANOVA are not satisfied in the model where the continuous covariates are replaced with their discretized levels. Therefore, naïve application of ANOVA method may lead to an incorrect conclusion. ^ Given the problems identified above, we proposed a novel method modifying from the traditional ANOVA approach to rigorously evaluate the interaction effect. The analytical expression of the interaction effect was derived based on the conditional distribution of the response variable given the discretized continuous covariates. A testing procedure that combines the p-values from each level of the discretized covariates was developed to test the overall significance of the interaction effect. According to the simulation study, the proposed method is more powerful then the least squares regression and the ANOVA method in detecting the interaction effect when data comes from a trivariate normal distribution. The proposed method was applied to a dataset from the National Institute of Neurological Disorders and Stroke (NINDS) tissue plasminogen activator (t-PA) stroke trial, and baseline age-by-weight interaction effect was found significant in predicting the change from baseline in NIHSS at Month-3 among patients received t-PA therapy.^
Resumo:
The performance of the Hosmer-Lemeshow global goodness-of-fit statistic for logistic regression models was explored in a wide variety of conditions not previously fully investigated. Computer simulations, each consisting of 500 regression models, were run to assess the statistic in 23 different situations. The items which varied among the situations included the number of observations used in each regression, the number of covariates, the degree of dependence among the covariates, the combinations of continuous and discrete variables, and the generation of the values of the dependent variable for model fit or lack of fit.^ The study found that the $\rm\ C$g* statistic was adequate in tests of significance for most situations. However, when testing data which deviate from a logistic model, the statistic has low power to detect such deviation. Although grouping of the estimated probabilities into quantiles from 8 to 30 was studied, the deciles of risk approach was generally sufficient. Subdividing the estimated probabilities into more than 10 quantiles when there are many covariates in the model is not necessary, despite theoretical reasons which suggest otherwise. Because it does not follow a X$\sp2$ distribution, the statistic is not recommended for use in models containing only categorical variables with a limited number of covariate patterns.^ The statistic performed adequately when there were at least 10 observations per quantile. Large numbers of observations per quantile did not lead to incorrect conclusions that the model did not fit the data when it actually did. However, the statistic failed to detect lack of fit when it existed and should be supplemented with further tests for the influence of individual observations. Careful examination of the parameter estimates is also essential since the statistic did not perform as desired when there was moderate to severe collinearity among covariates.^ Two methods studied for handling tied values of the estimated probabilities made only a slight difference in conclusions about model fit. Neither method split observations with identical probabilities into different quantiles. Approaches which create equal size groups by separating ties should be avoided. ^
Resumo:
Geostrophic surface velocities can be derived from the gradients of the mean dynamic topography-the difference between the mean sea surface and the geoid. Therefore, independently observed mean dynamic topography data are valuable input parameters and constraints for ocean circulation models. For a successful fit to observational dynamic topography data, not only the mean dynamic topography on the particular ocean model grid is required, but also information about its inverse covariance matrix. The calculation of the mean dynamic topography from satellite-based gravity field models and altimetric sea surface height measurements, however, is not straightforward. For this purpose, we previously developed an integrated approach to combining these two different observation groups in a consistent way without using the common filter approaches (Becker et al. in J Geodyn 59(60):99-110, 2012, doi:10.1016/j.jog.2011.07.0069; Becker in Konsistente Kombination von Schwerefeld, Altimetrie und hydrographischen Daten zur Modellierung der dynamischen Ozeantopographie, 2012, http://nbn-resolving.de/nbn:de:hbz:5n-29199). Within this combination method, the full spectral range of the observations is considered. Further, it allows the direct determination of the normal equations (i.e., the inverse of the error covariance matrix) of the mean dynamic topography on arbitrary grids, which is one of the requirements for ocean data assimilation. In this paper, we report progress through selection and improved processing of altimetric data sets. We focus on the preprocessing steps of along-track altimetry data from Jason-1 and Envisat to obtain a mean sea surface profile. During this procedure, a rigorous variance propagation is accomplished, so that, for the first time, the full covariance matrix of the mean sea surface is available. The combination of the mean profile and a combined GRACE/GOCE gravity field model yields a mean dynamic topography model for the North Atlantic Ocean that is characterized by a defined set of assumptions. We show that including the geodetically derived mean dynamic topography with the full error structure in a 3D stationary inverse ocean model improves modeled oceanographic features over previous estimates.
Resumo:
Euphausiids constitute major biomass component in shelf ecosystems and play a fundamental role in the rapid vertical transport of carbon from the ocean surface to the deeper layers during their daily vertical migration (DVM). DVM depth and migration patterns depend on oceanographic conditions with respect to temperature, light and oxygen availability at depth, factors that are highly dependent on season in most marine regions. Changes in the abiotic conditions also shape Euphausiid metabolism including aerobic and anaerobic energy production. Here we introduce a global krill respiration model which includes the effect of latitude (LAT), the day of the year of interest (DoY), and the number of daylight hours on the day of interest (DLh), in addition to the basal variables that determine ectothermal oxygen consumption (temperature, body mass and depth) in the ANN model (Artificial Neural Networks). The newly implemented parameters link space and time in terms of season and photoperiod to krill respiration. The ANN model showed a better fit (r**2=0.780) when DLh and LAT were included, indicating a decrease in respiration with increasing LAT and decreasing DLh. We therefore propose DLh as a potential variable to consider when building physiological models for both hemispheres. We also tested for seasonality the standard respiration rate of the most common species that were investigated until now in a large range of DLh and DoY with Multiple Linear Regression (MLR) or General Additive model (GAM). GAM successfully integrated DLh (r**2= 0.563) and DoY (r**2= 0.572) effects on respiration rates of the Antarctic krill, Euphausia superba, yielding the minimum metabolic activity in mid-June and the maximum at the end of December. Neither the MLR nor the GAM approach worked for the North Pacific krill Euphausia pacifica, and MLR for the North Atlantic krill Meganyctiphanes norvegica remained inconclusive because of insufficient seasonal data coverage. We strongly encourage comparative respiration measurements of worldwide Euphausiid key species at different seasons to improve accuracy in ecosystem modelling.
Resumo:
(preliminary) Exchanges of carbon, water and energy between the land surface and the atmosphere are monitored by eddy covariance technique at the ecosystem level. Currently, the FLUXNET database contains more than 500 sites registered and up to 250 of them sharing data (Free Fair Use dataset). Many modelling groups use the FLUXNET dataset for evaluating ecosystem model's performances but it requires uninterrupted time series for the meteorological variables used as input. Because original in-situ data often contain gaps, from very short (few hours) up to relatively long (some months), we develop a new and robust method for filling the gaps in meteorological data measured at site level. Our approach has the benefit of making use of continuous data available globally (ERA-interim) and high temporal resolution spanning from 1989 to today. These data are however not measured at site level and for this reason a method to downscale and correct the ERA-interim data is needed. We apply this method on the level 4 data (L4) from the LaThuile collection, freely available after registration under a Fair-Use policy. The performances of the developed method vary across sites and are also function of the meteorological variable. On average overall sites, the bias correction leads to cancel from 10% to 36% of the initial mismatch between in-situ and ERA-interim data, depending of the meteorological variable considered. In comparison to the internal variability of the in-situ data, the root mean square error (RMSE) between the in-situ data and the un-biased ERA-I data remains relatively large (on average overall sites, from 27% to 76% of the standard deviation of in-situ data, depending of the meteorological variable considered). The performance of the method remains low for the Wind Speed field, in particular regarding its capacity to conserve a standard deviation similar to the one measured at FLUXNET stations.
Resumo:
As anthropogenic climate change is an ongoing concern, scientific investigations on its impacts on coral reefs are increasing. Although impacts of combined ocean acidification (OA) and temperature stress (T) on reef-building scleractinian corals have been studied at the genus, species and population levels, there are little data available on how individual corals respond to combined OA and anomalous temperatures. In this study, we exposed individual colonies of Acropora digitifera, Montipora digitata and Porites cylindrica to four pCO2-temperature treatments including 400 µatm-28 °C, 400 µatm-31 °C, 1000 µatm-28 °C and 1000 µatm-31 °C for 26 days. Physiological parameters including calcification, protein content, maximum photosynthetic efficiency, Symbiodinium density, and chlorophyll content along with Symbiodinium type of each colony were examined. Along with intercolonial responses, responses of individual colonies versus pooled data to the treatments were investigated. The main results were: 1) responses to either OA or T or their combination were different between individual colonies when considering physiological functions; 2) tolerance to either OA or T was not synonymous with tolerance to the other parameter; 3) tolerance to both OA and T did not necessarily lead to tolerance of OA and T combined (OAT) at the same time; 4) OAT had negative, positive or no impacts on physiological functions of coral colonies; and 5) pooled data were not representative of responses of all individual colonies. Indeed, the pooled data obscured actual responses of individual colonies or presented a response that was not observed in any individual. From the results of this study we recommend improving experimental designs of studies investigating physiological responses of corals to climate change by complementing them with colony-specific examinations.
Resumo:
Ocean acidification, the result of increased dissolution of carbon dioxide (CO2) in seawater, is a leading subject of current research. The effects of acidification on non-calcifying macroalgae are, however, still unclear. The current study reports two 1-month studies using two different macroalgae, the red alga Palmaria palmata (Rhodophyta) and the kelp Saccharina latissima (Phaeophyta), exposed to control (pHNBS = 8.04) and increased (pHNBS = 7.82) levels of CO2-induced seawater acidification. The impacts of both increased acidification and time of exposure on net primary production (NPP), respiration (R), dimethylsulphoniopropionate (DMSP) concentrations, and algal growth have been assessed. In P. palmata, although NPP significantly increased during the testing period, it significantly decreased with acidification, whereas R showed a significant decrease with acidification only. S. latissima significantly increased NPP with acidification but not with time, and significantly increased R with both acidification and time, suggesting a concomitant increase in gross primary production. The DMSP concentrations of both species remained unchanged by either acidification or through time during the experimental period. In contrast, algal growth differed markedly between the two experiments, in that P. palmata showed very little growth throughout the experiment, while S. latissima showed substantial growth during the course of the study, with the latter showing a significant difference between the acidified and control treatments. These two experiments suggest that the study species used here were resistant to a short-term exposure to ocean acidification, with some of the differences seen between species possibly linked to different nutrient concentrations between the experiments.
Resumo:
This study subdivides the Potter Cove, King George Island, Antarctica, into seafloor regions using multivariate statistical methods. These regions are categories used for comparing, contrasting and quantifying biogeochemical processes and biodiversity between ocean regions geographically but also regions under development within the scope of global change. The division obtained is characterized by the dominating components and interpreted in terms of ruling environmental conditions. The analysis includes in total 42 different environmental variables, interpolated based on samples taken during Australian summer seasons 2010/2011 and 2011/2012. The statistical errors of several interpolation methods (e.g. IDW, Indicator, Ordinary and Co-Kriging) with changing settings have been compared and the most reasonable method has been applied. The multivariate mathematical procedures used are regionalized classification via k means cluster analysis, canonical-correlation analysis and multidimensional scaling. Canonical-correlation analysis identifies the influencing factors in the different parts of the cove. Several methods for the identification of the optimum number of clusters have been tested and 4, 7, 10 as well as 12 were identified as reasonable numbers for clustering the Potter Cove. Especially the results of 10 and 12 clusters identify marine-influenced regions which can be clearly separated from those determined by the geological catchment area and the ones dominated by river discharge.
Resumo:
ENVISAT ASAR WSM images with pixel size 150 × 150 m, acquired in different meteorological, oceanographic and sea ice conditions were used to determined icebergs in the Amundsen Sea (Antarctica). An object-based method for automatic iceberg detection from SAR data has been developed and applied. The object identification is based on spectral and spatial parameters on 5 scale levels, and was verified with manual classification in four polygon areas, chosen to represent varying environmental conditions. The algorithm works comparatively well in freezing temperatures and strong wind conditions, prevailing in the Amundsen Sea during the year. The detection rate was 96% which corresponds to 94% of the area (counting icebergs larger than 0.03 km**2), for all seasons. The presented algorithm tends to generate errors in the form of false alarms, mainly caused by the presence of ice floes, rather than misses. This affects the reliability since false alarms were manually corrected post analysis.