11 resultados para non-trivial data structures
em DigitalCommons@The Texas Medical Center
Resumo:
There are many diseases associated with the expansion of DNA repeats in humans. Myotonic dystrophy type 2 is one of such diseases, characterized by expansions of a (CCTG)•(CAGG) repeat tract in intron 1 of zinc finger protein 9 (ZNF9) in chromosome 3q21.3. The DM2 repeat tract contains a flanking region 5' to the tract that consists of a polymorphic repetitive sequence (TG)14-25(TCTG)4-11(CCTG) n. The (CCTG)•(CAGG) repeat is typically 11-26 repeats in persons without the disease, but can expand up to 11,000 repeats in affected individuals, which is the largest expansion seen in DNA repeat diseases to date. This DNA tract remains one of the least characterized disease-associated DNA repeats, and mechanisms causing the repeat expansion in humans have yet to be elucidated. Alternative, non B-DNA structures formed by the expanded repeats are typical in DNA repeat expansion diseases. These sequences may promote instability of the repeat tracts. I determined that slipped strand structure formation occurs for (CCTG)•(CAGG) repeats at a length of 42 or more. In addition, Z-DNA structure forms in the flanking human sequence adjacent to the (CCTG)•(CAGG) repeat tract. I have also performed genetic assays in E. coli cells and results indicate that the (CCTG)•(CAGG) repeats are more similar to the highly unstable (CTG)•(CAG) repeat tracts seen in Huntington's disease and myotonic dystrophy type 1, than to those of the more stable (ATTCT)•(AGAAT) repeat tracts of spinocerebellar ataxia type 10. This instability, however, is RecA-independent in the (CCTG)•(CAGG) and (ATTCT)•(AGAAT) repeats, whereas the instability is RecA-dependent in the (CTG)•(CAG) repeats. Structural studies of the (CCTG)•(CAGG) repeat tract and the flanking sequence, as well as genetic selection assays may reveal the mechanisms responsible for the repeat instability in E. coli, and this may lead to a better understanding of the mechanisms contributing to the human disease state. ^
Resumo:
Strategies are compared for the development of a linear regression model with stochastic (multivariate normal) regressor variables and the subsequent assessment of its predictive ability. Bias and mean squared error of four estimators of predictive performance are evaluated in simulated samples of 32 population correlation matrices. Models including all of the available predictors are compared with those obtained using selected subsets. The subset selection procedures investigated include two stopping rules, C$\sb{\rm p}$ and S$\sb{\rm p}$, each combined with an 'all possible subsets' or 'forward selection' of variables. The estimators of performance utilized include parametric (MSEP$\sb{\rm m}$) and non-parametric (PRESS) assessments in the entire sample, and two data splitting estimates restricted to a random or balanced (Snee's DUPLEX) 'validation' half sample. The simulations were performed as a designed experiment, with population correlation matrices representing a broad range of data structures.^ The techniques examined for subset selection do not generally result in improved predictions relative to the full model. Approaches using 'forward selection' result in slightly smaller prediction errors and less biased estimators of predictive accuracy than 'all possible subsets' approaches but no differences are detected between the performances of C$\sb{\rm p}$ and S$\sb{\rm p}$. In every case, prediction errors of models obtained by subset selection in either of the half splits exceed those obtained using all predictors and the entire sample.^ Only the random split estimator is conditionally (on $\\beta$) unbiased, however MSEP$\sb{\rm m}$ is unbiased on average and PRESS is nearly so in unselected (fixed form) models. When subset selection techniques are used, MSEP$\sb{\rm m}$ and PRESS always underestimate prediction errors, by as much as 27 percent (on average) in small samples. Despite their bias, the mean squared errors (MSE) of these estimators are at least 30 percent less than that of the unbiased random split estimator. The DUPLEX split estimator suffers from large MSE as well as bias, and seems of little value within the context of stochastic regressor variables.^ To maximize predictive accuracy while retaining a reliable estimate of that accuracy, it is recommended that the entire sample be used for model development, and a leave-one-out statistic (e.g. PRESS) be used for assessment. ^
Resumo:
Next-generation sequencing (NGS) technology has become a prominent tool in biological and biomedical research. However, NGS data analysis, such as de novo assembly, mapping and variants detection is far from maturity, and the high sequencing error-rate is one of the major problems. . To minimize the impact of sequencing errors, we developed a highly robust and efficient method, MTM, to correct the errors in NGS reads. We demonstrated the effectiveness of MTM on both single-cell data with highly non-uniform coverage and normal data with uniformly high coverage, reflecting that MTM’s performance does not rely on the coverage of the sequencing reads. MTM was also compared with Hammer and Quake, the best methods for correcting non-uniform and uniform data respectively. For non-uniform data, MTM outperformed both Hammer and Quake. For uniform data, MTM showed better performance than Quake and comparable results to Hammer. By making better error correction with MTM, the quality of downstream analysis, such as mapping and SNP detection, was improved. SNP calling is a major application of NGS technologies. However, the existence of sequencing errors complicates this process, especially for the low coverage (
Resumo:
Genetic instability in mammalian cells can occur by many different mechanisms. In the absence of exogenous sources of DNA damage, the DNA structure itself has been implicated in genetic instability. When the canonical B-DNA helix is naturally altered to form a non-canonical DNA structure such as a Z-DNA or H-DNA, this can lead to genetic instability in the form of DNA double-strand breaks (DSBs) (1, 2). Our laboratory found that the stability of these non-B DNA structures was different in mammals versus Escherichia coli (E.coli) bacteria (1, 2). One explanation for the difference between these species may be a result of how DSBs are repaired within each species. Non-homologous end-joining (NHEJ) is primed to repair DSBs in mammalian cells, while bacteria that lack NHEJ (such as E.coli), utilize homologous recombination (HR) to repair DSBs. To investigate the role of the error-prone NHEJ repair pathway in DNA structure-induced genetic instability, E.coli cells were modified to express genes to allow for a functional NHEJ system under different HR backgrounds. The Mycobacterium tuberculosis NHEJ sufficient system is composed of Ku and Ligase D (LigD) (3). These inducible NHEJ components were expressed individually and together in E.coli cells, with or without functional HR (RecA/RecB), and the Z-DNA and H-DNA-induced mutations were characterized. The Z-DNA structure gave rise to higher mutation frequencies compared to the controls, regardless of the DSB repair pathway(s) available; however, the type of mutants produced after repair was greatly dictated on the available DSB repair system, indicated by the shift from 2% large-scale deletions in the total mutant population to 24% large-scale deletions when NHEJ was present (4). This suggests that NHEJ has a role in the large deletions induced by Z-DNA-forming sequences. H-DNA structure, however, did not exhibit an increase in mutagenesis in the newly engineered E.coli environment, suggesting the involvement of other factors in regulating H-DNA formation/stability in bacterial cells. Accurate repair by established DNA DSB repair pathways is essential to maintain the stability of eukaryotic and prokaryotic genomes and our results suggest that an error-prone NHEJ pathway was involved in non-B DNA structure-induced mutagenesis in both prokaryotes and eukaryotes.
Resumo:
Friedreich's ataxia is caused by the expansion of the GAA•TTC trinucleotide repeat sequence located in intron 1 of the frataxin gene. The long GAA•TTC repeats are known to form several non-B DNA structures including hairpins, triplexes, parallel DNA and sticky DNA. Therefore it is believed that alternative DNA structures play a role in the loss of mRNA transcript and functional frataxin protein in FRDA patients. We wanted to further elucidate the characteristics for formation and stability of sticky DNA by evaluating the structure in a plasmid based system in vitro and in vivo in Escherichia coli. The negative supercoil density of plasmids harboring different lengths of GAA•TTC repeats, as well as either one or two repeat tracts were studied in E. coli to determine if plasmids containing two long tracts (≥60 repeats) in a direct repeat orientation would have a different topological effect in vivo compared to plasmids that harbored only one GAA•TTC tract or two tracts of < 60 repeats. The experiments revealed that, in fact, sticky DNA forming plasmids had a lower average negative supercoil density (-σ) compared to all other control plasmids used that had the potential to form other non-B DNA structures such as triplexes or Z-DNA. Also, the requirements for in vitro dissociation and reconstitution of the DNA•DNA associated region of sticky DNA were evaluated. Results conclude that the two repeat tracts associate in the presence of negative supercoiling and MgCl 2 or MnCl2 in a time and concentration-dependent manner. Interaction of the repeat sequences was not observed in the absence of negative supercoiling and/or MgCl2 or in the presence of other monovalent or divalent cations, indicating that supercoiling and quite specific cations are needed for the association of sticky DNA. These are the first experiments studying a more specific role of supercoiling and cation influence on this DNA conformation. To support our model of the topological effects of sticky DNA in plasmids, changes in sticky DNA band migration was measured with reference to the linear DNA after treatment with increasing concentrations of ethidium bromide (EtBr). The presence of independent negative supercoil domains was confirmed by this method and found to be segregated by the DNA-DNA associated region. Sequence-specific polyamide molecules were used to test the effect of binding of the ligands to the GAA•TTC repeats on the inhibition of sticky DNA. The destabilization of the sticky DNA conformation in vitro through this binding of the polyamides demonstrated the first conceptual therapeutic approach for the treatment of FRDA at the DNA molecular level. ^ Thus, examining the properties of sticky DNA formed by these long repeat tracts is important in the elucidation of the possible role of sticky DNA in Friedreich's ataxia. ^
Resumo:
Interaction effect is an important scientific interest for many areas of research. Common approach for investigating the interaction effect of two continuous covariates on a response variable is through a cross-product term in multiple linear regression. In epidemiological studies, the two-way analysis of variance (ANOVA) type of method has also been utilized to examine the interaction effect by replacing the continuous covariates with their discretized levels. However, the implications of model assumptions of either approach have not been examined and the statistical validation has only focused on the general method, not specifically for the interaction effect.^ In this dissertation, we investigated the validity of both approaches based on the mathematical assumptions for non-skewed data. We showed that linear regression may not be an appropriate model when the interaction effect exists because it implies a highly skewed distribution for the response variable. We also showed that the normality and constant variance assumptions required by ANOVA are not satisfied in the model where the continuous covariates are replaced with their discretized levels. Therefore, naïve application of ANOVA method may lead to an incorrect conclusion. ^ Given the problems identified above, we proposed a novel method modifying from the traditional ANOVA approach to rigorously evaluate the interaction effect. The analytical expression of the interaction effect was derived based on the conditional distribution of the response variable given the discretized continuous covariates. A testing procedure that combines the p-values from each level of the discretized covariates was developed to test the overall significance of the interaction effect. According to the simulation study, the proposed method is more powerful then the least squares regression and the ANOVA method in detecting the interaction effect when data comes from a trivariate normal distribution. The proposed method was applied to a dataset from the National Institute of Neurological Disorders and Stroke (NINDS) tissue plasminogen activator (t-PA) stroke trial, and baseline age-by-weight interaction effect was found significant in predicting the change from baseline in NIHSS at Month-3 among patients received t-PA therapy.^
Resumo:
ExxonMobil, a Fortune 500 oil and gas corporation, has a global workforce with employees assigned to projects in areas at risk for infectious diseases, particularly malaria. As such, the corporation has put in place a program to protect the health of workers and ensure their safety in malaria endemic zones. This program is called the Malaria Control Program (MCP). One component of this program is the more specific Malaria Chemoprophylaxis Compliance Program (MCCP), in which employees enroll following consent to random drug testing for compliance with the company's chemoprophylaxis requirements. Each year, data is gathered on the number of employees working in these locations and are selected randomly and tested for chemoprophylaxis compliance. The selection strives to test each eligible worker once per year. Test results that come back positive for the chemoprophylaxis drug are considered "detects" and tests that are negative for the drug and therefore show the worker is non-compliant at risk for severe malaria infection are considered "non-detect". ^ The current practice report used aggregate data to calculate statistics on test results to reflect compliance among both employees and contractors in various malaria-endemic areas. This aggregate, non-individualized data has been compiled and reflects the effectiveness and reach of ExxonMobil's Malaria Chemoprophylaxis Compliance Program. In order to assess compliance, information on the number of non-detect test results was compared to the number of tests completed per year. The data shows that over time, non-detect results have declined in both employee and contractor populations, and vary somewhat by location due to size and scope of the MCCP implemented in-country. Although the data indicate a positive trend for the corporation, some recommendations have been made for future implementation of the program.^
Resumo:
In the biomedical studies, the general data structures have been the matched (paired) and unmatched designs. Recently, many researchers are interested in Meta-Analysis to obtain a better understanding from several clinical data of a medical treatment. The hybrid design, which is combined two data structures, may create the fundamental question for statistical methods and the challenges for statistical inferences. The applied methods are depending on the underlying distribution. If the outcomes are normally distributed, we would use the classic paired and two independent sample T-tests on the matched and unmatched cases. If not, we can apply Wilcoxon signed rank and rank sum test on each case. ^ To assess an overall treatment effect on a hybrid design, we can apply the inverse variance weight method used in Meta-Analysis. On the nonparametric case, we can use a test statistic which is combined on two Wilcoxon test statistics. However, these two test statistics are not in same scale. We propose the Hybrid Test Statistic based on the Hodges-Lehmann estimates of the treatment effects, which are medians in the same scale.^ To compare the proposed method, we use the classic meta-analysis T-test statistic on the combined the estimates of the treatment effects from two T-test statistics. Theoretically, the efficiency of two unbiased estimators of a parameter is the ratio of their variances. With the concept of Asymptotic Relative Efficiency (ARE) developed by Pitman, we show ARE of the hybrid test statistic relative to classic meta-analysis T-test statistic using the Hodges-Lemann estimators associated with two test statistics.^ From several simulation studies, we calculate the empirical type I error rate and power of the test statistics. The proposed statistic would provide effective tool to evaluate and understand the treatment effect in various public health studies as well as clinical trials.^
Resumo:
Nuclear morphometry (NM) uses image analysis to measure features of the cell nucleus which are classified as: bulk properties, shape or form, and DNA distribution. Studies have used these measurements as diagnostic and prognostic indicators of disease with inconclusive results. The distributional properties of these variables have not been systematically investigated although much of the medical data exhibit nonnormal distributions. Measurements are done on several hundred cells per patient so summary measurements reflecting the underlying distribution are needed.^ Distributional characteristics of 34 NM variables from prostate cancer cells were investigated using graphical and analytical techniques. Cells per sample ranged from 52 to 458. A small sample of patients with benign prostatic hyperplasia (BPH), representing non-cancer cells, was used for general comparison with the cancer cells.^ Data transformations such as log, square root and 1/x did not yield normality as measured by the Shapiro-Wilks test for normality. A modulus transformation, used for distributions having abnormal kurtosis values, also did not produce normality.^ Kernel density histograms of the 34 variables exhibited non-normality and 18 variables also exhibited bimodality. A bimodality coefficient was calculated and 3 variables: DNA concentration, shape and elongation, showed the strongest evidence of bimodality and were studied further.^ Two analytical approaches were used to obtain a summary measure for each variable for each patient: cluster analysis to determine significant clusters and a mixture model analysis using a two component model having a Gaussian distribution with equal variances. The mixture component parameters were used to bootstrap the log likelihood ratio to determine the significant number of components, 1 or 2. These summary measures were used as predictors of disease severity in several proportional odds logistic regression models. The disease severity scale had 5 levels and was constructed of 3 components: extracapsulary penetration (ECP), lymph node involvement (LN+) and seminal vesicle involvement (SV+) which represent surrogate measures of prognosis. The summary measures were not strong predictors of disease severity. There was some indication from the mixture model results that there were changes in mean levels and proportions of the components in the lower severity levels. ^
Resumo:
Cancer is the second leading cause of death in the United States. With the advent of new technologies, changes in health care delivery, and multiplicity of provider types that patients must see, cancer care management has become increasingly complex. The availability of cancer health information has been shown to help cancer patients cope with the management and effects of their cancers. As a result, more cancer patients are using the internet to find resources that can aid in decision-making and recovery. ^ The Health Information National Trends Survey (HINTS) is a nationally representative survey designed to collect information about the experiences of cancer and non-cancer adults with health information sources. The HINTS survey focused on both conventional sources as well as newer technologies, particularly the internet. This study is a descriptive analysis of the HINTS 2003 and HINTS 2005 survey data. The purpose of the research is to explore the general trends in health information seeking and use by US adults, and especially by cancer patients. ^ From 2003 to 2005, internet use for various health-related activities appears to have increased among adults with and without cancer. Differences were found between the groups in the general trust in information media, particularly the internet. Non-cancer respondents tended to have greater trust in information media than cancer respondents. ^ The latter portion of this work examined characteristics of HINTS respondents that were thought to be relevant to how much trust individuals placed in the internet as a source of health information. Trust in health information from the internet was significantly greater among younger adults, higher-earning households, internet users, online seekers of health or cancer information, and those who found online cancer information useful. ^
Resumo:
Objectives. The central objective of this study was to systematically examine the internal structure of multihospital systems, determining the management principles used and the performance levels achieved in medical care and administrative areas.^ The Universe. The study universe consisted of short-term general American hospitals owned and operated by multihospital corporations. Corporations compared were the investor-owned (for-profit) and the voluntary multihospital systems. The individual hospital was the unit of analysis for the study.^ Theoretical Considerations. The contingency theory, using selected aspects of the classical and human relations schools of thought, seemed well suited to describe multihospital organization and was used in this research.^ The Study Hypotheses. The main null hypotheses generated were that there are no significant differences between the voluntary and the investor-owned multihospital sectors in their (1) hospital structures and (2) patient care and administrative performance levels.^ The Sample. A stratified random sample of 212 hospitals owned by multihospital systems was selected to equally represent the two study sectors. Of the sampled hospitals approached, 90.1% responded.^ The Analysis. Sixteen scales were constructed in conjunction with 16 structural variables developed from the major questions and sub-items of the questionnaire. This was followed by analysis of an additional 7 structural and 24 effectiveness (performance) measures, using frequency distributions. Finally, summary statistics and statistical testing for each variable and sub-items were completed and recorded in 38 tables.^ Study Findings. While it has been argued that there are great differences between the two sectors, this study found that with a few exceptions the null hypotheses of no difference in organizational and operational characteristics of non-profit and for-profit hospitals was accepted. However, there were several significant differences found in the structural variables: functional specialization, and autonomy were significantly higher in the voluntary sector. Only centralization was significantly different in the investor owned. Among the effectiveness measures, occupancy rate, cost of data processing, total manhours worked, F.T.E. ratios, and personnel per occupied bed were significantly higher in the voluntary sector. The findings indicated that both voluntary and for-profit systems were converging toward a common hierarchical corporate management approach. Factors of size and management style may be better descriptors to characterize a specific multihospital group than its profit or nonprofit status. (Abstract shortened with permission of author.) ^