17 resultados para Data Modeling
em DigitalCommons@The Texas Medical Center
Resumo:
The first manuscript, entitled "Time-Series Analysis as Input for Clinical Predictive Modeling: Modeling Cardiac Arrest in a Pediatric ICU" lays out the theoretical background for the project. There are several core concepts presented in this paper. First, traditional multivariate models (where each variable is represented by only one value) provide single point-in-time snapshots of patient status: they are incapable of characterizing deterioration. Since deterioration is consistently identified as a precursor to cardiac arrests, we maintain that the traditional multivariate paradigm is insufficient for predicting arrests. We identify time series analysis as a method capable of characterizing deterioration in an objective, mathematical fashion, and describe how to build a general foundation for predictive modeling using time series analysis results as latent variables. Building a solid foundation for any given modeling task involves addressing a number of issues during the design phase. These include selecting the proper candidate features on which to base the model, and selecting the most appropriate tool to measure them. We also identified several unique design issues that are introduced when time series data elements are added to the set of candidate features. One such issue is in defining the duration and resolution of time series elements required to sufficiently characterize the time series phenomena being considered as candidate features for the predictive model. Once the duration and resolution are established, there must also be explicit mathematical or statistical operations that produce the time series analysis result to be used as a latent candidate feature. In synthesizing the comprehensive framework for building a predictive model based on time series data elements, we identified at least four classes of data that can be used in the model design. The first two classes are shared with traditional multivariate models: multivariate data and clinical latent features. Multivariate data is represented by the standard one value per variable paradigm and is widely employed in a host of clinical models and tools. These are often represented by a number present in a given cell of a table. Clinical latent features derived, rather than directly measured, data elements that more accurately represent a particular clinical phenomenon than any of the directly measured data elements in isolation. The second two classes are unique to the time series data elements. The first of these is the raw data elements. These are represented by multiple values per variable, and constitute the measured observations that are typically available to end users when they review time series data. These are often represented as dots on a graph. The final class of data results from performing time series analysis. This class of data represents the fundamental concept on which our hypothesis is based. The specific statistical or mathematical operations are up to the modeler to determine, but we generally recommend that a variety of analyses be performed in order to maximize the likelihood that a representation of the time series data elements is produced that is able to distinguish between two or more classes of outcomes. The second manuscript, entitled "Building Clinical Prediction Models Using Time Series Data: Modeling Cardiac Arrest in a Pediatric ICU" provides a detailed description, start to finish, of the methods required to prepare the data, build, and validate a predictive model that uses the time series data elements determined in the first paper. One of the fundamental tenets of the second paper is that manual implementations of time series based models are unfeasible due to the relatively large number of data elements and the complexity of preprocessing that must occur before data can be presented to the model. Each of the seventeen steps is analyzed from the perspective of how it may be automated, when necessary. We identify the general objectives and available strategies of each of the steps, and we present our rationale for choosing a specific strategy for each step in the case of predicting cardiac arrest in a pediatric intensive care unit. Another issue brought to light by the second paper is that the individual steps required to use time series data for predictive modeling are more numerous and more complex than those used for modeling with traditional multivariate data. Even after complexities attributable to the design phase (addressed in our first paper) have been accounted for, the management and manipulation of the time series elements (the preprocessing steps in particular) are issues that are not present in a traditional multivariate modeling paradigm. In our methods, we present the issues that arise from the time series data elements: defining a reference time; imputing and reducing time series data in order to conform to a predefined structure that was specified during the design phase; and normalizing variable families rather than individual variable instances. The final manuscript, entitled: "Using Time-Series Analysis to Predict Cardiac Arrest in a Pediatric Intensive Care Unit" presents the results that were obtained by applying the theoretical construct and its associated methods (detailed in the first two papers) to the case of cardiac arrest prediction in a pediatric intensive care unit. Our results showed that utilizing the trend analysis from the time series data elements reduced the number of classification errors by 73%. The area under the Receiver Operating Characteristic curve increased from a baseline of 87% to 98% by including the trend analysis. In addition to the performance measures, we were also able to demonstrate that adding raw time series data elements without their associated trend analyses improved classification accuracy as compared to the baseline multivariate model, but diminished classification accuracy as compared to when just the trend analysis features were added (ie, without adding the raw time series data elements). We believe this phenomenon was largely attributable to overfitting, which is known to increase as the ratio of candidate features to class examples rises. Furthermore, although we employed several feature reduction strategies to counteract the overfitting problem, they failed to improve the performance beyond that which was achieved by exclusion of the raw time series elements. Finally, our data demonstrated that pulse oximetry and systolic blood pressure readings tend to start diminishing about 10-20 minutes before an arrest, whereas heart rates tend to diminish rapidly less than 5 minutes before an arrest.
Resumo:
Every x-ray attenuation curve inherently contains all the information necessary to extract the complete energy spectrum of a beam. To date, attempts to obtain accurate spectral information from attenuation data have been inadequate.^ This investigation presents a mathematical pair model, grounded in physical reality by the Laplace Transformation, to describe the attenuation of a photon beam and the corresponding bremsstrahlung spectral distribution. In addition the Laplace model has been mathematically extended to include characteristic radiation in a physically meaningful way. A method to determine the fraction of characteristic radiation in any diagnostic x-ray beam was introduced for use with the extended model.^ This work has examined the reconstructive capability of the Laplace pair model for a photon beam range of from 50 kVp to 25 MV, using both theoretical and experimental methods.^ In the diagnostic region, excellent agreement between a wide variety of experimental spectra and those reconstructed with the Laplace model was obtained when the atomic composition of the attenuators was accurately known. The model successfully reproduced a 2 MV spectrum but demonstrated difficulty in accurately reconstructing orthovoltage and 6 MV spectra. The 25 MV spectrum was successfully reconstructed although poor agreement with the spectrum obtained by Levy was found.^ The analysis of errors, performed with diagnostic energy data, demonstrated the relative insensitivity of the model to typical experimental errors and confirmed that the model can be successfully used to theoretically derive accurate spectral information from experimental attenuation data. ^
Resumo:
The joint modeling of longitudinal and survival data is a new approach to many applications such as HIV, cancer vaccine trials and quality of life studies. There are recent developments of the methodologies with respect to each of the components of the joint model as well as statistical processes that link them together. Among these, second order polynomial random effect models and linear mixed effects models are the most commonly used for the longitudinal trajectory function. In this study, we first relax the parametric constraints for polynomial random effect models by using Dirichlet process priors, then three longitudinal markers rather than only one marker are considered in one joint model. Second, we use a linear mixed effect model for the longitudinal process in a joint model analyzing the three markers. In this research these methods were applied to the Primary Biliary Cirrhosis sequential data, which were collected from a clinical trial of primary biliary cirrhosis (PBC) of the liver. This trial was conducted between 1974 and 1984 at the Mayo Clinic. The effects of three longitudinal markers (1) Total Serum Bilirubin, (2) Serum Albumin and (3) Serum Glutamic-Oxaloacetic transaminase (SGOT) on patients' survival were investigated. Proportion of treatment effect will also be studied using the proposed joint modeling approaches. ^ Based on the results, we conclude that the proposed modeling approaches yield better fit to the data and give less biased parameter estimates for these trajectory functions than previous methods. Model fit is also improved after considering three longitudinal markers instead of one marker only. The results from analysis of proportion of treatment effects from these joint models indicate same conclusion as that from the final model of Fleming and Harrington (1991), which is Bilirubin and Albumin together has stronger impact in predicting patients' survival and as a surrogate endpoints for treatment. ^
Resumo:
Microarray technology is a high-throughput method for genotyping and gene expression profiling. Limited sensitivity and specificity are one of the essential problems for this technology. Most of existing methods of microarray data analysis have an apparent limitation for they merely deal with the numerical part of microarray data and have made little use of gene sequence information. Because it's the gene sequences that precisely define the physical objects being measured by a microarray, it is natural to make the gene sequences an essential part of the data analysis. This dissertation focused on the development of free energy models to integrate sequence information in microarray data analysis. The models were used to characterize the mechanism of hybridization on microarrays and enhance sensitivity and specificity of microarray measurements. ^ Cross-hybridization is a major obstacle factor for the sensitivity and specificity of microarray measurements. In this dissertation, we evaluated the scope of cross-hybridization problem on short-oligo microarrays. The results showed that cross hybridization on arrays is mostly caused by oligo fragments with a run of 10 to 16 nucleotides complementary to the probes. Furthermore, a free-energy based model was proposed to quantify the amount of cross-hybridization signal on each probe. This model treats cross-hybridization as an integral effect of the interactions between a probe and various off-target oligo fragments. Using public spike-in datasets, the model showed high accuracy in predicting the cross-hybridization signals on those probes whose intended targets are absent in the sample. ^ Several prospective models were proposed to improve Positional Dependent Nearest-Neighbor (PDNN) model for better quantification of gene expression and cross-hybridization. ^ The problem addressed in this dissertation is fundamental to the microarray technology. We expect that this study will help us to understand the detailed mechanism that determines sensitivity and specificity on the microarrays. Consequently, this research will have a wide impact on how microarrays are designed and how the data are interpreted. ^
Resumo:
Mixture modeling is commonly used to model categorical latent variables that represent subpopulations in which population membership is unknown but can be inferred from the data. In relatively recent years, the potential of finite mixture models has been applied in time-to-event data. However, the commonly used survival mixture model assumes that the effects of the covariates involved in failure times differ across latent classes, but the covariate distribution is homogeneous. The aim of this dissertation is to develop a method to examine time-to-event data in the presence of unobserved heterogeneity under a framework of mixture modeling. A joint model is developed to incorporate the latent survival trajectory along with the observed information for the joint analysis of a time-to-event variable, its discrete and continuous covariates, and a latent class variable. It is assumed that the effects of covariates on survival times and the distribution of covariates vary across different latent classes. The unobservable survival trajectories are identified through estimating the probability that a subject belongs to a particular class based on observed information. We applied this method to a Hodgkin lymphoma study with long-term follow-up and observed four distinct latent classes in terms of long-term survival and distributions of prognostic factors. Our results from simulation studies and from the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. This flexible inference method provides more accurate estimation and accommodates unobservable heterogeneity among individuals while taking involved interactions between covariates into consideration.^
Resumo:
Empirical evidence and theoretical studies suggest that the phenotype, i.e., cellular- and molecular-scale dynamics, including proliferation rate and adhesiveness due to microenvironmental factors and gene expression that govern tumor growth and invasiveness, also determine gross tumor-scale morphology. It has been difficult to quantify the relative effect of these links on disease progression and prognosis using conventional clinical and experimental methods and observables. As a result, successful individualized treatment of highly malignant and invasive cancers, such as glioblastoma, via surgical resection and chemotherapy cannot be offered and outcomes are generally poor. What is needed is a deterministic, quantifiable method to enable understanding of the connections between phenotype and tumor morphology. Here, we critically assess advantages and disadvantages of recent computational modeling efforts (e.g., continuum, discrete, and cellular automata models) that have pursued this understanding. Based on this assessment, we review a multiscale, i.e., from the molecular to the gross tumor scale, mathematical and computational "first-principle" approach based on mass conservation and other physical laws, such as employed in reaction-diffusion systems. Model variables describe known characteristics of tumor behavior, and parameters and functional relationships across scales are informed from in vitro, in vivo and ex vivo biology. We review the feasibility of this methodology that, once coupled to tumor imaging and tumor biopsy or cell culture data, should enable prediction of tumor growth and therapy outcome through quantification of the relation between the underlying dynamics and morphological characteristics. In particular, morphologic stability analysis of this mathematical model reveals that tumor cell patterning at the tumor-host interface is regulated by cell proliferation, adhesion and other phenotypic characteristics: histopathology information of tumor boundary can be inputted to the mathematical model and used as a phenotype-diagnostic tool to predict collective and individual tumor cell invasion of surrounding tissue. This approach further provides a means to deterministically test effects of novel and hypothetical therapy strategies on tumor behavior.
Resumo:
The factorial validity of the SF-36 was evaluated using confirmatory factor analysis (CFA) methods, structural equation modeling (SEM), and multigroup structural equation modeling (MSEM). First, the measurement and structural model of the hypothesized SF-36 was explicated. Second, the model was tested for the validity of a second-order factorial structure, upon evidence of model misfit, determined the best-fitting model, and tested the validity of the best-fitting model on a second random sample from the same population. Third, the best-fitting model was tested for invariance of the factorial structure across race, age, and educational subgroups using MSEM.^ The findings support the second-order factorial structure of the SF-36 as proposed by Ware and Sherbourne (1992). However, the results suggest that: (a) Mental Health and Physical Health covary; (b) general mental health cross-loads onto Physical Health; (c) general health perception loads onto Mental Health instead of Physical Health; (d) many of the error terms are correlated; and (e) the physical function scale is not reliable across these two samples. This hierarchical factor pattern was replicated across both samples of health care workers, suggesting that the post hoc model fitting was not data specific. Subgroup analysis suggests that the physical function scale is not reliable across the "age" or "education" subgroups and that the general mental health scale path from Mental Health is not reliable across the "white/nonwhite" or "education" subgroups.^ The importance of this study is in the use of SEM and MSEM in evaluating sample data from the use of the SF-36. These methods are uniquely suited to the analysis of latent variable structures and are widely used in other fields. The use of latent variable models for self reported outcome measures has become widespread, and should now be applied to medical outcomes research. Invariance testing is superior to mean scores or summary scores when evaluating differences between groups. From a practical, as well as, psychometric perspective, it seems imperative that construct validity research related to the SF-36 establish whether this same hierarchical structure and invariance holds for other populations.^ This project is presented as three articles to be submitted for publication. ^
Resumo:
Many studies in biostatistics deal with binary data. Some of these studies involve correlated observations, which can complicate the analysis of the resulting data. Studies of this kind typically arise when a high degree of commonality exists between test subjects. If there exists a natural hierarchy in the data, multilevel analysis is an appropriate tool for the analysis. Two examples are the measurements on identical twins, or the study of symmetrical organs or appendages such as in the case of ophthalmic studies. Although this type of matching appears ideal for the purposes of comparison, analysis of the resulting data while ignoring the effect of intra-cluster correlation has been shown to produce biased results.^ This paper will explore the use of multilevel modeling of simulated binary data with predetermined levels of correlation. Data will be generated using the Beta-Binomial method with varying degrees of correlation between the lower level observations. The data will be analyzed using the multilevel software package MlwiN (Woodhouse, et al, 1995). Comparisons between the specified intra-cluster correlation of these data and the estimated correlations, using multilevel analysis, will be used to examine the accuracy of this technique in analyzing this type of data. ^
Resumo:
The respiratory central pattern generator is a collection of medullary neurons that generates the rhythm of respiration. The respiratory central pattern generator feeds phrenic motor neurons, which, in turn, drive the main muscle of respiration, the diaphragm. The purpose of this thesis is to understand the neural control of respiration through mathematical models of the respiratory central pattern generator and phrenic motor neurons. ^ We first designed and validated a Hodgkin-Huxley type model that mimics the behavior of phrenic motor neurons under a wide range of electrical and pharmacological perturbations. This model was constrained physiological data from the literature. Next, we designed and validated a model of the respiratory central pattern generator by connecting four Hodgkin-Huxley type models of medullary respiratory neurons in a mutually inhibitory network. This network was in turn driven by a simple model of an endogenously bursting neuron, which acted as the pacemaker for the respiratory central pattern generator. Finally, the respiratory central pattern generator and phrenic motor neuron models were connected and their interactions studied. ^ Our study of the models has provided a number of insights into the behavior of the respiratory central pattern generator and phrenic motor neurons. These include the suggestion of a role for the T-type and N-type calcium channels during single spikes and repetitive firing in phrenic motor neurons, as well as a better understanding of network properties underlying respiratory rhythm generation. We also utilized an existing model of lung mechanics to study the interactions between the respiratory central pattern generator and ventilation. ^
Resumo:
Anticancer drugs typically are administered in the clinic in the form of mixtures, sometimes called combinations. Only in rare cases, however, are mixtures approved as drugs. Rather, research on mixtures tends to occur after single drugs have been approved. The goal of this research project was to develop modeling approaches that would encourage rational preclinical mixture design. To this end, a series of models were developed. First, several QSAR classification models were constructed to predict the cytotoxicity, oral clearance, and acute systemic toxicity of drugs. The QSAR models were applied to a set of over 115,000 natural compounds in order to identify promising ones for testing in mixtures. Second, an improved method was developed to assess synergistic, antagonistic, and additive effects between drugs in a mixture. This method, dubbed the MixLow method, is similar to the Median-Effect method, the de facto standard for assessing drug interactions. The primary difference between the two is that the MixLow method uses a nonlinear mixed-effects model to estimate parameters of concentration-effect curves, rather than an ordinary least squares procedure. Parameter estimators produced by the MixLow method were more precise than those produced by the Median-Effect Method, and coverage of Loewe index confidence intervals was superior. Third, a model was developed to predict drug interactions based on scores obtained from virtual docking experiments. This represents a novel approach for modeling drug mixtures and was more useful for the data modeled here than competing approaches. The model was applied to cytotoxicity data for 45 mixtures, each composed of up to 10 selected drugs. One drug, doxorubicin, was a standard chemotherapy agent and the others were well-known natural compounds including curcumin, EGCG, quercetin, and rhein. Predictions of synergism/antagonism were made for all possible fixed-ratio mixtures, cytotoxicities of the 10 best-scoring mixtures were tested, and drug interactions were assessed. Predicted and observed responses were highly correlated (r2 = 0.83). Results suggested that some mixtures allowed up to an 11-fold reduction of doxorubicin concentrations without sacrificing efficacy. Taken together, the models developed in this project present a general approach to rational design of mixtures during preclinical drug development. ^
Resumo:
Colorectal cancer is the forth most common diagnosed cancer in the United States. Every year about a hundred forty-seven thousand people will be diagnosed with colorectal cancer and fifty-six thousand people lose their lives due to this disease. Most of the hereditary nonpolyposis colorectal cancer (HNPCC) and 12% of the sporadic colorectal cancer show microsatellite instability. Colorectal cancer is a multistep progressive disease. It starts from a mutation in a normal colorectal cell and grows into a clone of cells that further accumulates mutations and finally develops into a malignant tumor. In terms of molecular evolution, the process of colorectal tumor progression represents the acquisition of sequential mutations. ^ Clinical studies use biomarkers such as microsatellite or single nucleotide polymorphisms (SNPs) to study mutation frequencies in colorectal cancer. Microsatellite data obtained from single genome equivalent PCR or small pool PCR can be used to infer tumor progression. Since tumor progression is similar to population evolution, we used an approach known as coalescent, which is well established in population genetics, to analyze this type of data. Coalescent theory has been known to infer the sample's evolutionary path through the analysis of microsatellite data. ^ The simulation results indicate that the constant population size pattern and the rapid tumor growth pattern have different genetic polymorphic patterns. The simulation results were compared with experimental data collected from HNPCC patients. The preliminary result shows the mutation rate in 6 HNPCC patients range from 0.001 to 0.01. The patients' polymorphic patterns are similar to the constant population size pattern which implies the tumor progression is through multilineage persistence instead of clonal sequential evolution. The results should be further verified using a larger dataset. ^
Resumo:
Hodgkin's disease (HD) is a cancer of the lymphatic system. Survivors of HD face varieties of consequent adverse effects, in which secondary primary tumors (SPT) is one of the most serious consequences. This dissertation is aimed to model time-to-SPT in the presence of death and HD relapses during follow-up.^ The model is designed to handle a mixture phenomenon of SPT and the influence of death. Relapses of HD are adjusted as a covariate. Proportional hazards framework is used to define SPT intensity function, which includes an exponential term to estimate explanatory variables. Death as a competing risk is considered according to different scenarios, depending on which terminal event comes first. Newton-Raphson method is used to estimate the parameter estimates in the end.^ The proposed method is applied to a real data set containing a group of HD patients. Several risk factors for the development of SPT are identified and the findings are noteworthy in the development of healthcare guidelines that may lead to the early detection or prevention of SPT.^
Resumo:
This dissertation examined body mass index (BMI) growth trajectories and the effects of gender, ethnicity, dietary intake, and physical activity (PA) on BMI growth trajectories among 3rd to 12th graders (9-18 years of age). Growth curve model analysis was performed using data from The Child and Adolescent Trial for Cardiovascular Health (CATCH) study. The study population included 2909 students who were followed up from grades 3-12. The main outcome was BMI at grades 3, 4, 5, 8, and 12. ^ The results revealed that BMI growth differed across two distinct developmental periods of childhood and adolescence. Rate of BMI growth was faster in middle childhood (9-11 years old or 3rd - 5th grades) than in adolescence (11-18 years old or 5th - 12th grades). Students with higher BMI at 3rd grade (baseline) had faster rates of BMI growth. Three groups of students with distinct BMI growth trajectories were identified: high, average, and low. ^ Black and Hispanic children were more likely to be in the groups with higher baseline BMI and faster rates of BMI growth over time. The effects of gender or ethnicity on BMI growth differed across the three groups. The effects of ethnicity on BMI growth were weakened as the children aged. The effects of gender on BMI growth were attenuated in the groups with a large proportion of black and Hispanic children, i.e., “high” or “average” BMI trajectory group. After controlling for gender, ethnicity, and age at baseline, in the “high BMI trajectory”, rate of yearly BMI growth in middle childhood increased 0.102 for every 500 Kcals increase (p=0.049). No significant effects of percentage of energy from total fat and saturated fat on BMI growth were found. Baseline BMI increased 0.041 for every 30 minutes increased in moderate-to-vigorous PA (MVPA) in the “low BMI trajectory”, while Baseline BMI decreased 0.345 for every 30 minutes increased in vigorous PA (VPA) in the “high BMI trajectory”. ^ Childhood overweight and obesity interventions should start at the earliest possible ages, prior to 3rd grade and continue through grade school. Interventions should focus on all children, but specifically black and Hispanic children, who are more likely to be highest at-risk. Promoting VPA earlier in childhood is important for preventing overweight and obesity among children and adolescents. Interventions should target total energy intake, rather than only percentage of energy from total fat or saturated fat. ^
Resumo:
Radiotherapy has been a method of choice in cancer treatment for a number of years. Mathematical modeling is an important tool in studying the survival behavior of any cell as well as its radiosensitivity. One particular cell under investigation is the normal T-cell, the radiosensitivity of which may be indicative to the patient's tolerance to radiation doses.^ The model derived is a compound branching process with a random initial population of T-cells that is assumed to have compound distribution. T-cells in any generation are assumed to double or die at random lengths of time. This population is assumed to undergo a random number of generations within a period of time. The model is then used to obtain an estimate for the survival probability of T-cells for the data under investigation. This estimate is derived iteratively by applying the likelihood principle. Further assessment of the validity of the model is performed by simulating a number of subjects under this model.^ This study shows that there is a great deal of variation in T-cells survival from one individual to another. These variations can be observed under normal conditions as well as under radiotherapy. The findings are in agreement with a recent study and show that genetic diversity plays a role in determining the survival of T-cells. ^
Resumo:
The infant mortality rate (IMR) is considered to be one of the most important indices of a country's well-being. Countries around the world and other health organizations like the World Health Organization are dedicating their resources, knowledge and energy to reduce the infant mortality rates. The well-known Millennium Development Goal 4 (MDG 4), whose aim is to archive a two thirds reduction of the under-five mortality rate between 1990 and 2015, is an example of the commitment. ^ In this study our goal is to model the trends of IMR between the 1950s to 2010s for selected countries. We would like to know how the IMR is changing overtime and how it differs across countries. ^ IMR data collected over time forms a time series. The repeated observations of IMR time series are not statistically independent. So in modeling the trend of IMR, it is necessary to account for these correlations. We proposed to use the generalized least squares method in general linear models setting to deal with the variance-covariance structure in our model. In order to estimate the variance-covariance matrix, we referred to the time-series models, especially the autoregressive and moving average models. Furthermore, we will compared results from general linear model with correlation structure to that from ordinary least squares method without taking into account the correlation structure to check how significantly the estimates change.^