7 resultados para Probabilistic latent semantic analysis (PLSA)
em DigitalCommons@The Texas Medical Center
Resumo:
OBJECTIVE: To characterize PubMed usage over a typical day and compare it to previous studies of user behavior on Web search engines. DESIGN: We performed a lexical and semantic analysis of 2,689,166 queries issued on PubMed over 24 consecutive hours on a typical day. MEASUREMENTS: We measured the number of queries, number of distinct users, queries per user, terms per query, common terms, Boolean operator use, common phrases, result set size, MeSH categories, used semantic measurements to group queries into sessions, and studied the addition and removal of terms from consecutive queries to gauge search strategies. RESULTS: The size of the result sets from a sample of queries showed a bimodal distribution, with peaks at approximately 3 and 100 results, suggesting that a large group of queries was tightly focused and another was broad. Like Web search engine sessions, most PubMed sessions consisted of a single query. However, PubMed queries contained more terms. CONCLUSION: PubMed's usage profile should be considered when educating users, building user interfaces, and developing future biomedical information retrieval systems.
Resumo:
Studies on the relationship between psychosocial determinants and HIV risk behaviors have produced little evidence to support hypotheses based on theoretical relationships. One limitation inherent in many articles in the literature is the method of measurement of the determinants and the analytic approach selected. ^ To reduce the misclassification associated with unit scaling of measures specific to internalized homonegativity, I evaluated the psychometric properties of the Reactions to Homosexuality scale in a confirmatory factor analytic framework. In addition, I assessed the measurement invariance of the scale across racial/ethnic classifications in a sample of men who have sex with men. The resulting measure contained eight items loading on three first-order factors. Invariance assessment identified metric and partial strong invariance between racial/ethnic groups in the sample. ^ Application of the updated measure to a structural model allowed for the exploration of direct and indirect effects of internalized homonegativity on unprotected anal intercourse. Pathways identified in the model show that drug and alcohol use at last sexual encounter, the number of sexual partners in the previous three months and sexual compulsivity all contribute directly to risk behavior. Internalized homonegativity reduced the likelihood of exposure to drugs, alcohol or higher numbers of partners. For men who developed compulsive sexual behavior as a coping strategy for internalized homonegativity, there was an increase in the prevalence odds of risk behavior. ^ In the final stage of the analysis, I conducted a latent profile analysis of the items in the updated Reactions to Homosexuality scale. This analysis identified five distinct profiles, which suggested that the construct was not homogeneous in samples of men who have sex with men. Lack of prior consideration of these distinct manifestations of internalized homonegativity may have contributed to the analytic difficulty in identifying a relationship between the trait and high-risk sexual practices. ^
Resumo:
Injection drug use is the third most frequent risk factor for new HIV infections in the United States. A dual mode of exposure: unsafe drug using practices and risky sexual behaviors underlies injection drug users' (IDUs) risk for HIV infection. This research study aims to characterize patterns of drug use and sexual behaviors and to examine the social contexts associated with risk behaviors among a sample of injection drug users. ^ This cross-sectional study includes 523 eligible injection drug users from Houston, Texas, recruited into the 2009 National HIV Behavioral Surveillance project. Three separate set of analyses were carried out. First, using latent class analysis (LCA) and maximum likelihood we identified classes of behavior describing levels of HIV risk, from nine drug and sexual behaviors. Second, eight separate multivariable regression models were built to examine the odds of reporting a given risk behavior. We constructed the most parsimonious multivariable model using a manual backward stepwise process. Third, we examined whether HIV serostatus knowledge (self-reported positive, negative, or unknown serostatus) is associated with drug use and sexual HIV risk behaviors. ^ Participants were mostly male, older, and non-Hispanic Black. Forty-two percent of our sample had behaviors putting them at high risk, 25% at moderate risk, and 33% at low risk for HIV infection. Individuals in the High-risk group had the highest probability of risky behaviors, categorized as almost always sharing needles (0.93), seldom using condoms (0.10), reporting recent exchange sex partners (0.90), and practicing anal sex (0.34). We observed that unsafe injecting practices were associated with high risk sexual behaviors. IDUs who shared needles had higher odds of having anal sex (OR=2.89, 95%CI: 1.69-4.92) and unprotected sex (OR=2.66, 95%CI: 1.38-5.10) at last sex. Additionally, homelessness was associated with needle sharing (OR=2.24, 95% CI: 1.34-3.76) and cocaine use was associated with multiple sex partners (OR=1.82, 95% CI: 1.07-3.11). Furthermore, twenty-one percent of the sample was unaware of their HIV serostatus. The three groups were not different from each other in terms of drug-use behaviors: always using a new sterile needle, or in sharing needles or drug preparation equipment. However, IDUs unaware of their HIV serostatus were 33% more likely to report having more than three sexual partners in the past 12 months; 45% more likely to report to have unprotected sex and 85% more likely to use drug and or alcohol during or before at last sex compared to HIV-positive IDUs. ^ This analysis underscores the merit of LCA approach to empirically categorize injection drug users into distinct classes and identify their risk pattern using multiple indicators and our results show considerable overlap of high risk sexual and drug use behaviors among the high-risk class members. The observed clustering pattern of drug and sexual risk behavior among this population confirms that injection drug users do not represent a homogeneous population in terms of HIV risk. These findings will help develop tailored prevention programs.^
Resumo:
The Soehner-Dmochowski strain of murine sarcoma virus (MuSV-SD) was derived from a bone tumor of a New Zealand Black (NZB) rat infected with the Moloney strain of MuSV, which carries the gene encoding the v-mos protein. Serial passage of cell-free tumor extracts both decreased the latent period and resulted in osteosarcomas. Cells from a late passage tumor were established in culture, cell-free extracts frozen, and later inoculated into newborn NZB rats. One of the resulting bone tumors was established in culture and clonal cell lines derived, of which S4 was selected for the present study. The objectives of the study were two-fold: an examination of the genetic organization of MuSV-SD, and an examination of the biochemical characteristics of the viral proteins, since this is an acutely transforming virus which may yield insights into the mechanism of transformation caused by the v-mos protein. Blot hybridization of digested S4 genomic DNA reveals three candidate MuSV-SD integrated viral DNAs. The largest of these, MuSV-SD-6.5, was cloned from an S4 cosmid library, and the complete MuSV-SD-mos sequence was determined. The predicted amino acid sequence of the v-mos protein was compared to that of MuSV-124 and Ht-1, which show a 96.5% and 97.1% similarity, respectively. To characterize the MuSV-SD-mos protein further, immunochemical assays were performed using anti-mos antisera. The immunoblot analysis and immunoprecipitation assays demonstrated that similar levels of the v-mos protein were present in cells chronically infected with either MuSV-SD or MuSV-124; however, the immune complex kinase assay revealed greatly reduced in vitro serine kinase activity of the MuSV-SD-mos protein compared to that of MuSV-124. Sequence analysis demonstrated that the serine at amino acid residue 358 of the MuSV-SD-mos protein, like that of MuSV-Ht-1, had been mutated to a glycine. Mutations of this serine residue have been shown to affect the detectable in vitro kinase activity, however, v-mos proteins containing this mutation still retain transforming properties. Therefore, although the characteristic in vitro kinase activity of the MuSV-SD-mos protein has not been demonstrated, it is clear that this virus is a potent transforming agent. ^
Resumo:
Mixture modeling is commonly used to model categorical latent variables that represent subpopulations in which population membership is unknown but can be inferred from the data. In relatively recent years, the potential of finite mixture models has been applied in time-to-event data. However, the commonly used survival mixture model assumes that the effects of the covariates involved in failure times differ across latent classes, but the covariate distribution is homogeneous. The aim of this dissertation is to develop a method to examine time-to-event data in the presence of unobserved heterogeneity under a framework of mixture modeling. A joint model is developed to incorporate the latent survival trajectory along with the observed information for the joint analysis of a time-to-event variable, its discrete and continuous covariates, and a latent class variable. It is assumed that the effects of covariates on survival times and the distribution of covariates vary across different latent classes. The unobservable survival trajectories are identified through estimating the probability that a subject belongs to a particular class based on observed information. We applied this method to a Hodgkin lymphoma study with long-term follow-up and observed four distinct latent classes in terms of long-term survival and distributions of prognostic factors. Our results from simulation studies and from the Hodgkin lymphoma study demonstrated the superiority of our joint model compared with the conventional survival model. This flexible inference method provides more accurate estimation and accommodates unobservable heterogeneity among individuals while taking involved interactions between covariates into consideration.^
Resumo:
Clinical text understanding (CTU) is of interest to health informatics because critical clinical information frequently represented as unconstrained text in electronic health records are extensively used by human experts to guide clinical practice, decision making, and to document delivery of care, but are largely unusable by information systems for queries and computations. Recent initiatives advocating for translational research call for generation of technologies that can integrate structured clinical data with unstructured data, provide a unified interface to all data, and contextualize clinical information for reuse in multidisciplinary and collaborative environment envisioned by CTSA program. This implies that technologies for the processing and interpretation of clinical text should be evaluated not only in terms of their validity and reliability in their intended environment, but also in light of their interoperability, and ability to support information integration and contextualization in a distributed and dynamic environment. This vision adds a new layer of information representation requirements that needs to be accounted for when conceptualizing implementation or acquisition of clinical text processing tools and technologies for multidisciplinary research. On the other hand, electronic health records frequently contain unconstrained clinical text with high variability in use of terms and documentation practices, and without commitmentto grammatical or syntactic structure of the language (e.g. Triage notes, physician and nurse notes, chief complaints, etc). This hinders performance of natural language processing technologies which typically rely heavily on the syntax of language and grammatical structure of the text. This document introduces our method to transform unconstrained clinical text found in electronic health information systems to a formal (computationally understandable) representation that is suitable for querying, integration, contextualization and reuse, and is resilient to the grammatical and syntactic irregularities of the clinical text. We present our design rationale, method, and results of evaluation in processing chief complaints and triage notes from 8 different emergency departments in Houston Texas. At the end, we will discuss significance of our contribution in enabling use of clinical text in a practical bio-surveillance setting.
Resumo:
The first manuscript, entitled "Time-Series Analysis as Input for Clinical Predictive Modeling: Modeling Cardiac Arrest in a Pediatric ICU" lays out the theoretical background for the project. There are several core concepts presented in this paper. First, traditional multivariate models (where each variable is represented by only one value) provide single point-in-time snapshots of patient status: they are incapable of characterizing deterioration. Since deterioration is consistently identified as a precursor to cardiac arrests, we maintain that the traditional multivariate paradigm is insufficient for predicting arrests. We identify time series analysis as a method capable of characterizing deterioration in an objective, mathematical fashion, and describe how to build a general foundation for predictive modeling using time series analysis results as latent variables. Building a solid foundation for any given modeling task involves addressing a number of issues during the design phase. These include selecting the proper candidate features on which to base the model, and selecting the most appropriate tool to measure them. We also identified several unique design issues that are introduced when time series data elements are added to the set of candidate features. One such issue is in defining the duration and resolution of time series elements required to sufficiently characterize the time series phenomena being considered as candidate features for the predictive model. Once the duration and resolution are established, there must also be explicit mathematical or statistical operations that produce the time series analysis result to be used as a latent candidate feature. In synthesizing the comprehensive framework for building a predictive model based on time series data elements, we identified at least four classes of data that can be used in the model design. The first two classes are shared with traditional multivariate models: multivariate data and clinical latent features. Multivariate data is represented by the standard one value per variable paradigm and is widely employed in a host of clinical models and tools. These are often represented by a number present in a given cell of a table. Clinical latent features derived, rather than directly measured, data elements that more accurately represent a particular clinical phenomenon than any of the directly measured data elements in isolation. The second two classes are unique to the time series data elements. The first of these is the raw data elements. These are represented by multiple values per variable, and constitute the measured observations that are typically available to end users when they review time series data. These are often represented as dots on a graph. The final class of data results from performing time series analysis. This class of data represents the fundamental concept on which our hypothesis is based. The specific statistical or mathematical operations are up to the modeler to determine, but we generally recommend that a variety of analyses be performed in order to maximize the likelihood that a representation of the time series data elements is produced that is able to distinguish between two or more classes of outcomes. The second manuscript, entitled "Building Clinical Prediction Models Using Time Series Data: Modeling Cardiac Arrest in a Pediatric ICU" provides a detailed description, start to finish, of the methods required to prepare the data, build, and validate a predictive model that uses the time series data elements determined in the first paper. One of the fundamental tenets of the second paper is that manual implementations of time series based models are unfeasible due to the relatively large number of data elements and the complexity of preprocessing that must occur before data can be presented to the model. Each of the seventeen steps is analyzed from the perspective of how it may be automated, when necessary. We identify the general objectives and available strategies of each of the steps, and we present our rationale for choosing a specific strategy for each step in the case of predicting cardiac arrest in a pediatric intensive care unit. Another issue brought to light by the second paper is that the individual steps required to use time series data for predictive modeling are more numerous and more complex than those used for modeling with traditional multivariate data. Even after complexities attributable to the design phase (addressed in our first paper) have been accounted for, the management and manipulation of the time series elements (the preprocessing steps in particular) are issues that are not present in a traditional multivariate modeling paradigm. In our methods, we present the issues that arise from the time series data elements: defining a reference time; imputing and reducing time series data in order to conform to a predefined structure that was specified during the design phase; and normalizing variable families rather than individual variable instances. The final manuscript, entitled: "Using Time-Series Analysis to Predict Cardiac Arrest in a Pediatric Intensive Care Unit" presents the results that were obtained by applying the theoretical construct and its associated methods (detailed in the first two papers) to the case of cardiac arrest prediction in a pediatric intensive care unit. Our results showed that utilizing the trend analysis from the time series data elements reduced the number of classification errors by 73%. The area under the Receiver Operating Characteristic curve increased from a baseline of 87% to 98% by including the trend analysis. In addition to the performance measures, we were also able to demonstrate that adding raw time series data elements without their associated trend analyses improved classification accuracy as compared to the baseline multivariate model, but diminished classification accuracy as compared to when just the trend analysis features were added (ie, without adding the raw time series data elements). We believe this phenomenon was largely attributable to overfitting, which is known to increase as the ratio of candidate features to class examples rises. Furthermore, although we employed several feature reduction strategies to counteract the overfitting problem, they failed to improve the performance beyond that which was achieved by exclusion of the raw time series elements. Finally, our data demonstrated that pulse oximetry and systolic blood pressure readings tend to start diminishing about 10-20 minutes before an arrest, whereas heart rates tend to diminish rapidly less than 5 minutes before an arrest.