11 resultados para Topic modeling

em Deakin Research Online - Australia


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Although random control trial is the gold standard in medical research, researchers are increasingly looking to alternative data sources for hypothesis generation and early-stage evidence collection. Coded clinical data are collected routinely in most hospitals. While they contain rich information directly related to the real clinical setting, they are both noisy and semantically diverse, making them difficult to analyze with conventional statistical tools. This paper presents a novel application of Bayesian nonparametric modeling to uncover latent information in coded clinical data. For a patient cohort, a Bayesian nonparametric model is used to reveal the common comorbidity groups shared by the patients and the proportion that each comorbidity group is reflected individual patient. To demonstrate the method, we present a case study based on hospitalization coding from an Australian hospital. The model recovered 15 comorbidity groups among 1012 patients hospitalized during a month. When patients from two areas of unequal socio-economic status were compared, it reveals higher prevalence of diverticular disease in the region of lower socio-economic status. The study builds a convincing case for routine coded data to speed up hypothesis generation.

Relevância:

60.00% 60.00%

Publicador:

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Significant world events often cause the behavioral convergence of the expression of shared sentiment. This paper examines the use of the blogosphere as a framework to study user psychological behaviors, using their sentiment responses as a form of ‘sensor’ to infer real-world events of importance automatically. We formulate a novel temporal sentiment index function using quantitative measure of the valence value of bearing words in blog posts in which the set of affective bearing words is inspired from psychological research in emotion structure. The annual local minimum and maximum of the proposed sentiment signal function are utilized to extract significant events of the year and corresponding blog posts are further analyzed using topic modeling tools to understand their content. The paper then examines the correlation of topics discovered in relation to world news events reported by the mainstream news service provider, Cable News Network, and by using the Google search engine. Next, aiming at understanding sentiment at a finer granularity over time, we propose a stochastic burst detection model, extended from the work of Kleinberg, to work incrementally with stream data. The proposed model is then used to extract sentimental bursts occurring within a specific mood label (for example, a burst of observing ‘shocked’). The blog posts at those time indices are analyzed to extract topics, and these are compared to real-world news events. Our comprehensive set of experiments conducted on a large-scale set of 12 million posts from Livejournal shows that the proposed sentiment index function coincides well with significant world events while bursts in sentiment allow us to locate finer-grain external world events.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The users often have additional knowledge when Bayesian nonparametric models (BNP) are employed, e.g. for clustering there may be prior knowledge that some of the data instances should be in the same cluster (must-link constraint) or in different clusters (cannot-link constraint), and similarly for topic modeling some words should be grouped together or separately because of an underlying semantic. This can be achieved by imposing appropriate sampling probabilities based on such constraints. However, the traditional inference technique of BNP models via Gibbs sampling is time consuming and is not scalable for large data. Variational approximations are faster but many times they do not offer good solutions. Addressing this we present a small-variance asymptotic analysis of the MAP estimates of BNP models with constraints. We derive the objective function for Dirichlet process mixture model with constraints and devise a simple and efficient K-means type algorithm. We further extend the small-variance analysis to hierarchical BNP models with constraints and devise a similar simple objective function. Experiments on synthetic and real data sets demonstrate the efficiency and effectiveness of our algorithms.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Electronic Medical Record (EMR) has established itself as a valuable resource for large scale analysis of health data. A hospital EMR dataset typically consists of medical records of hospitalized patients. A medical record contains diagnostic information (diagnosis codes), procedures performed (procedure codes) and admission details. Traditional topic models, such as latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP), can be employed to discover disease topics from EMR data by treating patients as documents and diagnosis codes as words. This topic modeling helps to understand the constitution of patient diseases and offers a tool for better planning of treatment. In this paper, we propose a novel and flexible hierarchical Bayesian nonparametric model, the word distance dependent Chinese restaurant franchise (wddCRF), which incorporates word-to-word distances to discover semantically-coherent disease topics. We are motivated by the fact that diagnosis codes are connected in the form of ICD-10 tree structure which presents semantic relationships between codes. We exploit a decay function to incorporate distances between words at the bottom level of wddCRF. Efficient inference is derived for the wddCRF by using MCMC technique. Furthermore, since procedure codes are often correlated with diagnosis codes, we develop the correspondence wddCRF (Corr-wddCRF) to explore conditional relationships of procedure codes for a given disease pattern. Efficient collapsed Gibbs sampling is derived for the Corr-wddCRF. We evaluate the proposed models on two real-world medical datasets - PolyVascular disease and Acute Myocardial Infarction disease. We demonstrate that the Corr-wddCRF model discovers more coherent topics than the Corr-HDP. We also use disease topic proportions as new features and show that using features from the Corr-wddCRF outperforms the baselines on 14-days readmission prediction. Beside these, the prediction for procedure codes based on the Corr-wddCRF also shows considerable accuracy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

INTRODUCTION: Studies that address sensitive topics, such as female sexual difficulty and dysfunction, often achieve poor response rates that can bias  results. Factors that affect response rates to studies in this area are not well characterized.
AIM: To model the response rate in studies investigating the prevalence of female sexual difficulty and dysfunction.
METHODS: Databases were searched for English-language, prevalence studies using the search terms: sexual difficulties/dysfunction, woman/women/female, prevalence, and cross-sectional. Studies that did not report response rates or were clinic-based were excluded. A multiple linear regression model was constructed.
MAIN OUTCOME MEASURES: Published response rates.
RESULTS: A total of 1,380 publications were identified, and 54 of these met our inclusion criteria. Our model explained 58% of the variance in response rates of studies investigating the prevalence of difficulty with desire, arousal, orgasm, or sexual pain (R(2) = 0.581, P = 0.027). This model was based on study design variables, study year, location, and the reported prevalence of each type of sexual difficulty. More recent studies (beta = -1.05, P = 0.037) and studies that only included women over 50 years of age (beta = -31.11, P = 0.007) had lower response rates. The use of face-to-face interviews was associated with a higher response rate (beta = 20.51, P = 0.036). Studies that did not include questions regarding desire difficulties achieved higher response rates than those that did include questions on desire difficulty (beta = 23.70, P = 0.034).
CONCLUSION: Response rates in prevalence studies addressing female sexual difficulty and dysfunction are frequently low and have decreased by an average of just over 1% per anum since the late 60s. Participation may improve by conducting interviews in person. Studies that investigate a broad range of ages may be less representative of older women, due to a poorer response in older age groups. Lower response rates in studies that investigate desire difficulty suggest that sexual desire is a particularly sensitive topic.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The topic of systems of systems has been one of the most challenging areas in science and engineering due to its multidisciplinary scope and inherent complexity. Despite all attempts carried out so far in both academia and industry, real world applications are far remote. The purpose of this paper is to modify and adopt a recently developed modeling paradigm for system of systems and then employ it to model a generic baggage handling system of an airport complex. In a top-down design approach, we start modeling process by definition of some modeling goals that guide us in selection of some high level attributes. Then functional attributes are defined which act as ties between high level attributes (the first level of abstraction) and low level metrics/measurements. Since the most challenging issues in developing models for system of systems are identification and representation of dependencies amongst constituent entities, a machine learning technique is adopted for addressing these issues.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In this paper we introduce a probabilistic framework to exploit hierarchy, structure sharing and duration information for topic transition detection in videos. Our probabilistic detection framework is a combination of a shot classification step and a detection phase using hierarchical probabilistic models. We consider two models in this paper: the extended Hierarchical Hidden Markov Model (HHMM) and the Coxian Switching Hidden semi-Markov Model (S-HSMM) because they allow the natural decomposition of semantics in videos, including shared structures, to be modeled directly, and thus enabling efficient inference and reducing the sample complexity in learning. Additionally, the S-HSMM allows the duration information to be incorporated, consequently the modeling of long-term dependencies in videos is enriched through both hierarchical and duration modeling. Furthermore, the use of the Coxian distribution in the S-HSMM makes it tractable to deal with long sequences in video. Our experimentation of the proposed framework on twelve educational and training videos shows that both models outperform the baseline cases (flat HMM and HSMM) and performances reported in earlier work in topic detection. The superior performance of the S-HSMM over the HHMM verifies our belief that duration information is an important factor in video content modeling.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Smartphones are pervasively used in society, and have been both the target and victim of malware writers. Motivated by the significant threat that presents to legitimate users, we survey the current smartphone malware status and their propagation models. The content of this paper is presented in two parts. In the first part, we review the short history of mobile malware evolution since 2004, and then list the classes of mobile malware and their infection vectors. At the end of the first part, we enumerate the possible damage caused by smartphone malware. In the second part, we focus on smartphone malware propagation modeling. In order to understand the propagation behavior of smartphone malware, we recall generic epidemic models as a foundation for further exploration. We then extensively survey the smartphone malware propagation models. At the end of this paper, we highlight issues of the current smartphone malware propagation models and discuss possible future trends based on our understanding of this topic. © © 2014 IEEE.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Discovering knowledge from unstructured texts is a central theme in data mining and machine learning. We focus on fast discovery of thematic structures from a corpus. Our approach is based on a versatile probabilistic formulation – the restricted Boltzmann machine (RBM) –where the underlying graphical model is an undirected bipartite graph. Inference is efficient document representation can be computed with a single matrix projection, making RBMs suitable for massive text corpora available today. Standard RBMs, however, operate on bag-of-words assumption, ignoring the inherent underlying relational structures among words. This results in less coherent word thematic grouping. We introduce graph-based regularization schemes that exploit the linguistic structures, which in turn can be constructed from either corpus statistics or domain knowledge. We demonstrate that the proposed technique improves the group coherence, facilitates visualization, provides means for estimation of intrinsic dimensionality, reduces overfitting, and possibly leads to better classification accuracy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

BIM has received considerable attention from academics and innovative construction companies in recent years within the Iranian context. However, there is a conspicuous lack of studies, which give a picture of the current state of BIM in Iran. To address this gap in the body of the knowledge, this study intends to present an account on the current state of BIM with a focus on barriers and drivers associated with its adoption in Iran based on the perceptions of Iranian construction practitioners. Drawing upon a questionnaire survey completed by 44 construction practitioners and through deploying data visualization alongside statistical analyses, it came to light that industry practitioners in Iran are inexperienced as to BIM’s use and the level of BIM implementation in the country is at the lowest level of BIM maturity. That is, 29.5% of construction companies are involved in some level of BIM adoption whereas 56.8% have had no exposure to BIM and 36.4% do not even have any plans to adopt BIM in the near future. The findings also showed that the highest ranked barriers to adoption of BIM in Iran are almost entirely associated with the structure of the Iranian market, the nature of the construction industry and the predominant business environment in the country as well as lack of attention by policy makers and the government. On the other hand, major drivers were found to be associated with monetary gains and enhancing competitiveness in the market. The clear message is that widespread adoption of BIM in Iran will not occur in the absence of a supportive regulatory environment and financial assistance by policy makers. The paper contributes to the field by sharing the preliminary findings of the first study conducted on BIM adoption in Iran, which provides a sound basis for further inquiries on the topic.