870 resultados para Latent Semantic Analysis
Resumo:
Questionnaire data may contain missing values because certain questions do not apply to all respondents. For instance, questions addressing particular attributes of a symptom, such as frequency, triggers or seasonality, are only applicable to those who have experienced the symptom, while for those who have not, responses to these items will be missing. This missing information does not fall into the category 'missing by design', rather the features of interest do not exist and cannot be measured regardless of survey design. Analysis of responses to such conditional items is therefore typically restricted to the subpopulation in which they apply. This article is concerned with joint multivariate modelling of responses to both unconditional and conditional items without restricting the analysis to this subpopulation. Such an approach is of interest when the distributions of both types of responses are thought to be determined by common parameters affecting the whole population. By integrating the conditional item structure into the model, inference can be based both on unconditional data from the entire population and on conditional data from subjects for whom they exist. This approach opens new possibilities for multivariate analysis of such data. We apply this approach to latent class modelling and provide an example using data on respiratory symptoms (wheeze and cough) in children. Conditional data structures such as that considered here are common in medical research settings and, although our focus is on latent class models, the approach can be applied to other multivariate models.
Resumo:
The present research analyses the adequacy of the widely used Career Satisfaction Scale (CSS; Greenhaus, Parasuraman, & Wormley, 1990) for measuring change over time. We used data of a sample of 1,273 professionals over a 5-year time period. First, we tested longitudinal measurement invariance of the CSS. Second, we analysed changes in career satisfaction by means of multiple indicator latent growth modelling (MLGM). Results revealed that the CSS can be reliably used in mean change analyses. Altogether, career satisfaction was relatively stable over time; however, we found significant variance in intra-individual growth trajectories and a negative correlation between the initial level of and changes in career satisfaction. Professionals who were initially highly satisfied became less satisfied over time. Theoretical and practical implications with respect to the construct of career satisfaction and its development over time (i.e., alpha, beta, and gamma change) are discussed.
Resumo:
Semantic Web aims to allow machines to make inferences using the explicit conceptualisations contained in ontologies. By pointing to ontologies, Semantic Web-based applications are able to inter-operate and share common information easily. Nevertheless, multilingual semantic applications are still rare, owing to the fact that most online ontologies are monolingual in English. In order to solve this issue, techniques for ontology localisation and translation are needed. However, traditional machine translation is difficult to apply to ontologies, owing to the fact that ontology labels tend to be quite short in length and linguistically different from the free text paradigm. In this paper, we propose an approach to enhance machine translation of ontologies based on exploiting the well-structured concept descriptions contained in the ontology. In particular, our approach leverages the semantics contained in the ontology by using Cross Lingual Explicit Semantic Analysis (CLESA) for context-based disambiguation in phrase-based Statistical Machine Translation (SMT). The presented work is novel in the sense that application of CLESA in SMT has not been performed earlier to the best of our knowledge.
Resumo:
This article applies methods of latent class analysis (LCA) to data on lifetime illicit drug use in order to determine whether qualitatively distinct classes of illicit drug users can be identified. Self-report data on lifetime illicit drug use (cannabis, stimulants, hallucinogens, sedatives, inhalants, cocaine, opioids and solvents) collected from a sample of 6265 Australian twins (average age 30 years) were analyzed using LCA. Rates of childhood sexual and physical abuse, lifetime alcohol and tobacco dependence, symptoms of illicit drug abuse/dependence and psychiatric comorbidity were compared across classes using multinomial logistic regression. LCA identified a 5-class model: Class 1 (68.5%) had low risks of the use of all drugs except cannabis; Class 2 (17.8%) had moderate risks of the use of all drugs; Class 3 (6.6%) had high rates of cocaine, other stimulant and hallucinogen use but lower risks for the use of sedatives or opioids. Conversely, Class 4 (3.0%) had relatively low risks of cocaine, other stimulant or hallucinogen use but high rates of sedative and opioid use. Finally, Class 5 (4.2%) had uniformly high probabilities for the use of all drugs. Rates of psychiatric comorbidity were highest in the polydrug class although the sedative/opioid class had elevated rates of depression/suicidal behaviors and exposure to childhood abuse. Aggregation of population-level data may obscure important subgroup differences in patterns of illicit drug use and psychiatric comorbidity. Further exploration of a 'self-medicating' subgroup is needed.
Resumo:
Context: The relationships among the different eating disorders that exist in the community are poorly understood, especially for residual disorders in which bingeing or purging occurs in the absence of other behaviors. Objective: To examine a community sample for the number of mutually exclusive weight and eating profiles. Design: Data regarding lifetime eating disorder symptoms and weight range were submitted to a latent profile analysis. Profiles were compared regarding personality, current eating and weight, retrospectively reported life events, and lifetime depressive psychopathology. Setting: Longitudinal study among female twins from the Australian Twin Registry in whom eating was assessed by a telephone interview. Participants: A community sample of 1002 twins (individuals) who had participated in earlier waves of data collection. Main Outcome Measures: Number and clinical character of latent profiles. Results: The best fit was a 5-profile solution with women who were (1) of normal weight with few lifetime eating disorders (4.3%), (2) overweight (10.6% had a lifetime eating disorder), (3) underweight and generally had no eating disorders except for 5.3% who had restricting anorexia nervosa, (4) of low to normal weight (89.0% had a lifetime eating disorder), and (5) obese (37.0% had a lifetime eating disorder). Each profile contained more than 1 type of lifetime eating disorder except for the third profile. Women in the first and third profiles had the best functioning, with women in the fourth and fifth profiles having similarly poorer functioning. The women in the fourth group had a symptom profile distinctive from the other 4 groups in terms of severity; they were also more likely to have had lifetime major depression and suicidality. Conclusion: Lifetime weight ranges and the severity of eating disorder symptoms affected clustering more than the type of eating disorder symptom.
Resumo:
There has been an increased demand for characterizing user access patterns using web mining techniques since the informative knowledge extracted from web server log files can not only offer benefits for web site structure improvement but also for better understanding of user navigational behavior. In this paper, we present a web usage mining method, which utilize web user usage and page linkage information to capture user access pattern based on Probabilistic Latent Semantic Analysis (PLSA) model. A specific probabilistic model analysis algorithm, EM algorithm, is applied to the integrated usage data to infer the latent semantic factors as well as generate user session clusters for revealing user access patterns. Experiments have been conducted on real world data set to validate the effectiveness of the proposed approach. The results have shown that the presented method is capable of characterizing the latent semantic factors and generating user profile in terms of weighted page vectors, which may reflect the common access interest exhibited by users among same session cluster.
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
This dissertation research points out major challenging problems with current Knowledge Organization (KO) systems, such as subject gateways or web directories: (1) the current systems use traditional knowledge organization systems based on controlled vocabulary which is not very well suited to web resources, and (2) information is organized by professionals not by users, which means it does not reflect intuitively and instantaneously expressed users’ current needs. In order to explore users’ needs, I examined social tags which are user-generated uncontrolled vocabulary. As investment in professionally-developed subject gateways and web directories diminishes (support for both BUBL and Intute, examined in this study, is being discontinued), understanding characteristics of social tagging becomes even more critical. Several researchers have discussed social tagging behavior and its usefulness for classification or retrieval; however, further research is needed to qualitatively and quantitatively investigate social tagging in order to verify its quality and benefit. This research particularly examined the indexing consistency of social tagging in comparison to professional indexing to examine the quality and efficacy of tagging. The data analysis was divided into three phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of tag attributes. Most indexing consistency studies have been conducted with a small number of professional indexers, and they tended to exclude users. Furthermore, the studies mainly have focused on physical library collections. This dissertation research bridged these gaps by (1) extending the scope of resources to various web documents indexed by users and (2) employing the Information Retrieval (IR) Vector Space Model (VSM) - based indexing consistency method since it is suitable for dealing with a large number of indexers. As a second phase, an analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. Finally, to investigate tagging pattern and behaviors, a content analysis on tag attributes was conducted based on the FRBR model. The findings revealed that there was greater consistency over all subjects among taggers compared to that for two groups of professionals. The analysis of tagging exhaustivity and tag specificity in relation to tagging effectiveness was conducted to ameliorate difficulties associated with limitations in the analysis of indexing consistency based on only the quantitative measures of vocabulary matching. Examination of exhaustivity and specificity of social tags provided insights into particular characteristics of tagging behavior and its variation across subjects. To further investigate the quality of tags, a Latent Semantic Analysis (LSA) was conducted to determine to what extent tags are conceptually related to professionals’ keywords and it was found that tags of higher specificity tended to have a higher semantic relatedness to professionals’ keywords. This leads to the conclusion that the term’s power as a differentiator is related to its semantic relatedness to documents. The findings on tag attributes identified the important bibliographic attributes of tags beyond describing subjects or topics of a document. The findings also showed that tags have essential attributes matching those defined in FRBR. Furthermore, in terms of specific subject areas, the findings originally identified that taggers exhibited different tagging behaviors representing distinctive features and tendencies on web documents characterizing digital heterogeneous media resources. These results have led to the conclusion that there should be an increased awareness of diverse user needs by subject in order to improve metadata in practical applications. This dissertation research is the first necessary step to utilize social tagging in digital information organization by verifying the quality and efficacy of social tagging. This dissertation research combined both quantitative (statistics) and qualitative (content analysis using FRBR) approaches to vocabulary analysis of tags which provided a more complete examination of the quality of tags. Through the detailed analysis of tag properties undertaken in this dissertation, we have a clearer understanding of the extent to which social tagging can be used to replace (and in some cases to improve upon) professional indexing.
Resumo:
220 p.
Resumo:
Part 19: Knowledge Management in Networks
Resumo:
This paper demonstrates an experimental study that examines the accuracy of various information retrieval techniques for Web service discovery. The main goal of this research is to evaluate algorithms for semantic web service discovery. The evaluation is comprehensively benchmarked using more than 1,700 real-world WSDL documents from INEX 2010 Web Service Discovery Track dataset. For automatic search, we successfully use Latent Semantic Analysis and BM25 to perform Web service discovery. Moreover, we provide linking analysis which automatically links possible atomic Web services to meet the complex requirements of users. Our fusion engine recommends a final result to users. Our experiments show that linking analysis can improve the overall performance of Web service discovery. We also find that keyword-based search can quickly return results but it has limitation of understanding users’ goals.
Resumo:
This article presents and evaluates a model to automatically derive word association networks from text corpora. Two aspects were evaluated: To what degree can corpus-based word association networks (CANs) approximate human word association networks with respect to (1) their ability to quantitatively predict word associations and (2) their structural network characteristics. Word association networks are the basis of the human mental lexicon. However, extracting such networks from human subjects is laborious, time consuming and thus necessarily limited in relation to the breadth of human vocabulary. Automatic derivation of word associations from text corpora would address these limitations. In both evaluations corpus-based processing provided vector representations for words. These representations were then employed to derive CANs using two measures: (1) the well known cosine metric, which is a symmetric measure, and (2) a new asymmetric measure computed from orthogonal vector projections. For both evaluations, the full set of 4068 free association networks (FANs) from the University of South Florida word association norms were used as baseline human data. Two corpus based models were benchmarked for comparison: a latent topic model and latent semantic analysis (LSA). We observed that CANs constructed using the asymmetric measure were slightly less effective than the topic model in quantitatively predicting free associates, and slightly better than LSA. The structural networks analysis revealed that CANs do approximate the FANs to an encouraging degree.
Resumo:
Automatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior and the use of appropriate data mining techniques on the chosen representation. In this paper, we use the sequence of system calls to characterize program execution. The data mining tasks addressed are learning to map system call streams to fault labels and automatic identification of fault causes. Spectrum kernels and SVM are used for the former while latent semantic analysis is used for the latter The techniques are demonstrated for the intrusion dataset containing system call traces. The results show that kernel techniques are as accurate as the best available results but are faster by orders of magnitude. We also show that latent semantic indexing is capable of revealing fault-specific features.
Resumo:
Non-negative matrix factorization [5](NMF) is a well known tool for unsupervised machine learning. It can be viewed as a generalization of the K-means clustering, Expectation Maximization based clustering and aspect modeling by Probabilistic Latent Semantic Analysis (PLSA). Specifically PLSA is related to NMF with KL-divergence objective function. Further it is shown that K-means clustering is a special case of NMF with matrix L2 norm based error function. In this paper our objective is to analyze the relation between K-means clustering and PLSA by examining the KL-divergence function and matrix L2 norm based error function.