870 resultados para Latent Semantic Analysis
Resumo:
Two decades after its inception, Latent Semantic Analysis(LSA) has become part and parcel of every modern introduction to Information Retrieval. For any tool that matures so quickly, it is important to check its lore and limitations, or else stagnation will set in. We focus here on the three main aspects of LSA that are well accepted, and the gist of which can be summarized as follows: (1) that LSA recovers latent semantic factors underlying the document space, (2) that such can be accomplished through lossy compression of the document space by eliminating lexical noise, and (3) that the latter can best be achieved by Singular Value Decomposition. For each aspect we performed experiments analogous to those reported in the LSA literature and compared the evidence brought to bear in each case. On the negative side, we show that the above claims about LSA are much more limited than commonly believed. Even a simple example may show that LSA does not recover the optimal semantic factors as intended in the pedagogical example used in many LSA publications. Additionally, and remarkably deviating from LSA lore, LSA does not scale up well: the larger the document space, the more unlikely that LSA recovers an optimal set of semantic factors. On the positive side, we describe new algorithms to replace LSA (and more recent alternatives as pLSA, LDA, and kernel methods) by trading its l2 space for an l1 space, thereby guaranteeing an optimal set of semantic factors. These algorithms seem to salvage the spirit of LSA as we think it was initially conceived.
Resumo:
Web transaction data between Web visitors and Web functionalities usually convey user task-oriented behavior pattern. Mining such type of click-stream data will lead to capture usage pattern information. Nowadays Web usage mining technique has become one of most widely used methods for Web recommendation, which customizes Web content to user-preferred style. Traditional techniques of Web usage mining, such as Web user session or Web page clustering, association rule and frequent navigational path mining can only discover usage pattern explicitly. They, however, cannot reveal the underlying navigational activities and identify the latent relationships that are associated with the patterns among Web users as well as Web pages. In this work, we propose a Web recommendation framework incorporating Web usage mining technique based on Probabilistic Latent Semantic Analysis (PLSA) model. The main advantages of this method are, not only to discover usage-based access pattern, but also to reveal the underlying latent factor as well. With the discovered user access pattern, we then present user more interested content via collaborative recommendation. To validate the effectiveness of proposed approach, we conduct experiments on real world datasets and make comparisons with some existing traditional techniques. The preliminary experimental results demonstrate the usability of the proposed approach.
Resumo:
Bayesian probabilistic analysis offers a new approach to characterize semantic representations by inferring the most likely feature structure directly from the patterns of brain activity. In this study, infinite latent feature models [1] are used to recover the semantic features that give rise to the brain activation vectors when people think about properties associated with 60 concrete concepts. The semantic features recovered by ILFM are consistent with the human ratings of the shelter, manipulation, and eating factors that were recovered by a previous factor analysis. Furthermore, different areas of the brain encode different perceptual and conceptual features. This neurally-inspired semantic representation is consistent with some existing conjectures regarding the role of different brain areas in processing different semantic and perceptual properties. © 2012 Springer-Verlag.
Resumo:
In this paper, we compare a well-known semantic spacemodel, Latent Semantic Analysis (LSA) with another model, Hyperspace Analogue to Language (HAL) which is widely used in different area, especially in automatic query refinement. We conduct this comparative analysis to prove our hypothesis that with respect to ability of extracting the lexical information from a corpus of text, LSA is quite similar to HAL. We regard HAL and LSA as black boxes. Through a Pearsonrsquos correlation analysis to the outputs of these two black boxes, we conclude that LSA highly co-relates with HAL and thus there is a justification that LSA and HAL can potentially play a similar role in the area of facilitating automatic query refinement. This paper evaluates LSA in a new application area and contributes an effective way to compare different semantic space models.
Resumo:
Unstructured text data, such as emails, blogs, contracts, academic publications, organizational documents, transcribed interviews, and even tweets, are important sources of data in Information Systems research. Various forms of qualitative analysis of the content of these data exist and have revealed important insights. Yet, to date, these analyses have been hampered by limitations of human coding of large data sets, and by bias due to human interpretation. In this paper, we compare and combine two quantitative analysis techniques to demonstrate the capabilities of computational analysis for content analysis of unstructured text. Specifically, we seek to demonstrate how two quantitative analytic methods, viz., Latent Semantic Analysis and data mining, can aid researchers in revealing core content topic areas in large (or small) data sets, and in visualizing how these concepts evolve, migrate, converge or diverge over time. We exemplify the complementary application of these techniques through an examination of a 25-year sample of abstracts from selected journals in Information Systems, Management, and Accounting disciplines. Through this work, we explore the capabilities of two computational techniques, and show how these techniques can be used to gather insights from a large corpus of unstructured text.
Resumo:
Context Cancer patients experience a broad range of physical and psychological symptoms as a result of their disease and its treatment. On average, these patients report ten unrelieved and co-occurring symptoms. Objectives To determine if subgroups of oncology outpatients receiving active treatment (n=582) could be identified based on their distinct experience with thirteen commonly occurring symptoms; to determine whether these subgroups differed on select demographic, and clinical characteristics; and to determine if these subgroups differed on quality of life (QOL) outcomes. Methods Demographic, clinical, and symptom data from one Australian and two U.S. studies were combined. Latent class analysis (LCA) was used to identify patient subgroups with distinct symptom experiences based on self-report data on symptom occurrence using the Memorial Symptom Assessment Scale (MSAS). Results Four distinct latent classes were identified (i.e., All Low (28.0%), Moderate Physical and Lower Psych (26.3%), Moderate Physical and Higher Psych (25.4%), All High (20.3%)). Age, gender, education, cancer diagnosis, and presence of metastatic disease differentiated among the latent classes. Patients in the All High class had the worst QOL scores. Conclusion Findings from this study confirm the large amount of interindividual variability in the symptom experience of oncology patients. The identification of demographic and clinical characteristics that place patients are risk for a higher symptom burden can be used to guide more aggressive and individualized symptom management interventions.
Resumo:
Context: Identifying susceptibility genes for schizophrenia may be complicated by phenotypic heterogeneity, with some evidence suggesting that phenotypic heterogeneity reflects genetic heterogeneity. Objective: To evaluate the heritability and conduct genetic linkage analyses of empirically derived, clinically homogeneous schizophrenia subtypes. Design: Latent class and linkage analysis. Setting: Taiwanese field research centers. Participants: The latent class analysis included 1236 Han Chinese individuals with DSM-IV schizophrenia. These individuals were members of a large affected-sibling-pair sample of schizophrenia (606 ascertained families), original linkage analyses of which detected a maximum logarithm of odds (LOD) of 1.8 (z = 2.88) on chromosome 10q22.3. Main Outcome Measures: Multipoint exponential LOD scores by latent class assignment and parametric heterogeneity LOD scores. Results: Latent class analyses identified 4 classes, with 2 demonstrating familial aggregation. The first (LC2) described a group with severe negative symptoms, disorganization, and pronounced functional impairment, resembling “deficit schizophrenia.” The second (LC3) described a group with minimal functional impairment, mild or absent negative symptoms, and low disorganization. Using the negative/deficit subtype, we detected genome-wide significant linkage to 1q23-25 (LOD = 3.78, empiric genome-wide P = .01). This region was not detected using the DSM-IV schizophrenia diagnosis, but has been strongly implicated in schizophrenia pathogenesis by previous linkage and association studies.Variants in the 1q region may specifically increase risk for a negative/deficit schizophrenia subtype. Alternatively, these results may reflect increased familiality/heritability of the negative class, the presence of multiple 1q schizophrenia risk genes, or a pleiotropic 1q risk locus or loci, with stronger genotype-phenotype correlation with negative/deficit symptoms. Using the second familial latent class, we identified nominally significant linkage to the original 10q peak region. Conclusion: Genetic analyses of heritable, homogeneous phenotypes may improve the power of linkage and association studies of schizophrenia and thus have relevance to the design and analysis of genome-wide association studies.
Resumo:
For zygosity diagnosis in the absence of genotypic data, or in the recruitment phase of a twin study where only single twins from same-sex pairs are being screened, or to provide a test for sample duplication leading to the false identification of a dizygotic pair as monozygotic, the appropriate analysis of respondents' answers to questions about zygosity is critical. Using data from a young adult Australian twin cohort (N = 2094 complete pairs and 519 singleton twins from same-sex pairs with complete responses to all zygosity items), we show that application of latent class analysis (LCA), fitting a 2-class model, yields results that show good concordance with traditional methods of zygosity diagnosis, but with certain important advantages. These include the ability, in many cases, to assign zygosity with specified probability on the basis of responses of a single informant (advantageous when one zygosity type is being oversampled); and the ability to quantify the probability of misassignment of zygosity, allowing prioritization of cases for genotyping as well as identification of cases of probable laboratory error. Out of 242 twins (from 121 like-sex pairs) where genotypic data were available for zygosity confirmation, only a single case was identified of incorrect zygosity assignment by the latent class algorithm. Zygosity assignment for that single case was identified by the LCA as uncertain (probability of being a monozygotic twin only 76%), and the co-twin's responses clearly identified the pair as dizygotic (probability of being dizygotic 100%). In the absence of genotypic data, or as a safeguard against sample duplication, application of LCA for zygosity assignment or confirmation is strongly recommended.
Resumo:
In this article, we introduce the general statistical analysis approach known as latent class analysis and discuss some of the issues associated with this type of analysis in practice. Two recent examples from the respiratory health literature are used to highlight the types of research questions that have been addressed using this approach.
Resumo:
Background: Mycobacterium tuberculosis, a causative agent of chronic tuberculosis disease, is widespread among some animal species too. There is paucity of information on the distribution, prevalence and true disease status of tuberculosis in Asian elephants (Elephas maximus). The aim of this study was to estimate the sensitivity and specificity of serological tests to diagnose M. tuberculosis infection in captive elephants in southern India while simultaneously estimating sero-prevalence. Methodology/Principal Findings: Health assessment of 600 elephants was carried out and their sera screened with a commercially available rapid serum test. Trunk wash culture of select rapid serum test positive animals yielded no animal positive for M. tuberculosis isolation. Under Indian field conditions where the true disease status is unknown, we used a latent class model to estimate the diagnostic characteristics of an existing (rapid serum test) and new (four in-house ELISA) tests. One hundred and seventy nine sera were randomly selected for screening in the five tests. Diagnostic sensitivities of the four ELISAs were 91.3-97.6% (95% Credible Interval (CI): 74.8-99.9) and diagnostic specificity were 89.6-98.5% (95% CI: 79.4-99.9) based on the model we assumed. We estimate that 53.6% (95% CI: 44.6-62.8) of the samples tested were free from infection with M. tuberculosis and 15.9% (97.5% CI: 9.8 - to 24.0) tested positive on all five tests. Conclusions/Significance: Our results provide evidence for high prevalence of asymptomatic M. tuberculosis infection in Asian elephants in a captive Indian setting. Further validation of these tests would be important in formulating area-specific effective surveillance and control measures.