3 resultados para latent semantic analysis
em Illinois Digital Environment for Access to Learning and Scholarship Repository
Resumo:
Discovery Driven Analysis (DDA) is a common feature of OLAP technology to analyze structured data. In essence, DDA helps analysts to discover anomalous data by highlighting 'unexpected' values in the OLAP cube. By giving indications to the analyst on what dimensions to explore, DDA speeds up the process of discovering anomalies and their causes. However, Discovery Driven Analysis (and OLAP in general) is only applicable on structured data, such as records in databases. We propose a system to extend DDA technology to semi-structured text documents, that is, text documents with a few structured data. Our system pipeline consists of two stages: first, the text part of each document is structured around user specified dimensions, using semi-PLSA algorithm; then, we adapt DDA to these fully structured documents, thus enabling DDA on text documents. We present some applications of this system in OLAP analysis and show how scalability issues are solved. Results show that our system can handle reasonable datasets of documents, in real time, without any need for pre-computation.
Resumo:
This dissertation research points out major challenging problems with current Knowledge Organization (KO) systems, such as subject gateways or web directories: (1) the current systems use traditional knowledge organization systems based on controlled vocabulary which is not very well suited to web resources, and (2) information is organized by professionals not by users, which means it does not reflect intuitively and instantaneously expressed users’ current needs. In order to explore users’ needs, I examined social tags which are user-generated uncontrolled vocabulary. As investment in professionally-developed subject gateways and web directories diminishes (support for both BUBL and Intute, examined in this study, is being discontinued), understanding characteristics of social tagging becomes even more critical. Several researchers have discussed social tagging behavior and its usefulness for classification or retrieval; however, further research is needed to qualitatively and quantitatively investigate social tagging in order to verify its quality and benefit. This research particularly examined the indexing consistency of social tagging in comparison to professional indexing to examine the quality and efficacy of tagging. The data analysis was divided into three phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of tag attributes. Most indexing consistency studies have been conducted with a small number of professional indexers, and they tended to exclude users. Furthermore, the studies mainly have focused on physical library collections. This dissertation research bridged these gaps by (1) extending the scope of resources to various web documents indexed by users and (2) employing the Information Retrieval (IR) Vector Space Model (VSM) - based indexing consistency method since it is suitable for dealing with a large number of indexers. As a second phase, an analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. Finally, to investigate tagging pattern and behaviors, a content analysis on tag attributes was conducted based on the FRBR model. The findings revealed that there was greater consistency over all subjects among taggers compared to that for two groups of professionals. The analysis of tagging exhaustivity and tag specificity in relation to tagging effectiveness was conducted to ameliorate difficulties associated with limitations in the analysis of indexing consistency based on only the quantitative measures of vocabulary matching. Examination of exhaustivity and specificity of social tags provided insights into particular characteristics of tagging behavior and its variation across subjects. To further investigate the quality of tags, a Latent Semantic Analysis (LSA) was conducted to determine to what extent tags are conceptually related to professionals’ keywords and it was found that tags of higher specificity tended to have a higher semantic relatedness to professionals’ keywords. This leads to the conclusion that the term’s power as a differentiator is related to its semantic relatedness to documents. The findings on tag attributes identified the important bibliographic attributes of tags beyond describing subjects or topics of a document. The findings also showed that tags have essential attributes matching those defined in FRBR. Furthermore, in terms of specific subject areas, the findings originally identified that taggers exhibited different tagging behaviors representing distinctive features and tendencies on web documents characterizing digital heterogeneous media resources. These results have led to the conclusion that there should be an increased awareness of diverse user needs by subject in order to improve metadata in practical applications. This dissertation research is the first necessary step to utilize social tagging in digital information organization by verifying the quality and efficacy of social tagging. This dissertation research combined both quantitative (statistics) and qualitative (content analysis using FRBR) approaches to vocabulary analysis of tags which provided a more complete examination of the quality of tags. Through the detailed analysis of tag properties undertaken in this dissertation, we have a clearer understanding of the extent to which social tagging can be used to replace (and in some cases to improve upon) professional indexing.
Resumo:
In this thesis I examine a variety of linguistic elements which involve ``alternative'' semantic values---a class arguably including focus, interrogatives, indefinites, and disjunctions---and the connections between these elements. This study focusses on the analysis of such elements in Sinhala, with comparison to Malayalam, Tlingit, and Japanese. The central part of the study concerns the proper syntactic and semantic analysis of Q[uestion]-particles (including Sinhala "da", Malayalam "-oo", Japanese "ka"), which, in many languages, appear not only in interrogatives, but also in the formation of indefinites, disjunctions, and relative clauses. This set of contexts is syntactically-heterogeneous, and so syntax does not offer an explanation for the appearance of Q-particles in this particular set of environments. I propose that these contexts can be united in terms of semantics, as all involving some element which denotes a set of ``alternatives''. Both wh-words and disjunctions can be analysed as creating Hamblin-type sets of ``alternatives''. Q-particles can be treated as uniformly denoting variables over choice functions which apply to the aforementioned Hamblin-type sets, thus ``restoring'' the derivation to normal Montagovian semantics. The treatment of Q-particles as uniformly denoting variables over choice functions provides an explanation for why these particles appear in just this set of contexts: they all include an element with Hamblin-type semantics. However, we also find variation in the use of Q-particles; including, in some languages, the appearance of multiple morphologically-distinct Q-particles in different syntactic contexts. Such variation can be handled largely by positing that Q-particles may vary in their formal syntactic feature specifications, determining which syntactic contexts they are licensed in. The unified analysis of Q-particles as denoting variables over choice functions also raises various questions about the proper analysis of interrogatives, indefinites, and disjunctions, including issues concerning the nature of the semantics of wh-words and the syntactic structure of disjunction. As well, I observe that indefinites involving Q-particles have a crosslinguistic tendency to be epistemic indefinites, i.e. indefinites which explicitly signal ignorance of details regarding who or what satisfies the existential claim. I provide an account of such indefinites which draws on the analysis of Q-particles as variables over choice functions. These pragmatic ``signals of ignorance'' (which I argue to be presuppositions) also have a further role to play in determining the distribution of Q-particles in disjunctions. The final section of this study investigates the historical development of focus constructions and Q-particles in Sinhala. This diachronic study allows us not only to observe the origin and development of such elements, but also serves to delimit the range of possible synchronic analyses, thus providing us with further insights into the formal syntactic and semantic properties of Q-particles. This study highlights both the importance of considering various components of the grammar (e.g. syntax, semantics, pragmatics, morphology) and the use of philology in developing plausible formal analyses of complex linguistic phenomena such as the crosslinguistic distribution of Q-particles.