6 resultados para 080704 Information Retrieval and Web Search
em Illinois Digital Environment for Access to Learning and Scholarship Repository
                                
Resumo:
With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the author(s) of a biomedical publication, or implicit, such as the positive or negative sentiment that an author had when she wrote a product review; there may also be complex context such as the social network of the authors. Many applications require analysis of topic patterns over different contexts. For instance, analysis of search logs in the context of the user can reveal how we can improve the quality of a search engine by optimizing the search results according to particular users; analysis of customer reviews in the context of positive and negative sentiments can help the user summarize public opinions about a product; analysis of blogs or scientific publications in the context of a social network can facilitate discovery of more meaningful topical communities. Since context information significantly affects the choices of topics and language made by authors, in general, it is very important to incorporate it into analyzing and mining text data. In general, modeling the context in text, discovering contextual patterns of language units and topics from text, a general task which we refer to as Contextual Text Mining, has widespread applications in text mining. In this thesis, we provide a novel and systematic study of contextual text mining, which is a new paradigm of text mining treating context information as the ``first-class citizen.'' We formally define the problem of contextual text mining and its basic tasks, and propose a general framework for contextual text mining based on generative modeling of text. This conceptual framework provides general guidance on text mining problems with context information and can be instantiated into many real tasks, including the general problem of contextual topic analysis. We formally present a functional framework for contextual topic analysis, with a general contextual topic model and its various versions, which can effectively solve the text mining problems in a lot of real world applications. We further introduce general components of contextual topic analysis, by adding priors to contextual topic models to incorporate prior knowledge, regularizing contextual topic models with dependency structure of context, and postprocessing contextual patterns to extract refined patterns. The refinements on the general contextual topic model naturally lead to a variety of probabilistic models which incorporate different types of context and various assumptions and constraints. These special versions of the contextual topic model are proved effective in a variety of real applications involving topics and explicit contexts, implicit contexts, and complex contexts. We then introduce a postprocessing procedure for contextual patterns, by generating meaningful labels for multinomial context models. This method provides a general way to interpret text mining results for real users. By applying contextual text mining in the ``context'' of other text information management tasks, including ad hoc text retrieval and web search, we further prove the effectiveness of contextual text mining techniques in a quantitative way with large scale datasets. The framework of contextual text mining not only unifies many explorations of text analysis with context information, but also opens up many new possibilities for future research directions in text mining.
                                
Resumo:
At the dawn of the twentieth century, Imperial Russia was in the throes of immense social, political and cultural upheaval. The effects of rapid industrialization, rising capitalism and urbanization, as well as the trauma wrought by revolution and war, reverberated through all levels of society and every cultural sphere. In the aftermath of the 1905 revolution, amid a growing sense of panic over the chaos and divisions emerging in modern life, a portion of Russian educated society (obshchestvennost’) looked to the transformative and unifying power of music as a means of salvation from the personal, social and intellectual divisions of the contemporary world. Transcending professional divisions, these “orphans of Nietzsche” comprised a distinct aesthetic group within educated Russian society. While lacking a common political, religious or national outlook, these philosophers, poets, musicians and other educated members of the upper and middle strata were bound together by their shared image of music’s unifying power, itself built upon a synthesis of Russian and European ideas. They yearned for a “musical Orpheus,” a composer capable of restoring wholeness to society through his music. My dissertation is a study in what I call “musical metaphysics,” an examination of the creation, development, crisis and ultimate failure of this Orphic worldview. To begin, I examine the institutional foundations of musical life in late Imperial Russia, as well as the explosion of cultural life in the aftermath of the 1905 Revolution, a vibrant social context which nourished the formation of musical metaphysics. From here, I assess the intellectual basis upon which musical metaphysics rested: central concepts (music, life-transformation, theurgy, unity, genius, nation), as well as the philosophical heritage of Nietzsche and the Christian thinkers Vladimir Solov’ev, Aleksei Khomiakov, Ivan Kireevskii and Lev Tolstoi. Nietzsche’s orphans’ struggle to reconcile an amoral view of reality with a deeply felt sense of religious purpose gave rise to neo-Slavophile interpretations of history, in which the Russian nation (narod) was singled out as the savior of humanity from the materialism of modern life. This nationalizing tendency existed uneasily within the framework of the multi-ethnic empire. From broad social and cultural trends, I turn to detailed analysis of three of Moscow’s most admired contemporary composers, whose individual creative voices intersected with broader social concerns. The music of Aleksandr Scriabin (1871-1915) was associated with images of universal historical progress. Nikolai Medtner (1879-1951) embodied an “Imperial” worldview, in which musical style was imbued with an eternal significance which transcended the divisions of nation. The compositions of Sergei Rachmaninoff (1873-1943) were seen as the expression of a Russian “national” voice. Heightened nationalist sentiment and the impact of the Great War spelled the doom of this musical worldview. Music became an increasingly nationalized sphere within which earlier, Imperial definitions of belonging grew ever more problematic. As the Germanic heritage upon which their vision was partially based came under attack, Nietzsche’s orphans found themselves ever more divided and alienated from society as a whole. Music’s inability to physically transform the world ultimately came to symbolize the failure of Russia’s educated strata to effectively deal with the pressures of a modernizing society. In the aftermath of the 1917 revolutions, music was transformed from a symbol of active, unifying power into a space of memory, a means of commemorating, reinterpreting, and idealizing the lost world of Imperial Russia itself.
                                
Resumo:
We build a system to support search and visualization on heterogeneous information networks. We first build our system on a specialized heterogeneous information network: DBLP. The system aims to facilitate people, especially computer science researchers, toward a better understanding and user experience about academic information networks. Then we extend our system to the Web. Our results are much more intuitive and knowledgeable than the simple top-k blue links from traditional search engines, and bring more meaningful structural results with correlated entities. We also investigate the ranking algorithm, and we show that the personalized PageRank and proposed Hetero-personalized PageRank outperform the TF-IDF ranking or mixture of TF-IDF and authority ranking. Our work opens several directions for future research.
                                
Resumo:
This dissertation research points out major challenging problems with current Knowledge Organization (KO) systems, such as subject gateways or web directories: (1) the current systems use traditional knowledge organization systems based on controlled vocabulary which is not very well suited to web resources, and (2) information is organized by professionals not by users, which means it does not reflect intuitively and instantaneously expressed users’ current needs. In order to explore users’ needs, I examined social tags which are user-generated uncontrolled vocabulary. As investment in professionally-developed subject gateways and web directories diminishes (support for both BUBL and Intute, examined in this study, is being discontinued), understanding characteristics of social tagging becomes even more critical. Several researchers have discussed social tagging behavior and its usefulness for classification or retrieval; however, further research is needed to qualitatively and quantitatively investigate social tagging in order to verify its quality and benefit. This research particularly examined the indexing consistency of social tagging in comparison to professional indexing to examine the quality and efficacy of tagging. The data analysis was divided into three phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of tag attributes. Most indexing consistency studies have been conducted with a small number of professional indexers, and they tended to exclude users. Furthermore, the studies mainly have focused on physical library collections. This dissertation research bridged these gaps by (1) extending the scope of resources to various web documents indexed by users and (2) employing the Information Retrieval (IR) Vector Space Model (VSM) - based indexing consistency method since it is suitable for dealing with a large number of indexers. As a second phase, an analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. Finally, to investigate tagging pattern and behaviors, a content analysis on tag attributes was conducted based on the FRBR model. The findings revealed that there was greater consistency over all subjects among taggers compared to that for two groups of professionals. The analysis of tagging exhaustivity and tag specificity in relation to tagging effectiveness was conducted to ameliorate difficulties associated with limitations in the analysis of indexing consistency based on only the quantitative measures of vocabulary matching. Examination of exhaustivity and specificity of social tags provided insights into particular characteristics of tagging behavior and its variation across subjects. To further investigate the quality of tags, a Latent Semantic Analysis (LSA) was conducted to determine to what extent tags are conceptually related to professionals’ keywords and it was found that tags of higher specificity tended to have a higher semantic relatedness to professionals’ keywords. This leads to the conclusion that the term’s power as a differentiator is related to its semantic relatedness to documents. The findings on tag attributes identified the important bibliographic attributes of tags beyond describing subjects or topics of a document. The findings also showed that tags have essential attributes matching those defined in FRBR. Furthermore, in terms of specific subject areas, the findings originally identified that taggers exhibited different tagging behaviors representing distinctive features and tendencies on web documents characterizing digital heterogeneous media resources. These results have led to the conclusion that there should be an increased awareness of diverse user needs by subject in order to improve metadata in practical applications. This dissertation research is the first necessary step to utilize social tagging in digital information organization by verifying the quality and efficacy of social tagging. This dissertation research combined both quantitative (statistics) and qualitative (content analysis using FRBR) approaches to vocabulary analysis of tags which provided a more complete examination of the quality of tags. Through the detailed analysis of tag properties undertaken in this dissertation, we have a clearer understanding of the extent to which social tagging can be used to replace (and in some cases to improve upon) professional indexing.
                                
Resumo:
This thesis attempts to provide deeper historical and theoretical grounding for sense-making, thereby illustrating its applicability to practical information seeking research. In Chapter One I trace the philosophical origins of Brenda Dervin’s theory known as “sense making,” reaching beyond current scholarship that locates the origins of sense-making in twentieth-century Phenomenology and Communication theory and find its rich ontological, epistemological, and etymological heritage that dates back to the Pre-Socratics. After exploring sense-making’s Greek roots, I examine sense-making’s philosophical undercurrents found in Hegel’s Phenomenology of Spirit (1807), where he also returns to the simplicity of the Greeks for his concept of sense. With Chapter Two I explore sense-making methodology and find, in light of the Greek and Hegelian dialectic, a dialogical bridge connecting sense-making’s theory with pragmatic uses. This bridge between Dervin’s situation and use occupies a distinct position in sense-making theory. Moreover, building upon Brenda Dervin’s model of sense-making, I use her metaphors of gap and bridge analogy to discuss the dialectic and dialogic components of sense making. The purpose of Chapter Three is pragmatic – to gain insight into the online information-seeking needs, experiences, and motivation of first-degree relatives (FDRs) of breast cancer survivors through the lens of sense-making. This research analyses four questions: 1) information-seeking behavior among FDRs of cancer survivors compared to survivors and to undiagnosed, non-related online cancer information seekers in the general population, 2) types of and places where information is sought, 3) barriers or gaps and satisfaction rates FDRs face in their cancer information quest, and 4) types and degrees of cancer information and resources FDRs want and use in their information search for themselves and other family members. An online survey instrument designed to investigate these questions was developed and pilot tested. Via an email communication, the Susan Love Breast Cancer Research Foundation distributed 322,000 invitations to its membership to complete the survey, and from March 24th to April 5th 10,692 women agreed to take the survey with 8,804 volunteers actually completing survey responses. Of the 8,804 surveys, 95% of FDRs have searched for cancer information online, and 84% of FDRs use the Internet as a sense-making tool for additional information they have received from doctors or nurses. FDRs report needing much more information than either survivors or family/friends in ten out of fifteen categories related to breast and ovarian cancer. When searching for cancer information online, FDRs also rank highest in several of sense-making’s emotional levels: uncertainty, confusion, frustration, doubt, and disappointment than do either survivors or friends and family. The sense-making process has existed in theory and praxis since the early Greeks. In applying sense–making’s theory to a contemporary problem, the survey reveals unaddressed situations and gaps of FDRs’ information search process. FDRs are a highly motivated group of online information seekers whose needs are largely unaddressed as a result of gaps in available online information targeted to address their specific needs. Since FDRs represent a quarter of the population, further research addressing their specific online information needs and experiences is necessary.
                                
Resumo:
Visual recognition is a fundamental research topic in computer vision. This dissertation explores datasets, features, learning, and models used for visual recognition. In order to train visual models and evaluate different recognition algorithms, this dissertation develops an approach to collect object image datasets on web pages using an analysis of text around the image and of image appearance. This method exploits established online knowledge resources (Wikipedia pages for text; Flickr and Caltech data sets for images). The resources provide rich text and object appearance information. This dissertation describes results on two datasets. The first is Berg’s collection of 10 animal categories; on this dataset, we significantly outperform previous approaches. On an additional set of 5 categories, experimental results show the effectiveness of the method. Images are represented as features for visual recognition. This dissertation introduces a text-based image feature and demonstrates that it consistently improves performance on hard object classification problems. The feature is built using an auxiliary dataset of images annotated with tags, downloaded from the Internet. Image tags are noisy. The method obtains the text features of an unannotated image from the tags of its k-nearest neighbors in this auxiliary collection. A visual classifier presented with an object viewed under novel circumstances (say, a new viewing direction) must rely on its visual examples. This text feature may not change, because the auxiliary dataset likely contains a similar picture. While the tags associated with images are noisy, they are more stable when appearance changes. The performance of this feature is tested using PASCAL VOC 2006 and 2007 datasets. This feature performs well; it consistently improves the performance of visual object classifiers, and is particularly effective when the training dataset is small. With more and more collected training data, computational cost becomes a bottleneck, especially when training sophisticated classifiers such as kernelized SVM. This dissertation proposes a fast training algorithm called Stochastic Intersection Kernel Machine (SIKMA). This proposed training method will be useful for many vision problems, as it can produce a kernel classifier that is more accurate than a linear classifier, and can be trained on tens of thousands of examples in two minutes. It processes training examples one by one in a sequence, so memory cost is no longer the bottleneck to process large scale datasets. This dissertation applies this approach to train classifiers of Flickr groups with many group training examples. The resulting Flickr group prediction scores can be used to measure image similarity between two images. Experimental results on the Corel dataset and a PASCAL VOC dataset show the learned Flickr features perform better on image matching, retrieval, and classification than conventional visual features. Visual models are usually trained to best separate positive and negative training examples. However, when recognizing a large number of object categories, there may not be enough training examples for most objects, due to the intrinsic long-tailed distribution of objects in the real world. This dissertation proposes an approach to use comparative object similarity. The key insight is that, given a set of object categories which are similar and a set of categories which are dissimilar, a good object model should respond more strongly to examples from similar categories than to examples from dissimilar categories. This dissertation develops a regularized kernel machine algorithm to use this category dependent similarity regularization. Experiments on hundreds of categories show that our method can make significant improvement for categories with few or even no positive examples.
 
                    