981 resultados para Semantic Text Analysis
Resumo:
In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. ^ Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. ^ In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data. ^
Resumo:
In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.
Resumo:
This dissertation applies statistical methods to the evaluation of automatic summarization using data from the Text Analysis Conferences in 2008-2011. Several aspects of the evaluation framework itself are studied, including the statistical testing used to determine significant differences, the assessors, and the design of the experiment. In addition, a family of evaluation metrics is developed to predict the score an automatically generated summary would receive from a human judge and its results are demonstrated at the Text Analysis Conference. Finally, variations on the evaluation framework are studied and their relative merits considered. An over-arching theme of this dissertation is the application of standard statistical methods to data that does not conform to the usual testing assumptions.
Resumo:
In the present dissertation, multilingual thesauri were approached as cultural products and the focus was twofold: On the empirical level the focus was placed on the translatability of certain British-English social science indexing terms into the Finnish language and culture at a concept, a term and an indexing term level. On the theoretical level the focus was placed on the aim of translation and on the concept of equivalence. In accordance with modern communicative and dynamic translation theories the interest was on the human dimension. The study is qualitative. In this study, equivalence was understood in a similar way to how dynamic, functional equivalence is commonly understood in translation studies. Translating was seen as a decision-making process, where a translator often has different kinds of possibilities to choose in order to fulfil the function of the translation. Accordingly, and as a starting point for the construction of the empirical part, the function of the source text was considered to be the same or similar to the function of the target text, that is, a functional thesaurus both in source and target context. Further, the study approached the challenges of multilingual thesaurus construction from the perspectives of semantics and pragmatics. In semantic analysis the focus was on what the words conventionally mean and in pragmatics on the ‘invisible’ meaning - or how we recognise what is meant even when it is not actually said (or written). Languages and ideas expressed by languages are created mainly in accordance with expressional needs of the surrounding culture and thesauri were considered to reflect several subcultures and consequently the discourses which represent them. The research material consisted of different kinds of potential discourses: dictionaries, database records, and thesauri, Finnish versus British social science researches, Finnish versus British indexers, simulated indexing tasks with five articles and Finnish versus British thesaurus constructors. In practice, the professional background of the two last mentioned groups was rather similar. It became even more clear that all the material types had their own characteristics, although naturally not entirely separate from each other. It is further noteworthy that the different types and origins of research material were not used to represent true comparison pairs, and that the aim of triangulation of methods and material was to gain a holistic view. The general research questions were: 1. Can differences be found between Finnish and British discourses regarding family roles as thesaurus terms, and if so, what kinds of differences and which are the implications for multilingual thesaurus construction? 2. What is the pragmatic indexing term equivalence? The first question studied how the same topic (family roles) was represented in different contexts and by different users, and further focused on how the possible differences were handled in multilingual thesaurus construction. The second question was based on findings of the previous one, and answered to the final question as to what kinds of factors should be considered when defining translation equivalence in multilingual thesaurus construction. The study used multiple cases and several data collection and analysis methods aiming at theoretical replication and complementarity. The empirical material and analysis consisted of focused interviews (with Finnish and British social scientists, thesaurus constructors and indexers), simulated indexing tasks with Finnish and British indexers, semantic component analysis of dictionary definitions and translations, coword analysis and datasets retrieved in databases, and discourse analysis of thesauri. As a terminological starting point a topic and case family roles was selected. The results were clear: 1) It was possible to identify different discourses. There also existed subdiscourses. For example within the group of social scientists the orientation to qualitative versus quantitative research had an impact on the way they reacted to the studied words and discourses, and indexers placed more emphasis on the information seekers whereas thesaurus constructors approached the construction problems from a more material based solution. The differences between the different specialist groups i.e. the social scientists, the indexers and the thesaurus constructors were often greater than between the different geo-cultural groups i.e. Finnish versus British. The differences occurred as a result of different translation aims, diverging expectations for multilingual thesauri and variety of practices. For multilingual thesaurus construction this means severe challenges. The clearly ambiguous concept of multilingual thesaurus as well as different construction and translation strategies should be considered more precisely in order to shed light on focus and equivalence types, which are clearly not self-evident. The research also revealed the close connection between the aims of multilingual thesauri and the pragmatic indexing term equivalence. 2) The pragmatic indexing term equivalence is very much context-depended. Although thesaurus term equivalence is defined and standardised in the field of library and information science (LIS), it is not understood in one established way and the current LIS tools are inadequate to provide enough analytical tools for both constructing and studying different kinds of multilingual thesauri as well as their indexing term equivalence. The tools provided in translation science were more practical and theoretical, and especially the division of different meanings of a word provided a useful tool in analysing the pragmatic equivalence, which often differs from the ideal model represented in thesaurus construction literature. The study thus showed that the variety of different discourses should be acknowledged, there is a need for operationalisation of new types of multilingual thesauri, and the factors influencing pragmatic indexing term equivalence should be discussed more precisely than is traditionally done.
Resumo:
This class introduces basics of web mining and information retrieval including, for example, an introduction to the Vector Space Model and Text Mining. Guest Lecturer: Dr. Michael Granitzer Optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms, Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003 (Chapter 4, Text Analysis)
Resumo:
Primera conferencia. Bibliotecas y Repositorios Digitales: Gestión del Conocimiento, Acceso Abierto y Visibilidad Latinoamericana. (BIREDIAL) Mayo 9 al 11 de 2011. Bogotá, Colombia.
Resumo:
Numerous linguistic operations have been assigned to cortical brain areas, but the contributions of subcortical structures to human language processing are still being discussed. Using simultaneous EEG recordings directly from deep brain structures and the scalp, we show that the human thalamus systematically reacts to syntactic and semantic parameters of auditorily presented language in a temporally interleaved manner in coordination with cortical regions. In contrast, two key structures of the basal ganglia, the globus pallidus internus and the subthalamic nucleus, were not found to be engaged in these processes. We therefore propose that syntactic and semantic language analysis is primarily realized within cortico-thalamic networks, whereas a cohesive basal ganglia network is not involved in these essential operations of language analysis.
Resumo:
Complex networks have been increasingly used in text analysis, including in connection with natural language processing tools, as important text features appear to be captured by the topology and dynamics of the networks. Following previous works that apply complex networks concepts to text quality measurement, summary evaluation, and author characterization, we now focus on machine translation (MT). In this paper we assess the possible representation of texts as complex networks to evaluate cross-linguistic issues inherent in manual and machine translation. We show that different quality translations generated by NIT tools can be distinguished from their manual counterparts by means of metrics such as in-(ID) and out-degrees (OD), clustering coefficient (CC), and shortest paths (SP). For instance, we demonstrate that the average OD in networks of automatic translations consistently exceeds the values obtained for manual ones, and that the CC values of source texts are not preserved for manual translations, but are for good automatic translations. This probably reflects the text rearrangements humans perform during manual translation. We envisage that such findings could lead to better NIT tools and automatic evaluation metrics.
Resumo:
Many countries recognized the potential of medicaltourism as an alternative source of economic growth. Especially after theeconomic crisis many Asian countries joined medical tourism in hopes to escapethe severe financial difficulty. However, yet only few countries have managedto become a famous medical tourism destination. With growing number ofcompetitors, newly joined countries of medical tourism, face the difficulty inintroducing them self as attractive medical tourism destination. South Koreaas a new medical tourism destination, should consider what to offer to themedical tourists to attract them. The aim of the thesis was to investigate aspects influencing the participationof medical tourists to discover how South Korea could develop anattractive medical tourism destination. After examining the casestudy and results from the text analysis, researcher reached to the conclusionthat quality, cost and accessibility to treatment are the major reasons toparticipate in medical tourism. Also in the fierce competition, it is importantto develop differentiated offers from other destinations. Therefore, Koreashould concentrate on specialized treatments and ICT system to become anattractive medical tourism destination.
Resumo:
The objective of the present article is to identify and discuss the possibilities of using qualitative data analysis software in the framework of procedures proposed by SDI (socio-discursive interactionism), emphasizing free distribuited software or free versions of commercial software. A literature review of software for qualitative data analysis in the area of social sciences and humanities, focusing on language studies is presented. Some tools, such as: Wef-tQDA, MLCT, Yoshikoder and Tropes are examined with their respective features and functions. The software called Tropes is examined in more detail because of its particular relation with language and semantic analysis, as well as its embeded classification of linguistic elements such as, types of verbs, adjectives, modalizations, etc. Although trying to completely automate an SDI based analysis is not feasible, the programs appear to be powerful helpers in analyzing specific questions. Still, it seems important to be familiar with software options and use different applications in order to obtain a more diversified vision of the data. It is up to the researcher to be critical of the analysis provided by the machine.
Resumo:
While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
Resumo:
* This work was financially supported by RFBF-04-01-00858.
Resumo:
In this paper we introduce the online version of our ReaderBench framework, which includes multi-lingual comprehension-centered web services designed to address a wide range of individual and collaborative learning scenarios, as follows. First, students can be engaged in reading a course material, then eliciting their understanding of it; the reading strategies component provides an in-depth perspective of comprehension processes. Second, students can write an essay or a summary; the automated essay grading component provides them access to more than 200 textual complexity indices covering lexical, syntax, semantics and discourse structure measurements. Third, students can start discussing in a chat or a forum; the Computer Supported Collaborative Learning (CSCL) component provides indepth conversation analysis in terms of evaluating each member’s involvement in the CSCL environments. Eventually, the sentiment analysis, as well as the semantic models and topic mining components enable a clearer perspective in terms of learner’s points of view and of underlying interests.
Resumo:
With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the author(s) of a biomedical publication, or implicit, such as the positive or negative sentiment that an author had when she wrote a product review; there may also be complex context such as the social network of the authors. Many applications require analysis of topic patterns over different contexts. For instance, analysis of search logs in the context of the user can reveal how we can improve the quality of a search engine by optimizing the search results according to particular users; analysis of customer reviews in the context of positive and negative sentiments can help the user summarize public opinions about a product; analysis of blogs or scientific publications in the context of a social network can facilitate discovery of more meaningful topical communities. Since context information significantly affects the choices of topics and language made by authors, in general, it is very important to incorporate it into analyzing and mining text data. In general, modeling the context in text, discovering contextual patterns of language units and topics from text, a general task which we refer to as Contextual Text Mining, has widespread applications in text mining. In this thesis, we provide a novel and systematic study of contextual text mining, which is a new paradigm of text mining treating context information as the ``first-class citizen.'' We formally define the problem of contextual text mining and its basic tasks, and propose a general framework for contextual text mining based on generative modeling of text. This conceptual framework provides general guidance on text mining problems with context information and can be instantiated into many real tasks, including the general problem of contextual topic analysis. We formally present a functional framework for contextual topic analysis, with a general contextual topic model and its various versions, which can effectively solve the text mining problems in a lot of real world applications. We further introduce general components of contextual topic analysis, by adding priors to contextual topic models to incorporate prior knowledge, regularizing contextual topic models with dependency structure of context, and postprocessing contextual patterns to extract refined patterns. The refinements on the general contextual topic model naturally lead to a variety of probabilistic models which incorporate different types of context and various assumptions and constraints. These special versions of the contextual topic model are proved effective in a variety of real applications involving topics and explicit contexts, implicit contexts, and complex contexts. We then introduce a postprocessing procedure for contextual patterns, by generating meaningful labels for multinomial context models. This method provides a general way to interpret text mining results for real users. By applying contextual text mining in the ``context'' of other text information management tasks, including ad hoc text retrieval and web search, we further prove the effectiveness of contextual text mining techniques in a quantitative way with large scale datasets. The framework of contextual text mining not only unifies many explorations of text analysis with context information, but also opens up many new possibilities for future research directions in text mining.
Resumo:
This article undertakes a text analysis of the promotional materials generated by two educational brokers, the British Council’s Education Counselling Service (ECS) and Australia’s International Development Programme (IDP-Education Australia).By focusing on the micropractices of branding, the constructions of the "international student" and "international education" are examined to uncover the relations between international education and globalisation.The conclusion reached here is that the dominant marketing messages used to brand and sell education are unevenly weighted in favour of the economic imperative.International education remains fixed in modernist spatiotemporal contexts that ignore the challenges presented by globalisation.Developing new notions of international education will require a more critical engagement with the geopolitics of knowledge and with issues of subjectivity, difference, and power.Ultimately, a more sustained and comprehensive engagement with the noneconomic dimensions of globalisation will be necessary to achieve new visions of international education.