104 resultados para Abstractive summarization
Resumo:
In the last decade, large numbers of social media services have emerged and been widely used in people's daily life as important information sharing and acquisition tools. With a substantial amount of user-contributed text data on social media, it becomes a necessity to develop methods and tools for text analysis for this emerging data, in order to better utilize it to deliver meaningful information to users. Previous work on text analytics in last several decades is mainly focused on traditional types of text like emails, news and academic literatures, and several critical issues to text data on social media have not been well explored: 1) how to detect sentiment from text on social media; 2) how to make use of social media's real-time nature; 3) how to address information overload for flexible information needs. In this dissertation, we focus on these three problems. First, to detect sentiment of text on social media, we propose a non-negative matrix tri-factorization (tri-NMF) based dual active supervision method to minimize human labeling efforts for the new type of data. Second, to make use of social media's real-time nature, we propose approaches to detect events from text streams on social media. Third, to address information overload for flexible information needs, we propose two summarization framework, dominating set based summarization framework and learning-to-rank based summarization framework. The dominating set based summarization framework can be applied for different types of summarization problems, while the learning-to-rank based summarization framework helps utilize the existing training data to guild the new summarization tasks. In addition, we integrate these techneques in an application study of event summarization for sports games as an example of how to better utilize social media data.
Resumo:
Text summarization has been studied for over a half century, but traditional methods process texts empirically and neglect the fundamental characteristics and principles of language use and understanding. Automatic summarization is a desirable technique for processing big data. This reference summarizes previous text summarization approaches in a multi-dimensional category space, introduces a multi-dimensional methodology for research and development, unveils the basic characteristics and principles of language use and understanding, investigates some fundamental mechanisms of summarization, studies dimensions on representations, and proposes a multi-dimensional evaluation mechanism. Investigation extends to incorporating pictures into summary and to the summarization of videos, graphs and pictures, and converges to a general summarization method. Further, some basic behaviors of summarization are studied in the complex cyber-physical-social space. Finally, a creative summarization mechanism is proposed as an effort toward the creative summarization of things, which is an open process of interactions among physical objects, data, people, and systems in cyber-physical-social space through a multi-dimensional lens of semantic computing. The insights can inspire research and development of many computing areas.
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
While news stories are an important traditional medium to broadcast and consume news, microblogging has recently emerged as a place where people can dis- cuss, disseminate, collect or report information about news. However, the massive information in the microblogosphere makes it hard for readers to keep up with these real-time updates. This is especially a problem when it comes to breaking news, where people are more eager to know “what is happening”. Therefore, this dis- sertation is intended as an exploratory effort to investigate computational methods to augment human effort when monitoring the development of breaking news on a given topic from a microblog stream by extractively summarizing the updates in a timely manner. More specifically, given an interest in a topic, either entered as a query or presented as an initial news report, a microblog temporal summarization system is proposed to filter microblog posts from a stream with three primary concerns: topical relevance, novelty, and salience. Considering the relatively high arrival rate of microblog streams, a cascade framework consisting of three stages is proposed to progressively reduce quantity of posts. For each step in the cascade, this dissertation studies methods that improve over current baselines. In the relevance filtering stage, query and document expansion techniques are applied to mitigate sparsity and vocabulary mismatch issues. The use of word embedding as a basis for filtering is also explored, using unsupervised and supervised modeling to characterize lexical and semantic similarity. In the novelty filtering stage, several statistical ways of characterizing novelty are investigated and ensemble learning techniques are used to integrate results from these diverse techniques. These results are compared with a baseline clustering approach using both standard and delay-discounted measures. In the salience filtering stage, because of the real-time prediction requirement a method of learning verb phrase usage from past relevant news reports is used in conjunction with some standard measures for characterizing writing quality. Following a Cranfield-like evaluation paradigm, this dissertation includes a se- ries of experiments to evaluate the proposed methods for each step, and for the end- to-end system. New microblog novelty and salience judgments are created, building on existing relevance judgments from the TREC Microblog track. The results point to future research directions at the intersection of social media, computational jour- nalism, information retrieval, automatic summarization, and machine learning.
Resumo:
This dissertation applies statistical methods to the evaluation of automatic summarization using data from the Text Analysis Conferences in 2008-2011. Several aspects of the evaluation framework itself are studied, including the statistical testing used to determine significant differences, the assessors, and the design of the experiment. In addition, a family of evaluation metrics is developed to predict the score an automatically generated summary would receive from a human judge and its results are demonstrated at the Text Analysis Conference. Finally, variations on the evaluation framework are studied and their relative merits considered. An over-arching theme of this dissertation is the application of standard statistical methods to data that does not conform to the usual testing assumptions.
Resumo:
We present the design and deployment results for PosNet - a large-scale, long-duration sensor network that gathers summary position and status information from mobile nodes. The mobile nodes have a fixed-sized memory buffer to which position data is added at a constant rate, and from which data is downloaded at a non-constant rate. We have developed a novel algorithm that performs online summarization of position data within the buffer, where the algorithm naturally accommodates data input and output rate mismatch, and also provides a delay-tolerant approach to data transport. The algorithm has been extensively tested in a large-scale long-duration cattle monitoring and control application.
Resumo:
As of today, user-generated information such as online reviews has become increasingly significant for customers in decision making process. Meanwhile, as the volume of online reviews proliferates, there is an insistent demand to help the users tackle the information overload problem. In order to extract useful information from overwhelming reviews, considerable work has been proposed such as review summarization and review selection. Particularly, to avoid the redundant information, researchers attempt to select a small set of reviews to represent the entire review corpus by preserving its statistical properties (e.g., opinion distribution). However, one significant drawback of the existing works is that they only measure the utility of the extracted reviews as a whole without considering the quality of each individual review. As a result, the set of chosen reviews may consist of low-quality ones even its statistical property is close to that of the original review corpus, which is not preferred by the users. In this paper, we proposed a review selection method which takes review quality into consideration during the selection process. Specifically, we examine the relationships between product features based upon a domain ontology to capture the review characteristics based on which to select reviews that have good quality and preserve the opinion distribution as well. Our experimental results based on real world review datasets demonstrate that our proposed approach is feasible and able to improve the performance of the review selection effectively.
Resumo:
Users can rarely reveal their information need in full detail to a search engine within 1--2 words, so search engines need to "hedge their bets" and present diverse results within the precious 10 response slots. Diversity in ranking is of much recent interest. Most existing solutions estimate the marginal utility of an item given a set of items already in the response, and then use variants of greedy set cover. Others design graphs with the items as nodes and choose diverse items based on visit rates (PageRank). Here we introduce a radically new and natural formulation of diversity as finding centers in resistive graphs. Unlike in PageRank, we do not specify the edge resistances (equivalently, conductances) and ask for node visit rates. Instead, we look for a sparse set of center nodes so that the effective conductance from the center to the rest of the graph has maximum entropy. We give a cogent semantic justification for turning PageRank thus on its head. In marked deviation from prior work, our edge resistances are learnt from training data. Inference and learning are NP-hard, but we give practical solutions. In extensive experiments with subtopic retrieval, social network search, and document summarization, our approach convincingly surpasses recently-published diversity algorithms like subtopic cover, max-marginal relevance (MMR), Grasshopper, DivRank, and SVMdiv.
Resumo:
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated given the summary topics. This ensures that our summaries always highlight the crux of the document without paying any attention to the grammar and the structure of the documents. Finally, we evaluate our summaries on the DUC 2002 Single document summarization data corpus using ROUGE measures. Our summaries had higher ROUGE values and better semantic similarity with the documents than the DUC summaries.
Resumo:
[EN]Measuring semantic similarity and relatedness between textual items (words, sentences, paragraphs or even documents) is a very important research area in Natural Language Processing (NLP). In fact, it has many practical applications in other NLP tasks. For instance, Word Sense Disambiguation, Textual Entailment, Paraphrase detection, Machine Translation, Summarization and other related tasks such as Information Retrieval or Question Answering. In this masther thesis we study di erent approaches to compute the semantic similarity between textual items. In the framework of the european PATHS project1, we also evaluate a knowledge-base method on a dataset of cultural item descriptions. Additionaly, we describe the work carried out for the Semantic Textual Similarity (STS) shared task of SemEval-2012. This work has involved supporting the creation of datasets for similarity tasks, as well as the organization of the task itself.
Resumo:
简述了利用ANSYS软件对大型光学镜子热弹性变形进行有限元分析的一般方法,强调了在分析过程中应该注意的一些问题和实用技巧,并给出了一个分析透镜重力变形的实例。
Resumo:
A tese objetiva estruturar os pressupostos constitucionais impostos pelo conteúdo atual e humanizado do contraditório participativo às técnicas de sumarização da cognição. A primeira parte do estudo volta-se ao descortínio do papel do contraditório no sistema processual civil, do seu conteúdo mínimo atual, a partir da experiência internacional, em especial das Cortes de proteção dos direitos humanos, em confronto com o estágio evolutivo da jurisprudência brasileira. A segunda parte estuda as pressões exercidas pela celeridade sobre as fronteiras do contraditório, passando pelo exame dos dados disponibilizados pelo Conselho Nacional de Justiça e por outros institutos, pelo conteúdo do direito à razoável duração dos processos, também com amparo na experiência das Cortes internacionais de proteção dos direitos humanos, com o exame detido das condenações impostas ao Brasil pela Corte Interamericana de Direitos Humanos e da urisprudência interna sobre o tema, que nega aos prejudicados o direito à reparação dos danos sofridos pelos retardos injustificados. Definidas as bases, segue-se a análise das técnicas de sumarização da cognição, seus fundamentos, objetivos e espécies. A cognição sumária é definida em contraposição à cognição plena, segundo a qual as partes podem exercer, plenamente, em Juízo, os direitos inerentes ao contraditório participativo. O último quadrante se volta à estruturação dos pressupostos constitucionais legitimadores do emprego das técnicas de sumarização da cognição, impostos pelo contraditório como freio às pressões constantes da celeridade. O emprego legítimo das técnicas de tutela diferenciadas que se valem da cognição sumária para acelerar os resultados pressupõe, no quadro constitucional atual, (i) a observância do núcleo essencial do contraditório, identificado na audiência bilateral, em todo o iter da relação processual, (ii) a predeterminação legislativa, para que os cortes cognitivos não venham a ser casuisticamente realizados, (iii) a oportunidade, assegurada às partes, para integrar o contraditório em outra fase ou processo, em cognição plena, bem como (iv) a manutenção do equilíbrio na estabilização dos resultados, não podendo a cognição sumária, porque marcada pela incompletude, ser exaustiva em si. Ao final, depois do exame do caráter renunciável das garantias, é realizada a análise de alguns institutos processuais vigentes, nos quais é possível verificar o traço da sumarização da cognição, seguida da indicação das correções legislativas necessárias à conformação dos modelos aos padrões legitimadores propostos, reequilibrando as bases do sistema processual civil.
Resumo:
For increasing the usability of a medical device the usability engineering standards IEC 60601-1-6 and IEC 62366 suggest incorporating user information in the design and development process. However, practice shows that integrating user information and the related investigation of users, called user research, is difficult in the field of medical devices. In particular, identifying the most appropriate user research methods is a difficult process. This difficulty results from the complexity of the medical device industry, especially with respect to regulations and standards, the characteristics of this market and the broad range of potential user research methods available from various research disciplines. Against this background, this study aimed at guiding designers and engineers in selecting effective user research methods according to their stage in the design process. Two approaches are described which reduce the complexity of method selection by summarizing the high number of methods into homogenous method classes. These approaches are closely connected to the medical device industry characteristic design phases and therefore provide the possibility of selecting design-phase- specific user research methods. In the first approach potential user research methods are classified after their characteristics in the design process. The second approach suggests a method summarization according to their similarity in the data collection techniques and provides an additional linkage to design phase characteristics. Both approaches have been tested in practice and the results show that both approaches facilitate user research method selection. © 2009 Springer-Verlag.
Resumo:
Freenet,又叫自由网,是一种分布式信息存储和搜索系统,设计定位于保障信息的私有性和有效性。系统的运行类似于一个具有位置无关性特征的分布式文件系统,这个文件系统由许多独立的计算机组成,这些计算机允许用户匿名进行文件的插入、存储和请求。在自由网中,发布的匿名性,文件的标识、发布、存储、请求都与自由网的密钥和索引机制有着相当密切的联系,该文对自由网中的密钥机制做了一定程度的分析、探索,以求能对自由网的运行机制的进一步研究和了解有所帮助。