948 resultados para Information Mining
Resumo:
EMAp - Escola de Matemática Aplicada
Resumo:
With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques were used to derive this interesting information. Mining on XML documents is impacted by its model due to the semi-structured nature of these documents. Hence, in this chapter we present an overview of the various models of XML documents, how these models were used for mining and some of the issues and challenges in these models. In addition, this chapter also provides some insights into the future models of XML documents for effectively capturing the two important features namely structure and content of XML documents for mining.
Resumo:
The aim of this paper is to evaluate the efficacy of the application WebBootCaT to create specialised corpora automatically, investigating the translation of articles of association from Italian into English. The first section reflects on the relevant literature and proposes the utility of corpora for translators. The second section discusses the methodology employed, and the third section analyses the results obtained and comments on how language professionals could possibly exploit the application to its full. The fourth section provides a few concrete usage examples of the thus built corpora, to then conclude that WebBootCaT is a genuinely powerful tool that could be implemented by professional translators in order to save time and improve their translations in the long term.
Resumo:
Analysing wastewater samples is an innovative approach that overcomes many limitations of traditional surveys to identify and measure a range of chemicals that were consumed by or exposed to people living in a sewer catchment area. First conceptualised in 2001, much progress has been made to make wastewater analysis (WWA) a reliable and robust tool for measuring chemical consumption and/or exposure. At the moment, the most popular application of WWA, sometimes referred as sewage epidemiology, is to monitor the consumption of illicit drugs in communities around the globe, including China. The approach has been largely adopted by law enforcement agencies as a device to monitor the temporal and geographical patterns of drug consumption. In the future, the methodology can be extended to other chemicals including biomarkers of population health (e.g. environmental or oxidative stress biomarkers, lifestyle indicators or medications that are taken by different demographic groups) and pollutants that people are exposed to (e.g. polycyclic aromatic hydrocarbons, perfluorinated chemicals, and toxic pesticides). The extension of WWA to a huge range of chemicals may give rise to a field called sewage chemical-information mining (SCIM) with unexplored potentials. China has many densely populated cities with thousands of sewage treatment plants which are favourable for applying WWA/SCIM in order to help relevant authorities gather information about illicit drug consumption and population health status. However, there are some prerequisites and uncertainties of the methodology that should be addressed for SCIM to reach its full potential in China.
Resumo:
Sensor networks can be naturally represented as graphical models, where the edge set encodes the presence of sparsity in the correlation structure between sensors. Such graphical representations can be valuable for information mining purposes as well as for optimizing bandwidth and battery usage with minimal loss of estimation accuracy. We use a computationally efficient technique for estimating sparse graphical models which fits a sparse linear regression locally at each node of the graph via the Lasso estimator. Using a recently suggested online, temporally adaptive implementation of the Lasso, we propose an algorithm for streaming graphical model selection over sensor networks. With battery consumption minimization applications in mind, we use this algorithm as the basis of an adaptive querying scheme. We discuss implementation issues in the context of environmental monitoring using sensor networks, where the objective is short-term forecasting of local wind direction. The algorithm is tested against real UK weather data and conclusions are drawn about certain tradeoffs inherent in decentralized sensor networks data analysis. © 2010 The Author. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
This paper describes a framework for annotation on travel blogs based on subjectivity (FATS). The framework has the capability to auto-annotate -sentence by sentence- sections from blogs (posts) about travelling in the Spanish language. FATS is used in this experiment to annotate com- ponents from travel blogs in order to create a corpus of 300 annotated posts. Each subjective element in a sentence is annotated as positive or negative as appropriate. Currently correct annotations add up to about 95 per cent in our subset of the travel domain. By means of an iterative process of annotation we can create a subjectively annotated domain specific corpus.
Resumo:
This paper presents an approach to compare two types of data, subjective data (Polarity of Pan American Games 2011 event by country) and objective data (the number of medals won by each participating country), based on the Pearson corre- lation. When dealing with events described by people, knowledge acquisition is difficult because their structure is heterogeneous and subjective. A first step towards knowing the polarity of the information provided by people consists in automatically classifying the posts into clusters according to their polarity. The authors carried out a set of experiments using a corpus that consists of 5600 posts extracted from 168 Internet resources related to a specific event: the 2011 Pan American games. The approach is based on four components: a crawler, a filter, a synthesizer and a polarity analyzer. The PanAmerican approach automatically classifies the polarity of the event into clusters with the following results: 588 positive, 336 neutral, and 76 negative. Our work found out that the polarity of the content produced was strongly influenced by the results of the event with a correlation of .74. Thus, it is possible to conclude that the polarity of content is strongly affected by the results of the event. Finally, the accuracy of the PanAmerican approach is: .87, .90, and .80 according to the precision of the three classes of polarity evaluated.