16 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining
Resumo:
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies
Resumo:
Dissertação apresentada como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
In the recent past, hardly anyone could predict this course of GIS development. GIS is moving from desktop to cloud. Web 2.0 enabled people to input data into web. These data are becoming increasingly geolocated. Big amounts of data formed something that is called "Big Data". Scientists still don't know how to deal with it completely. Different Data Mining tools are used for trying to extract some useful information from this Big Data. In our study, we also deal with one part of these data - User Generated Geographic Content (UGGC). The Panoramio initiative allows people to upload photos and describe them with tags. These photos are geolocated, which means that they have exact location on the Earth's surface according to a certain spatial reference system. By using Data Mining tools, we are trying to answer if it is possible to extract land use information from Panoramio photo tags. Also, we tried to answer to what extent this information could be accurate. At the end, we compared different Data Mining methods in order to distinguish which one has the most suited performances for this kind of data, which is text. Our answers are quite encouraging. With more than 70% of accuracy, we proved that extracting land use information is possible to some extent. Also, we found Memory Based Reasoning (MBR) method the most suitable method for this kind of data in all cases.
Resumo:
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies
Resumo:
Paper presented at the 9th European Conference on Knowledge Management, Southampton Solent University, Southampton, UK, 4-5 Sep. 2008. URL: http://academic-conferences.org/eckm/eckm2008/eckm08-home.htm
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia Electrotécnica e de Computadores
Resumo:
In the stock market, information takes on special relevance, due to the market’s permanent updating and the great fluidity of information existent therein. Just as in any other negotiations, the party with the better information has a bargaining advantage, as it is able to make more advantageous business decisions. However, unlike most other markets, the proper functioning of the stock market is greatly dependent on investors’ trust in the market itself. As such, if there are investors who, due to any condition they possess or office they hold, have access to relevant information which is not accessible to the general public, distrust is bred within the market and, consequently, investment is lessened. Thus, there is a need to prevent those who hold privileged information from using it in abusive ways. In Portugal, abuse of privileged information is set out and punished criminally in Article 378. of the Portuguese Securities Code (‘Código dos Valores Mobiliários’). In this dissertation, I have set out, firstly, to analyze the inherent conditions for there to be a crime of abuse of privileged information; secondly, to analyze two well-known cases, which took place and were decided in other jurisdictions, and attempt to understand how these cases would fall under Article 378. of the Portuguese Securities Code. Whereas the first case, Chiarella v. United States, was scrutinize under Article 378 of the Portuguese Securities Code, in the second, Lafonta v. AMF, the conclusion arrived at was that the crime taken place was different. This analysis allowed, on one hand, the application to a particular case of prerequisites and concepts which were explained, at a first approach, from a more theoretical perspective; on the other hand, it also allowed the further development of specific aspects of the regime, namely the difference between an insider and a tipee, as well as to more clearly set out the limits to the precise character of the information at hand.
Resumo:
The dissertation presented for obtaining the Master’s Degree in Electrical Engineering and Computer Science, at Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia
Resumo:
Based in internet growth, through semantic web, together with communication speed improvement and fast development of storage device sizes, data and information volume rises considerably every day. Because of this, in the last few years there has been a growing interest in structures for formal representation with suitable characteristics, such as the possibility to organize data and information, as well as the reuse of its contents aimed for the generation of new knowledge. Controlled Vocabulary, specifically Ontologies, present themselves in the lead as one of such structures of representation with high potential. Not only allow for data representation, as well as the reuse of such data for knowledge extraction, coupled with its subsequent storage through not so complex formalisms. However, for the purpose of assuring that ontology knowledge is always up to date, they need maintenance. Ontology Learning is an area which studies the details of update and maintenance of ontologies. It is worth noting that relevant literature already presents first results on automatic maintenance of ontologies, but still in a very early stage. Human-based processes are still the current way to update and maintain an ontology, which turns this into a cumbersome task. The generation of new knowledge aimed for ontology growth can be done based in Data Mining techniques, which is an area that studies techniques for data processing, pattern discovery and knowledge extraction in IT systems. This work aims at proposing a novel semi-automatic method for knowledge extraction from unstructured data sources, using Data Mining techniques, namely through pattern discovery, focused in improving the precision of concept and its semantic relations present in an ontology. In order to verify the applicability of the proposed method, a proof of concept was developed, presenting its results, which were applied in building and construction sector.
Resumo:
The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.
Resumo:
A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems.
Resumo:
Trabalho de Projeto apresentado como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
The extraction of relevant terms from texts is an extensively researched task in Text- Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness. Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs. For instance, a keyword extractor may assume that the most relevant keywords are the most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology. These factors led to the development of the ConceptExtractor, a statistical and language-independent methodology which is explained in Part I of this thesis. In Part II, I will show that the automatic extraction of concepts has great applicability. For instance, for the extraction of keywords from documents, using the Tf-Idf metric only on concepts yields better results than using Tf-Idf without concepts, specially for multi-words. In addition, since concepts can be semantically related to other concepts, this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.
Resumo:
Actualmente, com a massificação da utilização das redes sociais, as empresas passam a sua mensagem nos seus canais de comunicação, mas os consumidores dão a sua opinião sobre ela. Argumentam, opinam, criticam (Nardi, Schiano, Gumbrecht, & Swartz, 2004). Positiva ou negativamente. Neste contexto o Text Mining surge como uma abordagem interessante para a resposta à necessidade de obter conhecimento a partir dos dados existentes. Neste trabalho utilizámos um algoritmo de Clustering hierárquico com o objectivo de descobrir temas distintos num conjunto de tweets obtidos ao longo de um determinado período de tempo para as empresas Burger King e McDonald’s. Com o intuito de compreender o sentimento associado a estes temas foi feita uma análise de sentimentos a cada tema encontrado, utilizando um algoritmo Bag-of-Words. Concluiu-se que o algoritmo de Clustering foi capaz de encontrar temas através do tweets obtidos, essencialmente ligados a produtos e serviços comercializados pelas empresas. O algoritmo de Sentiment Analysis atribuiu um sentimento a esses temas, permitindo compreender de entre os produtos/serviços identificados quais os que obtiveram uma polaridade positiva ou negativa, e deste modo sinalizar potencias situações problemáticas na estratégia das empresas, e situações positivas passíveis de identificação de decisões operacionais bem-sucedidas.