899 resultados para Information Filtering, Pattern Mining, Relevance Feature Discovery, Text Mining
Resumo:
The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.
Resumo:
Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.
Resumo:
A thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems.
Resumo:
The vision of the Internet of Things (IoT) includes large and dense deployment of interconnected smart sensing and monitoring devices. This vast deployment necessitates collection and processing of large volume of measurement data. However, collecting all the measured data from individual devices on such a scale may be impractical and time consuming. Moreover, processing these measurements requires complex algorithms to extract useful information. Thus, it becomes imperative to devise distributed information processing mechanisms that identify application-specific features in a timely manner and with a low overhead. In this article, we present a feature extraction mechanism for dense networks that takes advantage of dominance-based medium access control (MAC) protocols to (i) efficiently obtain global extrema of the sensed quantities, (ii) extract local extrema, and (iii) detect the boundaries of events, by using simple transforms that nodes employ on their local data. We extend our results for a large dense network with multiple broadcast domains (MBD). We discuss and compare two approaches for addressing the challenges with MBD and we show through extensive evaluations that our proposed distributed MBD approach is fast and efficient at retrieving the most valuable measurements, independent of the number sensor nodes in the network.
Resumo:
Context and Objective: Chagas disease is considered a worldwide emerging disease; it is endemic in Mexico and the state of Coahuila and is considered of little relevance. The objective of this study was to determine the seroprevalence of T. cruzi infection in blood donors and Chagas cardiomyopathy in patients from the coal mining region of Coahuila, Mexico.Design and Setting: Epidemiological, exploratory and prospective study in a general hospital during the period January to June 2011.Methods: We performed laboratory tests ELISA and indirect hemagglutination in three groups of individuals: 1) asymptomatic voluntary blood donors, 2) patients hospitalized in the cardiology department and 3) patients with dilated cardiomyopathy.Results: There were three levels of seroprevalence: 0.31% in asymptomatic individuals, 1.25% in cardiac patients and in patients with dilated cardiomyopathy in 21.14%.Conclusions: In spite of having detected autochthonous cases of Chagas disease, its importance to local public health remains to be established as well as the details of the dynamics of transmission so that the study is still in progress.
Resumo:
Experimental inoculations of approximately 100,000 infective Toxocara cati larval eggs were done in twelve pigs. The T. cati eggs used for inoculation were collected from cat's feces. Another group of three pigs served as an uninfected control. Groups of infected pigs were euthanized at seven, 14, 21, and 28 days post-inoculation (dpi). Tissue samples were taken for digestion and histopathology changes in early phase. The number of larvae recovered from the lungs peaked at seven and 14 dpi and were also present at 21, and 28 dpi. Larvae of T. cati were present in the lymph nodes of the small and large intestine at seven, 14, and 28 dpi and at seven, 14, 21, and 28 dpi respectively. In other studied tissues, no larvae or less than one larva per gram was detected. The pathological response observed in the liver and lungs at seven and 14 dpi, showed white spots on the liver surface and areas of consolidation were observed in the lungs. The lungs showed an inflammatory reaction with larvae in center at 28 dpi. In the liver we observed periportal and perilobular hepatitis. The lymph nodes of the intestines displayed eosinophil lymphadenitis with reactive centers containing parasitic forms in some of them. The granulomatous reaction was not observed in any tissues. The role of the other examined tissues had less significance. The relevance of this parasite as an etiological agent that leads to disease in paratenic hosts is evident.
Resumo:
Trabalho de Projeto apresentado como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Informática
Resumo:
The extraction of relevant terms from texts is an extensively researched task in Text- Mining. Relevant terms have been applied in areas such as Information Retrieval or document clustering and classification. However, relevance has a rather fuzzy nature since the classification of some terms as relevant or not relevant is not consensual. For instance, while words such as "president" and "republic" are generally considered relevant by human evaluators, and words like "the" and "or" are not, terms such as "read" and "finish" gather no consensus about their semantic and informativeness. Concepts, on the other hand, have a less fuzzy nature. Therefore, instead of deciding on the relevance of a term during the extraction phase, as most extractors do, I propose to first extract, from texts, what I have called generic concepts (all concepts) and postpone the decision about relevance for downstream applications, accordingly to their needs. For instance, a keyword extractor may assume that the most relevant keywords are the most frequent concepts on the documents. Moreover, most statistical extractors are incapable of extracting single-word and multi-word expressions using the same methodology. These factors led to the development of the ConceptExtractor, a statistical and language-independent methodology which is explained in Part I of this thesis. In Part II, I will show that the automatic extraction of concepts has great applicability. For instance, for the extraction of keywords from documents, using the Tf-Idf metric only on concepts yields better results than using Tf-Idf without concepts, specially for multi-words. In addition, since concepts can be semantically related to other concepts, this allows us to build implicit document descriptors. These applications led to published work. Finally, I will present some work that, although not published yet, is briefly discussed in this document.
Resumo:
INTRODUCTION: Bacterial meningitis has great social relevance due to its ability to produce sequelae and cause death. It is most frequently found in developing countries, especially among children. Meningococcal meningitis occurs at a high frequency in populations with poor living conditions. This study describes the temporal evolution of bacterial meningitis in Salvador, Brazil, 1995-2009, and verifies the association between its spatial variation and the living conditions of the population. METHODS: This was an ecological study in which the areas of information were classified by an index of living conditions. It examined fluctuations using a trend curve, and the relationship between this index and the spatial distribution of meningitis was verified using simple linear regression. RESULTS: From 1995-2009, there were 3,456 confirmed cases of bacterial meningitis in Salvador. We observed a downward trend during this period, with a yearly incidence of 9.1 cases/100,000 population and fatality of 16.7%. Children aged <5 years old and male were more affected. There was no significant spatial autocorrelation or pattern in the spatial distribution of the disease. The areas with the worst living conditions had higher fatality from meningococcal disease (β = 0.0078117, p < 0.005). CONCLUSIONS: Bacterial meningitis reaches all social strata; however, areas with poor living conditions have a greater proportion of cases that progress to death. This finding reflects the difficulties for ready access and poor quality of medical care faced by these populations.
Resumo:
Actualmente, com a massificação da utilização das redes sociais, as empresas passam a sua mensagem nos seus canais de comunicação, mas os consumidores dão a sua opinião sobre ela. Argumentam, opinam, criticam (Nardi, Schiano, Gumbrecht, & Swartz, 2004). Positiva ou negativamente. Neste contexto o Text Mining surge como uma abordagem interessante para a resposta à necessidade de obter conhecimento a partir dos dados existentes. Neste trabalho utilizámos um algoritmo de Clustering hierárquico com o objectivo de descobrir temas distintos num conjunto de tweets obtidos ao longo de um determinado período de tempo para as empresas Burger King e McDonald’s. Com o intuito de compreender o sentimento associado a estes temas foi feita uma análise de sentimentos a cada tema encontrado, utilizando um algoritmo Bag-of-Words. Concluiu-se que o algoritmo de Clustering foi capaz de encontrar temas através do tweets obtidos, essencialmente ligados a produtos e serviços comercializados pelas empresas. O algoritmo de Sentiment Analysis atribuiu um sentimento a esses temas, permitindo compreender de entre os produtos/serviços identificados quais os que obtiveram uma polaridade positiva ou negativa, e deste modo sinalizar potencias situações problemáticas na estratégia das empresas, e situações positivas passíveis de identificação de decisões operacionais bem-sucedidas.
Resumo:
telligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling was used in or- der to group articles in several relevant topics. The analysis was conducted using a dictionary of terms belonging to both banking and business intelli- gence domains. Such procedure allowed for the identification of relationships between terms and topics grouping articles, enabling to emerge hypotheses regarding research directions. To confirm such hypotheses, relevant articles were collected and scrutinized, allowing to validate the text mining proce- dure. The results show that credit in banking is clearly the main application trend, particularly predicting risk and thus supporting credit approval or de- nial. There is also a relevant interest in bankruptcy and fraud prediction. Customer retention seems to be associated, although weakly, with targeting, justifying bank offers to reduce churn. In addition, a large number of ar- ticles focused more on business intelligence techniques and its applications, using the banking industry just for evaluation, thus, not clearly acclaiming for benefits in the banking business. By identifying these current research topics, this study also highlights opportunities for future research.
Resumo:
To better understand the dynamic behavior of metabolic networks in a wide variety of conditions, the field of Systems Biology has increased its interest in the use of kinetic models. The different databases, available these days, do not contain enough data regarding this topic. Given that a significant part of the relevant information for the development of such models is still wide spread in the literature, it becomes essential to develop specific and powerful text mining tools to collect these data. In this context, this work has as main objective the development of a text mining tool to extract, from scientific literature, kinetic parameters, their respective values and their relations with enzymes and metabolites. The approach proposed integrates the development of a novel plug-in over the text mining framework @Note2. In the end, the pipeline developed was validated with a case study on Kluyveromyces lactis, spanning the analysis and results of 20 full text documents.
Resumo:
Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de Informação