945 resultados para text mining clusterizzazione clustering auto-organizzazione conoscenza MoK
Resumo:
Mode of access: Internet.
Resumo:
Short bibliography at end of each chapter.
Resumo:
The purpose of this paper is to explain the notion of clustering and a concrete clustering method- agglomerative hierarchical clustering algorithm. It shows how a data mining method like clustering can be applied to the analysis of stocks, traded on the Bulgarian Stock Exchange in order to identify similar temporal behavior of the traded stocks. This problem is solved with the aid of a data mining tool that is called XLMiner™ for Microsoft Excel Office.
Resumo:
With the explosive growth of the volume and complexity of document data (e.g., news, blogs, web pages), it has become a necessity to semantically understand documents and deliver meaningful information to users. Areas dealing with these problems are crossing data mining, information retrieval, and machine learning. For example, document clustering and summarization are two fundamental techniques for understanding document data and have attracted much attention in recent years. Given a collection of documents, document clustering aims to partition them into different groups to provide efficient document browsing and navigation mechanisms. One unrevealed area in document clustering is that how to generate meaningful interpretation for the each document cluster resulted from the clustering process. Document summarization is another effective technique for document understanding, which generates a summary by selecting sentences that deliver the major or topic-relevant information in the original documents. How to improve the automatic summarization performance and apply it to newly emerging problems are two valuable research directions. To assist people to capture the semantics of documents effectively and efficiently, the dissertation focuses on developing effective data mining and machine learning algorithms and systems for (1) integrating document clustering and summarization to obtain meaningful document clusters with summarized interpretation, (2) improving document summarization performance and building document understanding systems to solve real-world applications, and (3) summarizing the differences and evolution of multiple document sources.
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
Modern IT infrastructures are constructed by large scale computing systems and administered by IT service providers. Manually maintaining such large computing systems is costly and inefficient. Service providers often seek automatic or semi-automatic methodologies of detecting and resolving system issues to improve their service quality and efficiency. This dissertation investigates several data-driven approaches for assisting service providers in achieving this goal. The detailed problems studied by these approaches can be categorized into the three aspects in the service workflow: 1) preprocessing raw textual system logs to structural events; 2) refining monitoring configurations for eliminating false positives and false negatives; 3) improving the efficiency of system diagnosis on detected alerts. Solving these problems usually requires a huge amount of domain knowledge about the particular computing systems. The approaches investigated by this dissertation are developed based on event mining algorithms, which are able to automatically derive part of that knowledge from the historical system logs, events and tickets. ^ In particular, two textual clustering algorithms are developed for converting raw textual logs into system events. For refining the monitoring configuration, a rule based alert prediction algorithm is proposed for eliminating false alerts (false positives) without losing any real alert and a textual classification method is applied to identify the missing alerts (false negatives) from manual incident tickets. For system diagnosis, this dissertation presents an efficient algorithm for discovering the temporal dependencies between system events with corresponding time lags, which can help the administrators to determine the redundancies of deployed monitoring situations and dependencies of system components. To improve the efficiency of incident ticket resolving, several KNN-based algorithms that recommend relevant historical tickets with resolutions for incoming tickets are investigated. Finally, this dissertation offers a novel algorithm for searching similar textual event segments over large system logs that assists administrators to locate similar system behaviors in the logs. Extensive empirical evaluation on system logs, events and tickets from real IT infrastructures demonstrates the effectiveness and efficiency of the proposed approaches.^
Resumo:
Online Social Network (OSN) services provided by Internet companies bring people together to chat, share the information, and enjoy the information. Meanwhile, huge amounts of data are generated by those services (they can be regarded as the social media ) every day, every hour, even every minute, and every second. Currently, researchers are interested in analyzing the OSN data, extracting interesting patterns from it, and applying those patterns to real-world applications. However, due to the large-scale property of the OSN data, it is difficult to effectively analyze it. This dissertation focuses on applying data mining and information retrieval techniques to mine two key components in the social media data — users and user-generated contents. Specifically, it aims at addressing three problems related to the social media users and contents: (1) how does one organize the users and the contents? (2) how does one summarize the textual contents so that users do not have to go over every post to capture the general idea? (3) how does one identify the influential users in the social media to benefit other applications, e.g., Marketing Campaign? The contribution of this dissertation is briefly summarized as follows. (1) It provides a comprehensive and versatile data mining framework to analyze the users and user-generated contents from the social media. (2) It designs a hierarchical co-clustering algorithm to organize the users and contents. (3) It proposes multi-document summarization methods to extract core information from the social network contents. (4) It introduces three important dimensions of social influence, and a dynamic influence model for identifying influential users.
Resumo:
In this paper we use concepts from graph theory and cellular biology represented as ontologies, to carry out semantic mining tasks on signaling pathway networks. Specifically, the paper describes the semantic enrichment of signaling pathway networks. A cell signaling network describes the basic cellular activities and their interactions. The main contribution of this paper is in the signaling pathway research area, it proposes a new technique to analyze and understand how changes in these networks may affect the transmission and flow of information, which produce diseases such as cancer and diabetes. Our approach is based on three concepts from graph theory (modularity, clustering and centrality) frequently used on social networks analysis. Our approach consists into two phases: the first uses the graph theory concepts to determine the cellular groups in the network, which we will call them communities; the second uses ontologies for the semantic enrichment of the cellular communities. The measures used from the graph theory allow us to determine the set of cells that are close (for example, in a disease), and the main cells in each community. We analyze our approach in two cases: TGF-β and the Alzheimer Disease.
Resumo:
Il periodo in cui viviamo rappresenta la cuspide di una forte e rapida evoluzione nella comprensione del linguaggio naturale, raggiuntasi prevalentemente grazie allo sviluppo di modelli neurali. Nell'ambito dell'information extraction, tali progressi hanno recentemente consentito di riconoscere efficacemente relazioni semantiche complesse tra entità menzionate nel testo, quali proteine, sintomi e farmaci. Tale task -- reso possibile dalla modellazione ad eventi -- è fondamentale in biomedicina, dove la crescita esponenziale del numero di pubblicazioni scientifiche accresce ulteriormente il bisogno di sistemi per l'estrazione automatica delle interazioni racchiuse nei documenti testuali. La combinazione di AI simbolica e sub-simbolica può consentire l'introduzione di conoscenza strutturata nota all'interno di language model, rendendo quest'ultimi più robusti, fattuali e interpretabili. In tale contesto, la verbalizzazione di grafi è uno dei task su cui si riversano maggiori aspettative. Nonostante l'importanza di tali contributi (dallo sviluppo di chatbot alla formulazione di nuove ipotesi di ricerca), ad oggi, risultano assenti contributi capaci di verbalizzare gli eventi biomedici espressi in letteratura, apprendendo il legame tra le interazioni espresse in forma a grafo e la loro controparte testuale. La tesi propone il primo dataset altamente comprensivo su coppie evento-testo, includendo diverse sotto-aree biomediche, quali malattie infettive, ricerca oncologica e biologia molecolare. Il dataset introdotto viene usato come base per l'addestramento di modelli generativi allo stato dell'arte sul task di verbalizzazione, adottando un approccio text-to-text e illustrando una tecnica formale per la codifica di grafi evento mediante testo aumentato. Infine, si dimostra la validità degli eventi per il miglioramento delle capacità di comprensione dei modelli neurali su altri task NLP, focalizzandosi su single-document summarization e multi-task learning.
Resumo:
This paper develops an interactive approach for exploratory spatial data analysis. Measures of attribute similarity and spatial proximity are combined in a clustering model to support the identification of patterns in spatial information. Relationships between the developed clustering approach, spatial data mining and choropleth display are discussed. Analysis of property crime rates in Brisbane, Australia is presented. A surprising finding in this research is that there are substantial inconsistencies in standard choropleth display options found in two widely used commercial geographical information systems, both in terms of definition and performance. The comparative results demonstrate the usefulness and appeal of the developed approach in a geographical information system environment for exploratory spatial data analysis.
Resumo:
Geospatial clustering must be designed in such a way that it takes into account the special features of geoinformation and the peculiar nature of geographical environments in order to successfully derive geospatially interesting global concentrations and localized excesses. This paper examines families of geospaital clustering recently proposed in the data mining community and identifies several features and issues especially important to geospatial clustering in data-rich environments.
Resumo:
O Brasil é um dos países que possuem as maiores jazidas de minério de ferro do mundo. Devido aos recursos financeiros envolvidos, mão-de-obra, arrecadação de impostos, envolvimento de atividades logísticas e comercialização exterior, a mineração é uma das principais atividades econômicas do país, envolvendo nesse contexto as plantas de pelotização de minério de ferro. As pelotas de minério de ferro são um dos insumos essenciais utilizados na produção mundial do aço. Compostas essencialmente de óxido de ferro (Fe2O3), são aglomerados com diâmetro que varia de 6,3 a 16 mm, cuja sua produção envolve a combustão para que obtenham a resistência mecânica necessária para serem utilizadas nos alto fornos siderúrgicos. Tal característica impede que as pelotas se quebrem e se transformem em finos, dificultando a permeação do ar através da carga do alto forno e consequentemente reduzindo a eficiência da queima e produção do ferro gusa, o qual é transformado em aço nas etapas posteriores de refino. Para garantir a produção de pelotas com a qualidade desejada, especialmente nos aspectos de composição e resistência mecânica, foi proposta queima controlada em reator de leito fixo simulando as características dos fornos utilizados na indústria. Com os resultados obtidos, foi possível correlacionar nível de temperatura, consumo de O2 e geração de CO2 e resistência à compressão, permitindo o aprimoramento do processo de combustão de pelotas, além de atestar a possibilidade de utilização de outros combustíveis sólidos como alternativas ao gás natural.
Resumo:
Durante o período pré-menstrual é comum a ocorrência de disfonia, e são poucas as mulheres que se dão conta dessa variação da voz dentro do ciclo menstrual (Quinteiro, 1989). OBJETIVO: Verificar se há diferença no padrão vocal de mulheres no período de ovulação em relação ao primeiro dia do ciclo menstrual, utilizando-se da análise perceptivo-auditiva, da espectrografia, dos parâmetros acústicos e quando esta diferença está presente, se é percebida pelas mulheres. FORMA DE ESTUDO: Caso-controle. MATERIAL E MÉTODO: A amostra coletada foi de 30 estudantes de Fonoaudiologia, na faixa etária de 18 a 25 anos, não-fumantes, com ciclo menstrual regular e sem o uso de contraceptivo oral. As vozes foram gravadas no primeiro dia de menstruação e no décimo-terceiro dia pós-menstruação (ovulação), para posterior comparação. RESULTADOS: Observou-se durante o período menstrual que as vozes estão rouco-soprosa de grau leve a moderado, instáveis, sem a presença de quebra de sonoridade, com pitch e loudness adequados e ressonância equilibrada. Há pior qualidade de definição dos harmônicos, maior quantidade de ruído entre eles e menor extensão dos harmônicos superiores. Encontramos uma f0 mais aguda, jitter e shimmer aumentados e PHR diminuída. CONCLUSÃO: No período menstrual há mudanças na qualidade vocal, no comportamento dos harmônicos e nos parâmetros vocais (f0,jitter, shimmer e PHR). Além disso, a maioria das estudantes de Fonoaudiologia não percebeu a variação da voz durante o ciclo menstrual.
Resumo:
Apesar da alta prevalência entre idosos, a perda auditiva é pouco investigada. A audiometria é o teste padrão ouro para avaliação de perda auditiva, mas sua realização em larga escala traz dificuldades operacionais. O auto-relato pode ser uma alternativa. OBJETIVO: Determinar se uma única questão genérica tem validade para ser utilizada em pesquisas epidemiológicas. FORMA DE ESTUDO:Revisão sistemática. MATERIAL E MÉTODO: Foi realizada uma pesquisa da literatura médica nas bases de dados MEDLINE e LILACS, no período de 1990 a 2004. Foram analisados também artigos citados nas referências dos artigos identificados na busca eletrônica. Foram selecionados os artigos que compararam os resultados obtidos através de auto-relato, por meio de uma questão única genérica, e da audiometria tonal. Foram extraídos os dados de prevalência da perda auditiva, e de sensibilidade, especificidade e valores preditivos. RESULTADO: Foram incluídos dez estudos transversais. A questão única genérica parece ser um indicador aceitável de perda auditiva, sensível e razoavelmente específico, principalmente quando a perda é identificada como sendo a média tonal que inclua freqüências até 2 ou 4 kHz, a um nível de 40 dBNA, na melhor orelha. CONCLUSÃO: Uma questão única genérica tem uma boa performance em identificar idosos com perda auditiva e pode, portanto, ser recomendada para um estudo epidemiológico que não possa realizar medidas audiométricas.