954 resultados para educational data mining


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The past decade had witnessed an unprecedented growth in the amount of available digital content, and its volume is expected to continue to grow the next few years. Unstructured text data generated from web and enterprise sources form a large fraction of such content. Many of these contain large volumes of reusable data such as solutions to frequently occurring problems, and general know-how that may be reused in appropriate contexts. In this work, we address issues around leveraging unstructured text data from sources as diverse as the web and the enterprise within the Case-based Reasoning framework. Case-based Reasoning (CBR) provides a framework and methodology for systematic reuse of historical knowledge that is available in the form of problemsolution
pairs, in solving new problems. Here, we consider possibilities of enhancing Textual CBR systems under three main themes: procurement, maintenance and retrieval. We adapt and build upon the stateof-the-art techniques from data mining and natural language processing in addressing various challenges therein. Under procurement, we investigate the problem of extracting cases (i.e., problem-solution pairs) from data sources such as incident/experience
reports. We develop case-base maintenance methods specifically tuned to text targeted towards retaining solutions such that the utility of the filtered case base in solving new problems is maximized. Further, we address the problem of query suggestions for textual case-bases and show that exploiting the problem-solution partition can enhance retrieval effectiveness by prioritizing more useful query suggestions. Additionally, we illustrate interpretable clustering as a tool to drill-down to domain specific text collections (since CBR systems are usually very domain specific) and develop techniques for improved similarity assessment in social media sources such as microblogs. Through extensive empirical evaluations, we illustrate the improvements that we are able to
achieve over the state-of-the-art methods for the respective tasks.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Atualmente, a poluição atmosférica constitui uma das principais causas ambientais de mortalidade. Cerca de 30% da população europeia residente em áreas urbanas encontra-se exposta a níveis de poluição atmosférica superiores aos valores- limite de qualidade do ar legislados para proteção da saúde humana, representando o tráfego rodoviário uma das principais fontes de poluição urbana. Além dos poluentes tradicionais avaliados em áreas urbanas, os poluentes classificados como perigosos para a saúde (Hazard Air Pollutants - HAPs) têm particular relevância devido aos seus conhecidos efeitos tóxicos e cancerígenos. Neste sentido, a avaliação da exposição tornase primordial para a determinação da relação entre a poluição atmosférica urbana e efeitos na saúde. O presente estudo tem como principal objetivo o desenvolvimento e implementação de uma metodologia para avaliação da exposição individual à poluição atmosférica urbana relacionada com o tráfego rodoviário. De modo a atingir este objetivo, foram identificados os parâmetros relevantes para a quantificação de exposição e analisados os atuais e futuros potenciais impactos na saúde associados com a exposição à poluição urbana. Neste âmbito, o modelo ExPOSITION (EXPOSure model to traffIc-relaTed aIr pOllutioN) foi desenvolvido baseado numa abordagem inovadora que envolve a análise da trajetória dos indivíduos recolhidas por telemóveis com tecnologia GPS e processadas através da abordagem de data mining e análise geoespacial. O modelo ExPOSITION considera também uma abordagem probabilística para caracterizar a variabilidade dos parâmetros microambientais e a sua contribuição para exposição individual. Adicionalmente, de forma a atingir os objetivos do estudo foi desenvolvido um novo módulo de cálculo de emissões de HAPs provenientes do transporte rodoviário. Neste estudo, um sistema de modelação, incluindo os modelos de transporteemissões- dispersão-exposição, foi aplicado na área urbana de Leiria para quantificação de exposição individual a PM2.5 e benzeno. Os resultados de modelação foram validados com base em medições obtidas por monitorização pessoal e monitorização biológica verificando-se uma boa concordância entre os resultados do modelo e dados de medições. A metodologia desenvolvida e implementada no âmbito deste trabalho permite analisar e estimar a magnitude, frequência e inter e intra-variabilidade dos níveis de exposição individual, bem como a contribuição de diferentes microambientes, considerando claramente a sequência de eventos de exposição e relação fonte-recetor, que é fundamental para avaliação dos efeitos na saúde e estudos epidemiológicos. O presente trabalho contribui para uma melhor compreensão da exposição individual em áreas urbanas, proporcionando novas perspetivas sobre a exposição individual, essenciais na seleção de estratégias de redução da exposição à poluição atmosférica urbana, e consequentes efeitos na saúde.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The evolution of calcified tissues is a defining feature in vertebrate evolution. Investigating the evolution of proteins involved in tissue calcification should help elucidate how calcified tissues have evolved. The purpose of this study was to collect and compare sequences of matrix and bone γ-carboxyglutamic acid proteins (MGP and BGP, respectively) to identify common features and determine the evolutionary relationship between MGP and BGP. Thirteen cDNAs and genes were cloned using standard methods or reconstructed through the use of comparative genomics and data mining. These sequences were compared with available annotated sequences (a total of 48 complete or nearly complete sequences, 28 BGPs and 20 MGPs) have been identified across 32 different species (representing most classes of vertebrates), and evolutionarily conserved features in both MGP and BGP were analyzed using bioinformatic tools and the Tree-Puzzle software. We propose that: 1) MGP and BGP genes originated from two genome duplications that occurred around 500 and 400 million years ago before jawless and jawed fish evolved, respectively; 2) MGP appeared first concomitantly with the emergence of cartilaginous structures, and BGP appeared thereafter along with bony structures; and 3) BGP derives from MGP. We also propose a highly specific pattern definition for the Gla domain of BGP and MGP. Previous Section Next Section BGP1 (bone Gla protein or osteocalcin) and MGP (matrix Gla protein) belong to the growing family of vitamin K-dependent (VKD) proteins, the members of which are involved in a broad range of biological functions such as skeletogenesis and bone maintenance (BGP and MGP), hemostasis (prothrombin, clotting factors VII, IX, and X, and proteins C, S, and Z), growth control (gas6), and potentially signal transduction (proline-rich Gla proteins 1 and 2). VKD proteins are characterized by the presence of several Gla residues resulting from the post-translational vitamin K-dependent γ-carboxylation of specific glutamates, through which they can bind to calcium-containing mineral such as hydroxyapatite. To date, VKD proteins have only been clearly identified in vertebrates (1) although the presence of a γ-glutamyl carboxylase has been reported in the fruit fly Drosophila melanogaster (2) and in marine snails belonging to the genus Conus (3). Gla residues have also been found in neuropeptides from Conus venoms (4), suggesting a wider prevalence of γ-carboxylation.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2014

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the implementation of data mining applications. The question of the existence of substantial differences between them and the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process as well as an understanding of the similarities between them.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper presents a Multi-Agent Market simulator designed for developing new agent market strategies based on a complete understanding of buyer and seller behaviors, preference models and pricing algorithms, considering user risk preferences and game theory for scenario analysis. This tool studies negotiations based on different market mechanisms and, time and behavior dependent strategies. The results of the negotiations between agents are analyzed by data mining algorithms in order to extract rules that give agents feedback to improve their strategies. The system also includes agents that are capable of improving their performance with their own experience, by adapting to the market conditions, and capable of considering other agent reactions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The present research paper presents five different clustering methods to identify typical load profiles of medium voltage (MV) electricity consumers. These methods are intended to be used in a smart grid environment to extract useful knowledge about customer’s behaviour. The obtained knowledge can be used to support a decision tool, not only for utilities but also for consumers. Load profiles can be used by the utilities to identify the aspects that cause system load peaks and enable the development of specific contracts with their customers. The framework presented throughout the paper consists in several steps, namely the pre-processing data phase, clustering algorithms application and the evaluation of the quality of the partition, which is supported by cluster validity indices. The process ends with the analysis of the discovered knowledge. To validate the proposed framework, a case study with a real database of 208 MV consumers is used.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Electricity markets are complex environments with very particular characteristics. MASCEM is a market simulator developed to allow deep studies of the interactions between the players that take part in the electricity market negotiations. This paper presents a new proposal for the definition of MASCEM players’ strategies to negotiate in the market. The proposed methodology is multiagent based, using reinforcement learning algorithms to provide players with the capabilities to perceive the changes in the environment, while adapting their bids formulation according to their needs, using a set of different techniques that are at their disposal.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The growing importance and influence of new resources connected to the power systems has caused many changes in their operation. Environmental policies and several well know advantages have been made renewable based energy resources largely disseminated. These resources, including Distributed Generation (DG), are being connected to lower voltage levels where Demand Response (DR) must be considered too. These changes increase the complexity of the system operation due to both new operational constraints and amounts of data to be processed. Virtual Power Players (VPP) are entities able to manage these resources. Addressing these issues, this paper proposes a methodology to support VPP actions when these act as a Curtailment Service Provider (CSP) that provides DR capacity to a DR program declared by the Independent System Operator (ISO) or by the VPP itself. The amount of DR capacity that the CSP can assure is determined using data mining techniques applied to a database which is obtained for a large set of operation scenarios. The paper includes a case study based on 27,000 scenarios considering a diversity of distributed resources in a 33 bus distribution network.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper consist in the establishment of a Virtual Producer/Consumer Agent (VPCA) in order to optimize the integrated management of distributed energy resources and to improve and control Demand Side Management DSM) and its aggregated loads. The paper presents the VPCA architecture and the proposed function-based organization to be used in order to coordinate the several generation technologies, the different load types and storage systems. This VPCA organization uses a frame work based on data mining techniques to characterize the costumers. The paper includes results of several experimental tests cases, using real data and taking into account electricity generation resources as well as consumption data.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Many current e-commerce systems provide personalization when their content is shown to users. In this sense, recommender systems make personalized suggestions and provide information of items available in the system. Nowadays, there is a vast amount of methods, including data mining techniques that can be employed for personalization in recommender systems. However, these methods are still quite vulnerable to some limitations and shortcomings related to recommender environment. In order to deal with some of them, in this work we implement a recommendation methodology in a recommender system for tourism, where classification based on association is applied. Classification based on association methods, also named associative classification methods, consist of an alternative data mining technique, which combines concepts from classification and association in order to allow association rules to be employed in a prediction context. The proposed methodology was evaluated in some case studies, where we could verify that it is able to shorten limitations presented in recommender systems and to enhance recommendation quality.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Projecto para obtenção do grau de Mestre em Engenharia Informática e de computadores

Relevância:

80.00% 80.00%

Publicador:

Resumo:

A procura de padrões nos dados de modo a formar grupos é conhecida como aglomeração de dados ou clustering, sendo uma das tarefas mais realizadas em mineração de dados e reconhecimento de padrões. Nesta dissertação é abordado o conceito de entropia e são usados algoritmos com critérios entrópicos para fazer clustering em dados biomédicos. O uso da entropia para efetuar clustering é relativamente recente e surge numa tentativa da utilização da capacidade que a entropia possui de extrair da distribuição dos dados informação de ordem superior, para usá-la como o critério na formação de grupos (clusters) ou então para complementar/melhorar algoritmos existentes, numa busca de obtenção de melhores resultados. Alguns trabalhos envolvendo o uso de algoritmos baseados em critérios entrópicos demonstraram resultados positivos na análise de dados reais. Neste trabalho, exploraram-se alguns algoritmos baseados em critérios entrópicos e a sua aplicabilidade a dados biomédicos, numa tentativa de avaliar a adequação destes algoritmos a este tipo de dados. Os resultados dos algoritmos testados são comparados com os obtidos por outros algoritmos mais “convencionais" como o k-médias, os algoritmos de spectral clustering e um algoritmo baseado em densidade.