793 resultados para Data Mining, Clustering, PSA, Pavement Deflection
Resumo:
Peer reviewed
Resumo:
Peer reviewed
Resumo:
Data mining can be defined as the extraction of implicit, previously un-known, and potentially useful information from data. Numerous re-searchers have been developing security technology and exploring new methods to detect cyber-attacks with the DARPA 1998 dataset for Intrusion Detection and the modified versions of this dataset KDDCup99 and NSL-KDD, but until now no one have examined the performance of the Top 10 data mining algorithms selected by experts in data mining. The compared classification learning algorithms in this thesis are: C4.5, CART, k-NN and Naïve Bayes. The performance of these algorithms are compared with accuracy, error rate and average cost on modified versions of NSL-KDD train and test dataset where the instances are classified into normal and four cyber-attack categories: DoS, Probing, R2L and U2R. Additionally the most important features to detect cyber-attacks in all categories and in each category are evaluated with Weka’s Attribute Evaluator and ranked according to Information Gain. The results show that the classification algorithm with best performance on the dataset is the k-NN algorithm. The most important features to detect cyber-attacks are basic features such as the number of seconds of a network connection, the protocol used for the connection, the network service used, normal or error status of the connection and the number of data bytes sent. The most important features to detect DoS, Probing and R2L attacks are basic features and the least important features are content features. Unlike U2R attacks, where the content features are the most important features to detect attacks.
Resumo:
The incredible rapid development to huge volumes of air travel, mainly because of jet airliners that appeared to the sky in the 1950s, created the need for systematic research for aviation safety and collecting data about air traffic. The structured data can be analysed easily using queries from databases and running theseresults through graphic tools. However, in analysing narratives that often give more accurate information about the case, mining tools are needed. The analysis of textual data with computers has not been possible until data mining tools have been developed. Their use, at least among aviation, is still at a moderate level. The research aims at discovering lethal trends in the flight safety reports. The narratives of 1,200 flight safety reports from years 1994 – 1996 in Finnish were processed with three text mining tools. One of them was totally language independent, the other had a specific configuration for Finnish and the third originally created for English, but encouraging results had been achieved with Spanish and that is why a Finnish test was undertaken, too. The global rate of accidents is stabilising and the situation can now be regarded as satisfactory, but because of the growth in air traffic, the absolute number of fatal accidents per year might increase, if the flight safety will not be improved. The collection of data and reporting systems have reached their top level. The focal point in increasing the flight safety is analysis. The air traffic has generally been forecasted to grow 5 – 6 per cent annually over the next two decades. During this period, the global air travel will probably double also with relatively conservative expectations of economic growth. This development makes the airline management confront growing pressure due to increasing competition, signify cant rise in fuel prices and the need to reduce the incident rate due to expected growth in air traffic volumes. All this emphasises the urgent need for new tools and methods. All systems provided encouraging results, as well as proved challenges still to be won. Flight safety can be improved through the development and utilisation of sophisticated analysis tools and methods, like data mining, using its results supporting the decision process of the executives.
Resumo:
Este trabalho objetivou realizar a sistematização e análise das informações disponíveis na literatura sobre técnicas de produção de mudas de seis espécies florestais nativas e exóticas no Bioma Amazônia.
Resumo:
The semiarid region of northeastern Brazil, the Caatinga, is extremely important due to its biodiversity and endemism. Measurements of plant physiology are crucial to the calibration of Dynamic Global Vegetation Models (DGVMs) that are currently used to simulate the responses of vegetation in face of global changes. In a field work realized in an area of preserved Caatinga forest located in Petrolina, Pernambuco, measurements of carbon assimilation (in response to light and CO2) were performed on 11 individuals of Poincianella microphylla, a native species that is abundant in this region. These data were used to calibrate the maximum carboxylation velocity (Vcmax) used in the INLAND model. The calibration techniques used were Multiple Linear Regression (MLR), and data mining techniques as the Classification And Regression Tree (CART) and K-MEANS. The results were compared to the UNCALIBRATED model. It was found that simulated Gross Primary Productivity (GPP) reached 72% of observed GPP when using the calibrated Vcmax values, whereas the UNCALIBRATED approach accounted for 42% of observed GPP. Thus, this work shows the benefits of calibrating DGVMs using field ecophysiological measurements, especially in areas where field data is scarce or non-existent, such as in the Caatinga
Resumo:
This paper presents the characterization of high voltage (HV) electric power consumers based on a data clustering approach. The typical load profiles (TLP) are obtained selecting the best partition of a power consumption database among a pool of data partitions produced by several clustering algorithms. The choice of the best partition is supported using several cluster validity indices. The proposed data-mining (DM) based methodology, that includes all steps presented in the process of knowledge discovery in databases (KDD), presents an automatic data treatment application in order to preprocess the initial database in an automatic way, allowing time saving and better accuracy during this phase. These methods are intended to be used in a smart grid environment to extract useful knowledge about customers’ consumption behavior. To validate our approach, a case study with a real database of 185 HV consumers was used.
Resumo:
O paradigma de avaliação do ensino superior foi alterado em 2005 para ter em conta, para além do número de entradas, o número de alunos diplomados. Esta alteração pressiona as instituições académicas a melhorar o desempenho dos alunos. Um fenómeno perceptível ao analisar esse desempenho é que a performance registada não é nem uniforme nem constante ao longo da estadia do aluno no curso. Estas variações não estão a ser consideradas no esforço de melhorar o desempenho académico e surge motivação para detectar os diferentes perfis de desempenho e utilizar esse conhecimento para melhorar a o desempenho das instituições académicas. Este documento descreve o trabalho realizado no sentido de propor uma metodologia para detectar padrões de desempenho académico, num curso do ensino superior. Como ferramenta de análise são usadas técnicas de data mining, mais precisamente algoritmos de agrupamento. O caso de estudo para este trabalho é a população estudantil da licenciatura em Eng. Informática da FCT-UNL. Propõe-se dois modelos para o aluno, que servem de base para a análise. Um modelo analisa os alunos tendo em conta a sua performance num ano lectivo e o segundo analisa os alunos tendo em conta o seu percurso académico pelo curso, desde que entrou até se diplomar, transferir ou desistir. Esta análise é realizada recorrendo aos algoritmos de agrupamento: algoritmo aglomerativo hierárquico, k-means, SOM e SNN, entre outros.
Resumo:
Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de Informação
Resumo:
Internet ha rivoluzionato il modo di comunicare degli individui. Siamo testimoni della nascita e dello sviluppo di un'era caratterizzata dalla disponibilità di informazione libera e accessibile a tutti. Negli ultimi anni grazie alla diffusione di smartphone, tablet e altre tipologie di dispositivi connessi, è cambiato il fulcro dell'innovazione spostandosi dalle persone agli oggetti. E' così che nasce il concetto di Internet of Things, termine usato per descrivere la rete di comunicazione creata tra i diversi dispositivi connessi ad Internet e capaci di interagire in autonomia. Gli ambiti applicativi dell'Internet of Things spaziano dalla domotica alla sanità, dall'environmental monitoring al concetto di smart cities e così via. L'obiettivo principale di tale disciplina è quello di migliorare la vita delle persone grazie a sistemi che siano in grado di interagire senza aver bisogno dell'intervento dell'essere umano. Proprio per la natura eterogenea della disciplina e in relazione ai diversi ambiti applicativi, nell'Internet of Things si può incorrere in problemi derivanti dalla presenza di tecnologie differenti o di modalità eterogenee di memorizzazione dei dati. A questo proposito viene introdotto il concetto di Internet of Things collaborativo, termine che indica l'obiettivo di realizzare applicazioni che possano garantire interoperabilità tra i diversi ecosistemi e tra le diverse fonti da cui l'Internet of Things attinge, sfruttando la presenza di piattaforme di pubblicazione di Open Data. L'obiettivo di questa tesi è stato quello di creare un sistema per l'aggregazione di dati da due piattaforme, ThingSpeak e Sparkfun, con lo scopo di unificarli in un unico database ed estrarre informazioni significative dai dati tramite due tecniche di Data Mining: il Dictionary Learning e l'Affinity Propagation. Vengono illustrate le due metodologie che rientrano rispettivamente tra le tecniche di classificazione e di clustering.
Resumo:
Geospatial clustering must be designed in such a way that it takes into account the special features of geoinformation and the peculiar nature of geographical environments in order to successfully derive geospatially interesting global concentrations and localized excesses. This paper examines families of geospaital clustering recently proposed in the data mining community and identifies several features and issues especially important to geospatial clustering in data-rich environments.
Resumo:
Trabalho de Projeto apresentado como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
Knowledge discovery in databases is the non-trivial process of identifying valid, novel potentially useful and ultimately understandable patterns from data. The term Data mining refers to the process which does the exploratory analysis on the data and builds some model on the data. To infer patterns from data, data mining involves different approaches like association rule mining, classification techniques or clustering techniques. Among the many data mining techniques, clustering plays a major role, since it helps to group the related data for assessing properties and drawing conclusions. Most of the clustering algorithms act on a dataset with uniform format, since the similarity or dissimilarity between the data points is a significant factor in finding out the clusters. If a dataset consists of mixed attributes, i.e. a combination of numerical and categorical variables, a preferred approach is to convert different formats into a uniform format. The research study explores the various techniques to convert the mixed data sets to a numerical equivalent, so as to make it equipped for applying the statistical and similar algorithms. The results of clustering mixed category data after conversion to numeric data type have been demonstrated using a crime data set. The thesis also proposes an extension to the well known algorithm for handling mixed data types, to deal with data sets having only categorical data. The proposed conversion has been validated on a data set corresponding to breast cancer. Moreover, another issue with the clustering process is the visualization of output. Different geometric techniques like scatter plot, or projection plots are available, but none of the techniques display the result projecting the whole database but rather demonstrate attribute-pair wise analysis