907 resultados para Data clustering
Resumo:
Complex systems, i.e. systems composed of a large set of elements interacting in a non-linear way, are constantly found all around us. In the last decades, different approaches have been proposed toward their understanding, one of the most interesting being the Complex Network perspective. This legacy of the 18th century mathematical concepts proposed by Leonhard Euler is still current, and more and more relevant in real-world problems. In recent years, it has been demonstrated that network-based representations can yield relevant knowledge about complex systems. In spite of that, several problems have been detected, mainly related to the degree of subjectivity involved in the creation and evaluation of such network structures. In this Thesis, we propose addressing these problems by means of different data mining techniques, thus obtaining a novel hybrid approximation intermingling complex networks and data mining. Results indicate that such techniques can be effectively used to i) enable the creation of novel network representations, ii) reduce the dimensionality of analyzed systems by pre-selecting the most important elements, iii) describe complex networks, and iv) assist in the analysis of different network topologies. The soundness of such approach is validated through different validation cases drawn from actual biomedical problems, e.g. the diagnosis of cancer from tissue analysis, or the study of the dynamics of the brain under different neurological disorders.
Resumo:
This study analyses financial data using the result characterization of a self-organized neural network model. The goal was prototyping a tool that may help an economist or a market analyst to analyse stock market series. To reach this goal, the tool shows economic dependencies and statistics measures over stock market series. The neural network SOM (self-organizing maps) model was used to ex-tract behavioural patterns of the data analysed. Based on this model, it was de-veloped an application to analyse financial data. This application uses a portfo-lio of correlated markets or inverse-correlated markets as input. After the anal-ysis with SOM, the result is represented by micro clusters that are organized by its behaviour tendency. During the study appeared the need of a better analysis for SOM algo-rithm results. This problem was solved with a cluster solution technique, which groups the micro clusters from SOM U-Matrix analyses. The study showed that the correlation and inverse-correlation markets projects multiple clusters of data. These clusters represent multiple trend states that may be useful for technical professionals.
Resumo:
Botnets are a group of computers infected with a specific sub-set of a malware family and controlled by one individual, called botmaster. This kind of networks are used not only, but also for virtual extorsion, spam campaigns and identity theft. They implement different types of evasion techniques that make it harder for one to group and detect botnet traffic. This thesis introduces one methodology, called CONDENSER, that outputs clusters through a self-organizing map and that identify domain names generated by an unknown pseudo-random seed that is known by the botnet herder(s). Aditionally DNS Crawler is proposed, this system saves historic DNS data for fast-flux and double fastflux detection, and is used to identify live C&Cs IPs used by real botnets. A program, called CHEWER, was developed to automate the calculation of the SVM parameters and features that better perform against the available domain names associated with DGAs. CONDENSER and DNS Crawler were developed with scalability in mind so the detection of fast-flux and double fast-flux networks become faster. We used a SVM for the DGA classififer, selecting a total of 11 attributes and achieving a Precision of 77,9% and a F-Measure of 83,2%. The feature selection method identified the 3 most significant attributes of the total set of attributes. For clustering, a Self-Organizing Map was used on a total of 81 attributes. The conclusions of this thesis were accepted in Botconf through a submited article. Botconf is known conferênce for research, mitigation and discovery of botnets tailled for the industry, where is presented current work and research. This conference is known for having security and anti-virus companies, law enforcement agencies and researchers.
Resumo:
The superfluous consumption of energy is faced by the modern society as a Socio-Economical and Environmental problem of the present days. This situation is worsening given that it is becoming clear that the tendency is to increase energy price every year. It is also noticeable that people, not necessarily proficient in technology, are not able to know where savings can be achieved, due to the absence of accessible awareness mechanisms. One of the home user concerns is to balance the need of reducing energy consumption, while producing the same activity with all the comfort and work efficiency. The common techniques to reduce the consumption are to use a less wasteful equipment, altering the equipment program to a more economical one or disconnecting appliances that are not necessary at the moment. However, there is no direct feedback from this performed actions, which leads to the situation where the user is not aware of the influence that these techniques have in the electrical bill. With the intension to give some control over the home consumption, Energy Management Systems (EMS) were developed. These systems allow the access to the consumption information and help understanding the energy waste. However, some studies have proven that these systems have a clear mismatch between the information that is presented and the one the user finds useful for his daily life, leading to demotivation of use. In order to create a solution more oriented towards the user’s demands, a specially tailored language (DSL) was implemented. This solution allows the user to acquire the information he considers useful, through the construction of questions about his energy consumption. The development of this language, following the Model Driven Development (MDD) approach, took into consideration the ideas of facility managers and home users in the phases of design and validation. These opinions were gathered through meetings with experts and a survey, which was conducted to the purpose of collecting statistics about what home users want to know.
Resumo:
Companies are increasingly more and more dependent on distributed web-based software systems to support their businesses. This increases the need to maintain and extend software systems with up-to-date new features. Thus, the development process to introduce new features usually needs to be swift and agile, and the supporting software evolution process needs to be safe, fast, and efficient. However, this is usually a difficult and challenging task for a developer due to the lack of support offered by programming environments, frameworks, and database management systems. Changes needed at the code level, database model, and the actual data contained in the database must be planned and developed together and executed in a synchronized way. Even under a careful development discipline, the impact of changing an application data model is hard to predict. The lifetime of an application comprises changes and updates designed and tested using data, which is usually far from the real, production, data. So, coding DDL and DML SQL scripts to update database schema and data, is the usual (and hard) approach taken by developers. Such manual approach is error prone and disconnected from the real data in production, because developers may not know the exact impact of their changes. This work aims to improve the maintenance process in the context of Agile Platform by Outsystems. Our goal is to design and implement new data-model evolution features that ensure a safe support for change and a sound migration process. Our solution includes impact analysis mechanisms targeting the data model and the data itself. This provides, to developers, a safe, simple, and guided evolution process.
Resumo:
RESUMO - A Segurança do Doente tem assumido uma relevância crescente nas organizações de saúde, resultado da divulgação de diversos estudos que revelaram a magnitude deste problema e simultaneamente, de uma maior pressão por parte da opinião pública e da comunicação social. Este estudo pretende desenvolver e avaliar a performance de um sistema eletrónico de deteção de eventos adversos, baseado num Data Warehouse, por comparação com os resultados obtidos pela metodologia tradicional de revisão dos registos clínicos. O objetivo principal do trabalho consistiu em identificar um conjunto de triggers / indicadores de alerta que permitam detetar potenciais eventos adversos mais comuns. O sistema desenvolvido apresentou um Valor Preditivo Positivo de 18.2%, uma sensibilidade de 65.1% e uma especificidade de 68.6%, sendo constituído por nove indicadores baseados em informação clínica e 445 códigos do ICD-9-CM, relativos a diagnósticos e procedimentos. Apesar de terem algumas limitações, os sistemas eletrónicos de deteção de eventos adversos apresentam inúmeras potencialidades, nomeadamente a utilização em tempo real e em complemento a metodologias já existentes. Considerando a importância da problemática em análise e a necessidade de aprofundar os resultados obtidos neste trabalho de projeto, seria relevante a sua extensão a um universo mais alargado de instituições hospitalares, estando a sua replicabilidade facilitada, uma vez que o Data Warehouse tem por base um conjunto de aplicações disseminadas a nível nacional. O desenvolvimento e a consolidação dos sistemas eletrónicos de deteção de eventos adversos constitui inegavelmente uma área de futuro, com reflexos ao nível da melhoria da informação existente nas organizações e que contribuirá decisivamente para a melhoria dos cuidados de saúde prestados aos doentes.
Resumo:
Stratigraphic Columns (SC) are the most useful and common ways to represent the eld descriptions (e.g., grain size, thickness of rock packages, and fossil and lithological components) of rock sequences and well logs. In these representations the width of SC vary according to the grain size (i.e., the wider the strata, the coarser the rocks (Miall 1990; Tucker 2011)), and the thickness of each layer is represented at the vertical axis of the diagram. Typically these representations are drawn 'manually' using vector graphic editors (e.g., Adobe Illustrator®, CorelDRAW®, Inskape). Nowadays there are various software which automatically plot SCs, but there are not versatile open-source tools and it is very di cult to both store and analyse stratigraphic information. This document presents Stratigraphic Data Analysis in R (SDAR), an analytical package1 designed for both plotting and facilitate the analysis of Stratigraphic Data in R (R Core Team 2014). SDAR, uses simple stratigraphic data and takes advantage of the exible plotting tools available in R to produce detailed SCs. The main bene ts of SDAR are: (i) used to generate accurate and complete SC plot including multiple features (e.g., sedimentary structures, samples, fossil content, color, structural data, contacts between beds), (ii) developed in a free software environment for statistical computing and graphics, (iii) run on a wide variety of platforms (i.e., UNIX, Windows, and MacOS), (iv) both plotting and analysing functions can be executed directly on R's command-line interface (CLI), consequently this feature enables users to integrate SDAR's functions with several others add-on packages available for R from The Comprehensive R Archive Network (CRAN).
Resumo:
Crisis-affected communities and global organizations for international aid are becoming increasingly digital as consequence geotechnology popularity. Humanitarian sector changed in profound ways by adopting new technical approach to obtain information from area with difficult geographical or political access. Since 2011, turkey is hosting a growing number of Syrian refugees along southeastern region. Turkish policy of hosting them in camps and the difficulty created by governors to international aid group expeditions to get information, made such international organizations to investigate and adopt other approach in order to obtain information needed. They intensified its remote sensing approach. However, the majority of studies used very high-resolution satellite imagery (VHRSI). The study area is extensive and the temporal resolution of VHRSI is low, besides it is infeasible only using these sensors as unique approach for the whole area. The focus of this research, aims to investigate the potentialities of mid-resolution imagery (here only Landsat) to obtain information from region in crisis (here, southeastern Turkey) through a new web-based platform called Google Earth Engine (GEE). Hereby it is also intended to verify GEE currently reliability once the Application Programming Interface (API) is still in beta version. The finds here shows that the basic functions are trustworthy. Results pointed out that Landsat can recognize change in the spectral resolution clearly only for the first settlement. The ongoing modifications vary for each case. Overall, Landsat demonstrated high limitations, but need more investigations and may be used, with restriction, as a support of VHRSI.
Resumo:
This study identifies a measure of the cultural importance of an area within a city. It does so by making use of origindestination trip data and the bike stations of the bike share system in New York City as a proxy to study the city. Rarely is movement in the city studied at such a small scale. The change in strength of the similarity of movement between each station is studied. It is the first study to provide this measure of importance for every point in the system. This measure is then related to the characteristics which make for vibrant city communities, namely highly mixed land use types. It reveals that the spatial pattern of important areas remains constant over differing time periods. Communities are then characterised by the land uses surrounding these stations with high measures of importance. Finally it identifies the areas of global cultural importance alongside the areas of local importance to the city.
Resumo:
In the last few years, we have observed an exponential increasing of the information systems, and parking information is one more example of them. The needs of obtaining reliable and updated information of parking slots availability are very important in the goal of traffic reduction. Also parking slot prediction is a new topic that has already started to be applied. San Francisco in America and Santander in Spain are examples of such projects carried out to obtain this kind of information. The aim of this thesis is the study and evaluation of methodologies for parking slot prediction and the integration in a web application, where all kind of users will be able to know the current parking status and also future status according to parking model predictions. The source of the data is ancillary in this work but it needs to be understood anyway to understand the parking behaviour. Actually, there are many modelling techniques used for this purpose such as time series analysis, decision trees, neural networks and clustering. In this work, the author explains the best techniques at this work, analyzes the result and points out the advantages and disadvantages of each one. The model will learn the periodic and seasonal patterns of the parking status behaviour, and with this knowledge it can predict future status values given a date. The data used comes from the Smart Park Ontinyent and it is about parking occupancy status together with timestamps and it is stored in a database. After data acquisition, data analysis and pre-processing was needed for model implementations. The first test done was with the boosting ensemble classifier, employed over a set of decision trees, created with C5.0 algorithm from a set of training samples, to assign a prediction value to each object. In addition to the predictions, this work has got measurements error that indicates the reliability of the outcome predictions being correct. The second test was done using the function fitting seasonal exponential smoothing tbats model. Finally as the last test, it has been tried a model that is actually a combination of the previous two models, just to see the result of this combination. The results were quite good for all of them, having error averages of 6.2, 6.6 and 5.4 in vacancies predictions for the three models respectively. This means from a parking of 47 places a 10% average error in parking slot predictions. This result could be even better with longer data available. In order to make this kind of information visible and reachable from everyone having a device with internet connection, a web application was made for this purpose. Beside the data displaying, this application also offers different functions to improve the task of searching for parking. The new functions, apart from parking prediction, were: - Park distances from user location. It provides all the distances to user current location to the different parks in the city. - Geocoding. The service for matching a literal description or an address to a concrete location. - Geolocation. The service for positioning the user. - Parking list panel. This is not a service neither a function, is just a better visualization and better handling of the information.
Resumo:
Retail services are a main contributor to municipal budget and are an activity that affects perceived quality-of-life, especially for those with mobility difficulties (e.g. the elderly, low income citizens). However, there is evidence of a decline in some of the services market towns provide to their citizens. In market towns, this decline has been reported all over the western world, from North America to Australia. The aim of this research was to understand retail decline and enlighten on some ways of addressing this decline, using a case study, Thornbury, a small town in the Southwest of England. Data collected came from two participatory approaches: photo-surveys and multicriteria mapping. The interpretation of data came from using participants as analysts, but also, using systems thinking (systems diagramming and social trap theory) for theory building. This research moves away from mainstream economic and town planning perspectives by making use of different methods and concepts used in anthropology and visual sociology (photo-surveys), decision-making and ecological economics (multicriteria mapping and social trap theory). In sum, this research has experimented with different methods, out of their context, to analyse retail decline in a small town. This research developed a conceptual model for retail decline and identified the existence of conflicting goals and interests and their implications for retail decline, as well as causes for these. Most of the potential causes have had little attention in the literature. This research also identified that some of the measures commonly used for dealing with retail decline may be contributing to the causes of retail decline itself. Additionally, this research reviewed some of the measures that can be used to deal with retail decline, implications for policy-making and reflected on the use of the data collection and analysis methods in the context of small to medium towns.
Resumo:
INTRODUCTION: Malaria is a serious problem in the Brazilian Amazon region, and the detection of possible risk factors could be of great interest for public health authorities. The objective of this article was to investigate the association between environmental variables and the yearly registers of malaria in the Amazon region using Bayesian spatiotemporal methods. METHODS: We used Poisson spatiotemporal regression models to analyze the Brazilian Amazon forest malaria count for the period from 1999 to 2008. In this study, we included some covariates that could be important in the yearly prediction of malaria, such as deforestation rate. We obtained the inferences using a Bayesian approach and Markov Chain Monte Carlo (MCMC) methods to simulate samples for the joint posterior distribution of interest. The discrimination of different models was also discussed. RESULTS: The model proposed here suggests that deforestation rate, the number of inhabitants per km², and the human development index (HDI) are important in the prediction of malaria cases. CONCLUSIONS: It is possible to conclude that human development, population growth, deforestation, and their associated ecological alterations are conducive to increasing malaria risk. We conclude that the use of Poisson regression models that capture the spatial and temporal effects under the Bayesian paradigm is a good strategy for modeling malaria counts.
Resumo:
This work project (WP) is a study about a clustering strategy for Sport Zone. The general cluster study’s objective is to create groups such that within each group the individuals are similar to each other, but should be different among groups. The clusters creation is a mix of common sense, trial and error and some statistical supporting techniques. Our particular objective is to support category managers to better define the product type to be displayed in the stores’ shelves by doing store clusters. This research was carried out for Sport Zone, and comprises an objective definition, a literature review, the clustering activity itself, some factor analysis and a discriminant analysis to better frame our work. Together with this quantitative part, a survey addressed to category managers to better understand their key drivers, for choosing the type of product of each store, was carried out. Based in a non-random sample of 65 stores with data referring to 2013, the final result was the choice of 6 store clusters (Figure 1) which were individually characterized as the main outcome of this work. In what relates to our selected variables, all were important for the distinction between clusters, which proves the adequacy of their choice. The interpretation of the results gives category managers a tool to understand which products best fit the clustered stores. Furthermore, as a side finding thanks to the clusterization, a STP (Segmentation, Targeting and Positioning) was initiated, being this WP the first steps of a continuous process.
Resumo:
This paper demonstrates the significance of culture in examining the relationshipbetween democratic capital and environmental performance.The aim is to examine the relationship among scores on the Environmental Performance Index and the two dimensions of cross cultural variation suggested by Ronald Inglehart and Christian Welzel. Significantional interrelationships among democracy, cultural and environmental sustaintability measures could be found, following the regression results. Firstly, higher levels of democratic capital stock are associated with better environmental performance. Secondly importance to distinguish between cultural groups could be confirmed.
Resumo:
O paradigma de avaliação do ensino superior foi alterado em 2005 para ter em conta, para além do número de entradas, o número de alunos diplomados. Esta alteração pressiona as instituições académicas a melhorar o desempenho dos alunos. Um fenómeno perceptível ao analisar esse desempenho é que a performance registada não é nem uniforme nem constante ao longo da estadia do aluno no curso. Estas variações não estão a ser consideradas no esforço de melhorar o desempenho académico e surge motivação para detectar os diferentes perfis de desempenho e utilizar esse conhecimento para melhorar a o desempenho das instituições académicas. Este documento descreve o trabalho realizado no sentido de propor uma metodologia para detectar padrões de desempenho académico, num curso do ensino superior. Como ferramenta de análise são usadas técnicas de data mining, mais precisamente algoritmos de agrupamento. O caso de estudo para este trabalho é a população estudantil da licenciatura em Eng. Informática da FCT-UNL. Propõe-se dois modelos para o aluno, que servem de base para a análise. Um modelo analisa os alunos tendo em conta a sua performance num ano lectivo e o segundo analisa os alunos tendo em conta o seu percurso académico pelo curso, desde que entrou até se diplomar, transferir ou desistir. Esta análise é realizada recorrendo aos algoritmos de agrupamento: algoritmo aglomerativo hierárquico, k-means, SOM e SNN, entre outros.