70 resultados para mIneração de dados
Resumo:
Nowadays, telecommunications is one of the most dynamic and strategic areas in the world. Organizations are always seeking to find new management practices within an ever increasing competitive environment where resources are getting scarce. In this scenario, data obtained from business and corporate processes have even greater importance, although this data is not yet adequately explored. Knowledge Discovery in Databases (KDD) appears then, as an option to allow the study of complex problems in different areas of management. This work proposes both a systematization of KDD activities using concepts from different methodologies, such as CRISP-DM, SEMMA and FAYYAD approaches and a study concerning the viability of multivariate regression analysis models to explain corporative telecommunications sales using performance indicators. Thus, statistical methods were outlined to analyze the effects of such indicators on the behavior of business productivity. According to business and standard statistical analysis, equations were defined and fit to their respective determination coefficients. Tests of hypotheses were also conducted on parameters with the purpose of validating the regression models. The results show that there is a relationship between these development indicators and the amount of sales
Resumo:
Self-organizing maps (SOM) are artificial neural networks widely used in the data mining field, mainly because they constitute a dimensionality reduction technique given the fixed grid of neurons associated with the network. In order to properly the partition and visualize the SOM network, the various methods available in the literature must be applied in a post-processing stage, that consists of inferring, through its neurons, relevant characteristics of the data set. In general, such processing applied to the network neurons, instead of the entire database, reduces the computational costs due to vector quantization. This work proposes a post-processing of the SOM neurons in the input and output spaces, combining visualization techniques with algorithms based on gravitational forces and the search for the shortest path with the greatest reward. Such methods take into account the connection strength between neighbouring neurons and characteristics of pattern density and distances among neurons, both associated with the position that the neurons occupy in the data space after training the network. Thus, the goal consists of defining more clearly the arrangement of the clusters present in the data. Experiments were carried out so as to evaluate the proposed methods using various artificially generated data sets, as well as real world data sets. The results obtained were compared with those from a number of well-known methods existent in the literature
Resumo:
A fragilidade brasileira quanto à competitividade turística é um fato observável nos dados da Organização Mundial do Turismo. O Brasil caiu em 2011, da 45ª para a 52ª posição, apesar de liderar no atributo recursos naturais e estar colocado na 23° em recursos culturais. Assim, grandes interesses e esforços têm sido direcionados para o estudo da competitividade dos produtos e destinos turísticos. O destino turístico é caracterizado por um conjunto complexo e articulado de fatores tangíveis e intangíveis, apresentando alta complexidade, dados de elevada dimensionalidade, não linearidade e comportamento dinâmico, tornando-se difícil a modelagem desses processos por meio de abordagens baseadas em técnicas estatísticas clássicas. Esta tese investigou modelos de equações estruturais e seus algoritmos, aplicados nesta área, analisando o ciclo completo de análise de dados, em um processo confirmatório no desenvolvimento e avaliação de um modelo holístico da satisfação do turista; na validação da estrutura do modelo de medida e do modelo estrutural, por meio de testes de invariância de múltiplos grupos; na análise comparativa dos métodos de estimação MLE, GLS e ULS para a modelagem da satisfação e na realização de segmentação de mercado no setor de destino turístico utilizando mapas auto-organizáveis de Kohonen e sua validação com modelagem de equações estruturais. Aplicações foram feitas em análises de dados no setor de turismo, principal indústria de serviços do Estado do Rio Grande do Norte, tendo sido, teoricamente desenvolvidos e testados empiricamente, modelos de equações estruturais em padrões comportamentais de destino turístico. Os resultados do estudo empírico se basearam em pesquisas com a técnica de amostragem aleatória sistemática, efetuadas em Natal-RN, entre Janeiro e Março de 2013 e forneceram evidências sustentáveis de que o modelo teórico proposto é satisfatório, com elevada capacidade explicativa e preditiva, sendo a satisfação o antecedente mais importante da lealdade no destino. Além disso, a satisfação é mediadora entre a geração da motivação da viagem e a lealdade do destino e que os turistas buscam primeiro à satisfação com a qualidade dos serviços de turismo e, posteriormente, com os aspectos que influenciam a lealdade. Contribuições acadêmicas e gerenciais são mostradas e sugestões de estudo são dadas para trabalhos futuros.
Resumo:
The relevance of rising healthcare costs is a main topic in complementary health companies in Brazil. In 2011, these expenses consumed more than 80% of the monthly health insurance in Brazil. Considering the administrative costs, it is observed that the companies operating in this market work, on average, at the threshold between profit and loss. This paper presents results after an investigation of the welfare costs of a health plan company in Brazil. It was based on the KDD process and explorative Data Mining. A diversity of results is presented, such as data summarization, providing compact descriptions of the data, revealing common features and intrinsic observations. Among the key findings was observed that a small portion of the population is responsible for the most demanding of resources devoted to health care
Resumo:
Currently, one of the biggest challenges for the field of data mining is to perform cluster analysis on complex data. Several techniques have been proposed but, in general, they can only achieve good results within specific areas providing no consensus of what would be the best way to group this kind of data. In general, these techniques fail due to non-realistic assumptions about the true probability distribution of the data. Based on this, this thesis proposes a new measure based on Cross Information Potential that uses representative points of the dataset and statistics extracted directly from data to measure the interaction between groups. The proposed approach allows us to use all advantages of this information-theoretic descriptor and solves the limitations imposed on it by its own nature. From this, two cost functions and three algorithms have been proposed to perform cluster analysis. As the use of Information Theory captures the relationship between different patterns, regardless of assumptions about the nature of this relationship, the proposed approach was able to achieve a better performance than the main algorithms in literature. These results apply to the context of synthetic data designed to test the algorithms in specific situations and to real data extracted from problems of different fields
Resumo:
The opening of the Brazilian market of electricity and competitiveness between companies in the energy sector make the search for useful information and tools that will assist in decision making activities, increase by the concessionaires. An important source of knowledge for these utilities is the time series of energy demand. The identification of behavior patterns and description of events become important for the planning execution, seeking improvements in service quality and financial benefits. This dissertation presents a methodology based on mining and representation tools of time series, in order to extract knowledge that relate series of electricity demand in various substations connected of a electric utility. The method exploits the relationship of duration, coincidence and partial order of events in multi-dimensionals time series. To represent the knowledge is used the language proposed by Mörchen (2005) called Time Series Knowledge Representation (TSKR). We conducted a case study using time series of energy demand of 8 substations interconnected by a ring system, which feeds the metropolitan area of Goiânia-GO, provided by CELG (Companhia Energética de Goiás), responsible for the service of power distribution in the state of Goiás (Brazil). Using the proposed methodology were extracted three levels of knowledge that describe the behavior of the system studied, representing clearly the system dynamics, becoming a tool to assist planning activities
Resumo:
Traditional applications of feature selection in areas such as data mining, machine learning and pattern recognition aim to improve the accuracy and to reduce the computational cost of the model. It is done through the removal of redundant, irrelevant or noisy data, finding a representative subset of data that reduces its dimensionality without loss of performance. With the development of research in ensemble of classifiers and the verification that this type of model has better performance than the individual models, if the base classifiers are diverse, comes a new field of application to the research of feature selection. In this new field, it is desired to find diverse subsets of features for the construction of base classifiers for the ensemble systems. This work proposes an approach that maximizes the diversity of the ensembles by selecting subsets of features using a model independent of the learning algorithm and with low computational cost. This is done using bio-inspired metaheuristics with evaluation filter-based criteria
Resumo:
Clustering data is a very important task in data mining, image processing and pattern recognition problems. One of the most popular clustering algorithms is the Fuzzy C-Means (FCM). This thesis proposes to implement a new way of calculating the cluster centers in the procedure of FCM algorithm which are called ckMeans, and in some variants of FCM, in particular, here we apply it for those variants that use other distances. The goal of this change is to reduce the number of iterations and processing time of these algorithms without affecting the quality of the partition, or even to improve the number of correct classifications in some cases. Also, we developed an algorithm based on ckMeans to manipulate interval data considering interval membership degrees. This algorithm allows the representation of data without converting interval data into punctual ones, as it happens to other extensions of FCM that deal with interval data. In order to validate the proposed methodologies it was made a comparison between a clustering for ckMeans, K-Means and FCM algorithms (since the algorithm proposed in this paper to calculate the centers is similar to the K-Means) considering three different distances. We used several known databases. In this case, the results of Interval ckMeans were compared with the results of other clustering algorithms when applied to an interval database with minimum and maximum temperature of the month for a given year, referring to 37 cities distributed across continents
Resumo:
Data clustering is applied to various fields such as data mining, image processing and pattern recognition technique. Clustering algorithms splits a data set into clusters such that elements within the same cluster have a high degree of similarity, while elements belonging to different clusters have a high degree of dissimilarity. The Fuzzy C-Means Algorithm (FCM) is a fuzzy clustering algorithm most used and discussed in the literature. The performance of the FCM is strongly affected by the selection of the initial centers of the clusters. Therefore, the choice of a good set of initial cluster centers is very important for the performance of the algorithm. However, in FCM, the choice of initial centers is made randomly, making it difficult to find a good set. This paper proposes three new methods to obtain initial cluster centers, deterministically, the FCM algorithm, and can also be used in variants of the FCM. In this work these initialization methods were applied in variant ckMeans.With the proposed methods, we intend to obtain a set of initial centers which are close to the real cluster centers. With these new approaches startup if you want to reduce the number of iterations to converge these algorithms and processing time without affecting the quality of the cluster or even improve the quality in some cases. Accordingly, cluster validation indices were used to measure the quality of the clusters obtained by the modified FCM and ckMeans algorithms with the proposed initialization methods when applied to various data sets
Resumo:
Educational Data Mining is an application domain in artificial intelligence area that has been extensively explored nowadays. Technological advances and in particular, the increasing use of virtual learning environments have allowed the generation of considerable amounts of data to be investigated. Among the activities to be treated in this context exists the prediction of school performance of the students, which can be accomplished through the use of machine learning techniques. Such techniques may be used for student’s classification in predefined labels. One of the strategies to apply these techniques consists in their combination to design multi-classifier systems, which efficiency can be proven by results achieved in other studies conducted in several areas, such as medicine, commerce and biometrics. The data used in the experiments were obtained from the interactions between students in one of the most used virtual learning environments called Moodle. In this context, this paper presents the results of several experiments that include the use of specific multi-classifier systems systems, called ensembles, aiming to reach better results in school performance prediction that is, searching for highest accuracy percentage in the student’s classification. Therefore, this paper presents a significant exploration of educational data and it shows analyzes of relevant results about these experiments.
Resumo:
Soft skills and teamwork practices were identi ed as the main de ciencies of recent graduates in computer courses. This issue led to a realization of a qualitative research aimed at investigating the challenges faced by professors of those courses in conducting, monitoring and assessing collaborative software development projects. Di erent challenges were reported by teachers, including di culties in the assessment of students both in the collective and individual levels. In this context, a quantitative research was conducted with the aim to map soft skill of students to a set of indicators that can be extracted from software repositories using data mining techniques. These indicators are aimed at measuring soft skills, such as teamwork, leadership, problem solving and the pace of communication. Then, a peer assessment approach was applied in a collaborative software development course of the software engineering major at the Federal University of Rio Grande do Norte (UFRN). This research presents a correlation study between the students' soft skills scores and indicators based on mining software repositories. This study contributes: (i) in the presentation of professors' perception of the di culties and opportunities for improving management and monitoring practices in collaborative software development projects; (ii) in investigating relationships between soft skills and activities performed by students using software repositories; (iii) in encouraging the development of soft skills and the use of software repositories among software engineering students; (iv) in contributing to the state of the art of three important areas of software engineering, namely software engineering education, educational data mining and human aspects of software engineering.
Resumo:
Soft skills and teamwork practices were identi ed as the main de ciencies of recent graduates in computer courses. This issue led to a realization of a qualitative research aimed at investigating the challenges faced by professors of those courses in conducting, monitoring and assessing collaborative software development projects. Di erent challenges were reported by teachers, including di culties in the assessment of students both in the collective and individual levels. In this context, a quantitative research was conducted with the aim to map soft skill of students to a set of indicators that can be extracted from software repositories using data mining techniques. These indicators are aimed at measuring soft skills, such as teamwork, leadership, problem solving and the pace of communication. Then, a peer assessment approach was applied in a collaborative software development course of the software engineering major at the Federal University of Rio Grande do Norte (UFRN). This research presents a correlation study between the students' soft skills scores and indicators based on mining software repositories. This study contributes: (i) in the presentation of professors' perception of the di culties and opportunities for improving management and monitoring practices in collaborative software development projects; (ii) in investigating relationships between soft skills and activities performed by students using software repositories; (iii) in encouraging the development of soft skills and the use of software repositories among software engineering students; (iv) in contributing to the state of the art of three important areas of software engineering, namely software engineering education, educational data mining and human aspects of software engineering.
Resumo:
This work aims to characterize the workers in mineral activities exposed to lung injuries in Parelhas Municipality, Rio Grande do Norte State, seeking to relate respiratory diseases to the mining activity. The studied area (Parelhas City), with about 19,700 inhabitants, is located in the Serido region, approximately 232 km far from Natal City. The number of people involved in informal mining activity (garimpo) in the Seridó region reaches about 5,000. These workers generally do not use any kind of individual protection equipments and develop, at early ages of greater productivity, severe forms of diseases, which end up disabling them to professional activities, family and social life. Deceases by respiratory problems (e.g. silicosis) have been reported in very young adults. A descriptive observational study was conducted based on information from the records found in Dr. José Augusto Dantas Hospital, between the years 1996- 2006. The occupational and socio-economic features of the population, which was selected by using the hospital records, were achieved through individually answered forms. The purpose was to link the occupational activities with the respiratory diseases. The next stage of the research was an observational case-control study, in the 1:1 proportion. The achieved data allowed confirming the central hypothesis of the research, which states that the pneumoconiosis cases are due to the mineral-based activities in the studied area. The final step of the investigation tried to assess the knowledge of relatives of students in public and private elementary and high schools from Parelhas City, regarding silicosis. About 15.4% of urban schools were analyzed through application of a structured questionnaire. The results show distinct socio-economic levels and a difference in the perception of the relatives of students in public and private schools, concerning silicosis. It was possible to identify the characteristics of the population economically involved with mineral-based activities and to define the group that deserves preferential attention in preventive actions. The work indicates some environmental problems caused by inadequate mining operations in the region
Resumo:
When a company desires to invest in a project, it must obtain resources needed to make the investment. The alternatives are using firm s internal resources or obtain external resources through contracts of debt and issuance of shares. Decisions involving the composition of internal resources, debt and shares in the total resources used to finance the activities of a company related to the choice of its capital structure. Although there are studies in the area of finance on the debt determinants of firms, the issue of capital structure is still controversial. This work sought to identify the predominant factors that determine the capital structure of Brazilian share capital, non-financial firms. This work was used a quantitative approach, with application of the statistical technique of multiple linear regression on data in panel. Estimates were made by the method of ordinary least squares with model of fixed effects. About 116 companies were selected to participate in this research. The period considered is from 2003 to 2007. The variables and hypotheses tested in this study were built based on theories of capital structure and in empirical researches. Results indicate that the variables, such as risk, size, and composition of assets and firms growth influence their indebtedness. The profitability variable was not relevant to the composition of indebtedness of the companies analyzed. However, analyzing only the long-term debt, comes to the conclusion that the relevant variables are the size of firms and, especially, the composition of its assets (tangibility).This sense, the smaller the size of the undertaking or the greater the representation of fixed assets in total assets, the greater its propensity to long-term debt. Furthermore, this research could not identify a predominant theory to explain the capital structure of Brazilian
Resumo:
This work aims to analyze risks related to information technology (IT) in procedures related to data migration. This is done considering ALEPH, Integrated Libray System (ILS) that migrated data to the Library Module present in the software called Sistema Integrado de Gestão de Atividades Acadêmicas (SIGAA) at the Zila Mamede Central Library at the Federal University of Rio Grande do Norte (UFRN) in Natal/Brazil. The methodological procedure used was of a qualitative exploratory research with the realization of case study at the referred library in order to better understand this phenomenon. Data collection was able once there was use of a semi-structured interview that was applied with (11) subjects that are employed at the library as well as in the Technology Superintendence at UFRN. In order to examine data Content analysis as well as thematic review process was performed. After data migration the results of the interview were then linked to both analysis units and their system register with category correspondence. The main risks detected were: data destruction; data loss; data bank communication failure; user response delay; data inconsistency and duplicity. These elements point out implication and generate disorders that affect external and internal system users and lead to stress, work duplicity and hassles. Thus, some measures were taken related to risk management such as adequate planning, central management support, and pilot test simulations. For the advantages it has reduced of: risk, occurrence of problems and possible unforeseen costs, and allows achieving organizational objectives, among other. It is inferred therefore that the risks present in data bank conversion in libraries exist and some are predictable, however, it is seen that librarians do not know or ignore and are not very worried in the identification risks in data bank conversion, their acknowledge would minimize or even extinguish them. Another important aspect to consider is the existence of few empirical research that deal specifically with this subject and thus presenting the new of new approaches in order to promote better understanding of the matter in the corporate environment of the information units