91 resultados para Agrupamento de dados
em Universidade Federal do Rio Grande do Norte(UFRN)
Resumo:
Symbolic Data Analysis (SDA) main aims to provide tools for reducing large databases to extract knowledge and provide techniques to describe the unit of such data in complex units, as such, interval or histogram. The objective of this work is to extend classical clustering methods for symbolic interval data based on interval-based distance. The main advantage of using an interval-based distance for interval-based data lies on the fact that it preserves the underlying imprecision on intervals which is usually lost when real-valued distances are applied. This work includes an approach allow existing indices to be adapted to interval context. The proposed methods with interval-based distances are compared with distances punctual existing literature through experiments with simulated data and real data interval
Resumo:
The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using classic clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. This work presents the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods
Resumo:
The main goal of this work is to investigate the suitability of applying cluster ensemble techniques (ensembles or committees) to gene expression data. More specifically, we will develop experiments with three diferent cluster ensembles methods, which have been used in many works in literature: coassociation matrix, relabeling and voting, and ensembles based on graph partitioning. The inputs for these methods will be the partitions generated by three clustering algorithms, representing diferent paradigms: kmeans, ExpectationMaximization (EM), and hierarchical method with average linkage. These algorithms have been widely applied to gene expression data. In general, the results obtained with our experiments indicate that the cluster ensemble methods present a better performance when compared to the individual techniques. This happens mainly for the heterogeneous ensembles, that is, ensembles built with base partitions generated with diferent clustering algorithms
Resumo:
Este estudo teve por objetivo avaliar a eficácia de uma estratégia de ensino sobre diagnósticos de enfermagem, fundamentada na aprendizagem, baseada em problemas no desempenho do raciocínio clínico e julgamento diagnóstico dos discentes de graduação. É estudo experimental, realizado em duas fases: validação de conteúdo dos problemas e aplicação da estratégia educativa. Os resultados mostraram melhora na capacidade de agrupamento dos dados dos discentes do grupo experimental. Conclui-se que houve influência positiva da estratégia implementada
Resumo:
In this paper artificial neural network (ANN) based on supervised and unsupervised algorithms were investigated for use in the study of rheological parameters of solid pharmaceutical excipients, in order to develop computational tools for manufacturing solid dosage forms. Among four supervised neural networks investigated, the best learning performance was achieved by a feedfoward multilayer perceptron whose architectures was composed by eight neurons in the input layer, sixteen neurons in the hidden layer and one neuron in the output layer. Learning and predictive performance relative to repose angle was poor while to Carr index and Hausner ratio (CI and HR, respectively) showed very good fitting capacity and learning, therefore HR and CI were considered suitable descriptors for the next stage of development of supervised ANNs. Clustering capacity was evaluated for five unsupervised strategies. Network based on purely unsupervised competitive strategies, classic "Winner-Take-All", "Frequency-Sensitive Competitive Learning" and "Rival-Penalize Competitive Learning" (WTA, FSCL and RPCL, respectively) were able to perform clustering from database, however this classification was very poor, showing severe classification errors by grouping data with conflicting properties into the same cluster or even the same neuron. On the other hand it could not be established what was the criteria adopted by the neural network for those clustering. Self-Organizing Maps (SOM) and Neural Gas (NG) networks showed better clustering capacity. Both have recognized the two major groupings of data corresponding to lactose (LAC) and cellulose (CEL). However, SOM showed some errors in classify data from minority excipients, magnesium stearate (EMG) , talc (TLC) and attapulgite (ATP). NG network in turn performed a very consistent classification of data and solve the misclassification of SOM, being the most appropriate network for classifying data of the study. The use of NG network in pharmaceutical technology was still unpublished. NG therefore has great potential for use in the development of software for use in automated classification systems of pharmaceutical powders and as a new tool for mining and clustering data in drug development
Resumo:
Data clustering is applied to various fields such as data mining, image processing and pattern recognition technique. Clustering algorithms splits a data set into clusters such that elements within the same cluster have a high degree of similarity, while elements belonging to different clusters have a high degree of dissimilarity. The Fuzzy C-Means Algorithm (FCM) is a fuzzy clustering algorithm most used and discussed in the literature. The performance of the FCM is strongly affected by the selection of the initial centers of the clusters. Therefore, the choice of a good set of initial cluster centers is very important for the performance of the algorithm. However, in FCM, the choice of initial centers is made randomly, making it difficult to find a good set. This paper proposes three new methods to obtain initial cluster centers, deterministically, the FCM algorithm, and can also be used in variants of the FCM. In this work these initialization methods were applied in variant ckMeans.With the proposed methods, we intend to obtain a set of initial centers which are close to the real cluster centers. With these new approaches startup if you want to reduce the number of iterations to converge these algorithms and processing time without affecting the quality of the cluster or even improve the quality in some cases. Accordingly, cluster validation indices were used to measure the quality of the clusters obtained by the modified FCM and ckMeans algorithms with the proposed initialization methods when applied to various data sets
Resumo:
Este estudo teve por objetivo avaliar a eficácia de uma estratégia de ensino sobre diagnósticos de enfermagem, fundamentada na aprendizagem, baseada em problemas no desempenho do raciocínio clínico e julgamento diagnóstico dos discentes de graduação. É estudo experimental, realizado em duas fases: validação de conteúdo dos problemas e aplicação da estratégia educativa. Os resultados mostraram melhora na capacidade de agrupamento dos dados dos discentes do grupo experimental. Conclui-se que houve influência positiva da estratégia implementada
Resumo:
Self-organizing maps (SOM) are artificial neural networks widely used in the data mining field, mainly because they constitute a dimensionality reduction technique given the fixed grid of neurons associated with the network. In order to properly the partition and visualize the SOM network, the various methods available in the literature must be applied in a post-processing stage, that consists of inferring, through its neurons, relevant characteristics of the data set. In general, such processing applied to the network neurons, instead of the entire database, reduces the computational costs due to vector quantization. This work proposes a post-processing of the SOM neurons in the input and output spaces, combining visualization techniques with algorithms based on gravitational forces and the search for the shortest path with the greatest reward. Such methods take into account the connection strength between neighbouring neurons and characteristics of pattern density and distances among neurons, both associated with the position that the neurons occupy in the data space after training the network. Thus, the goal consists of defining more clearly the arrangement of the clusters present in the data. Experiments were carried out so as to evaluate the proposed methods using various artificially generated data sets, as well as real world data sets. The results obtained were compared with those from a number of well-known methods existent in the literature
Resumo:
The main objective of this study is to apply recently developed methods of physical-statistic to time series analysis, particularly in electrical induction s profiles of oil wells data, to study the petrophysical similarity of those wells in a spatial distribution. For this, we used the DFA method in order to know if we can or not use this technique to characterize spatially the fields. After obtain the DFA values for all wells, we applied clustering analysis. To do these tests we used the non-hierarchical method called K-means. Usually based on the Euclidean distance, the K-means consists in dividing the elements of a data matrix N in k groups, so that the similarities among elements belonging to different groups are the smallest possible. In order to test if a dataset generated by the K-means method or randomly generated datasets form spatial patterns, we created the parameter Ω (index of neighborhood). High values of Ω reveals more aggregated data and low values of Ω show scattered data or data without spatial correlation. Thus we concluded that data from the DFA of 54 wells are grouped and can be used to characterize spatial fields. Applying contour level technique we confirm the results obtained by the K-means, confirming that DFA is effective to perform spatial analysis
Resumo:
In recent years, the DFA introduced by Peng, was established as an important tool capable of detecting long-range autocorrelation in time series with non-stationary. This technique has been successfully applied to various areas such as: Econophysics, Biophysics, Medicine, Physics and Climatology. In this study, we used the DFA technique to obtain the Hurst exponent (H) of the profile of electric density profile (RHOB) of 53 wells resulting from the Field School of Namorados. In this work we want to know if we can or not use H to spatially characterize the spatial data field. Two cases arise: In the first a set of H reflects the local geology, with wells that are geographically closer showing similar H, and then one can use H in geostatistical procedures. In the second case each well has its proper H and the information of the well are uncorrelated, the profiles show only random fluctuations in H that do not show any spatial structure. Cluster analysis is a method widely used in carrying out statistical analysis. In this work we use the non-hierarchy method of k-means. In order to verify whether a set of data generated by the k-means method shows spatial patterns, we create the parameter Ω (index of neighborhood). High Ω shows more aggregated data, low Ω indicates dispersed or data without spatial correlation. With help of this index and the method of Monte Carlo. Using Ω index we verify that random cluster data shows a distribution of Ω that is lower than actual cluster Ω. Thus we conclude that the data of H obtained in 53 wells are grouped and can be used to characterize space patterns. The analysis of curves level confirmed the results of the k-means
Resumo:
Clustering data is a very important task in data mining, image processing and pattern recognition problems. One of the most popular clustering algorithms is the Fuzzy C-Means (FCM). This thesis proposes to implement a new way of calculating the cluster centers in the procedure of FCM algorithm which are called ckMeans, and in some variants of FCM, in particular, here we apply it for those variants that use other distances. The goal of this change is to reduce the number of iterations and processing time of these algorithms without affecting the quality of the partition, or even to improve the number of correct classifications in some cases. Also, we developed an algorithm based on ckMeans to manipulate interval data considering interval membership degrees. This algorithm allows the representation of data without converting interval data into punctual ones, as it happens to other extensions of FCM that deal with interval data. In order to validate the proposed methodologies it was made a comparison between a clustering for ckMeans, K-Means and FCM algorithms (since the algorithm proposed in this paper to calculate the centers is similar to the K-Means) considering three different distances. We used several known databases. In this case, the results of Interval ckMeans were compared with the results of other clustering algorithms when applied to an interval database with minimum and maximum temperature of the month for a given year, referring to 37 cities distributed across continents
Resumo:
Peng was the first to work with the Technical DFA (Detrended Fluctuation Analysis), a tool capable of detecting auto-long-range correlation in time series with non-stationary. In this study, the technique of DFA is used to obtain the Hurst exponent (H) profile of the electric neutron porosity of the 52 oil wells in Namorado Field, located in the Campos Basin -Brazil. The purpose is to know if the Hurst exponent can be used to characterize spatial distribution of wells. Thus, we verify that the wells that have close values of H are spatially close together. In this work we used the method of hierarchical clustering and non-hierarchical clustering method (the k-mean method). Then compare the two methods to see which of the two provides the best result. From this, was the parameter � (index neighborhood) which checks whether a data set generated by the k- average method, or at random, so in fact spatial patterns. High values of � indicate that the data are aggregated, while low values of � indicate that the data are scattered (no spatial correlation). Using the Monte Carlo method showed that combined data show a random distribution of � below the empirical value. So the empirical evidence of H obtained from 52 wells are grouped geographically. By passing the data of standard curves with the results obtained by the k-mean, confirming that it is effective to correlate well in spatial distribution
Resumo:
The main objective of this study is to apply recently developed methods of physical-statistic to time series analysis, particularly in electrical induction s profiles of oil wells data, to study the petrophysical similarity of those wells in a spatial distribution. For this, we used the DFA method in order to know if we can or not use this technique to characterize spatially the fields. After obtain the DFA values for all wells, we applied clustering analysis. To do these tests we used the non-hierarchical method called K-means. Usually based on the Euclidean distance, the K-means consists in dividing the elements of a data matrix N in k groups, so that the similarities among elements belonging to different groups are the smallest possible. In order to test if a dataset generated by the K-means method or randomly generated datasets form spatial patterns, we created the parameter Ω (index of neighborhood). High values of Ω reveals more aggregated data and low values of Ω show scattered data or data without spatial correlation. Thus we concluded that data from the DFA of 54 wells are grouped and can be used to characterize spatial fields. Applying contour level technique we confirm the results obtained by the K-means, confirming that DFA is effective to perform spatial analysis
Resumo:
In recent years, the DFA introduced by Peng, was established as an important tool capable of detecting long-range autocorrelation in time series with non-stationary. This technique has been successfully applied to various areas such as: Econophysics, Biophysics, Medicine, Physics and Climatology. In this study, we used the DFA technique to obtain the Hurst exponent (H) of the profile of electric density profile (RHOB) of 53 wells resulting from the Field School of Namorados. In this work we want to know if we can or not use H to spatially characterize the spatial data field. Two cases arise: In the first a set of H reflects the local geology, with wells that are geographically closer showing similar H, and then one can use H in geostatistical procedures. In the second case each well has its proper H and the information of the well are uncorrelated, the profiles show only random fluctuations in H that do not show any spatial structure. Cluster analysis is a method widely used in carrying out statistical analysis. In this work we use the non-hierarchy method of k-means. In order to verify whether a set of data generated by the k-means method shows spatial patterns, we create the parameter Ω (index of neighborhood). High Ω shows more aggregated data, low Ω indicates dispersed or data without spatial correlation. With help of this index and the method of Monte Carlo. Using Ω index we verify that random cluster data shows a distribution of Ω that is lower than actual cluster Ω. Thus we conclude that the data of H obtained in 53 wells are grouped and can be used to characterize space patterns. The analysis of curves level confirmed the results of the k-means
Resumo:
The theoretical recital of the present study it is initiated of the evidence that the work occupies an important space in the man s life in way that the majority of the people works and passes great part of its time inside organizati ons. However, it is verified that the relation between man and work is becoming increasingly disagreement a time that the employees had started to complain work s routines, stress, not use all their potential and inadequate work s conditions. It can be observed by the way of Dejours (1994) studies. Thus, as contribution for the quality of work life s (QWL) studies the research developed here objectified to characterize the public employees quality of work life at EMATER -RN taking as reference an instrumen t of research synthesized from the typical academic literature of the subject. The synthesis of an ampler instrument is a necessity not taken care to the literature that treats on the subject but already perceived by some studies like Moraes et al (1990); Rodrigues (1989); Siqueira & Coleta (1989); Moraes et al (1992); Carvalho & Souza (2003); El -Aouar & Souza (2003) and Mourão, Kilimnick & Fernandes (2005); Adorno, Marques & Borges (2005) amongst others. These studies point out weak points of the existing models in the QWL s literature, as well as they recommend the elaboration of a model more flexible, that contemplates Brazilian cultural characteristics, and that contemplates the entire variable studied in the main existing models. For reach this objectiv e the adopted methodology was characterized as a case study with collected data in qualitative and quantitative way. Questionnaires and comments had been used as sources of evidences. These evidences had been tabulated through of statistical package SPSS ( Statistical Package for Social Science), in which the main technique of multivariate analysis used were the factorial analysis. As for the gotten results, it was verified the grouping of the quality of work life s indicators in 11 factors which are: Work s execution, Individual accomplishment, Work s equity, Relation individual and organization, Work s organization, Adequacy of the remuneration, Relation between head and subordinate, Effectiveness of the communication and the learning, Relation between work and personal life, Participation and Effectiveness of the work processes. Whatever to the characterization of the EMATER -RN s quality of work life it was clearly that to the measure that the satisfaction s evaluation with the QWL in the organization walks to intrinsic factors for extrinsic factors this level of satisfaction goes diminishing what points to the importance to improve these extrinsic factors in the institution. In summary it is possible to conclude that the organization studied has offered a significant set of referring variable to the quality of work life of the individual