940 resultados para K-means


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A statistical–dynamical regionalization approach is developed to assess possible changes in wind storm impacts. The method is applied to North Rhine-Westphalia (Western Germany) using the FOOT3DK mesoscale model for dynamical downscaling and ECHAM5/OM1 global circulation model climate projections. The method first classifies typical weather developments within the reanalysis period using K-means cluster algorithm. Most historical wind storms are associated with four weather developments (primary storm-clusters). Mesoscale simulations are performed for representative elements for all clusters to derive regional wind climatology. Additionally, 28 historical storms affecting Western Germany are simulated. Empirical functions are estimated to relate wind gust fields and insured losses. Transient ECHAM5/OM1 simulations show an enhanced frequency of primary storm-clusters and storms for 2060–2100 compared to 1960–2000. Accordingly, wind gusts increase over Western Germany, reaching locally +5% for 98th wind gust percentiles (A2-scenario). Consequently, storm losses are expected to increase substantially (+8% for A1B-scenario, +19% for A2-scenario). Regional patterns show larger changes over north-eastern parts of North Rhine-Westphalia than for western parts. For storms with return periods above 20 yr, loss expectations for Germany may increase by a factor of 2. These results document the method's functionality to assess future changes in loss potentials in regional terms.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Boreal winter wind storm situations over Central Europe are investigated by means of an objective cluster analysis. Surface data from the NCEP-Reanalysis and ECHAM4/OPYC3-climate change GHG simulation (IS92a) are considered. To achieve an optimum separation of clusters of extreme storm conditions, 55 clusters of weather patterns are differentiated. To reduce the computational effort, a PCA is initially performed, leading to a data reduction of about 98 %. The clustering itself was computed on 3-day periods constructed with the first six PCs using "k-means" clustering algorithm. The applied method enables an evaluation of the time evolution of the synoptic developments. The climate change signal is constructed by a projection of the GCM simulation on the EOFs attained from the NCEP-Reanalysis. Consequently, the same clusters are obtained and frequency distributions can be compared. For Central Europe, four primary storm clusters are identified. These clusters feature almost 72 % of the historical extreme storms events and add only to 5 % of the total relative frequency. Moreover, they show a statistically significant signature in the associated wind fields over Europe. An increased frequency of Central European storm clusters is detected with enhanced GHG conditions, associated with an enhancement of the pressure gradient over Central Europe. Consequently, more intense wind events over Central Europe are expected. The presented algorithm will be highly valuable for the analysis of huge data amounts as is required for e.g. multi-model ensemble analysis, particularly because of the enormous data reduction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Exascale systems are the next frontier in high-performance computing and are expected to deliver a performance of the order of 10^18 operations per second using massive multicore processors. Very large- and extreme-scale parallel systems pose critical algorithmic challenges, especially related to concurrency, locality and the need to avoid global communication patterns. This work investigates a novel protocol for dynamic group communication that can be used to remove the global communication requirement and to reduce the communication cost in parallel formulations of iterative data mining algorithms. The protocol is used to provide a communication-efficient parallel formulation of the k-means algorithm for cluster analysis. The approach is based on a collective communication operation for dynamic groups of processes and exploits non-uniform data distributions. Non-uniform data distributions can be either found in real-world distributed applications or induced by means of multidimensional binary search trees. The analysis of the proposed dynamic group communication protocol has shown that it does not introduce significant communication overhead. The parallel clustering algorithm has also been extended to accommodate an approximation error, which allows a further reduction of the communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing elements.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: The validity of ensemble averaging on event-related potential (ERP) data has been questioned, due to its assumption that the ERP is identical across trials. Thus, there is a need for preliminary testing for cluster structure in the data. New method: We propose a complete pipeline for the cluster analysis of ERP data. To increase the signalto-noise (SNR) ratio of the raw single-trials, we used a denoising method based on Empirical Mode Decomposition (EMD). Next, we used a bootstrap-based method to determine the number of clusters, through a measure called the Stability Index (SI). We then used a clustering algorithm based on a Genetic Algorithm (GA)to define initial cluster centroids for subsequent k-means clustering. Finally, we visualised the clustering results through a scheme based on Principal Component Analysis (PCA). Results: After validating the pipeline on simulated data, we tested it on data from two experiments – a P300 speller paradigm on a single subject and a language processing study on 25 subjects. Results revealed evidence for the existence of 6 clusters in one experimental condition from the language processing study. Further, a two-way chi-square test revealed an influence of subject on cluster membership.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Extratropical transition (ET) has eluded objective identification since the realisation of its existence in the 1970s. Recent advances in numerical, computational models have provided data of higher resolution than previously available. In conjunction with this, an objective characterisation of the structure of a storm has now become widely accepted in the literature. Here we present a method of combining these two advances to provide an objective method for defining ET. The approach involves applying K-means clustering to isolate different life-cycle stages of cyclones and then analysing the progression through these stages. This methodology is then tested by applying it to five recent years from the European Centre of Medium-Range Weather Forecasting operational analyses. It is found that this method is able to determine the general characteristics for ET in the Northern Hemisphere. Between 2008 and 2012, 54% (±7, 32 of 59) of Northern Hemisphere tropical storms are estimated to undergo ET. There is great variability across basins and time of year. To fully capture all the instances of ET is necessary to introduce and characterise multiple pathways through transition. Only one of the three transition types needed has been previously well-studied. A brief description of the alternate types of transitions is given, along with illustrative storms, to assist with further study

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Precipitation over western Europe (WE) is projected to increase (decrease) roughly northward (equatorward) of 50°N during the 21st century. These changes are generally attributed to alterations in the regional large-scale circulation, e.g., jet stream, cyclone activity, and blocking frequencies. A novel weather typing within the sector (30°W–10°E, 25–70°N) is used for a more comprehensive dynamical interpretation of precipitation changes. A k-means clustering on daily mean sea level pressure was undertaken for ERA-Interim reanalysis (1979–2014). Eight weather types are identified: S1, S2, S3 (summertime types), W1, W2, W3 (wintertime types), B1, and B2 (blocking-like types). Their distinctive dynamical characteristics allow identifying the main large-scale precipitation-driving mechanisms. Simulations with 22 Coupled Model Intercomparison Project 5 models for recent climate conditions show biases in reproducing the observed seasonality of weather types. In particular, an overestimation of weather type frequencies associated with zonal airflow is identified. Considering projections following the (Representative Concentration Pathways) RCP8.5 scenario over 2071–2100, the frequencies of the three driest types (S1, B2, and W3) are projected to increase (mainly S1, +4%) in detriment of the rainiest types, particularly W1 (−3%). These changes explain most of the precipitation projections over WE. However, a weather type-independent background signal is identified (increase/decrease in precipitation over northern/southern WE), suggesting modifications in precipitation-generating processes and/or model inability to accurately simulate these processes. Despite these caveats in the precipitation scenarios for WE, which must be duly taken into account, our approach permits a better understanding of the projected trends for precipitation over WE.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper is concerned with the computational efficiency of fuzzy clustering algorithms when the data set to be clustered is described by a proximity matrix only (relational data) and the number of clusters must be automatically estimated from such data. A fuzzy variant of an evolutionary algorithm for relational clustering is derived and compared against two systematic (pseudo-exhaustive) approaches that can also be used to automatically estimate the number of fuzzy clusters in relational data. An extensive collection of experiments involving 18 artificial and two real data sets is reported and analyzed. (C) 2011 Elsevier B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A large amount of biological data has been produced in the last years. Important knowledge can be extracted from these data by the use of data analysis techniques. Clustering plays an important role in data analysis, by organizing similar objects from a dataset into meaningful groups. Several clustering algorithms have been proposed in the literature. However, each algorithm has its bias, being more adequate for particular datasets. This paper presents a mathematical formulation to support the creation of consistent clusters for biological data. Moreover. it shows a clustering algorithm to solve this formulation that uses GRASP (Greedy Randomized Adaptive Search Procedure). We compared the proposed algorithm with three known other algorithms. The proposed algorithm presented the best clustering results confirmed statistically. (C) 2009 Elsevier Ltd. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper proposes a novel way to combine different observation models in a particle filter framework. This, so called, auto-adjustable observation model, enhance the particle filter accuracy when the tracked objects overlap without infringing a great runtime penalty to the whole tracking system. The approach has been tested under two important real world situations related to animal behavior: mice and larvae tracking. The proposal was compared to some state-of-art approaches and the results show, under the datasets tested, that a good trade-off between accuracy and runtime can be achieved using an auto-adjustable observation model. (C) 2009 Elsevier B.V. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Parkinson's disease (PD) is the second most common neurodegenerative disorder (after Alzheimer's disease) and directly affects upto 5 million people worldwide. The stages (Hoehn and Yaar) of disease has been predicted by many methods which will be helpful for the doctors to give the dosage according to it. So these methods were brought up based on the data set which includes about seventy patients at nine clinics in Sweden. The purpose of the work is to analyze unsupervised technique with supervised neural network techniques in order to make sure the collected data sets are reliable to make decisions. The data which is available was preprocessed before calculating the features of it. One of the complex and efficient feature called wavelets has been calculated to present the data set to the network. The dimension of the final feature set has been reduced using principle component analysis. For unsupervised learning k-means gives the closer result around 76% while comparing with supervised techniques. Back propagation and J4 has been used as supervised model to classify the stages of Parkinson's disease where back propagation gives the variance percentage of 76-82%. The results of both these models have been analyzed. This proves that the data which are collected are reliable to predict the disease stages in Parkinson's disease.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

To efficiently and yet accurately cluster Web documents is of great interests to Web users and is a key component of the searching accuracy of a Web search engine. To achieve this, this paper introduces a new approach for the clustering of Web documents, which is called maximal frequent itemset (MFI) approach. Iterative clustering algorithms, such as K-means and expectation-maximization (EM), are sensitive to their initial conditions. MFI approach firstly locates the center points of high density clusters precisely. These center points then are used as initial points for the K-means algorithm. Our experimental results tested on 3 Web document sets show that our MFI approach outperforms the other methods we compared in most cases, particularly in the case of large number of categories in Web document sets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

For many clustering algorithms, such as K-Means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters, that is, k, to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images, or biological data. In an effort to improve the effectiveness of clustering, we seek the answer to a fundamental question: How can we effectively estimate the number of clusters in a given data set? We propose an efficient method based on spectra analysis of eigenvalues (not eigenvectors) of the data set as the solution to the above. We first present the relationship between a data set and its underlying spectra with theoretical and experimental results. We then show how our method is capable of suggesting a range of k that is well suited to different analysis contexts. Finally, we conclude with further  empirical results to show how the answer to this fundamental question enhances the clustering process for large text collections.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images or biological data. The fundamental question this paper addresses is: ldquoHow can we effectively estimate the natural number of clusters in a given text collection?rdquo. We propose to use spectral analysis, which analyzes the eigenvalues (not eigenvectors) of the collection, as the solution to the above. We first present the relationship between a text collection and its underlying spectra. We then show how the answer to this question enhances the clustering process. Finally, we conclude with empirical results and related work.