229 resultados para Cluster Counting Algorithm
Resumo:
The K-Means algorithm for cluster analysis is one of the most influential and popular data mining methods. Its straightforward parallel formulation is well suited for distributed memory systems with reliable interconnection networks, such as massively parallel processors and clusters of workstations. However, in large-scale geographically distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. The lack of scalable and fault tolerant global communication and synchronisation methods in large-scale systems has hindered the adoption of the K-Means algorithm for applications in large networked systems such as wireless sensor networks, peer-to-peer systems and mobile ad hoc networks. This work proposes a fully distributed K-Means algorithm (EpidemicK-Means) which does not require global communication and is intrinsically fault tolerant. The proposed distributed K-Means algorithm provides a clustering solution which can approximate the solution of an ideal centralised algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out against the state of the art sampling methods and shows that the proposed method overcomes the limitations of the sampling-based approaches for skewed clusters distributions. The experimental analysis confirms that the proposed algorithm is very accurate and fault tolerant under unreliable network conditions (message loss and node failures) and is suitable for asynchronous networks of very large and extreme scale.
Resumo:
Advances in hardware and software in the past decade allow to capture, record and process fast data streams at a large scale. The research area of data stream mining has emerged as a consequence from these advances in order to cope with the real time analysis of potentially large and changing data streams. Examples of data streams include Google searches, credit card transactions, telemetric data and data of continuous chemical production processes. In some cases the data can be processed in batches by traditional data mining approaches. However, in some applications it is required to analyse the data in real time as soon as it is being captured. Such cases are for example if the data stream is infinite, fast changing, or simply too large in size to be stored. One of the most important data mining techniques on data streams is classification. This involves training the classifier on the data stream in real time and adapting it to concept drifts. Most data stream classifiers are based on decision trees. However, it is well known in the data mining community that there is no single optimal algorithm. An algorithm may work well on one or several datasets but badly on others. This paper introduces eRules, a new rule based adaptive classifier for data streams, based on an evolving set of Rules. eRules induces a set of rules that is constantly evaluated and adapted to changes in the data stream by adding new and removing old rules. It is different from the more popular decision tree based classifiers as it tends to leave data instances rather unclassified than forcing a classification that could be wrong. The ongoing development of eRules aims to improve its accuracy further through dynamic parameter setting which will also address the problem of changing feature domain values.
Resumo:
This contribution introduces a new digital predistorter to compensate serious distortions caused by memory high power amplifiers (HPAs) which exhibit output saturation characteristics. The proposed design is based on direct learning using a data-driven B-spline Wiener system modeling approach. The nonlinear HPA with memory is first identified based on the B-spline neural network model using the Gauss-Newton algorithm, which incorporates the efficient De Boor algorithm with both B-spline curve and first derivative recursions. The estimated Wiener HPA model is then used to design the Hammerstein predistorter. In particular, the inverse of the amplitude distortion of the HPA's static nonlinearity can be calculated effectively using the Newton-Raphson formula based on the inverse of De Boor algorithm. A major advantage of this approach is that both the Wiener HPA identification and the Hammerstein predistorter inverse can be achieved very efficiently and accurately. Simulation results obtained are presented to demonstrate the effectiveness of this novel digital predistorter design.
Resumo:
High spatial resolution environmental data gives us a better understanding of the environmental factors affecting plant distributions at fine spatial scales. However, large environmental datasets dramatically increase compute times and output species model size stimulating the need for an alternative computing solution. Cluster computing offers such a solution, by allowing both multiple plant species Environmental Niche Models (ENMs) and individual tiles of high spatial resolution models to be computed concurrently on the same compute cluster. We apply our methodology to a case study of 4,209 species of Mediterranean flora (around 17% of species believed present in the biome). We demonstrate a 16 times speed-up of ENM computation time when 16 CPUs were used on the compute cluster. Our custom Java ‘Merge’ and ‘Downsize’ programs reduce ENM output files sizes by 94%. The median 0.98 test AUC score of species ENMs is aided by various species occurrence data filtering techniques. Finally, by calculating the percentage change of individual grid cell values, we map the projected percentages of plant species vulnerable to climate change in the Mediterranean region between 1950–2000 and 2020.
Resumo:
Evolutionary meta-algorithms for pulse shaping of broadband femtosecond duration laser pulses are proposed. The genetic algorithm searching the evolutionary landscape for desired pulse shapes consists of a population of waveforms (genes), each made from two concatenated vectors, specifying phases and magnitudes, respectively, over a range of frequencies. Frequency domain operators such as mutation, two-point crossover average crossover, polynomial phase mutation, creep and three-point smoothing as well as a time-domain crossover are combined to produce fitter offsprings at each iteration step. The algorithm applies roulette wheel selection; elitists and linear fitness scaling to the gene population. A differential evolution (DE) operator that provides a source of directed mutation and new wavelet operators are proposed. Using properly tuned parameters for DE, the meta-algorithm is used to solve a waveform matching problem. Tuning allows either a greedy directed search near the best known solution or a robust search across the entire parameter space.
Resumo:
This paper analyze and study a pervasive computing system in a mining environment to track people based on RFID (radio frequency identification) technology. In first instance, we explain the RFID fundamentals and the LANDMARC (location identification based on dynamic active RFID calibration) algorithm, then we present the proposed algorithm combining LANDMARC and trilateration technique to collect the coordinates of the people inside the mine, next we generalize a pervasive computing system that can be implemented in mining, and finally we show the results and conclusions.
Resumo:
Global communicationrequirements andloadimbalanceof someparalleldataminingalgorithms arethe major obstacles to exploitthe computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication costin parallel data mining algorithms and, in particular, in the k-means algorithm for cluster analysis. In the straightforward parallel formulation of the k-means algorithm, data and computation loads are uniformly distributed over the processing nodes. This approach has excellent load balancing characteristics that may suggest it could scale up to large and extreme-scale parallel computing systems. However, at each iteration step the algorithm requires a global reduction operationwhichhinders thescalabilityoftheapproach.Thisworkstudiesadifferentparallelformulation of the algorithm where the requirement of global communication is removed, while maintaining the same deterministic nature ofthe centralised algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real-world distributed applications or can be induced by means ofmulti-dimensional binary searchtrees. The approachcanalso be extended to accommodate an approximation error which allows a further reduction ofthe communication costs. The effectiveness of the exact and approximate methods has been tested in a parallel computing system with 64 processors and in simulations with 1024 processing element
Resumo:
Objective To undertake a process evaluation of pharmacists' recommendations arising in the context of a complex IT-enabled pharmacist-delivered randomised controlled trial (PINCER trial) to reduce the risk of hazardous medicines management in general practices. Methods PINCER pharmacists manually recorded patients’ demographics, details of interventions recommended, actions undertaken by practice staff and time taken to manage individual cases of hazardous medicines management. Data were coded and double entered into SPSS v15, and then summarised using percentages for categorical data (with 95% CI) and, as appropriate, means (SD) or medians (IQR) for continuous data. Key findings Pharmacists spent a median of 20 minutes (IQR 10, 30) reviewing medical records, recommending interventions and completing actions in each case of hazardous medicines management. Pharmacists judged 72% (95%CI 70, 74) (1463/2026) of cases of hazardous medicines management to be clinically relevant. Pharmacists recommended 2105 interventions in 74% (95%CI 73, 76) (1516/2038) of cases and 1685 actions were taken in 61% (95%CI 59, 63) (1246/2038) of cases; 66% (95%CI 64, 68) (1383/2105) of interventions recommended by pharmacists were completed and 5% (95%CI 4, 6) (104/2105) of recommendations were accepted by general practitioners (GPs), but not completed at the end of the pharmacists’ placement; the remaining recommendations were rejected or considered not relevant by GPs. Conclusions The outcome measures were used to target pharmacist activity in general practice towards patients at risk from hazardous medicines management. Recommendations from trained PINCER pharmacists were found to be broadly acceptable to GPs and led to ameliorative action in the majority of cases. It seems likely that the approach used by the PINCER pharmacists could be employed by other practice pharmacists following appropriate training.
Resumo:
For Northern Hemisphere extra-tropical cyclone activity, the dependency of a potential anthropogenic climate change signal on the identification method applied is analysed. This study investigates the impact of the used algorithm on the changing signal, not the robustness of the climate change signal itself. Using one single transient AOGCM simulation as standard input for eleven state-of-the-art identification methods, the patterns of model simulated present day climatologies are found to be close to those computed from re-analysis, independent of the method applied. Although differences in the total number of cyclones identified exist, the climate change signals (IPCC SRES A1B) in the model run considered are largely similar between methods for all cyclones. Taking into account all tracks, decreasing numbers are found in the Mediterranean, the Arctic in the Barents and Greenland Seas, the mid-latitude Pacific and North America. Changing patterns are even more similar, if only the most severe systems are considered: the methods reveal a coherent statistically significant increase in frequency over the eastern North Atlantic and North Pacific. We found that the differences between the methods considered are largely due to the different role of weaker systems in the specific methods.
Resumo:
A statistical–dynamical regionalization approach is developed to assess possible changes in wind storm impacts. The method is applied to North Rhine-Westphalia (Western Germany) using the FOOT3DK mesoscale model for dynamical downscaling and ECHAM5/OM1 global circulation model climate projections. The method first classifies typical weather developments within the reanalysis period using K-means cluster algorithm. Most historical wind storms are associated with four weather developments (primary storm-clusters). Mesoscale simulations are performed for representative elements for all clusters to derive regional wind climatology. Additionally, 28 historical storms affecting Western Germany are simulated. Empirical functions are estimated to relate wind gust fields and insured losses. Transient ECHAM5/OM1 simulations show an enhanced frequency of primary storm-clusters and storms for 2060–2100 compared to 1960–2000. Accordingly, wind gusts increase over Western Germany, reaching locally +5% for 98th wind gust percentiles (A2-scenario). Consequently, storm losses are expected to increase substantially (+8% for A1B-scenario, +19% for A2-scenario). Regional patterns show larger changes over north-eastern parts of North Rhine-Westphalia than for western parts. For storms with return periods above 20 yr, loss expectations for Germany may increase by a factor of 2. These results document the method's functionality to assess future changes in loss potentials in regional terms.
Resumo:
Northern Hemisphere cyclone activity is assessed by applying an algorithm for the detection and tracking of synoptic scale cyclones to mean sea level pressure data. The method, originally developed for the Southern Hemisphere, is adapted for application in the Northern Hemisphere winter season. NCEP-Reanalysis data from 1958/59 to 1997/98 are used as input. The sensitivities of the results to particular parameters of the algorithm are discussed for both case studies and from a climatological point of view. Results show that the choice of settings is of major relevance especially for the tracking of smaller scale and fast moving systems. With an appropriate setting the algorithm is capable of automatically tracking different types of cyclones at the same time: Both fast moving and developing systems over the large ocean basins and smaller scale cyclones over the Mediterranean basin can be assessed. The climatology of cyclone variables, e.g., cyclone track density, cyclone counts, intensification rates, propagation speeds and areas of cyclogenesis and -lysis gives detailed information on typical cyclone life cycles for different regions. The lowering of the spatial and temporal resolution of the input data from full resolution T62/06h to T42/12h decreases the cyclone track density and cyclone counts. Reducing the temporal resolution alone contributes to a decline in the number of fast moving systems, which is relevant for the cyclone track density. Lowering spatial resolution alone mainly reduces the number of weak cyclones.
Resumo:
For an increasing number of applications, mesoscale modelling systems now aim to better represent urban areas. The complexity of processes resolved by urban parametrization schemes varies with the application. The concept of fitness-for-purpose is therefore critical for both the choice of parametrizations and the way in which the scheme should be evaluated. A systematic and objective model response analysis procedure (Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm) is used to assess the fitness of the single-layer urban canopy parametrization implemented in the Weather Research and Forecasting (WRF) model. The scheme is evaluated regarding its ability to simulate observed surface energy fluxes and the sensitivity to input parameters. Recent amendments are described, focussing on features which improve its applicability to numerical weather prediction, such as a reduced and physically more meaningful list of input parameters. The study shows a high sensitivity of the scheme to parameters characterizing roof properties in contrast to a low response to road-related ones. Problems in partitioning of energy between turbulent sensible and latent heat fluxes are also emphasized. Some initial guidelines to prioritize efforts to obtain urban land-cover class characteristics in WRF are provided. Copyright © 2010 Royal Meteorological Society and Crown Copyright.
Resumo:
In this paper a modified algorithm is suggested for developing polynomial neural network (PNN) models. Optimal partial description (PD) modeling is introduced at each layer of the PNN expansion, a task accomplished using the orthogonal least squares (OLS) method. Based on the initial PD models determined by the polynomial order and the number of PD inputs, OLS selects the most significant regressor terms reducing the output error variance. The method produces PNN models exhibiting a high level of accuracy and superior generalization capabilities. Additionally, parsimonious models are obtained comprising a considerably smaller number of parameters compared to the ones generated by means of the conventional PNN algorithm. Three benchmark examples are elaborated, including modeling of the gas furnace process as well as the iris and wine classification problems. Extensive simulation results and comparison with other methods in the literature, demonstrate the effectiveness of the suggested modeling approach.