162 resultados para Agglomerative Hierarchical Clustering
Resumo:
Empirical evidence shows that repositories of business process models used in industrial practice contain significant amounts of duplication. This duplication arises for example when the repository covers multiple variants of the same processes or due to copy-pasting. Previous work has addressed the problem of efficiently retrieving exact clones that can be refactored into shared subprocess models. This article studies the broader problem of approximate clone detection in process models. The article proposes techniques for detecting clusters of approximate clones based on two well-known clustering algorithms: DBSCAN and Hi- erarchical Agglomerative Clustering (HAC). The article also defines a measure of standardizability of an approximate clone cluster, meaning the potential benefit of replacing the approximate clones with a single standardized subprocess. Experiments show that both techniques, in conjunction with the proposed standardizability measure, accurately retrieve clusters of approximate clones that originate from copy-pasting followed by independent modifications to the copied fragments. Additional experiments show that both techniques produce clusters that match those produced by human subjects and that are perceived to be standardizable.
Resumo:
The problem of clustering a large document collection is not only challenged by the number of documents and the number of dimensions, but it is also affected by the number and sizes of the clusters. Traditional clustering methods fail to scale when they need to generate a large number of clusters. Furthermore, when the clusters size in the solution is heterogeneous, i.e. some of the clusters are large in size, the similarity measures tend to degrade. A ranking based clustering method is proposed to deal with these issues in the context of the Social Event Detection task. Ranking scores are used to select a small number of most relevant clusters in order to compare and place a document. Additionally,instead of conventional cluster centroids, cluster patches are proposed to represent clusters, that are hubs-like set of documents. Text, temporal, spatial and visual content information collected from the social event images is utilized in calculating similarity. Results show that these strategies allow us to have a balance between performance and accuracy of the clustering solution gained by the clustering method.
Resumo:
Recently, attempts to improve decision making in species management have focussed on uncertainties associated with modelling temporal fluctuations in populations. Reducing model uncertainty is challenging; while larger samples improve estimation of species trajectories and reduce statistical errors, they typically amplify variability in observed trajectories. In particular, traditional modelling approaches aimed at estimating population trajectories usually do not account well for nonlinearities and uncertainties associated with multi-scale observations characteristic of large spatio-temporal surveys. We present a Bayesian semi-parametric hierarchical model for simultaneously quantifying uncertainties associated with model structure and parameters, and scale-specific variability over time. We estimate uncertainty across a four-tiered spatial hierarchy of coral cover from the Great Barrier Reef. Coral variability is well described; however, our results show that, in the absence of additional model specifications, conclusions regarding coral trajectories become highly uncertain when considering multiple reefs, suggesting that management should focus more at the scale of individual reefs. The approach presented facilitates the description and estimation of population trajectories and associated uncertainties when variability cannot be attributed to specific causes and origins. We argue that our model can unlock value contained in large-scale datasets, provide guidance for understanding sources of uncertainty, and support better informed decision making
Resumo:
This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
Resumo:
High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.
Resumo:
Clustering is an important technique in organising and categorising web scale documents. The main challenges faced in clustering the billions of documents available on the web are the processing power required and the sheer size of the datasets available. More importantly, it is nigh impossible to generate the labels for a general web document collection containing billions of documents and a vast taxonomy of topics. However, document clusters are most commonly evaluated by comparison to a ground truth set of labels for documents. This paper presents a clustering and labeling solution where the Wikipedia is clustered and hundreds of millions of web documents in ClueWeb12 are mapped on to those clusters. This solution is based on the assumption that the Wikipedia contains such a wide range of diverse topics that it represents a small scale web. We found that it was possible to perform the web scale document clustering and labeling process on one desktop computer under a couple of days for the Wikipedia clustering solution containing about 1000 clusters. It takes longer to execute a solution with finer granularity clusters such as 10,000 or 50,000. These results were evaluated using a set of external data.
Resumo:
Long-term measurements of particle number size distribution (PNSD) produce a very large number of observations and their analysis requires an efficient approach in order to produce results in the least possible time and with maximum accuracy. Clustering techniques are a family of sophisticated methods which have been recently employed to analyse PNSD data, however, very little information is available comparing the performance of different clustering techniques on PNSD data. This study aims to apply several clustering techniques (i.e. K-means, PAM, CLARA and SOM) to PNSD data, in order to identify and apply the optimum technique to PNSD data measured at 25 sites across Brisbane, Australia. A new method, based on the Generalised Additive Model (GAM) with a basis of penalised B-splines, was proposed to parameterise the PNSD data and the temporal weight of each cluster was also estimated using the GAM. In addition, each cluster was associated with its possible source based on the results of this parameterisation, together with the characteristics of each cluster. The performances of four clustering techniques were compared using the Dunn index and Silhouette width validation values and the K-means technique was found to have the highest performance, with five clusters being the optimum. Therefore, five clusters were found within the data using the K-means technique. The diurnal occurrence of each cluster was used together with other air quality parameters, temporal trends and the physical properties of each cluster, in order to attribute each cluster to its source and origin. The five clusters were attributed to three major sources and origins, including regional background particles, photochemically induced nucleated particles and vehicle generated particles. Overall, clustering was found to be an effective technique for attributing each particle size spectra to its source and the GAM was suitable to parameterise the PNSD data. These two techniques can help researchers immensely in analysing PNSD data for characterisation and source apportionment purposes.
Resumo:
A Bitcoin wallet is a set of private keys known to a user and which allow that user to spend any Bitcoin associated with those keys. In a hierarchical deterministic (HD) wallet, child private keys are generated pseudorandomly from a master private key, and the corresponding child public keys can be generated by anyone with knowledge of the master public key. These wallets have several interesting applications including Internet retail, trustless audit, and a treasurer allocating funds among departments. A specification of HD wallets has even been accepted as Bitcoin standard BIP32. Unfortunately, in all existing HD wallets---including BIP32 wallets---an attacker can easily recover the master private key given the master public key and any child private key. This vulnerability precludes use cases such as a combined treasurer-auditor, and some in the Bitcoin community have suspected that this vulnerability cannot be avoided. We propose a new HD wallet that is not subject to this vulnerability. Our HD wallet can tolerate the leakage of up to m private keys with a master public key size of O(m). We prove that breaking our HD wallet is at least as hard as the so-called "one more" discrete logarithm problem.
Resumo:
Existing techniques for automated discovery of process models from event logs gen- erally produce flat process models. Thus, they fail to exploit the notion of subprocess as well as error handling and repetition constructs provided by contemporary process modeling notations, such as the Business Process Model and Notation (BPMN). This paper presents a technique for automated discovery of hierarchical BPMN models con- taining interrupting and non-interrupting boundary events and activity markers. The technique employs functional and inclusion dependency discovery techniques in order to elicit a process-subprocess hierarchy from the event log. Given this hierarchy and the projected logs associated to each node in the hierarchy, parent process and subprocess models are then discovered using existing techniques for flat process model discovery. Finally, the resulting models and logs are heuristically analyzed in order to identify boundary events and markers. By employing approximate dependency discovery tech- niques, it is possible to filter out noise in the event log arising for example from data entry errors or missing events. A validation with one synthetic and two real-life logs shows that process models derived by the proposed technique are more accurate and less complex than those derived with flat process discovery techniques. Meanwhile, a validation on a family of synthetically generated logs shows that the technique is resilient to varying levels of noise.
Resumo:
Purpose This paper aims to set out a new hierarchical and differentiated model of social marketing principles, concepts and techniques that builds on, but supersedes, the existing lists of non-equivalent and undifferentiated benchmark criteria. Design/methodology/approach This is a conceptual paper that proposes a hierarchical model of social marketing principles, concepts and techniques. Findings This new delineation of the social marketing principle, its four core concepts and five techniques, represents a new way to conceptualize and recognize the different elements that constitute social marketing. This new model will help add to and further the development of the theoretical basis of social marketing, building on the definitional work led by the International Social Marketing Association (iSMA), Australian Association of Social Marketing (AASM) and European Social Marketing Association (ESMA). Research limitations/implications This proposed model offers a foundation for future research to expand upon. Further research is recommended to empirically test the proposed model. Originality/value This paper seeks to advance the theoretical base of social marketing by making a reasoned case for the need to differentiate between principles, concepts and techniques when seeking to describe social marketing.
Resumo:
The proliferation of the web presents an unsolved problem of automatically analyzing billions of pages of natural language. We introduce a scalable algorithm that clusters hundreds of millions of web pages into hundreds of thousands of clusters. It does this on a single mid-range machine using efficient algorithms and compressed document representations. It is applied to two web-scale crawls covering tens of terabytes. ClueWeb09 and ClueWeb12 contain 500 and 733 million web pages and were clustered into 500,000 to 700,000 clusters. To the best of our knowledge, such fine grained clustering has not been previously demonstrated. Previous approaches clustered a sample that limits the maximum number of discoverable clusters. The proposed EM-tree algorithm uses the entire collection in clustering and produces several orders of magnitude more clusters than the existing algorithms. Fine grained clustering is necessary for meaningful clustering in massive collections where the number of distinct topics grows linearly with collection size. These fine-grained clusters show an improved cluster quality when assessed with two novel evaluations using ad hoc search relevance judgments and spam classifications for external validation. These evaluations solve the problem of assessing the quality of clusters where categorical labeling is unavailable and unfeasible.
Resumo:
We propose a novel technique for conducting robust voice activity detection (VAD) in high-noise recordings. We use Gaussian mixture modeling (GMM) to train two generic models; speech and non-speech. We then score smaller segments of a given (unseen) recording against each of these GMMs to obtain two respective likelihood scores for each segment. These scores are used to compute a dissimilarity measure between pairs of segments and to carry out complete-linkage clustering of the segments into speech and non-speech clusters. We compare the accuracy of our method against state-of-the-art and standardised VAD techniques to demonstrate an absolute improvement of 15% in half-total error rate (HTER) over the best performing baseline system and across the QUT-NOISE-TIMIT database. We then apply our approach to the Audio-Visual Database of American English (AVDBAE) to demonstrate the performance of our algorithm in using visual, audio-visual or a proposed fusion of these features.