786 resultados para Incremental Clustering
Resumo:
With the size and state of the Internet today, a good quality approach to organizing this mass of information is of great importance. Clustering web pages into groups of similar documents is one approach, but relies heavily on good feature extraction and document representation as well as a good clustering approach and algorithm. Due to the changing nature of the Internet, resulting in a dynamic dataset, an incremental approach is preferred. In this work we propose an enhanced incremental clustering approach to develop a better clustering algorithm that can help to better organize the information available on the Internet in an incremental fashion. Experiments show that the enhanced algorithm outperforms the original histogram based algorithm by up to 7.5%.
Resumo:
Emerging high-dimensional data mining applications needs to find interesting clusters embeded in arbitrarily aligned subspaces of lower dimensionality. It is difficult to cluster high-dimensional data objects, when they are sparse and skewed. Updations are quite common in dynamic databases and they are usually processed in batch mode. In very large dynamic databases, it is necessary to perform incremental cluster analysis only to the updations. We present a incremental clustering algorithm for subspace clustering in very high dimensions, which handles both insertion and deletions of datapoints to the backend databases.
Resumo:
Les logiciels sont en constante évolution, nécessitant une maintenance et un développement continus. Ils subissent des changements tout au long de leur vie, que ce soit pendant l'ajout de nouvelles fonctionnalités ou la correction de bogues dans le code. Lorsque ces logiciels évoluent, leurs architectures ont tendance à se dégrader avec le temps et deviennent moins adaptables aux nouvelles spécifications des utilisateurs. Elles deviennent plus complexes et plus difficiles à maintenir. Dans certains cas, les développeurs préfèrent refaire la conception de ces architectures à partir du zéro plutôt que de prolonger la durée de leurs vies, ce qui engendre une augmentation importante des coûts de développement et de maintenance. Par conséquent, les développeurs doivent comprendre les facteurs qui conduisent à la dégradation des architectures, pour prendre des mesures proactives qui facilitent les futurs changements et ralentissent leur dégradation. La dégradation des architectures se produit lorsque des développeurs qui ne comprennent pas la conception originale du logiciel apportent des changements au logiciel. D'une part, faire des changements sans comprendre leurs impacts peut conduire à l'introduction de bogues et à la retraite prématurée du logiciel. D'autre part, les développeurs qui manquent de connaissances et–ou d'expérience dans la résolution d'un problème de conception peuvent introduire des défauts de conception. Ces défauts ont pour conséquence de rendre les logiciels plus difficiles à maintenir et évoluer. Par conséquent, les développeurs ont besoin de mécanismes pour comprendre l'impact d'un changement sur le reste du logiciel et d'outils pour détecter les défauts de conception afin de les corriger. Dans le cadre de cette thèse, nous proposons trois principales contributions. La première contribution concerne l'évaluation de la dégradation des architectures logicielles. Cette évaluation consiste à utiliser une technique d’appariement de diagrammes, tels que les diagrammes de classes, pour identifier les changements structurels entre plusieurs versions d'une architecture logicielle. Cette étape nécessite l'identification des renommages de classes. Par conséquent, la première étape de notre approche consiste à identifier les renommages de classes durant l'évolution de l'architecture logicielle. Ensuite, la deuxième étape consiste à faire l'appariement de plusieurs versions d'une architecture pour identifier ses parties stables et celles qui sont en dégradation. Nous proposons des algorithmes de bit-vecteur et de clustering pour analyser la correspondance entre plusieurs versions d'une architecture. La troisième étape consiste à mesurer la dégradation de l'architecture durant l'évolution du logiciel. Nous proposons un ensemble de m´etriques sur les parties stables du logiciel, pour évaluer cette dégradation. La deuxième contribution est liée à l'analyse de l'impact des changements dans un logiciel. Dans ce contexte, nous présentons une nouvelle métaphore inspirée de la séismologie pour identifier l'impact des changements. Notre approche considère un changement à une classe comme un tremblement de terre qui se propage dans le logiciel à travers une longue chaîne de classes intermédiaires. Notre approche combine l'analyse de dépendances structurelles des classes et l'analyse de leur historique (les relations de co-changement) afin de mesurer l'ampleur de la propagation du changement dans le logiciel, i.e., comment un changement se propage à partir de la classe modifiée è d'autres classes du logiciel. La troisième contribution concerne la détection des défauts de conception. Nous proposons une métaphore inspirée du système immunitaire naturel. Comme toute créature vivante, la conception de systèmes est exposée aux maladies, qui sont des défauts de conception. Les approches de détection sont des mécanismes de défense pour les conception des systèmes. Un système immunitaire naturel peut détecter des pathogènes similaires avec une bonne précision. Cette bonne précision a inspiré une famille d'algorithmes de classification, appelés systèmes immunitaires artificiels (AIS), que nous utilisions pour détecter les défauts de conception. Les différentes contributions ont été évaluées sur des logiciels libres orientés objets et les résultats obtenus nous permettent de formuler les conclusions suivantes: • Les métriques Tunnel Triplets Metric (TTM) et Common Triplets Metric (CTM), fournissent aux développeurs de bons indices sur la dégradation de l'architecture. La d´ecroissance de TTM indique que la conception originale de l'architecture s’est dégradée. La stabilité de TTM indique la stabilité de la conception originale, ce qui signifie que le système est adapté aux nouvelles spécifications des utilisateurs. • La séismologie est une métaphore intéressante pour l'analyse de l'impact des changements. En effet, les changements se propagent dans les systèmes comme les tremblements de terre. L'impact d'un changement est plus important autour de la classe qui change et diminue progressivement avec la distance à cette classe. Notre approche aide les développeurs à identifier l'impact d'un changement. • Le système immunitaire est une métaphore intéressante pour la détection des défauts de conception. Les résultats des expériences ont montré que la précision et le rappel de notre approche sont comparables ou supérieurs à ceux des approches existantes.
Resumo:
In this paper we present a study of the computational cost of the GNG3D algorithm for mesh optimization. This algorithm has been implemented taking as a basis a new method which is based on neural networks and consists on two differentiated phases: an optimization phase and a reconstruction phase. The optimization phase is developed applying an optimization algorithm based on the Growing Neural Gas model, which constitutes an unsupervised incremental clustering algorithm. The primary goal of this phase is to obtain a simplified set of vertices representing the best approximation of the original 3D object. In the reconstruction phase we use the information provided by the optimization algorithm to reconstruct the faces thus obtaining the optimized mesh. The computational cost of both phases is calculated, showing some examples.
Resumo:
In this paper, we present ICICLE (Image ChainNet and Incremental Clustering Engine), a prototype system that we have developed to efficiently and effectively retrieve WWW images based on image semantics. ICICLE has two distinguishing features. First, it employs a novel image representation model called Weight ChainNet to capture the semantics of the image content. A new formula, called list space model, for computing semantic similarities is also introduced. Second, to speed up retrieval, ICICLE employs an incremental clustering mechanism, ICC (Incremental Clustering on ChainNet), to cluster images with similar semantics into the same partition. Each cluster has a summary representative and all clusters' representatives are further summarized into a balanced and full binary tree structure. We conducted an extensive performance study to evaluate ICICLE. Compared with some recently proposed methods, our results show that ICICLE provides better recall and precision. Our clustering technique ICC facilitates speedy retrieval of images without sacrificing recall and precision significantly.
Resumo:
With the growing number of XML documents on theWeb it becomes essential to effectively organise these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. However, many issues arise in discovering knowledge from these types of semi-structured documents due to their heterogeneity and structural irregularity. Most of the existing research on clustering techniques focuses only on one feature of the XML documents, this being either their structure or their content due to scalability and complexity problems. The knowledge gained in the form of clusters based on the structure or the content is not suitable for reallife datasets. It therefore becomes essential to include both the structure and content of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both these kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. The overall objective of this thesis is to address these issues by: (1) proposing methods to utilise frequent pattern mining techniques to reduce the dimension; (2) developing models to effectively combine the structure and content of XML documents; and (3) utilising the proposed models in clustering. This research first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. A clustering framework with two types of models, implicit and explicit, is developed. The implicit model uses a Vector Space Model (VSM) to combine the structure and the content information. The explicit model uses a higher order model, namely a 3- order Tensor Space Model (TSM), to explicitly combine the structure and the content information. This thesis also proposes a novel incremental technique to decompose largesized tensor models to utilise the decomposed solution for clustering the XML documents. The proposed framework and its components were extensively evaluated on several real-life datasets exhibiting extreme characteristics to understand the usefulness of the proposed framework in real-life situations. Additionally, this research evaluates the outcome of the clustering process on the collection selection problem in the information retrieval on the Wikipedia dataset. The experimental results demonstrate that the proposed frequent pattern mining and clustering methods outperform the related state-of-the-art approaches. In particular, the proposed framework of utilising frequent structures for constraining the content shows an improvement in accuracy over content-only and structure-only clustering results. The scalability evaluation experiments conducted on large scaled datasets clearly show the strengths of the proposed methods over state-of-the-art methods. In particular, this thesis work contributes to effectively combining the structure and the content of XML documents for clustering, in order to improve the accuracy of the clustering solution. In addition, it also contributes by addressing the research gaps in frequent pattern mining to generate efficient and concise frequent subtrees with various node relationships that could be used in clustering.
Resumo:
With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.
Resumo:
For many applications, it is necessary to produce speech transcriptions in a causal fashion. To produce high quality transcripts, speaker adaptation is often used. This requires online speaker clustering and incremental adaptation techniques to be developed. This paper presents an integrated approach to online speaker clustering and adaptation which allows efficient clustering of speakers using the same accumulated statistics that are normally used for adaptation. Using a consistent criterion for both clustering and adaptation should yield gains for both stages. The proposed approach is evaluated on a meetings transcription task using audio from multiple distant microphones. Consistent gains over standard clustering and adaptation were obtained. Copyright © 2011 ISCA.
Resumo:
The present work presents a new method for activity extraction and reporting from video based on the aggregation of fuzzy relations. Trajectory clustering is first employed mainly to discover the points of entry and exit of mobiles appearing in the scene. In a second step, proximity relations between resulting clusters of detected mobiles and contextual elements from the scene are modeled employing fuzzy relations. These can then be aggregated employing typical soft-computing algebra. A clustering algorithm based on the transitive closure calculation of the fuzzy relations allows building the structure of the scene and characterises the ongoing different activities of the scene. Discovered activity zones can be reported as activity maps with different granularities thanks to the analysis of the transitive closure matrix. Taking advantage of the soft relation properties, activity zones and related activities can be labeled in a more human-like language. We present results obtained on real videos corresponding to apron monitoring in the Toulouse airport in France.
Resumo:
In this paper, we focus on the design of bivariate EDAs for discrete optimization problems and propose a new approach named HSMIEC. While the current EDAs require much time in the statistical learning process as the relationships among the variables are too complicated, we employ the Selfish gene theory (SG) in this approach, as well as a Mutual Information and Entropy based Cluster (MIEC) model is also set to optimize the probability distribution of the virtual population. This model uses a hybrid sampling method by considering both the clustering accuracy and clustering diversity and an incremental learning and resample scheme is also set to optimize the parameters of the correlations of the variables. Compared with several benchmark problems, our experimental results demonstrate that HSMIEC often performs better than some other EDAs, such as BMDA, COMIT, MIMIC and ECGA. © 2009 Elsevier B.V. All rights reserved.
Clustering of Protein Structures Using Hydrophobic Free Energy And Solvent Accessibility of Proteins