773 resultados para Traditional clustering
Resumo:
With the growing size and variety of social media files on the web, it’s becoming critical to efficiently organize them into clusters for further processing. This paper presents a novel scalable constrained document clustering method that harnesses the power of search engines capable of dealing with large text data. Instead of calculating distance between the documents and all of the clusters’ centroids, a neighborhood of best cluster candidates is chosen using a document ranking scheme. To make the method faster and less memory dependable, the in-memory and in-database processing are combined in a semi-incremental manner. This method has been extensively tested in the social event detection application. Empirical analysis shows that the proposed method is efficient both in computation and memory usage while producing notable accuracy.
Resumo:
Traditional nearest points methods use all the samples in an image set to construct a single convex or affine hull model for classification. However, strong artificial features and noisy data may be generated from combinations of training samples when significant intra-class variations and/or noise occur in the image set. Existing multi-model approaches extract local models by clustering each image set individually only once, with fixed clusters used for matching with various image sets. This may not be optimal for discrimination, as undesirable environmental conditions (eg. illumination and pose variations) may result in the two closest clusters representing different characteristics of an object (eg. frontal face being compared to non-frontal face). To address the above problem, we propose a novel approach to enhance nearest points based methods by integrating affine/convex hull classification with an adapted multi-model approach. We first extract multiple local convex hulls from a query image set via maximum margin clustering to diminish the artificial variations and constrain the noise in local convex hulls. We then propose adaptive reference clustering (ARC) to constrain the clustering of each gallery image set by forcing the clusters to have resemblance to the clusters in the query image set. By applying ARC, noisy clusters in the query set can be discarded. Experiments on Honda, MoBo and ETH-80 datasets show that the proposed method outperforms single model approaches and other recent techniques, such as Sparse Approximated Nearest Points, Mutual Subspace Method and Manifold Discriminant Analysis.
Resumo:
Cougars, Grannies, Evil Stepmothers, and Menopausal Hot Flashers: Roles, Representations of Age and the Non-traditional Romance Heroine is an examination of the stereotyped roles of age and the under-representation of women over forty as worthy protagonists in romance fiction.
Resumo:
The claim that restorative justice emerged in response to the failings of the traditional criminal justice system is frequently made and rarely challenged in the restorative justice literature. It is stated unproblematically, as though it is an unassailable fact rather than a powerful truth claim, thereby positioning restorative justice as a natural, progressive and superior model of justice in comparison with the traditional criminal justice system. This truth claim therefore bestows restorative justice with a legitimacy that is difficult to challenge or refute. Drawing on a Foucaultian genealogy of restorative justice, this article seeks to destabilise the truth claim that restorative justice emerged in response to the failings of the criminal justice system. While the shortcomings of the traditional criminal justice system may provide a backdrop to the emergence of restorative justice, this article argues that such a possibility makes restorative justice a possibility rather than an inevitability.
Resumo:
This thesis presents new methods for classification and thematic grouping of billions of web pages, at scales previously not achievable. This process is also known as document clustering, where similar documents are automatically associated with clusters that represent various distinct topic. These automatically discovered topics are in turn used to improve search engine performance by only searching the topics that are deemed relevant to particular user queries.
Resumo:
We present a novel method for improving hierarchical speaker clustering in the tasks of speaker diarization and speaker linking. In hierarchical clustering, a tree can be formed that demonstrates various levels of clustering. We propose a ratio that expresses the impact of each cluster on the formation of this tree and use this to rescale cluster scores. This provides score normalisation based on the impact of each cluster. We use a state-of-the-art speaker diarization and linking system across the SAIVT-BNEWS corpus to show that our proposed impact ratio can provide a relative improvement of 16% in diarization error rate (DER).
Resumo:
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
Resumo:
This thesis is a morphological study of the settlement patterns of the diverse hill groups in Chittagong Hill Tracts – a mountainous borderland of Bangladesh in South Asia. It examines the settlement morphology of a hill town, using a combination of both quantitative and qualitative methods, and explains the recurrent neighbourhood types of the highland groups in relation to their urbanisation. The research findings related to the settlements of diverse cultural groups in a cross-border region of the Asian uplands are also relevant to similar contexts and enquiries. Furthermore, the developed methodological framework that facilitated the data collection process in CHT's culturally diverse regions is also applicable to the investigation of geographic areas with similar socio-cultural complexities. Finally, this research specifically contributes to the literature of cross-cultural studies of highland towns and vernacular settlements in the Asian context.
Resumo:
Understanding the natural variability of the Earth's climate system and accurately identifying potential anthropogenic influences requires long term, geographically distributed records of key climate indicators, such as temperature and precipitation that extend prior to the last 400. years of the Holocene. Reef corals provide an excellent source of high resolution climate records, and importantly represent the tropical marine environment where palaeoclimate data are urgently required. Recent decades have seen significant improvement in our understanding of coral biomineralisation, the associated uptake of geochemical proxies and methods of identifying and understanding the effects of both early and late, post depositional diagenetic alteration. These processes all have significant implications for interpreting geochemical proxies relevant to palaeoclimatic reconstructions. This paper reviews the current 'state of the art' in terms of coral based palaeoclimate reconstructions and highlights a key remaining problem. The majority of coral based palaeoclimate research has been derived from massive colonies of Porites. However, massive Porites are not globally abundant and may not provide material of a particular age of interest in those regions where they are present. Therefore, there is great potential for alternate coral genera to act as complimentary climate archives. While it remains critical to consider five key factors - vital effects, differential growth morphologies, geochemical heterogeneity in the skeletal ultrastructure, transfer equation selection and diagenetic screening of skeletal material - in order to allow the highest level of accuracy in coral palaeoclimate reconstructions, it is also important to develop alternate taxa for palaeoclimate studies in regions where Porites colonies are absent or rare. Currently as many as nine genera other than Porites have proven at least limited utility in palaeothermometry, most of which are found in the Atlantic/Caribbean region where massive Porites do not exist. Even branching taxa such as Acropora have significant potential to preserve environmental archives. Increasing this capability will greatly expand the number of potential geochemical archives available for longer term temporal records of palaeoclimate.
Resumo:
This project is a step forward in the study of text mining where enhanced text representation with semantic information plays a significant role. It develops effective methods of entity-oriented retrieval, semantic relation identification and text clustering utilizing semantically annotated data. These methods are based on enriched text representation generated by introducing semantic information extracted from Wikipedia into the input text data. The proposed methods are evaluated against several start-of-art benchmarking methods on real-life data-sets. In particular, this thesis improves the performance of entity-oriented retrieval, identifies different lexical forms for an entity relation and handles clustering documents with multiple feature spaces.
Resumo:
High-Order Co-Clustering (HOCC) methods have attracted high attention in recent years because of their ability to cluster multiple types of objects simultaneously using all available information. During the clustering process, HOCC methods exploit object co-occurrence information, i.e., inter-type relationships amongst different types of objects as well as object affinity information, i.e., intra-type relationships amongst the same types of objects. However, it is difficult to learn accurate intra-type relationships in the presence of noise and outliers. Existing HOCC methods consider the p nearest neighbours based on Euclidean distance for the intra-type relationships, which leads to incomplete and inaccurate intra-type relationships. In this paper, we propose a novel HOCC method that incorporates multiple subspace learning with a heterogeneous manifold ensemble to learn complete and accurate intra-type relationships. Multiple subspace learning reconstructs the similarity between any pair of objects that belong to the same subspace. The heterogeneous manifold ensemble is created based on two-types of intra-type relationships learnt using p-nearest-neighbour graph and multiple subspaces learning. Moreover, in order to make sure the robustness of clustering process, we introduce a sparse error matrix into matrix decomposition and develop a novel iterative algorithm. Empirical experiments show that the proposed method achieves improved results over the state-of-art HOCC methods for FScore and NMI.
Resumo:
This paper explores how traditional media organizations (such as magazines, music, film, books, and newspapers) develop routines for coping with an increasingly productive audience. While previous studies have reported on how such organizations have been affected by digital technologies, this study makes a contribution to this literature by being one of the first to show how organizational routines for engaging with an increasingly productive audience actually emerge and diffuse between industries. The paper explores to what extent routines employed by two traditional media organizations have been brought in from other organizational settings, specifically from so-called ‘software platform operators’. Data on routines for engaging with productive audiences have been collected from two information-rich cases in the music and the magazine industries, and from eight high-profile software platform operators. The paper concludes that the routines employed by the two traditional media organizations and by the software platform operators are based on the same set of principles: Provide the audience with (a) tools that allow them to easily generate cultural content; (b) building blocks which facilitate their creative activities; and (c) recognition and rewards based on both rationality and emotion.