48 resultados para document clustering
Resumo:
This paper proposes max separation clustering (MSC), a new non-hierarchical clustering method used for feature extraction from optical emission spectroscopy (OES) data for plasma etch process control applications. OES data is high dimensional and inherently highly redundant with the result that it is difficult if not impossible to recognize useful features and key variables by direct visualization. MSC is developed for clustering variables with distinctive patterns and providing effective pattern representation by a small number of representative variables. The relationship between signal-to-noise ratio (SNR) and clustering performance is highlighted, leading to a requirement that low SNR signals be removed before applying MSC. Experimental results on industrial OES data show that MSC with low SNR signal removal produces effective summarization of the dominant patterns in the data.
Resumo:
This paper considers the musicological aspects of the songs performed by Ophelia in Shakespeare's Hamlet. It proposes a reconsideration of the concept of madness and insanity by an attentive, attuned and learned listening to the songs sung by Ophelia and the ways in which they are performed and received.
Resumo:
The quantity and quality of spatial data are increasing rapidly. This is particularly evident in the case of movement data. Devices capable of accurately recording the position of moving entities have become ubiquitous and created an abundance of movement data. Valuable knowledge concerning processes occurring in the physical world can be extracted from these large movement data sets. Geovisual analytics offers powerful techniques to achieve this. This article describes a new geovisual analytics tool specifically designed for movement data. The tool features the classic space-time cube augmented with a novel clustering approach to identify common behaviour. These techniques were used to analyse pedestrian movement in a city environment which revealed the effectiveness of the tool for identifying spatiotemporal patterns. © 2014 Taylor & Francis.
Resumo:
Recent technological advances have increased the quantity of movement data being recorded. While valuable knowledge can be gained by analysing such data, its sheer volume creates challenges. Geovisual analytics, which helps the human cognition process by using tools to reason about data, offers powerful techniques to resolve these challenges. This paper introduces such a geovisual analytics environment for exploring movement trajectories, which provides visualisation interfaces, based on the classic space-time cube. Additionally, a new approach, using the mathematical description of motion within a space-time cube, is used to determine the similarity of trajectories and forms the basis for clustering them. These techniques were used to analyse pedestrian movement. The results reveal interesting and useful spatiotemporal patterns and clusters of pedestrians exhibiting similar behaviour.
Resumo:
Increasingly semiconductor manufacturers are exploring opportunities for virtual metrology (VM) enabled process monitoring and control as a means of reducing non-value added metrology and achieving ever more demanding wafer fabrication tolerances. However, developing robust, reliable and interpretable VM models can be very challenging due to the highly correlated input space often associated with the underpinning data sets. A particularly pertinent example is etch rate prediction of plasma etch processes from multichannel optical emission spectroscopy data. This paper proposes a novel input-clustering based forward stepwise regression methodology for VM model building in such highly correlated input spaces. Max Separation Clustering (MSC) is employed as a pre-processing step to identify a reduced srt of well-conditioned, representative variables that can then be used as inputs to state-of-the-art model building techniques such as Forward Selection Regression (FSR), Ridge regression, LASSO and Forward Selection Ridge Regression (FCRR). The methodology is validated on a benchmark semiconductor plasma etch dataset and the results obtained are compared with those achieved when the state-of-art approaches are applied directly to the data without the MSC pre-processing step. Significant performance improvements are observed when MSC is combined with FSR (13%) and FSRR (8.5%), but not with Ridge Regression (-1%) or LASSO (-32%). The optimal VM results are obtained using the MSC-FSR and MSC-FSRR generated models. © 2012 IEEE.
Resumo:
Analysis of colorectal carcinoma (CRC) tissue for KRAS codon 12 or 13 mutations to guide use of anti-epidermal growth factor receptor (EGFR) therapy is now considered mandatory in the UK. The scope of this practice has been recently extended because of data indicating that NRAS mutations and additional KRAS mutations also predict for poor response to anti-EGFR therapy. The following document provides guidance on RAS (i.e., KRAS and NRAS) testing of CRC tissue in the setting of personalised medicine within the UK and particularly within the NHS. This guidance covers issues related to case selection, preanalytical aspects, analysis and interpretation of such RAS testing.
Resumo:
In this paper we propose a graph stream clustering algorithm with a unied similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it dynamically creates new sketch-based clusters and periodically merges existing similar clusters. Experiments on two publicly available datasets reveal the advantages of our approach in detecting vertex clusters in the graph stream. We provide a detailed investigation into how parameters affect the algorithm performance. We also provide a quantitative evaluation and comparison with a well-known offline community detection algorithm which shows that our streaming algorithm can achieve comparable or better average cluster purity.
Resumo:
Report on implementation of the candidate gender quota in the Fianna Fail Party.
Resumo:
Introduction
Mild cognitive impairment (MCI) has clinical value in its ability to predict later dementia. A better understanding of cognitive profiles can further help delineate who is most at risk of conversion to dementia. We aimed to (1) examine to what extent the usual MCI subtyping using core criteria corresponds to empirically defined clusters of patients (latent profile analysis [LPA] of continuous neuropsychological data) and (2) compare the two methods of subtyping memory clinic participants in their prediction of conversion to dementia.
Methods
Memory clinic participants (MCI, n = 139) and age-matched controls (n = 98) were recruited. Participants had a full cognitive assessment, and results were grouped (1) according to traditional MCI subtypes and (2) using LPA. MCI participants were followed over approximately 2 years after their initial assessment to monitor for conversion to dementia.
Results
Groups were well matched for age and education. Controls performed significantly better than MCI participants on all cognitive measures. With the traditional analysis, most MCI participants were in the amnestic multidomain subgroup (46.8%) and this group was most at risk of conversion to dementia (63%). From the LPA, a three-profile solution fit the data best. Profile 3 was the largest group (40.3%), the most cognitively impaired, and most at risk of conversion to dementia (68% of the group).
Discussion
LPA provides a useful adjunct in delineating MCI participants most at risk of conversion to dementia and adds confidence to standard categories of clinical inference.
Resumo:
Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing clustering algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing (LSH) to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.