42 resultados para classification aided by clustering
Resumo:
Clustering analysis of data from DNA microarray hybridization studies is an essential task for identifying biologically relevant groups of genes. Attribute cluster algorithm (ACA) has provided an attractive way to group and select meaningful genes. However, ACA needs much prior knowledge about the genes to set the number of clusters. In practical applications, if the number of clusters is misspecified, the performance of the ACA will deteriorate rapidly. In fact, it is a very demanding to do that because of our little knowledge. We propose the Cooperative Competition Cluster Algorithm (CCCA) in this paper. In the algorithm, we assume that both cooperation and competition exist simultaneously between clusters in the process of clustering. By using this principle of Cooperative Competition, the number of clusters can be found in the process of clustering. Experimental results on a synthetic and gene expression data are demonstrated. The results show that CCCA can choose the number of clusters automatically and get excellent performance with respect to other competing methods.
Resumo:
Traditionally, the Internet provides only a “best-effort” service, treating all packets going to the same destination equally. However, providing differentiated services for different users based on their quality requirements is increasingly becoming a demanding issue. For this, routers need to have the capability to distinguish and isolate traffic belonging to different flows. This ability to determine the flow each packet belongs to is called packet classification. Technology vendors are reluctant to support algorithmic solutions for classification due to their nondeterministic performance. Although content addressable memories (CAMs) are favoured by technology vendors due to their deterministic high-lookup rates, they suffer from the problems of high-power consumption and high-silicon cost. This paper provides a new algorithmic-architectural solution for packet classification that mixes CAMs with algorithms based on multilevel cutting of the classification space into smaller spaces. The provided solution utilizes the geometrical distribution of rules in the classification space. It provides the deterministic performance of CAMs, support for dynamic updates, and added flexibility for system designers.
Resumo:
Clinical and pathological heterogeneity of breast cancer hinders selection of appropriate treatment for individual cases. Molecular profiling at gene or protein levels may elucidate the biological variance of tumors and provide a new classification system that correlates better with biological, clinical and prognostic parameters. We studied the immunohistochemical profile of a panel of seven important biomarkers using tumor tissue arrays. The tumor samples were then classified with a monothetic (binary variables) clustering algorithm. Two distinct groups of tumors are characterized by the estrogen receptor (ER) status and tumor grade (p = 0.0026). Four biomarkers, c-erbB2, Cox-2, p53 and VEGF, were significantly overexpressed in tumors with the ER-negative (ER-) phenotype. Eight subsets of tumors were further identified according to the expression status of VEGF, c-erbB2 and p53. The malignant potential of the ER-/VEGF+ subgroup was associated with the strong correlations of Cox-2 and c-erb132 with VEGF. Our results indicate that this molecular classification system, based on the statistical analysis of immunohistochemical profiling, is a useful approach for tumor grouping. Some of these subgroups have a relative genetic homogeneity that may allow further study of specific genetically-controlled metabolic pathways. This approach may hold great promise in rationalizing the application of different therapeutic strategies for different subgroups of breast tumors. (C) 2003 Elsevier Inc. All rights reserved.
Resumo:
Conditional branches frequently exhibit similar behavior (bias, time-varying behavior,...), a property that can be used to improve branch prediction accuracy. Branch clustering constructs groups or clusters of branches with similar behavior and applies different branch prediction techniques to each branch cluster. We revisit the topic of branch clustering with the aim of generalizing branch clustering. We investigate several methods to measure cluster information, with the most effective the storage of information in the branch target buffer. Also, we investigate alternative methods of using the branch cluster identification in the branch predictor. By these improvements we arrive at a branch clustering technique that obtains higher accuracy than previous approaches presented in the literature for the gshare predictor. Furthermore, we evaluate our branch clustering technique in a wide range of predictors to show the general applicability of the method. Branch clustering improves the accuracy of the local history (PAg) predictor, the path-based perceptron and the PPM-like predictor, one of the 2004 CBP finalists.
Resumo:
Recent technological advances have increased the quantity of movement data being recorded. While valuable knowledge can be gained by analysing such data, its sheer volume creates challenges. Geovisual analytics, which helps the human cognition process by using tools to reason about data, offers powerful techniques to resolve these challenges. This paper introduces such a geovisual analytics environment for exploring movement trajectories, which provides visualisation interfaces, based on the classic space-time cube. Additionally, a new approach, using the mathematical description of motion within a space-time cube, is used to determine the similarity of trajectories and forms the basis for clustering them. These techniques were used to analyse pedestrian movement. The results reveal interesting and useful spatiotemporal patterns and clusters of pedestrians exhibiting similar behaviour.
Resumo:
In this study, 137 corn distillers dried grains with solubles (DDGS) samples from a range of different geographical origins (Jilin Province of China, Heilongjiang Province of China, USA and Europe) were collected and analysed. Different near infrared spectrometers combined with different chemometric packages were used in two independent laboratories to investigate the feasibility of classifying geographical origin of DDGS. Base on the same dataset, one laboratory developed a partial least square discriminant analysis model and another laboratory developed an orthogonal partial least square discriminant analysis model. Results showed that both models could perfectly classify DDGS samples from different geographical origins. These promising results encourage the development of larger scale efforts to produce datasets which can be used to differentiate the geographical origin of DDGS and such efforts are required to provide higher level food security measures on a global scale.
Resumo:
Classification methods with embedded feature selection capability are very appealing for the analysis of complex processes since they allow the analysis of root causes even when the number of input variables is high. In this work, we investigate the performance of three techniques for classification within a Monte Carlo strategy with the aim of root cause analysis. We consider the naive bayes classifier and the logistic regression model with two different implementations for controlling model complexity, namely, a LASSO-like implementation with a L1 norm regularization and a fully Bayesian implementation of the logistic model, the so called relevance vector machine. Several challenges can arise when estimating such models mainly linked to the characteristics of the data: a large number of input variables, high correlation among subsets of variables, the situation where the number of variables is higher than the number of available data points and the case of unbalanced datasets. Using an ecological and a semiconductor manufacturing dataset, we show advantages and drawbacks of each method, highlighting the superior performance in term of classification accuracy for the relevance vector machine with respect to the other classifiers. Moreover, we show how the combination of the proposed techniques and the Monte Carlo approach can be used to get more robust insights into the problem under analysis when faced with challenging modelling conditions.
Resumo:
Breast cancer remains a frequent cause of female cancer death despite the great strides in elucidation of biological subtypes and their reported clinical and prognostic significance. We have defined a general cohort of breast cancers in terms of putative actionable targets, involving growth and proliferative factors, the cell cycle, and apoptotic pathways, both as single biomarkers across a general cohort and within intrinsic molecular subtypes.
We identified 293 patients treated with adjuvant chemotherapy. Additional hormonal therapy and trastuzumab was administered depending on hormonal and HER2 status respectively. We performed immunohistochemistry for ER, PR, HER2, MM1, CK5/6, p53, TOP2A, EGFR, IGF1R, PTEN, p-mTOR and e-cadherin. The cohort was classified into luminal (62%) and non-luminal (38%) tumors as well as luminal A (27%), luminal B HER2 negative (22%) and positive (12%), HER2 enriched (14%) and triple negative (25%). Patients with luminal tumors and co-overexpression of TOP2A or IGF1R loss displayed worse overall survival (p=0.0251 and p=0.0008 respectively). Non-luminal tumors had much greater heterogeneous expression profiles with no individual markers of prognostic significance. Non-luminal tumors were characterised by EGFR and TOP2A overexpression, IGF1R, PTEN and p-mTOR negativity and extreme p53 expression.
Our results indicate that only a minority of intrinsic subtype tumors purely express single novel actionable targets. This lack of pure biomarker expression is particular prevalent in the triple negative subgroup and may allude to the mechanism of targeted therapy inaction and myriad disappointing trial results. Utilising a combinatorial biomarker approach may enhance studies of targeted therapies providing additional information during design and patient selection while also helping decipher negative trial results.
Resumo:
Application of sensor-based technology within activity monitoring systems is becoming a popular technique within the smart environment paradigm. Nevertheless, the use of such an approach generates complex constructs of data, which subsequently requires the use of intricate activity recognition techniques to automatically infer the underlying activity. This paper explores a cluster-based ensemble method as a new solution for the purposes of activity recognition within smart environments. With this approach activities are modelled as collections of clusters built on different subsets of features. A classification process is performed by assigning a new instance to its closest cluster from each collection. Two different sensor data representations have been investigated, namely numeric and binary. Following the evaluation of the proposed methodology it has been demonstrated that the cluster-based ensemble method can be successfully applied as a viable option for activity recognition. Results following exposure to data collected from a range of activities indicated that the ensemble method had the ability to perform with accuracies of 94.2% and 97.5% for numeric and binary data, respectively. These results outperformed a range of single classifiers considered as benchmarks.