778 resultados para Traditional clustering
Resumo:
Members of the Yorta Yorta Aboriginal Community v State of Victoria was the first case in which a claim for native title was lodged in a non-remote area of the Australian mainland which was the subject of European settlement at an early stage in Australian history - highlights the difficulties in establishing native title claims in long settled regions of Australia - a failure to recognise the strength of oral tradition in establishing Aboriginal connection with the land.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
This paper discusses a document discovery tool based on Conceptual Clustering by Formal Concept Analysis. The program allows users to navigate e-mail using a visual lattice metaphor rather than a tree. It implements a virtual. le structure over e-mail where files and entire directories can appear in multiple positions. The content and shape of the lattice formed by the conceptual ontology can assist in e-mail discovery. The system described provides more flexibility in retrieving stored e-mails than what is normally available in e-mail clients. The paper discusses how conceptual ontologies can leverage traditional document retrieval systems and aid knowledge discovery in document collections.
Resumo:
Although stock prices fluctuate, the variations are relatively small and are frequently assumed to be normal distributed on a large time scale. But sometimes these fluctuations can become determinant, especially when unforeseen large drops in asset prices are observed that could result in huge losses or even in market crashes. The evidence shows that these events happen far more often than would be expected under the generalized assumption of normal distributed financial returns. Thus it is crucial to properly model the distribution tails so as to be able to predict the frequency and magnitude of extreme stock price returns. In this paper we follow the approach suggested by McNeil and Frey (2000) and combine the GARCH-type models with the Extreme Value Theory (EVT) to estimate the tails of three financial index returns DJI,FTSE 100 and NIKKEI 225 representing three important financial areas in the world. Our results indicate that EVT-based conditional quantile estimates are much more accurate than those from conventional AR-GARCH models assuming normal or Student’s t-distribution innovations when doing out-of-sample estimation (within the insample estimation, this is so for the right tail of the distribution of returns).
Resumo:
OBJECTIVE: To estimate the incidence rate of type 1 diabetes in the urban area of Santiago, Chile, from March 21, 1997 to March 20, 1998, and to assess the spatio-temporal clustering of cases during that period. METHODS: All sixty-one incident cases were located temporally (day of diagnosis) and spatially (place of residence) in the area of study. Knox's method was used to assess spatio-temporal clustering of incident cases. RESULTS: The overall incidence rate of type 1 diabetes was 4.11 cases per 100,000 children aged less than 15 years per year (95% confidence interval: 3.06--5.14). The incidence rate seems to have increased since the last estimate of the incidence calculated for the years 1986--1992 in the metropolitan region of Santiago. Different combinations of space-time intervals have been evaluated to assess spatio-temporal clustering. The smallest p-value was found for the combination of critical distances of 750 meters and 60 days (uncorrected p-value = 0.048). CONCLUSIONS: Although these are preliminary results regarding space-time clustering in Santiago, exploratory analysis of the data method would suggest a possible aggregation of incident cases in space-time coordinates.
Resumo:
A definition of medium voltage (MV) load diagrams was made, based on the data base knowledge discovery process. Clustering techniques were used as support for the agents of the electric power retail markets to obtain specific knowledge of their customers’ consumption habits. Each customer class resulting from the clustering operation is represented by its load diagram. The Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) were applied to an electricity consumption data from a utility client’s database in order to form the customer’s classes and to find a set of representative consumption patterns. The WEACS approach is a clustering ensemble combination approach that uses subsampling and that weights differently the partitions in the co-association matrix. As a complementary step to the WEACS approach, all the final data partitions produced by the different variations of the method are combined and the Ward Link algorithm is used to obtain the final data partition. Experiment results showed that WEACS approach led to better accuracy than many other clustering approaches. In this paper the WEACS approach separates better the customer’s population than Two-step clustering algorithm.
Resumo:
With the electricity market liberalization, the distribution and retail companies are looking for better market strategies based on adequate information upon the consumption patterns of its electricity consumers. A fair insight on the consumers’ behavior will permit the definition of specific contract aspects based on the different consumption patterns. In order to form the different consumers’ classes, and find a set of representative consumption patterns we use electricity consumption data from a utility client’s database and two approaches: Two-step clustering algorithm and the WEACS approach based on evidence accumulation (EAC) for combining partitions in a clustering ensemble. While EAC uses a voting mechanism to produce a co-association matrix based on the pairwise associations obtained from N partitions and where each partition has equal weight in the combination process, the WEACS approach uses subsampling and weights differently the partitions. As a complementary step to the WEACS approach, we combine the partitions obtained in the WEACS approach with the ALL clustering ensemble construction method and we use the Ward Link algorithm to obtain the final data partition. The characterization of the obtained consumers’ clusters was performed using the C5.0 classification algorithm. Experiment results showed that the WEACS approach leads to better results than many other clustering approaches.