933 resultados para Clustering analysis


Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document clustering. Many large scale problems exist in document clustering. K-tree scales well with large inputs due to its low complexity. It offers promising results both in terms of efficiency and quality. Document classification was completed using Support Vector Machines.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes a novel Hybrid Clustering approach for XML documents (HCX) that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The empirical analysis reveals that the proposed method is scalable and accurate.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

XML document clustering is essential for many document handling applications such as information storage, retrieval, integration and transformation. An XML clustering algorithm should process both the structural and the content information of XML documents in order to improve the accuracy and meaning of the clustering solution. However, the inclusion of both kinds of information in the clustering process results in a huge overhead for the underlying clustering algorithm because of the high dimensionality of the data. This paper introduces a novel approach that first determines the structural similarity in the form of frequent subtrees and then uses these frequent subtrees to represent the constrained content of the XML documents in order to determine the content similarity. The proposed method reduces the high dimensionality of input data by using only the structure-constrained content. The empirical analysis reveals that the proposed method can effectively cluster even very large XML datasets and outperform other existing methods.

Relevância:

30.00% 30.00%

Publicador:

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Digital collections are growing exponentially in size as the information age takes a firm grip on all aspects of society. As a result Information Retrieval (IR) has become an increasingly important area of research. It promises to provide new and more effective ways for users to find information relevant to their search intentions. Document clustering is one of the many tools in the IR toolbox and is far from being perfected. It groups documents that share common features. This grouping allows a user to quickly identify relevant information. If these groups are misleading then valuable information can accidentally be ignored. There- fore, the study and analysis of the quality of document clustering is important. With more and more digital information available, the performance of these algorithms is also of interest. An algorithm with a time complexity of O(n2) can quickly become impractical when clustering a corpus containing millions of documents. Therefore, the investigation of algorithms and data structures to perform clustering in an efficient manner is vital to its success as an IR tool. Document classification is another tool frequently used in the IR field. It predicts categories of new documents based on an existing database of (doc- ument, category) pairs. Support Vector Machines (SVM) have been found to be effective when classifying text documents. As the algorithms for classifica- tion are both efficient and of high quality, the largest gains can be made from improvements to representation. Document representations are vital for both clustering and classification. Representations exploit the content and structure of documents. Dimensionality reduction can improve the effectiveness of existing representations in terms of quality and run-time performance. Research into these areas is another way to improve the efficiency and quality of clustering and classification results. Evaluating document clustering is a difficult task. Intrinsic measures of quality such as distortion only indicate how well an algorithm minimised a sim- ilarity function in a particular vector space. Intrinsic comparisons are inherently limited by the given representation and are not comparable between different representations. Extrinsic measures of quality compare a clustering solution to a “ground truth” solution. This allows comparison between different approaches. As the “ground truth” is created by humans it can suffer from the fact that not every human interprets a topic in the same manner. Whether a document belongs to a particular topic or not can be subjective.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This overview focuses on the application of chemometrics techniques for the investigation of soils contaminated by polycyclic aromatic hydrocarbons (PAHs) and metals because these two important and very diverse groups of pollutants are ubiquitous in soils. The salient features of various studies carried out in the micro- and recreational environments of humans, are highlighted in the context of the various multivariate statistical techniques available across discipline boundaries that have been effectively used in soil studies. Particular attention is paid to techniques employed in the geosciences that may be effectively utilized for environmental soil studies; classical multivariate approaches that may be used in isolation or as complementary methods to these are also discussed. Chemometrics techniques widely applied in atmospheric studies for identifying sources of pollutants or for determining the importance of contaminant source contributions to a particular site, have seen little use in soil studies, but may be effectively employed in such investigations. Suitable programs are also available for suggesting mitigating measures in cases of soil contamination, and these are also considered. Specific techniques reviewed include pattern recognition techniques such as Principal Components Analysis (PCA), Fuzzy Clustering (FC) and Cluster Analysis (CA); geostatistical tools include variograms, Geographical Information Systems (GIS), contour mapping and kriging; source identification and contribution estimation methods reviewed include Positive Matrix Factorisation (PMF), and Principal Component Analysis on Absolute Principal Component Scores (PCA/APCS). Mitigating measures to limit or eliminate pollutant sources may be suggested through the use of ranking analysis and multi criteria decision making methods (MCDM). These methods are mainly represented in this review by studies employing the Preference Ranking Organisation Method for Enrichment Evaluation (PROMETHEE) and its associated graphic output, Geometrical Analysis for Interactive Aid (GAIA).

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This study aimed to investigate the spatial clustering and dynamic dispersion of dengue incidence in Queensland, Australia. We used Moran’s I statistic to assess the spatial autocorrelation of reported dengue cases. Spatial empirical Bayes smoothing estimates were used to display the spatial distribution of dengue in postal areas throughout Queensland. Local indicators of spatial association (LISA) maps and logistic regression models were used to identify spatial clusters and examine the spatio-temporal patterns of the spread of dengue. The results indicate that the spatial distribution of dengue was clustered during each of the three periods of 1993–1996, 1997–2000 and 2001–2004. The high-incidence clusters of dengue were primarily concentrated in the north of Queensland and low-incidence clusters occurred in the south-east of Queensland. The study concludes that the geographical range of notified dengue cases has significantly expanded in Queensland over recent years.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A hierarchical structure is used to represent the content of the semi-structured documents such as XML and XHTML. The traditional Vector Space Model (VSM) is not sufficient to represent both the structure and the content of such web documents. Hence in this paper, we introduce a novel method of representing the XML documents in Tensor Space Model (TSM) and then utilize it for clustering. Empirical analysis shows that the proposed method is scalable for a real-life dataset as well as the factorized matrices produced from the proposed method helps to improve the quality of clusters due to the enriched document representation with both the structure and the content information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The modern society has come to expect the electrical energy on demand, while many of the facilities in power systems are aging beyond repair and maintenance. The risk of failure is increasing with the aging equipments and can pose serious consequences for continuity of electricity supply. As the equipments used in high voltage power networks are very expensive, economically it may not be feasible to purchase and store spares in a warehouse for extended periods of time. On the other hand, there is normally a significant time before receiving equipment once it is ordered. This situation has created a considerable interest in the evaluation and application of probability methods for aging plant and provisions of spares in bulk supply networks, and can be of particular importance for substations. Quantitative adequacy assessment of substation and sub-transmission power systems is generally done using a contingency enumeration approach which includes the evaluation of contingencies, classification of the contingencies based on selected failure criteria. The problem is very complex because of the need to include detailed modelling and operation of substation and sub-transmission equipment using network flow evaluation and to consider multiple levels of component failures. In this thesis a new model associated with aging equipment is developed to combine the standard tools of random failures, as well as specific model for aging failures. This technique is applied in this thesis to include and examine the impact of aging equipments on system reliability of bulk supply loads and consumers in distribution network for defined range of planning years. The power system risk indices depend on many factors such as the actual physical network configuration and operation, aging conditions of the equipment, and the relevant constraints. The impact and importance of equipment reliability on power system risk indices in a network with aging facilities contains valuable information for utilities to better understand network performance and the weak links in the system. In this thesis, algorithms are developed to measure the contribution of individual equipment to the power system risk indices, as part of the novel risk analysis tool. A new cost worth approach was developed in this thesis that can make an early decision in planning for replacement activities concerning non-repairable aging components, in order to maintain a system reliability performance which economically is acceptable. The concepts, techniques and procedures developed in this thesis are illustrated numerically using published test systems. It is believed that the methods and approaches presented, substantially improve the accuracy of risk predictions by explicit consideration of the effect of equipment entering a period of increased risk of a non-repairable failure.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background: Waist circumference has been identified as a valuable predictor of cardiovascular risk in children. The development of waist circumference percentiles and cut-offs for various ethnic groups are necessary because of differences in body composition. The purpose of this study was to develop waist circumference percentiles for Chinese children and to explore optimal waist circumference cut-off values for predicting cardiovascular risk factors clustering in this population.----- ----- Methods: Height, weight, and waist circumference were measured in 5529 children (2830 boys and 2699 girls) aged 6-12 years randomly selected from southern and northern China. Blood pressure, fasting triglycerides, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, and glucose were obtained in a subsample (n = 1845). Smoothed percentile curves were produced using the LMS method. Receiver-operating characteristic analysis was used to derive the optimal age- and gender-specific waist circumference thresholds for predicting the clustering of cardiovascular risk factors.----- ----- Results: Gender-specific waist circumference percentiles were constructed. The waist circumference thresholds were at the 90th and 84th percentiles for Chinese boys and girls respectively, with sensitivity and specificity ranging from 67% to 83%. The odds ratio of a clustering of cardiovascular risk factors among boys and girls with a higher value than cut-off points was 10.349 (95% confidence interval 4.466 to 23.979) and 8.084 (95% confidence interval 3.147 to 20.767) compared with their counterparts.----- ----- Conclusions: Percentile curves for waist circumference of Chinese children are provided. The cut-off point for waist circumference to predict cardiovascular risk factors clustering is at the 90th and 84th percentiles for Chinese boys and girls, respectively.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to adequately represent a speaker based on sparse training data, as well as an improved capture of differences in speaker characteristics. This paper hence proposes that it would be beneficial to capitalize on the advantages of eigenvoice modeling in a CLR framework. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 35.1% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes an innovative instance similarity based evaluation metric that reduces the search map for clustering to be performed. An aggregate global score is calculated for each instance using the novel idea of Fibonacci series. The use of Fibonacci numbers is able to separate the instances effectively and, in hence, the intra-cluster similarity is increased and the inter-cluster similarity is decreased during clustering. The proposed FIBCLUS algorithm is able to handle datasets with numerical, categorical and a mix of both types of attributes. Results obtained with FIBCLUS are compared with the results of existing algorithms such as k-means, x-means expected maximization and hierarchical algorithms that are widely used to cluster numeric, categorical and mix data types. Empirical analysis shows that FIBCLUS is able to produce better clustering solutions in terms of entropy, purity and F-score in comparison to the above described existing algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The multifractal properties of two indices of geomagnetic activity, D st (representative of low latitudes) and a p (representative of the global geomagnetic activity), with the solar X-ray brightness, X l , during the period from 1 March 1995 to 17 June 2003 are examined using multifractal detrended fluctuation analysis (MF-DFA). The h(q) curves of D st and a p in the MF-DFA are similar to each other, but they are different from that of X l , indicating that the scaling properties of X l are different from those of D st and a p . Hence, one should not predict the magnitude of magnetic storms directly from solar X-ray observations. However, a strong relationship exists between the classes of the solar X-ray irradiance (the classes being chosen to separate solar flares of class X-M, class C, and class B or less, including no flares) in hourly measurements and the geomagnetic disturbances (large to moderate, small, or quiet) seen in D st and a p during the active period. Each time series was converted into a symbolic sequence using three classes. The frequency, yielding the measure representations, of the substrings in the symbolic sequences then characterizes the pattern of space weather events. Using the MF-DFA method and traditional multifractal analysis, we calculate the h(q), D(q), and τ (q) curves of the measure representations. The τ (q) curves indicate that the measure representations of these three indices are multifractal. On the basis of this three-class clustering, we find that the h(q), D(q), and τ (q) curves of the measure representations of these three indices are similar to each other for positive values of q. Hence, a positive flare storm class dependence is reflected in the scaling exponents h(q) in the MF-DFA and the multifractal exponents D(q) and τ (q). This finding indicates that the use of the solar flare classes could improve the prediction of the D st classes.