160 resultados para hierarchical clustering

em Queensland University of Technology - ePrints Archive


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Genetic correlation (rg) analysis determines how much of the correlation between two measures is due to common genetic influences. In an analysis of 4 Tesla diffusion tensor images (DTI) from 531 healthy young adult twins and their siblings, we generalized the concept of genetic correlation to determine common genetic influences on white matter integrity, measured by fractional anisotropy (FA), at all points of the brain, yielding an NxN genetic correlation matrix rg(x,y) between FA values at all pairs of voxels in the brain. With hierarchical clustering, we identified brain regions with relatively homogeneous genetic determinants, to boost the power to identify causal single nucleotide polymorphisms (SNP). We applied genome-wide association (GWA) to assess associations between 529,497 SNPs and FA in clusters defined by hubs of the clustered genetic correlation matrix. We identified a network of genes, with a scale-free topology, that influences white matter integrity over multiple brain regions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We present a novel method for improving hierarchical speaker clustering in the tasks of speaker diarization and speaker linking. In hierarchical clustering, a tree can be formed that demonstrates various levels of clustering. We propose a ratio that expresses the impact of each cluster on the formation of this tree and use this to rescale cluster scores. This provides score normalisation based on the impact of each cluster. We use a state-of-the-art speaker diarization and linking system across the SAIVT-BNEWS corpus to show that our proposed impact ratio can provide a relative improvement of 16% in diarization error rate (DER).

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper addresses the following predictive business process monitoring problem: Given the execution trace of an ongoing case,and given a set of traces of historical (completed) cases, predict the most likely outcome of the ongoing case. In this context, a trace refers to a sequence of events with corresponding payloads, where a payload consists of a set of attribute-value pairs. Meanwhile, an outcome refers to a label associated to completed cases, like, for example, a label indicating that a given case completed “on time” (with respect to a given desired duration) or “late”, or a label indicating that a given case led to a customer complaint or not. The paper tackles this problem via a two-phased approach. In the first phase, prefixes of historical cases are encoded using complex symbolic sequences and clustered. In the second phase, a classifier is built for each of the clusters. To predict the outcome of an ongoing case at runtime given its (uncompleted) trace, we select the closest cluster(s) to the trace in question and apply the respective classifier(s), taking into account the Euclidean distance of the trace from the center of the clusters. We consider two families of clustering algorithms – hierarchical clustering and k-medoids – and use random forests for classification. The approach was evaluated on four real-life datasets.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

We have used microarray gene expression profiling and machine learning to predict the presence of BRAF mutations in a panel of 61 melanoma cell lines. The BRAF gene was found to be mutated in 42 samples (69%) and intragenic mutations of the NRAS gene were detected in seven samples (11%). No cell line carried mutations of both genes. Using support vector machines, we have built a classifier that differentiates between melanoma cell lines based on BRAF mutation status. As few as 83 genes are able to discriminate between BRAF mutant and BRAF wild-type samples with clear separation observed using hierarchical clustering. Multidimensional scaling was used to visualize the relationship between a BRAF mutation signature and that of a generalized mitogen-activated protein kinase (MAPK) activation (either BRAF or NRAS mutation) in the context of the discriminating gene list. We observed that samples carrying NRAS mutations lie somewhere between those with or without BRAF mutations. These observations suggest that there are gene-specific mutation signals in addition to a common MAPK activation that result from the pleiotropic effects of either BRAF or NRAS on other signaling pathways, leading to measurably different transcriptional changes.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Evidence exists that repositories of business process models used in industrial practice contain significant amounts of duplication. This duplication may stem from the fact that the repository describes variants of the same pro- cesses and/or because of copy/pasting activity throughout the lifetime of the repository. Previous work has put forward techniques for identifying duplicate fragments (clones) that can be refactored into shared subprocesses. However, these techniques are limited to finding exact clones. This paper analyzes the prob- lem of approximate clone detection and puts forward two techniques for detecting clusters of approximate clones. Experiments show that the proposed techniques are able to accurately retrieve clusters of approximate clones that originate from copy/pasting followed by independent modifications to the copied fragments.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This paper is devoted to the analysis of career paths and employability. The state-of-the-art on this topic is rather poor in methodologies. Some authors propose distances well adapted to the data, but are limiting their analysis to hierarchical clustering. Other authors apply sophisticated methods, but only after paying the price of transforming the categorical data into continuous, via a factorial analysis. The latter approach has an important drawback since it makes a linear assumption on the data. We propose a new methodology, inspired from biology and adapted to career paths, combining optimal matching and self-organizing maps. A complete study on real-life data will illustrate our proposal.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Atherosclerotic cardiovascular disease remains the leading cause of morbidity and mortality in industrialized societies. The lack of metabolite biomarkers has impeded the clinical diagnosis of atherosclerosis so far. In this study, stable atherosclerosis patients (n=16) and age- and sex-matched non-atherosclerosis healthy subjects (n=28) were recruited from the local community (Harbin, P. R. China). The plasma was collected from each study subject and was subjected to metabolomics analysis by GC/MS. Pattern recognition analyses (principal components analysis, orthogonal partial least-squares discriminate analysis, and hierarchical clustering analysis) commonly demonstrated plasma metabolome, which was significantly different from atherosclerotic and non-atherosclerotic subjects. The development of atherosclerosis-induced metabolic perturbations of fatty acids, such as palmitate, stearate, and 1-monolinoleoylglycerol, was confirmed consistent with previous publication, showing that palmitate significantly contributes to atherosclerosis development via targeting apoptosis and inflammation pathways. Altogether, this study demonstrated that the development of atherosclerosis directly perturbed fatty acid metabolism, especially that of palmitate, which was confirmed as a phenotypic biomarker for clinical diagnosis of atherosclerosis.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Samples of Forsythia suspensa from raw (Laoqiao) and ripe (Qingqiao) fruit were analyzed with the use of HPLC-DAD and the EIS-MS techniques. Seventeen peaks were detected, and of these, twelve were identified. Most were related to the glucopyranoside molecular fragment. Samples collected from three geographical areas (Shanxi, Henan and Shandong Provinces), were discriminated with the use of hierarchical clustering analysis (HCA), discriminant analysis (DA), and principal component analysis (PCA) models, but only PCA was able to provide further information about the relationships between objects and loadings; eight peaks were related to the provinces of sample origin. The supervised classification models-K-nearest neighbor (KNN), least squares support vector machines (LS-SVM), and counter propagation artificial neural network (CP-ANN) methods, indicated successful classification but KNN produced 100% classification rate. Thus, the fruit were discriminated on the basis of their places of origin.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Travel speed is one of the most critical parameters for road safety; the evidence suggests that increased vehicle speed is associated with higher crash risk and injury severity. Both naturalistic and simulator studies have reported that drivers distracted by a mobile phone select a lower driving speed. Speed decrements have been argued to be a risk compensatory behaviour of distracted drivers. Nonetheless, the extent and circumstances of the speed change among distracted drivers are still not known very well. As such, the primary objective of this study was to investigate patterns of speed variation in relation to contextual factors and distraction. Using the CARRS-Q high-fidelity Advanced Driving Simulator, the speed selection behaviour of 32 drivers aged 18-26 years was examined in two phone conditions: baseline (no phone conversation) and handheld phone operation. The simulator driving route contained five different types of road traffic complexities, including one road section with a horizontal S curve, one horizontal S curve with adjacent traffic, one straight segment of suburban road without traffic, one straight segment of suburban road with traffic interactions, and one road segment in a city environment. Speed deviations from the posted speed limit were analysed using Ward’s Hierarchical Clustering method to identify the effects of road traffic environment and cognitive distraction. The speed deviations along curved road sections formed two different clusters for the two phone conditions, implying that distracted drivers adopt a different strategy for selecting driving speed in a complex driving situation. In particular, distracted drivers selected a lower speed while driving along a horizontal curve. The speed deviation along the city road segment and other straight road segments grouped into a different cluster, and the deviations were not significantly different across phone conditions, suggesting a negligible effect of distraction on speed selection along these road sections. Future research should focus on developing a risk compensation model to explain the relationship between road traffic complexity and distraction.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Objectives In China, “serious road traffic crashes” (SRTCs) are those in which there are 10-30 fatalities, 50-100 serious injuries or a total cost of 50-100 million RMB ($US8-16m), and “particularly serious road traffic crashes” (PSRTCs) are those which are more severe or costly. Due to the large number of fatalities and injuries as well as the negative public reaction they elicit, SRTCs and PSRTCs have become great concerns to China during recent years. The aim of this study is to identify the main factors contributing to these road traffic crashes and to propose preventive measures to reduce their number. Methods 49 contributing factors of the SRTCs and PSRTCs that occurred from 2007 to 2013 were collected from the database “In-depth Investigation and Analysis System for Major Road traffic crashes” (IIASMRTC) and were analyzed through the integrated use of principal component analysis and hierarchical clustering to determine the primary and secondary groups of contributing factors. Results Speeding and overloading of passengers were the primary contributing factors, featuring in up to 66.3% and 32.6% of accidents respectively. Two secondary contributing factors were road-related: lack of or nonstandard roadside safety infrastructure, and slippery roads due to rain, snow or ice. Conclusions The current approach to SRTCs and PSRTCs is focused on the attribution of responsibility and the enforcement of regulations considered relevant to particular SRTCs and PSRTCs. It would be more effective to investigate contributing factors and characteristics of SRTCs and PSRTCs as a whole, to provide adequate information for safety interventions in regions where SRTCs and PSRTCs are more common. In addition to mandating of a driver training program and publicisation of the hazards associated with traffic violations, implementation of speed cameras, speed signs, markings and vehicle-mounted GPS are suggested to reduce speeding of passenger vehicles, while increasing regular checks by traffic police and passenger station staff, and improving transportation management to increase income of contractors and drivers are feasible measures to prevent overloading of people. Other promising measures include regular inspection of roadside safety infrastructure, and improving skid resistance on dangerous road sections in mountainous areas.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

A hierarchical structure is used to represent the content of the semi-structured documents such as XML and XHTML. The traditional Vector Space Model (VSM) is not sufficient to represent both the structure and the content of such web documents. Hence in this paper, we introduce a novel method of representing the XML documents in Tensor Space Model (TSM) and then utilize it for clustering. Empirical analysis shows that the proposed method is scalable for a real-life dataset as well as the factorized matrices produced from the proposed method helps to improve the quality of clusters due to the enriched document representation with both the structure and the content information.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper proposes an innovative instance similarity based evaluation metric that reduces the search map for clustering to be performed. An aggregate global score is calculated for each instance using the novel idea of Fibonacci series. The use of Fibonacci numbers is able to separate the instances effectively and, in hence, the intra-cluster similarity is increased and the inter-cluster similarity is decreased during clustering. The proposed FIBCLUS algorithm is able to handle datasets with numerical, categorical and a mix of both types of attributes. Results obtained with FIBCLUS are compared with the results of existing algorithms such as k-means, x-means expected maximization and hierarchical algorithms that are widely used to cluster numeric, categorical and mix data types. Empirical analysis shows that FIBCLUS is able to produce better clustering solutions in terms of entropy, purity and F-score in comparison to the above described existing algorithms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Capacity probability models of generating units are commonly used in many power system reliability studies, at hierarchical level one (HLI). Analytical modelling of a generating system with many units or generating units with many derated states in a system, can result in an extensive number of states in the capacity model. Limitations on available memory and computational time of present computer facilities can pose difficulties for assessment of such systems in many studies. A cluster procedure using the nearest centroid sorting method was used for IEEE-RTS load model. The application proved to be very effective in producing a highly similar model with substantially fewer states. This paper presents an extended application of the clustering method to include capacity probability representation. A series of sensitivity studies are illustrated using IEEE-RTS generating system and load models. The loss of load and energy expectations (LOLE, LOEE), are used as indicators to evaluate the application

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper investigates the business cycle co-movement across countries and regions since 1950 as a measure for quantifying the economic interdependence in the ongoing globalisation process. Our methodological approach is based on analysis of a correlation matrix and the networks it contains. Such an approach summarises the interaction and interdependence of all elements, and it represents a more accurate measure of the global interdependence involved in an economic system. Our results show (1) the dynamics of interdependence has been driven more by synchronisation in regional growth patterns than by the synchronisation of the world economy, and (2) world crisis periods dramatically increase the global co-movement in the world economy.