790 resultados para Agglomerative Hierarchical Clustering
Resumo:
Atherosclerotic cardiovascular disease remains the leading cause of morbidity and mortality in industrialized societies. The lack of metabolite biomarkers has impeded the clinical diagnosis of atherosclerosis so far. In this study, stable atherosclerosis patients (n=16) and age- and sex-matched non-atherosclerosis healthy subjects (n=28) were recruited from the local community (Harbin, P. R. China). The plasma was collected from each study subject and was subjected to metabolomics analysis by GC/MS. Pattern recognition analyses (principal components analysis, orthogonal partial least-squares discriminate analysis, and hierarchical clustering analysis) commonly demonstrated plasma metabolome, which was significantly different from atherosclerotic and non-atherosclerotic subjects. The development of atherosclerosis-induced metabolic perturbations of fatty acids, such as palmitate, stearate, and 1-monolinoleoylglycerol, was confirmed consistent with previous publication, showing that palmitate significantly contributes to atherosclerosis development via targeting apoptosis and inflammation pathways. Altogether, this study demonstrated that the development of atherosclerosis directly perturbed fatty acid metabolism, especially that of palmitate, which was confirmed as a phenotypic biomarker for clinical diagnosis of atherosclerosis.
Resumo:
Data associated with germplasm collections are typically large and multivariate with a considerable number of descriptors measured on each of many accessions. Pattern analysis methods of clustering and ordination have been identified as techniques for statistically evaluating the available diversity in germplasm data. While used in many studies, the approaches have not dealt explicitly with the computational consequences of large data sets (i.e. greater than 5000 accessions). To consider the application of these techniques to germplasm evaluation data, 11328 accessions of groundnut (Arachis hypogaea L) from the International Research Institute for the Semi-Arid Tropics, Andhra Pradesh, India were examined. Data for nine quantitative descriptors measured in the rainy and post-rainy growing seasons were used. The ordination technique of principal component analysis was used to reduce the dimensionality of the germplasm data. The identification of phenotypically similar groups of accessions within large scale data via the computationally intensive hierarchical clustering techniques was not feasible and non-hierarchical techniques had to be used. Finite mixture models that maximise the likelihood of an accession belonging to a cluster were used to cluster the accessions in this collection. The patterns of response for the different growing seasons were found to be highly correlated. However, in relating the results to passport and other characterisation and evaluation descriptors, the observed patterns did not appear to be related to taxonomy or any other well known characteristics of groundnut.
Resumo:
Samples of Forsythia suspensa from raw (Laoqiao) and ripe (Qingqiao) fruit were analyzed with the use of HPLC-DAD and the EIS-MS techniques. Seventeen peaks were detected, and of these, twelve were identified. Most were related to the glucopyranoside molecular fragment. Samples collected from three geographical areas (Shanxi, Henan and Shandong Provinces), were discriminated with the use of hierarchical clustering analysis (HCA), discriminant analysis (DA), and principal component analysis (PCA) models, but only PCA was able to provide further information about the relationships between objects and loadings; eight peaks were related to the provinces of sample origin. The supervised classification models-K-nearest neighbor (KNN), least squares support vector machines (LS-SVM), and counter propagation artificial neural network (CP-ANN) methods, indicated successful classification but KNN produced 100% classification rate. Thus, the fruit were discriminated on the basis of their places of origin.
Resumo:
Travel speed is one of the most critical parameters for road safety; the evidence suggests that increased vehicle speed is associated with higher crash risk and injury severity. Both naturalistic and simulator studies have reported that drivers distracted by a mobile phone select a lower driving speed. Speed decrements have been argued to be a risk compensatory behaviour of distracted drivers. Nonetheless, the extent and circumstances of the speed change among distracted drivers are still not known very well. As such, the primary objective of this study was to investigate patterns of speed variation in relation to contextual factors and distraction. Using the CARRS-Q high-fidelity Advanced Driving Simulator, the speed selection behaviour of 32 drivers aged 18-26 years was examined in two phone conditions: baseline (no phone conversation) and handheld phone operation. The simulator driving route contained five different types of road traffic complexities, including one road section with a horizontal S curve, one horizontal S curve with adjacent traffic, one straight segment of suburban road without traffic, one straight segment of suburban road with traffic interactions, and one road segment in a city environment. Speed deviations from the posted speed limit were analysed using Ward’s Hierarchical Clustering method to identify the effects of road traffic environment and cognitive distraction. The speed deviations along curved road sections formed two different clusters for the two phone conditions, implying that distracted drivers adopt a different strategy for selecting driving speed in a complex driving situation. In particular, distracted drivers selected a lower speed while driving along a horizontal curve. The speed deviation along the city road segment and other straight road segments grouped into a different cluster, and the deviations were not significantly different across phone conditions, suggesting a negligible effect of distraction on speed selection along these road sections. Future research should focus on developing a risk compensation model to explain the relationship between road traffic complexity and distraction.
Resumo:
Aim: This study investigated the use of stable δ13C and δ18O isotopes in the sagittal otolith carbonate of narrow-barred Spanish mackerel, Scomberomorus commerson, as indicators of population structure across Australia. Location: Samples were collected from 25 locations extending from the lower west coast of Western Australia (30°), across northern Australian waters, and to the east coast of Australia (18°) covering a coastline length of approximately 9500 km, including samples from Indonesia. Methods: The stable δ13C and δ18O isotopes in the sagittal otolith carbonate of S. commerson were analysed using standard mass spectrometric techniques. The isotope ratios across northern Australian subregions were subjected to an agglomerative hierarchical cluster analysis to define subregions. Isotope ratios within each of the subregions were compared to assess population structure across Australia. Results: Cluster analysis separated samples into four subregions: central Western Australia, north Western Australia, northern Australia and the Gulf of Carpentaria and eastern Australia. Isotope signatures for fish from a number of sampling sites from across Australia and Indonesia were significantly different, indicating population separation. No significant differences were found in otolith isotope ratios between sampling times (no temporal variation). Main conclusions: Significant differences in the isotopic signatures of S. commerson demonstrate that there is unlikely to be any substantial movement of fish among these spatially discrete adult assemblages. The lack of temporal variation among otolith isotope ratios indicates that S. commerson populations do not undergo longshore spatial shifts in distribution during their life history. The temporal persistence of spatially explicit stable isotopic signatures indicates that, at these spatial scales, the population units sampled comprise functionally distinct management units or separate ‘stocks’ for many of the purposes of fisheries management. The spatial subdivision evident among populations of S. commerson across northern and western Australia indicates that it may be advantageous to consider S. commerson population dynamics and fisheries management from a metapopulation perspective (at least at the regional level).
Resumo:
In this thesis we present and evaluate two pattern matching based methods for answer extraction in textual question answering systems. A textual question answering system is a system that seeks answers to natural language questions from unstructured text. Textual question answering systems are an important research problem because as the amount of natural language text in digital format grows all the time, the need for novel methods for pinpointing important knowledge from the vast textual databases becomes more and more urgent. We concentrate on developing methods for the automatic creation of answer extraction patterns. A new type of extraction pattern is developed also. The pattern matching based approach chosen is interesting because of its language and application independence. The answer extraction methods are developed in the framework of our own question answering system. Publicly available datasets in English are used as training and evaluation data for the methods. The techniques developed are based on the well known methods of sequence alignment and hierarchical clustering. The similarity metric used is based on edit distance. The main conclusions of the research are that answer extraction patterns consisting of the most important words of the question and of the following information extracted from the answer context: plain words, part-of-speech tags, punctuation marks and capitalization patterns, can be used in the answer extraction module of a question answering system. This type of patterns and the two new methods for generating answer extraction patterns provide average results when compared to those produced by other systems using the same dataset. However, most answer extraction methods in the question answering systems tested with the same dataset are both hand crafted and based on a system-specific and fine-grained question classification. The the new methods developed in this thesis require no manual creation of answer extraction patterns. As a source of knowledge, they require a dataset of sample questions and answers, as well as a set of text documents that contain answers to most of the questions. The question classification used in the training data is a standard one and provided already in the publicly available data.
Resumo:
This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.
Resumo:
Objectives In China, “serious road traffic crashes” (SRTCs) are those in which there are 10-30 fatalities, 50-100 serious injuries or a total cost of 50-100 million RMB ($US8-16m), and “particularly serious road traffic crashes” (PSRTCs) are those which are more severe or costly. Due to the large number of fatalities and injuries as well as the negative public reaction they elicit, SRTCs and PSRTCs have become great concerns to China during recent years. The aim of this study is to identify the main factors contributing to these road traffic crashes and to propose preventive measures to reduce their number. Methods 49 contributing factors of the SRTCs and PSRTCs that occurred from 2007 to 2013 were collected from the database “In-depth Investigation and Analysis System for Major Road traffic crashes” (IIASMRTC) and were analyzed through the integrated use of principal component analysis and hierarchical clustering to determine the primary and secondary groups of contributing factors. Results Speeding and overloading of passengers were the primary contributing factors, featuring in up to 66.3% and 32.6% of accidents respectively. Two secondary contributing factors were road-related: lack of or nonstandard roadside safety infrastructure, and slippery roads due to rain, snow or ice. Conclusions The current approach to SRTCs and PSRTCs is focused on the attribution of responsibility and the enforcement of regulations considered relevant to particular SRTCs and PSRTCs. It would be more effective to investigate contributing factors and characteristics of SRTCs and PSRTCs as a whole, to provide adequate information for safety interventions in regions where SRTCs and PSRTCs are more common. In addition to mandating of a driver training program and publicisation of the hazards associated with traffic violations, implementation of speed cameras, speed signs, markings and vehicle-mounted GPS are suggested to reduce speeding of passenger vehicles, while increasing regular checks by traffic police and passenger station staff, and improving transportation management to increase income of contractors and drivers are feasible measures to prevent overloading of people. Other promising measures include regular inspection of roadside safety infrastructure, and improving skid resistance on dangerous road sections in mountainous areas.
Resumo:
Core Vector Machine(CVM) is suitable for efficient large-scale pattern classification. In this paper, a method for improving the performance of CVM with Gaussian kernel function irrespective of the orderings of patterns belonging to different classes within the data set is proposed. This method employs a selective sampling based training of CVM using a novel kernel based scalable hierarchical clustering algorithm. Empirical studies made on synthetic and real world data sets show that the proposed strategy performs well on large data sets.
Resumo:
The presence of a large number of spectral bands in the hyperspectral images increases the capability to distinguish between various physical structures. However, they suffer from the high dimensionality of the data. Hence, the processing of hyperspectral images is applied in two stages: dimensionality reduction and unsupervised classification techniques. The high dimensionality of the data has been reduced with the help of Principal Component Analysis (PCA). The selected dimensions are classified using Niche Hierarchical Artificial Immune System (NHAIS). The NHAIS combines the splitting method to search for the optimal cluster centers using niching procedure and the merging method is used to group the data points based on majority voting. Results are presented for two hyperspectral images namely EO-1 Hyperion image and Indian pines image. A performance comparison of this proposed hierarchical clustering algorithm with the earlier three unsupervised algorithms is presented. From the results obtained, we deduce that the NHAIS is efficient.
Resumo:
Establishing functional relationships between multi-domain protein sequences is a non-trivial task. Traditionally, delineating functional assignment and relationships of proteins requires domain assignments as a prerequisite. This process is sensitive to alignment quality and domain definitions. In multi-domain proteins due to multiple reasons, the quality of alignments is poor. We report the correspondence between the classification of proteins represented as full-length gene products and their functions. Our approach differs fundamentally from traditional methods in not performing the classification at the level of domains. Our method is based on an alignment free local matching scores (LMS) computation at the amino-acid sequence level followed by hierarchical clustering. As there are no gold standards for full-length protein sequence classification, we resorted to Gene Ontology and domain-architecture based similarity measures to assess our classification. The final clusters obtained using LMS show high functional and domain architectural similarities. Comparison of the current method with alignment based approaches at both domain and full-length protein showed superiority of the LMS scores. Using this method we have recreated objective relationships among different protein kinase sub-families and also classified immunoglobulin containing proteins where sub-family definitions do not exist currently. This method can be applied to any set of protein sequences and hence will be instrumental in analysis of large numbers of full-length protein sequences.
Resumo:
We investigated the site response characteristics of Kachchh rift basin over the meizoseismal area of the 2001, Mw 7.6, Bhuj (NW India) earthquake using the spectral ratio of the horizontal and vertical components of ambient vibrations. Using the available knowledge on the regional geology of Kachchh and well documented ground responses from the earthquake, we evaluated the H/V curves pattern across sediment filled valleys and uplifted areas generally characterized by weathered sandstones. Although our HIV curves showed a largely fuzzy nature, we found that the hierarchical clustering method was useful for comparing large numbers of response curves and identifying the areas with similar responses. Broad and plateau shaped peaks of a cluster of curves within the valley region suggests the possibility of basin effects within valley. Fundamental resonance frequencies (f(0)) are found in the narrow range of 0.1-2.3 Hz and their spatial distribution demarcated the uplifted regions from the valleys. In contrary, low HIV peak amplitudes (A(0) = 2-4) were observed on the uplifted areas and varying values (2-9) were found within valleys. Compared to the amplification factors, the liquefaction indices (kg) were able to effectively indicate the areas which experienced severe liquefaction. The amplification ranges obtained in the current study were found to be comparable to those obtained from earthquake data for a limited number of seismic stations located on uplifted areas; however the values on the valley region may not reflect their true amplification potential due to basin effects. Our study highlights the practical usefulness as well as limitations of the HIV method to study complex geological settings as Kachchh. (C) 2014 Elsevier Ltd. All rights reserved.
Resumo:
This study analyzed species richness, distribution, and sighting frequency of selected reef fishes to describe species assemblage composition, abundance, and spatial distribution patterns among sites and regions (Upper Keys, Middle Keys, Lower Keys, and Dry Tortugas) within the Florida Keys National Marine Sanctuary (FKNMS) barrier reef ecosystem. Data were obtained from the Reef Environmental Education Foundation (REEF) Fish Survey Project, a volunteer fish-monitoring program. A total of 4,324 visual fish surveys conducted at 112 sites throughout the FKNMS were used in these analyses. The data set contained sighting information on 341 fish species comprising 68 families. Species richness was generally highest in the Upper Keys sites (maximum was 220 species at Molasses Reef) and lowest in the Dry Tortugas sites. Encounter rates differed among regions, with the Dry Tortugas having the highest rate, potentially a result of differences in the evenness in fishes and the lower diversity of habitat types in the Dry Tortugas region. Geographic coverage maps were developed for 29 frequently observed species. Fourteen of these species showed significant regional variation in mean sighting frequency (%SF). Six species had significantly lower mean %SF and eight species had significantly higher mean %SF in the Dry Tortugas compared with other regions. Hierarchical clustering based on species composition (presence-absence) and species % SF revealed interesting patterns of similarities among sites that varied across spatial scales. Results presented here indicate that phenomena affecting reef fish composition in the FKNMS operate at multiple spatial scales, including a biogeographic scale that defines the character of the region as a whole, a reef scale (~50-100 km) that include meso-scale physical oceanographic processes and regional variation in reef structure and associated reef habitats, and a local scale that includes level of protection, cross-shelf location and a suite of physical characteristics of a given reef. It is likely that at both regional and local scales, species habitat requirements strongly influence the patterns revealed in this study, and are particularly limiting for species that are less frequently observed in the Dry Tortugas. The results of this report serve as a benchmark for the current status of the reef fishes in the FKNMS. In addition, these data provide the basis for analyses on reserve effects and the biogeographic coupling of benthic habitats and fish assemblages that are currently underway. (PDF contains 61 pages.)
Resumo:
Elucidating the intricate relationship between brain structure and function, both in healthy and pathological conditions, is a key challenge for modern neuroscience. Recent progress in neuroimaging has helped advance our understanding of this important issue, with diffusion images providing information about structural connectivity (SC) and functional magnetic resonance imaging shedding light on resting state functional connectivity (rsFC). Here, we adopt a systems approach, relying on modular hierarchical clustering, to study together SC and rsFC datasets gathered independently from healthy human subjects. Our novel approach allows us to find a common skeleton shared by structure and function from which a new, optimal, brain partition can be extracted. We describe the emerging common structure-function modules (SFMs) in detail and compare them with commonly employed anatomical or functional parcellations. Our results underline the strong correspondence between brain structure and resting-state dynamics as well as the emerging coherent organization of the human brain.
Resumo:
We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.