795 resultados para hierarchical clustering
Resumo:
Establishing functional relationships between multi-domain protein sequences is a non-trivial task. Traditionally, delineating functional assignment and relationships of proteins requires domain assignments as a prerequisite. This process is sensitive to alignment quality and domain definitions. In multi-domain proteins due to multiple reasons, the quality of alignments is poor. We report the correspondence between the classification of proteins represented as full-length gene products and their functions. Our approach differs fundamentally from traditional methods in not performing the classification at the level of domains. Our method is based on an alignment free local matching scores (LMS) computation at the amino-acid sequence level followed by hierarchical clustering. As there are no gold standards for full-length protein sequence classification, we resorted to Gene Ontology and domain-architecture based similarity measures to assess our classification. The final clusters obtained using LMS show high functional and domain architectural similarities. Comparison of the current method with alignment based approaches at both domain and full-length protein showed superiority of the LMS scores. Using this method we have recreated objective relationships among different protein kinase sub-families and also classified immunoglobulin containing proteins where sub-family definitions do not exist currently. This method can be applied to any set of protein sequences and hence will be instrumental in analysis of large numbers of full-length protein sequences.
Resumo:
We investigated the site response characteristics of Kachchh rift basin over the meizoseismal area of the 2001, Mw 7.6, Bhuj (NW India) earthquake using the spectral ratio of the horizontal and vertical components of ambient vibrations. Using the available knowledge on the regional geology of Kachchh and well documented ground responses from the earthquake, we evaluated the H/V curves pattern across sediment filled valleys and uplifted areas generally characterized by weathered sandstones. Although our HIV curves showed a largely fuzzy nature, we found that the hierarchical clustering method was useful for comparing large numbers of response curves and identifying the areas with similar responses. Broad and plateau shaped peaks of a cluster of curves within the valley region suggests the possibility of basin effects within valley. Fundamental resonance frequencies (f(0)) are found in the narrow range of 0.1-2.3 Hz and their spatial distribution demarcated the uplifted regions from the valleys. In contrary, low HIV peak amplitudes (A(0) = 2-4) were observed on the uplifted areas and varying values (2-9) were found within valleys. Compared to the amplification factors, the liquefaction indices (kg) were able to effectively indicate the areas which experienced severe liquefaction. The amplification ranges obtained in the current study were found to be comparable to those obtained from earthquake data for a limited number of seismic stations located on uplifted areas; however the values on the valley region may not reflect their true amplification potential due to basin effects. Our study highlights the practical usefulness as well as limitations of the HIV method to study complex geological settings as Kachchh. (C) 2014 Elsevier Ltd. All rights reserved.
Resumo:
This study analyzed species richness, distribution, and sighting frequency of selected reef fishes to describe species assemblage composition, abundance, and spatial distribution patterns among sites and regions (Upper Keys, Middle Keys, Lower Keys, and Dry Tortugas) within the Florida Keys National Marine Sanctuary (FKNMS) barrier reef ecosystem. Data were obtained from the Reef Environmental Education Foundation (REEF) Fish Survey Project, a volunteer fish-monitoring program. A total of 4,324 visual fish surveys conducted at 112 sites throughout the FKNMS were used in these analyses. The data set contained sighting information on 341 fish species comprising 68 families. Species richness was generally highest in the Upper Keys sites (maximum was 220 species at Molasses Reef) and lowest in the Dry Tortugas sites. Encounter rates differed among regions, with the Dry Tortugas having the highest rate, potentially a result of differences in the evenness in fishes and the lower diversity of habitat types in the Dry Tortugas region. Geographic coverage maps were developed for 29 frequently observed species. Fourteen of these species showed significant regional variation in mean sighting frequency (%SF). Six species had significantly lower mean %SF and eight species had significantly higher mean %SF in the Dry Tortugas compared with other regions. Hierarchical clustering based on species composition (presence-absence) and species % SF revealed interesting patterns of similarities among sites that varied across spatial scales. Results presented here indicate that phenomena affecting reef fish composition in the FKNMS operate at multiple spatial scales, including a biogeographic scale that defines the character of the region as a whole, a reef scale (~50-100 km) that include meso-scale physical oceanographic processes and regional variation in reef structure and associated reef habitats, and a local scale that includes level of protection, cross-shelf location and a suite of physical characteristics of a given reef. It is likely that at both regional and local scales, species habitat requirements strongly influence the patterns revealed in this study, and are particularly limiting for species that are less frequently observed in the Dry Tortugas. The results of this report serve as a benchmark for the current status of the reef fishes in the FKNMS. In addition, these data provide the basis for analyses on reserve effects and the biogeographic coupling of benthic habitats and fish assemblages that are currently underway. (PDF contains 61 pages.)
Resumo:
Elucidating the intricate relationship between brain structure and function, both in healthy and pathological conditions, is a key challenge for modern neuroscience. Recent progress in neuroimaging has helped advance our understanding of this important issue, with diffusion images providing information about structural connectivity (SC) and functional magnetic resonance imaging shedding light on resting state functional connectivity (rsFC). Here, we adopt a systems approach, relying on modular hierarchical clustering, to study together SC and rsFC datasets gathered independently from healthy human subjects. Our novel approach allows us to find a common skeleton shared by structure and function from which a new, optimal, brain partition can be extracted. We describe the emerging common structure-function modules (SFMs) in detail and compare them with commonly employed anatomical or functional parcellations. Our results underline the strong correspondence between brain structure and resting-state dynamics as well as the emerging coherent organization of the human brain.
Resumo:
We introduce the Pitman Yor Diffusion Tree (PYDT) for hierarchical clustering, a generalization of the Dirichlet Diffusion Tree (Neal, 2001) which removes the restriction to binary branching structure. The generative process is described and shown to result in an exchangeable distribution over data points. We prove some theoretical properties of the model and then present two inference methods: a collapsed MCMC sampler which allows us to model uncertainty over tree structures, and a computationally efficient greedy Bayesian EM search algorithm. Both algorithms use message passing on the tree structure. The utility of the model and algorithms is demonstrated on synthetic and real world data, both continuous and binary.
Resumo:
Multivariate classification methods were used to evaluate data on the concentrations of eight metals in human senile lenses measured by atomic absorption spectrometry. Principal components analysis and hierarchical clustering separated senile cataract lenses, nuclei from cataract lenses, and normal lenses into three classes on the basis of the eight elements. Stepwise discriminant analysis was applied to give discriminant functions with five selected variables. Results provided by the linear learning machine method were also satisfactory; the k-nearest neighbour method was less useful.
Resumo:
Small failures should only disrupt a small part of a network. One way to do this is by marking the surrounding area as untrustworthy --- circumscribing the failure. This can be done with a distributed algorithm using hierarchical clustering and neighbor relations, and the resulting circumscription is near-optimal for convex failures.
Resumo:
Q. Meng and M.H. Lee, 'Error-driven active learning in growing radial basis function networks for early robot learning', 2006 IEEE International Conference on Robotics and Automation (IEEE ICRA 2006), 2984-90, Orlando, Florida, USA.
Resumo:
Sk?t, L., Humphreys, M. O., Armstead, I. P., Heywood, S., Sk?t, K. P., Sanderson, R., Thomas, I. D., Chorlton, K. H., & Sackville Hamilton, N. R. (2005). An association mapping approach to identify flowering time genes in natural populations of Lolium perenne (L.). Molecular Breeding, 15(3), 233-245. Sponsorship: BBSRC RAE2008
Resumo:
The SIEGE (Smoking Induced Epithelial Gene Expression) database is a clinical resource for compiling and analyzing gene expression data from epithelial cells of the human intra-thoracic airway. This database supports a translational research study whose goal is to profile the changes in airway gene expression that are induced by cigarette smoke. RNA is isolated from airway epithelium obtained at bronchoscopy from current-, former- and never-smoker subjects, and hybridized to Affymetrix HG-U133A Genechips, which measure the level of expression of ~22 500 human transcripts. The microarray data generated along with relevant patient information is uploaded to SIEGE by study administrators using the database's web interface, found at http://pulm.bumc.bu.edu/siegeDB. PERL-coded scripts integrated with SIEGE perform various quality control functions including the processing, filtering and formatting of stored data. The R statistical package is used to import database expression values and execute a number of statistical analyses including t-tests, correlation coefficients and hierarchical clustering. Values from all statistical analyses can be queried through CGI-based tools and web forms found on the �Search� section of the database website. Query results are embedded with graphical capabilities as well as with links to other databases containing valuable gene resources, including Entrez Gene, GO, Biocarta, GeneCards, dbSNP and the NCBI Map Viewer.
Resumo:
We present a highly accurate method for classifying web pages based on link percentage, which is the percentage of text characters that are parts of links normalized by the number of all text characters on a web page. K-means clustering is used to create unique thresholds to differentiate index pages and article pages on individual web sites. Index pages contain mostly links to articles and other indices, while article pages contain mostly text. We also present a novel link grouping algorithm using agglomerative hierarchical clustering that groups links in the same spatial neighborhood together while preserving link structure. Grouping allows users with severe disabilities to use a scan-based mechanism to tab through a web page and select items. In experiments, we saw up to a 40-fold reduction in the number of commands needed to click on a link with a scan-based interface, which shows that we can vastly improve the rate of communication for users with disabilities. We used web page classification and link grouping to alter web page display on an accessible web browser that we developed to make a usable browsing interface for users with disabilities. Our classification method consistently outperformed a baseline classifier even when using minimal data to generate article and index clusters, and achieved classification accuracy of 94.0% on web sites with well-formed or slightly malformed HTML, compared with 80.1% accuracy for the baseline classifier.
Resumo:
BACKGROUND: A major challenge in oncology is the selection of the most effective chemotherapeutic agents for individual patients, while the administration of ineffective chemotherapy increases mortality and decreases quality of life in cancer patients. This emphasizes the need to evaluate every patient's probability of responding to each chemotherapeutic agent and limiting the agents used to those most likely to be effective. METHODS AND RESULTS: Using gene expression data on the NCI-60 and corresponding drug sensitivity, mRNA and microRNA profiles were developed representing sensitivity to individual chemotherapeutic agents. The mRNA signatures were tested in an independent cohort of 133 breast cancer patients treated with the TFAC (paclitaxel, 5-fluorouracil, adriamycin, and cyclophosphamide) chemotherapy regimen. To further dissect the biology of resistance, we applied signatures of oncogenic pathway activation and performed hierarchical clustering. We then used mRNA signatures of chemotherapy sensitivity to identify alternative therapeutics for patients resistant to TFAC. Profiles from mRNA and microRNA expression data represent distinct biologic mechanisms of resistance to common cytotoxic agents. The individual mRNA signatures were validated in an independent dataset of breast tumors (P = 0.002, NPV = 82%). When the accuracy of the signatures was analyzed based on molecular variables, the predictive ability was found to be greater in basal-like than non basal-like patients (P = 0.03 and P = 0.06). Samples from patients with co-activated Myc and E2F represented the cohort with the lowest percentage (8%) of responders. Using mRNA signatures of sensitivity to other cytotoxic agents, we predict that TFAC non-responders are more likely to be sensitive to docetaxel (P = 0.04), representing a viable alternative therapy. CONCLUSIONS: Our results suggest that the optimal strategy for chemotherapy sensitivity prediction integrates molecular variables such as ER and HER2 status with corresponding microRNA and mRNA expression profiles. Importantly, we also present evidence to support the concept that analysis of molecular variables can present a rational strategy to identifying alternative therapeutic opportunities.
Resumo:
Coastal zooplankton have been investigated since 1984 at a Long Term Ecological Research station MC (LTER-MC) in the inner Gulf of Naples (Tyrrhenian Sea, Western Mediterranean). The sampling site, located between the littoral and the open sea systems, has very active hydrography that affects plankton communities. The present work was aimed at establishing whether, in such a dynamic and variable environment, species associations and homogeneous periods could be identified as characteristic and stable features of the mesozooplankton over the period 1984–2006. Hierarchical clustering was applied to assess species associations based on a matrix of similarities between species (R-mode), and homogeneous periods based on a matrix of similarities between observations (Q-mode). The Indicator Value index [IndVal, Dufrene and Legendre (1997) Species assemblages and indicator species: the need for a flexible asymmetrical approach. Ecol. Monogr., 67, 345–366] was calculated to identify species characterizing each period. Five taxonomic groups with well-defined composition and abundance were identified as robust associations that likely reflect different modes of community functioning. The temporal course of these associations was largely shaped by strong seasonal forcing comprising both physical and biological (e.g. trophic) signals. These associations persisted over the long term, thus indicating some stable characters in the Naples zooplankton time-series, providing evidence of resilience in communities in highly variable coastal conditions.
Resumo:
Juvenile idiopathic arthritis (JIA) comprises a poorly understood group of chronic, childhood onset, autoimmune diseases with variable clinical outcomes. We investigated whether profiling of the synovial fluid (SF) proteome by a fluorescent dye based, two-dimensional gel (DIGE) approach could distinguish patients in whom inflammation extends to affect a large number of joints, early in the disease process. SF samples from 22 JIA patients were analyzed: 10 with oligoarticular arthritis, 5 extended oligoarticular and 7 polyarticular disease. SF samples were labeled with Cy dyes and separated by two-dimensional electrophoresis. Multivariate analyses were used to isolate a panel of proteins which distinguish patient subgroups. Proteins were identified using MALDI-TOF mass spectrometry with expression further verified by Western immunoblotting and immunohistochemistry. Hierarchical clustering based on the expression levels of a set of 40 proteins segregated the extended oligoarticular from the oligoarticular patients (p <0.05). Expression patterns of the isolated protein panel have also been observed over time, as disease spreads to multiple joints. The data indicates that synovial fluid proteome profiles could be used to stratify patients based on risk of disease extension. These protein profiles may also assist in monitoring therapeutic responses over time and help predict joint damage. © 2009 American Chemical Society.
Resumo:
The recent emergence of high-throughput arrays for methylation analysis has made the influence of tumor content on the interpretation of methylation levels increasingly pertinent. However, to what degree does tumor content have an influence, and what degree of tumor content makes a specimen acceptable for accurate analysis remains unclear. Taking a systematic approach, we analyzed 98 unselected formalin-fixed and paraffin-embedded gastric tumors and matched normal tissue samples using the Illumina GoldenGate methylation assay. Unsupervised hierarchical clustering showed 2 separate clusters with a significant difference in average tumor content levels. The probes identified to be significantly differentially methylated between the tumors and normals also differed according to the tumor content of the samples included, with the sensitivity of identifying the