29 resultados para Datasets
em Cambridge University Engineering Department Publications Database
Resumo:
MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. In the two-dataset case, we show that MDI's performance is comparable with the present state-of-the-art. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods.
Resumo:
Campylobacter jejuni is the most common bacterial cause of foodborne disease in the developed world. Its general physiology and biochemistry, as well as the mechanisms enabling it to colonize and cause disease in various hosts, are not well understood, and new approaches are required to understand its basic biology. High-throughput sequencing technologies provide unprecedented opportunities for functional genomic research. Recent studies have shown that direct Illumina sequencing of cDNA (RNA-seq) is a useful technique for the quantitative and qualitative examination of transcriptomes. In this study we report RNA-seq analyses of the transcriptomes of C. jejuni (NCTC11168) and its rpoN mutant. This has allowed the identification of hitherto unknown transcriptional units, and further defines the regulon that is dependent on rpoN for expression. The analysis of the NCTC11168 transcriptome was supplemented by additional proteomic analysis using liquid chromatography-MS. The transcriptomic and proteomic datasets represent an important resource for the Campylobacter research community. © 2011 SGM.
Resumo:
MOTIVATION: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets. RESULTS: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs. AVAILABILITY: If interested in the code for the work presented in this article, please contact the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Resumo:
This paper presents an incremental learning solution for Linear Discriminant Analysis (LDA) and its applications to object recognition problems. We apply the sufficient spanning set approximation in three steps i.e. update for the total scatter matrix, between-class scatter matrix and the projected data matrix, which leads an online solution which closely agrees with the batch solution in accuracy while significantly reducing the computational complexity. The algorithm yields an efficient solution to incremental LDA even when the number of classes as well as the set size is large. The incremental LDA method has been also shown useful for semi-supervised online learning. Label propagation is done by integrating the incremental LDA into an EM framework. The method has been demonstrated in the task of merging large datasets which were collected during MPEG standardization for face image retrieval, face authentication using the BANCA dataset, and object categorisation using the Caltech101 dataset. © 2010 Springer Science+Business Media, LLC.
Resumo:
We introduce a new regression framework, Gaussian process regression networks (GPRN), which combines the structural properties of Bayesian neural networks with the non-parametric flexibility of Gaussian processes. This model accommodates input dependent signal and noise correlations between multiple response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both efficient Markov chain Monte Carlo and variational Bayes inference procedures for this model. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output (multi-task) Gaussian process models and three multivariate volatility models on benchmark datasets, including a 1000 dimensional gene expression dataset.
Resumo:
In this paper a novel visualisation method for diffusion tensor MRI datasets is introduced. This is based on the use of Complex Wavelets in order to produce "stripy" textures which depict the anisotropic component of the diffusion tensors. Grey-scale pixel intensity is used to show the isotropic component. This paper also discusses enhancements of the technique for 3D visualisation. © 2004 IEEE.
Resumo:
Most previous work on trainable language generation has focused on two paradigms: (a) using a statistical model to rank a set of generated utterances, or (b) using statistics to inform the generation decision process. Both approaches rely on the existence of a handcrafted generator, which limits their scalability to new domains. This paper presents BAGEL, a statistical language generator which uses dynamic Bayesian networks to learn from semantically-aligned data produced by 42 untrained annotators. A human evaluation shows that BAGEL can generate natural and informative utterances from unseen inputs in the information presentation domain. Additionally, generation performance on sparse datasets is improved significantly by using certainty-based active learning, yielding ratings close to the human gold standard with a fraction of the data. © 2010 Association for Computational Linguistics.
Resumo:
A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters (or density manifolds) describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets.
Resumo:
This paper tackles the novel challenging problem of 3D object phenotype recognition from a single 2D silhouette. To bridge the large pose (articulation or deformation) and camera viewpoint changes between the gallery images and query image, we propose a novel probabilistic inference algorithm based on 3D shape priors. Our approach combines both generative and discriminative learning. We use latent probabilistic generative models to capture 3D shape and pose variations from a set of 3D mesh models. Based on these 3D shape priors, we generate a large number of projections for different phenotype classes, poses, and camera viewpoints, and implement Random Forests to efficiently solve the shape and pose inference problems. By model selection in terms of the silhouette coherency between the query and the projections of 3D shapes synthesized using the galleries, we achieve the phenotype recognition result as well as a fast approximate 3D reconstruction of the query. To verify the efficacy of the proposed approach, we present new datasets which contain over 500 images of various human and shark phenotypes and motions. The experimental results clearly show the benefits of using the 3D priors in the proposed method over previous 2D-based methods. © 2011 IEEE.
Resumo:
The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching (SiGMa), a simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between entity properties in a greedy local search, thus making it scalable. Despite its greedy nature, our experiments indicate that SiGMa can efficiently match some of the world's largest knowledge bases with high precision. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency.
Resumo:
A combination of singular systems analysis and analytic phase techniques are used to investigate the possible occurrence in observations of coherent synchronization between quasi-biennial and semi-annual oscillations (QBOs; SAOs) in the stratosphere and troposphere. Time series of zonal mean zonal winds near the Equator are analysed from the ERA-40 and ERA-interim reanalysis datasets over a ∼ 50-year period. In the stratosphere, the QBO is found to synchronize with the SAO almost all the time, but with a frequency ratio that changes erratically between 4:1, 5:1 and 6:1. A similar variable synchronization is also evident in the tropical troposphere between semi-annual and quasi-biennial cycles (known as TBOs). Mean zonal winds from ERA-40 and ERA-interim, and also time series of indices for the Indian and West Pacific monsoons, are commonly found to exhibit synchronization, with SAO/TBO ratios that vary between 4:1 and 7:1. Coherent synchronization between the QBO and tropical TBO does not appear to persist for long intervals, however. This suggests that both the QBO and tropical TBOs may be separately synchronized to SAOs that are themselves enslaved to the seasonal cycle, or to the annual cycle itself. However, the QBO and TBOs are evidently only weakly coupled between themselves and are frequently found to lose mutual coherence when each changes its frequency ratio to its respective SAO. This suggests a need to revise a commonly cited paradigm that advocates the use of stratospheric QBO indices as a predictor for tropospheric phenomena such as monsoons and hurricanes. © 2012 Royal Meteorological Society.
Resumo:
Real-time cardiac ultrasound allows monitoring the heart motion during intracardiac beating heart procedures. Our application assists atrial septal defect (ASD) closure techniques using real-time 3D ultrasound guidance. One major image processing challenge is the processing of information at high frame rate. We present an optimized block flow technique, which combines the probability-based velocity computation for an entire block with template matching. We propose adapted similarity constraints both from frame to frame, to conserve energy, and globally, to minimize errors. We show tracking results on eight in-vivo 4D datasets acquired from porcine beating-heart procedures. Computing velocity at the block level with an optimized scheme, our technique tracks ASD motion at 41 frames/s. We analyze the errors of motion estimation and retrieve the cardiac cycle in ungated images. © 2007 IEEE.