738 resultados para Annotation de génomes
Resumo:
Common diseases such as endometriosis (ED), Alzheimer's disease (AD) and multiple sclerosis (MS) account for a significant proportion of the health care burden in many countries. Genome-wide association studies (GWASs) for these diseases have identified a number of individual genetic variants contributing to the risk of those diseases. However, the effect size for most variants is small and collectively the known variants explain only a small proportion of the estimated heritability. We used a linear mixed model to fit all single nucleotide polymorphisms (SNPs) simultaneously, and estimated genetic variances on the liability scale using SNPs from GWASs in unrelated individuals for these three diseases. For each of the three diseases, case and control samples were not all genotyped in the same laboratory. We demonstrate that a careful analysis can obtain robust estimates, but also that insufficient quality control (QC) of SNPs can lead to spurious results and that too stringent QC is likely to remove real genetic signals. Our estimates show that common SNPs on commercially available genotyping chips capture significant variation contributing to liability for all three diseases. The estimated proportion of total variation tagged by all SNPs was 0.26 (SE 0.04) for ED, 0.24 (SE 0.03) for AD and 0.30 (SE 0.03) for MS. Further, we partitioned the genetic variance explained into five categories by a minor allele frequency (MAF), by chromosomes and gene annotation. We provide strong evidence that a substantial proportion of variation in liability is explained by common SNPs, and thereby give insights into the genetic architecture of the diseases.
Resumo:
Video surveillance infrastructure has been widely installed in public places for security purposes. However, live video feeds are typically monitored by human staff, making the detection of important events as they occur difficult. As such, an expert system that can automatically detect events of interest in surveillance footage is highly desirable. Although a number of approaches have been proposed, they have significant limitations: supervised approaches, which can detect a specific event, ideally require a large number of samples with the event spatially and temporally localised; while unsupervised approaches, which do not require this demanding annotation, can only detect whether an event is abnormal and not specific event types. To overcome these problems, we formulate a weakly-supervised approach using Kullback-Leibler (KL) divergence to detect rare events. The proposed approach leverages the sparse nature of the target events to its advantage, and we show that this data imbalance guarantees the existence of a decision boundary to separate samples that contain the target event from those that do not. This trait, combined with the coarse annotation used by weakly supervised learning (that only indicates approximately when an event occurs), greatly reduces the annotation burden while retaining the ability to detect specific events. Furthermore, the proposed classifier requires only a decision threshold, simplifying its use compared to other weakly supervised approaches. We show that the proposed approach outperforms state-of-the-art methods on a popular real-world traffic surveillance dataset, while preserving real time performance.
Resumo:
Clustering identities in a video is a useful task to aid in video search, annotation and retrieval, and cast identification. However, reliably clustering faces across multiple videos is challenging task due to variations in the appearance of the faces, as videos are captured in an uncontrolled environment. A person's appearance may vary due to session variations including: lighting and background changes, occlusions, changes in expression and make up. In this paper we propose the novel Local Total Variability Modelling (Local TVM) approach to cluster faces across a news video corpus; and incorporate this into a novel two stage video clustering system. We first cluster faces within a single video using colour, spatial and temporal cues; after which we use face track modelling and hierarchical agglomerative clustering to cluster faces across the entire corpus. We compare different face recognition approaches within this framework. Experiments on a news video database show that the Local TVM technique is able effectively model the session variation observed in the data, resulting in improved clustering performance, with much greater computational efficiency than other methods.
Automatic detection of diabetic foot complications with infrared thermography by asymmetric analysis
Resumo:
Early identification of diabetic foot complications and their precursors is essential in preventing their devastating consequences, such as foot infection and amputation. Frequent, automatic risk assessment by an intelligent telemedicine system might be feasible and cost effective. Infrared thermography is a promising modality for such a system. The temperature differences between corresponding areas on contralateral feet are the clinically significant parameters. This asymmetric analysis is hindered by (1) foot segmentation errors, especially when the foot temperature and the ambient temperature are comparable, and by (2) different shapes and sizes between contralateral feet due to deformities or minor amputations. To circumvent the first problem, we used a color image and a thermal image acquired synchronously. Foot regions, detected in the color image, were rigidly registered to the thermal image. This resulted in 97.8% ± 1.1% sensitivity and 98.4% ± 0.5% specificity over 76 high-risk diabetic patients with manual annotation as a reference. Nonrigid landmark-based registration with Bsplines solved the second problem. Corresponding points in the two feet could be found regardless of the shapes and sizes of the feet. With that, the temperature difference of the left and right feet could be obtained.
Resumo:
The time of the large sequencing projects has enabled unprecedented possibilities of investigating more complex aspects of living organisms. Among the high-throughput technologies based on the genomic sequences, the DNA microarrays are widely used for many purposes, including the measurement of the relative quantity of the messenger RNAs. However, the reliability of microarrays has been strongly doubted as robust analysis of the complex microarray output data has been developed only after the technology had already been spread in the community. An objective of this study consisted of increasing the performance of microarrays, and was measured by the successful validation of the results by independent techniques. To this end, emphasis has been given to the possibility of selecting candidate genes with remarkable biological significance within specific experimental design. Along with literature evidence, the re-annotation of the probes and model-based normalization algorithms were found to be beneficial when analyzing Affymetrix GeneChip data. Typically, the analysis of microarrays aims at selecting genes whose expression is significantly different in different conditions followed by grouping them in functional categories, enabling a biological interpretation of the results. Another approach investigates the global differences in the expression of functionally related groups of genes. Here, this technique has been effective in discovering patterns related to temporal changes during infection of human cells. Another aspect explored in this thesis is related to the possibility of combining independent gene expression data for creating a catalog of genes that are selectively expressed in healthy human tissues. Not all the genes present in human cells are active; some involved in basic activities (named housekeeping genes) are expressed ubiquitously. Other genes (named tissue-selective genes) provide more specific functions and they are expressed preferably in certain cell types or tissues. Defining the tissue-selective genes is also important as these genes can cause disease with phenotype in the tissues where they are expressed. The hypothesis that gene expression could be used as a measure of the relatedness of the tissues has been also proved. Microarray experiments provide long lists of candidate genes that are often difficult to interpret and prioritize. Extending the power of microarray results is possible by inferring the relationships of genes under certain conditions. Gene transcription is constantly regulated by the coordinated binding of proteins, named transcription factors, to specific portions of the its promoter sequence. In this study, the analysis of promoters from groups of candidate genes has been utilized for predicting gene networks and highlighting modules of transcription factors playing a central role in the regulation of their transcription. Specific modules have been found regulating the expression of genes selectively expressed in the hippocampus, an area of the brain having a central role in the Major Depression Disorder. Similarly, gene networks derived from microarray results have elucidated aspects of the development of the mesencephalon, another region of the brain involved in Parkinson Disease.
Resumo:
Regulated transcription controls the diversity, developmental pathways and spatial organization of the hundreds of cell types that make up a mammal. Using single-molecule cDNA sequencing, we mapped transcription start sites (TSSs) and their usage in human and mouse primary cells, cell lines and tissues to produce a comprehensive overview of mammalian gene expression across the human body. We find that few genes are truly 'housekeeping', whereas many mammalian promoters are composite entities composed of several closely separated TSSs, with independent cell-type-specific expression profiles. TSSs specific to different cell types evolve at different rates, whereas promoters of broadly expressed genes are the most conserved. Promoter-based expression analysis reveals key transcription factors defining cell states and links them to binding-site motifs. The functions of identified novel transcripts can be predicted by coexpression and sample ontology enrichment analyses. The functional annotation of the mammalian genome 5 (FANTOM5) project provides comprehensive expression profiles and functional annotation of mammalian cell-type-specific transcriptomes with wide applications in biomedical research.
Resumo:
Bacilysin is a non-ribosomally synthesized dipeptide antibiotic that is active against a wide range of bacteria and some fungi. Synthesis of bacilysin (L-alanine-[2,3-epoxycyclohexano-4]-L-alanine) is achieved by proteins in the bac operon, also referred to as the bacABCDE (ywfBCDEF) gene cluster in B. subtilis. Extensive genetic analysis from several strains of B. subtilis suggests that the bacABC gene cluster encodes all the proteins that synthesize the epoxyhexanone ring of L-anticapsin. These data, however, were not consistent with the putative functional annotation for these proteins whereby BacA, a prephenate dehydratase along with a potential isomerase/guanylyl transferase, BacB and an oxidoreductase, BacC, could synthesize L-anticapsin. Here we demonstrate that BacA is a decarboxylase that acts on prephenate. Further, based on the biochemical characterization and the crystal structure of BacB, we show that BacB is an oxidase that catalyzes the synthesis of 2-oxo-3-(4-oxocyclohexa-2,5-dienyl)propanoic acid, a precursor to L-anticapsin. This protein is a bi-cupin, with two putative active sites each containing a bound metal ion. Additional electron density at the active site of the C-terminal domain of BacB could be interpreted as a bound phenylpyruvic acid. A significant decrease in the catalytic activity of a point variant of BacB with a mutation at the N-terminal domain suggests that the N-terminal cupin domain is involved in catalysis.
Resumo:
Background: The members of cupin superfamily exhibit large variations in their sequences, functions, organization of domains, quaternary associations and the nature of bound metal ion, despite having a conserved beta-barrel structural scaffold. Here, an attempt has been made to understand structure-function relationships among the members of this diverse superfamily and identify the principles governing functional diversity. The cupin superfamily also contains proteins for which the structures are available through world-wide structural genomics initiatives but characterized as ``hypothetical''. We have explored the feasibility of obtaining clues to functions of such proteins by means of comparative analysis with cupins of known structure and function. Methodology/Principal Findings: A 3-D structure-based phylogenetic approach was undertaken. Interestingly, a dendrogram generated solely on the basis of structural dissimilarity measure at the level of domain folds was found to cluster functionally similar members. This clustering also reflects an independent evolution of the two domains in bicupins. Close examination of structural superposition of members across various functional clusters reveals structural variations in regions that not only form the active site pocket but are also involved in interaction with another domain in the same polypeptide or in the oligomer. Conclusions/Significance: Structure-based phylogeny of cupins can influence identification of functions of proteins of yet unknown function with cupin fold. This approach can be extended to other proteins with a common fold that show high evolutionary divergence. This approach is expected to have an influence on the function annotation in structural genomics initiatives.
Resumo:
In the current era of high-throughput sequencing and structure determination, functional annotation has become a bottleneck in biomedical science. Here, we show that automated inference of molecular function using functional linkages among genes increases the accuracy of functional assignments by >= 8% and enriches functional descriptions in >= 34% of top assignments. Furthermore, biochemical literature supports >80% of automated inferences for previously unannotated proteins. These results emphasize the benefit of incorporating functional linkages in protein annotation.
Resumo:
Autoimmune diseases are more common in dogs than in humans and are already threatening the future of some highly predisposed dog breeds. Susceptibility to autoimmune diseases is controlled by environmental and genetic factors, especially the major histocompatibility complex (MHC) gene region. Dogs show a similar physiology, disease presentation and clinical response as humans, making them an excellent disease model for autoimmune diseases common to both species. The genetic background of canine autoimmune disorders is largely unknown, but recent annotation of the dog genome and subsequent development of new genomic tools offer a unique opportunity to map novel autoimmune genes in various breeds. Many autoimmune disorders show breed-specific enrichment, supporting a strong genetic background. Furthermore, the presence of hundreds of breeds as genetic isolates facilitates gene mapping in complex autoimmune disorders. Identification of novel predisposing genes establishes breeds as models and may reveal novel candidate genes for the corresponding human disorders. Genetic studies will eventually shed light on common biological functions and interactions between genes and the environment. This study aimed to identify genetic risk factors in various autoimmune disorders, including systemic lupus erythematosus (SLE)-related diseases, comprising immune-mediated rheumatic disease (IMRD) and steroid-responsive meningitis arteritis (SMRA) as well as Addison s disease (AD) in Nova Scotia Duck Tolling Retrievers (NSDTRs) and chronic superficial keratitis (CSK) in German Shepherd dogs (GSDs). We used two different approaches to identify genetic risk factors. Firstly, a candidate gene approach was applied to test the potential association of MHC class II, also known as a dog leukocyte antigen (DLA) in canine species. Secondly, a genome-wide association study (GWAS) was performed to identify novel risk loci for SLE-related disease and AD in NSDTRs. We identified DLA risk haplotypes for an IMRD subphenotype of SLE-related disease, AD and CSK, but not in SMRA, and show that the MHC class II gene region is a major genetic risk factor in canine autoimmune diseases. An elevated risk was found for IMRD in dogs that carried the DLA-DRB1*00601/DQA1*005011/DQB1*02001 haplotype (OR = 2.0, 99% CI = 1.03-3.95, p = 0.01) and for ANA-positive IMRD dogs (OR = 2.3, 99% CI = 1.07-5.04, p-value 0.007). We also found that DLA-DRB1*01502/DQA*00601/DQB1*02301 haplotype was significantly associated with AD in NSDTRs (OR = 2.1, CI = 1.0-4.4, P = 0.044) and the DLA-DRB1*01501/DQA1*00601/DQB1*00301 haplotype with the CSK in GSDs (OR=2.67, CI=1.17-6.44, p= 0.02). In addition, we found that homozygosity for the risk haplotype increases the risk for each disease phenotype and that an overall homozygosity for the DLA region predisposes to CSK and AD. Our results have enabled the development of genetic tests to improve breeding practices by avoiding the production of puppies homozygous for risk haplotypes. We also performed the first successful GWAS for a complex disease in dogs. With less than 100 cases and 100 controls, we identified five risk loci for SLE-related disease and AD and found strong candidate genes involved in a novel T-cell activation pathway. We show that an inbred dog population has fewer risk factors, but each of them has a stronger genetic risk. Ongoing studies aim to identify the causative mutations and bring new knowledge to help diagnostics, treatment and understanding of the aetiology of SLE-related diseases.
Resumo:
Gene expression is one of the most critical factors influencing the phenotype of a cell. As a result of several technological advances, measuring gene expression levels has become one of the most common molecular biological measurements to study the behaviour of cells. The scientific community has produced enormous and constantly increasing collection of gene expression data from various human cells both from healthy and pathological conditions. However, while each of these studies is informative and enlighting in its own context and research setup, diverging methods and terminologies make it very challenging to integrate existing gene expression data to a more comprehensive view of human transcriptome function. On the other hand, bioinformatic science advances only through data integration and synthesis. The aim of this study was to develop biological and mathematical methods to overcome these challenges and to construct an integrated database of human transcriptome as well as to demonstrate its usage. Methods developed in this study can be divided in two distinct parts. First, the biological and medical annotation of the existing gene expression measurements needed to be encoded by systematic vocabularies. There was no single existing biomedical ontology or vocabulary suitable for this purpose. Thus, new annotation terminology was developed as a part of this work. Second part was to develop mathematical methods correcting the noise and systematic differences/errors in the data caused by various array generations. Additionally, there was a need to develop suitable computational methods for sample collection and archiving, unique sample identification, database structures, data retrieval and visualization. Bioinformatic methods were developed to analyze gene expression levels and putative functional associations of human genes by using the integrated gene expression data. Also a method to interpret individual gene expression profiles across all the healthy and pathological tissues of the reference database was developed. As a result of this work 9783 human gene expression samples measured by Affymetrix microarrays were integrated to form a unique human transcriptome resource GeneSapiens. This makes it possible to analyse expression levels of 17330 genes across 175 types of healthy and pathological human tissues. Application of this resource to interpret individual gene expression measurements allowed identification of tissue of origin with 92.0% accuracy among 44 healthy tissue types. Systematic analysis of transcriptional activity levels of 459 kinase genes was performed across 44 healthy and 55 pathological tissue types and a genome wide analysis of kinase gene co-expression networks was done. This analysis revealed biologically and medically interesting data on putative kinase gene functions in health and disease. Finally, we developed a method for alignment of gene expression profiles (AGEP) to perform analysis for individual patient samples to pinpoint gene- and pathway-specific changes in the test sample in relation to the reference transcriptome database. We also showed how large-scale gene expression data resources can be used to quantitatively characterize changes in the transcriptomic program of differentiating stem cells. Taken together, these studies indicate the power of systematic bioinformatic analyses to infer biological and medical insights from existing published datasets as well as to facilitate the interpretation of new molecular profiling data from individual patients.
Resumo:
We outline the design and creation of a syntactically and morphologically annotated corpora of Finnish for use by the research community. We motivate a definitional, systematic “grammar definition corpus” as a first step in an three-year annotation effort to help create higher-quality, better-documented extensive parsebanks at a later stage. The syntactic representation, consisting of a dependency structure and a basic set of dependency functions, is outlined with examples. Reference is made to double-blind annotation experiments to measure the applicability of the newgrammar definition corpus methodology.
Resumo:
Computer Vision has seen a resurgence in the parts-based representation for objects over the past few years. The parts are usually annotated beforehand for training. We present an annotation free parts-based representation for the pedestrian using Non-Negative Matrix Factorization (NMF). We show that NMF is able to capture the wide range of pose and clothing of the pedestrians. We use a modified form of NMF i.e. NMF with sparsity constraints on the factored matrices. We also make use of Riemannian distance metric for similarity measurements in NMF space as the basis vectors generated by NMF aren't orthogonal. We show that for 1% drop in accuracy as compared to the Histogram of Oriented Gradients (HOG) representation we can achieve robustness to partial occlusion.
Resumo:
Depth measures the extent of atom/residue burial within a protein. It correlates with properties such as protein stability, hydrogen exchange rate, protein-protein interaction hot spots, post-translational modification sites and sequence variability. Our server, DEPTH, accurately computes depth and solvent-accessible surface area (SASA) values. We show that depth can be used to predict small molecule ligand binding cavities in proteins. Often, some of the residues lining a ligand binding cavity are both deep and solvent exposed. Using the depth-SASA pair values for a residue, its likelihood to form part of a small molecule binding cavity is estimated. The parameters of the method were calibrated over a training set of 900 high-resolution X-ray crystal structures of single-domain proteins bound to small molecules (molecular weight < 1.5 KDa). The prediction accuracy of DEPTH is comparable to that of other geometry-based prediction methods including LIGSITE, SURFNET and Pocket-Finder (all with Matthew's correlation coefficient of similar to 0.4) over a testing set of 225 single and multi-chain protein structures. Users have the option of tuning several parameters to detect cavities of different sizes, for example, geometrically flat binding sites. The input to the server is a protein 3D structure in PDB format. The users have the option of tuning the values of four parameters associated with the computation of residue depth and the prediction of binding cavities. The computed depths, SASA and binding cavity predictions are displayed in 2D plots and mapped onto 3D representations of the protein structure using Jmol. Links are provided to download the outputs. Our server is useful for all structural analysis based on residue depth and SASA, such as guiding site-directed mutagenesis experiments and small molecule docking exercises, in the context of protein functional annotation and drug discovery.
Resumo:
Background: Sensitive remote homology detection and accurate alignments especially in the midnight zone of sequence similarity are needed for better function annotation and structural modeling of proteins. An algorithm, AlignHUSH for HMM-HMM alignment has been developed which is capable of recognizing distantly related domain families The method uses structural information, in the form of predicted secondary structure probabilities, and hydrophobicity of amino acids to align HMMs of two sets of aligned sequences. The effect of using adjoining column(s) information has also been investigated and is found to increase the sensitivity of HMM-HMM alignments and remote homology detection. Results: We have assessed the performance of AlignHUSH using known evolutionary relationships available in SCOP. AlignHUSH performs better than the best HMM-HMM alignment methods and is observed to be even more sensitive at higher error rates. Accuracy of the alignments obtained using AlignHUSH has been assessed using the structure-based alignments available in BaliBASE. The alignment length and the alignment quality are found to be appropriate for homology modeling and function annotation. The alignment accuracy is found to be comparable to existing methods for profile-profile alignments. Conclusions: A new method to align HMMs has been developed and is shown to have better sensitivity at error rates of 10% and above when compared to other available programs. The proposed method could effectively aid obtaining clues to functions of proteins of yet unknown function. A web-server incorporating the AlignHUSH method is available at http://crick.mbu.iisc.ernet.in/similar to alignhush/