738 resultados para Annotation de génomes
Resumo:
This paper describes the efforts at MILE lab, IISc, to create a 100,000-word database each in Kannada and Tamil for the design and development of Online Handwritten Recognition. It has been collected from over 600 users in order to capture the variations in writing style. We describe features of the scripts and how the number of symbols were reduced to be able to effectively train the data for recognition. The list of words include all the characters, Kannada and Indo-Arabic numerals, punctuations and other symbols. A semi-automated tool for the annotation of data from stroke to word level is used. It segments each word into stroke groups and also acts as a validation mechanism for segmentation. The tool displays the stroke, stroke groups and aksharas of a word and hence can be used to study the various styles of writing, delayed strokes and for assigning quality tags to the words. The tool is currently being used for annotating Tamil and Kannada data. The output is stored in a standard XML format.
Resumo:
A decade since the availability of Mycobacterium tuberculosis (Mtb) genome sequence, no promising drug has seen the light of the day. This not only indicates the challenges in discovering new drugs but also suggests a gap in our current understanding of Mtb biology. We attempt to bridge this gap by carrying out extensive re-annotation and constructing a systems level protein interaction map of Mtb with an objective of finding novel drug target candidates. Towards this, we synergized crowd sourcing and social networking methods through an initiative `Connect to Decode' (C2D) to generate the first and largest manually curated interactome of Mtb termed `3interactome pathway' (IPW), encompassing a total of 1434 proteins connected through 2575 functional relationships. Interactions leading to gene regulation, signal transduction, metabolism, structural complex formation have been catalogued. In the process, we have functionally annotated 87% of the Mtb genome in context of gene products. We further combine IPW with STRING based network to report central proteins, which may be assessed as potential drug targets for development of drugs with least possible side effects. The fact that five of the 17 predicted drug targets are already experimentally validated either genetically or biochemically lends credence to our unique approach.
Resumo:
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like ``linker'' sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be ``plugged-into'' routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold. (C) 2013 Elsevier Ltd. All rights reserved.
Resumo:
Background: The number of genome-wide association studies (GWAS) has increased rapidly in the past couple of years, resulting in the identification of genes associated with different diseases. The next step in translating these findings into biomedically useful information is to find out the mechanism of the action of these genes. However, GWAS studies often implicate genes whose functions are currently unknown; for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with breast cancer, but their molecular function is unknown. Results: We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes by employing the directed acyclic graph structure of GO and the network of protein-protein interactions (PPIs). The approach is designed based on the fact that two proteins that interact biophysically would be in physical proximity of each other, would possess complementary molecular function, and play role in related biological processes. Predicted GO terms were ranked according to their relative association scores and the approach was evaluated quantitatively by plotting the precision versus recall values and F-scores (the harmonic mean of precision and recall) versus varying thresholds. Precisions of similar to 58% and similar to 40% for localization and functions respectively of proteins were determined at a threshold of similar to 30 (top 30 GO terms in the ranked list). Comparison with function prediction based on semantic similarity among nodes in an ontology and incorporation of those similarities in a k nearest neighbor classifier confirmed that our results compared favorably. Conclusions: This approach was applied to predict the cellular component and molecular function GO terms of all human proteins that have interacting partners possessing at least one known GO annotation. The list of predictions is available at http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the algorithm, evaluations and the results of the computational predictions, especially for genes identified in GWAS studies to be associated with diseases, which are of translational interest.
Resumo:
The structural annotation of proteins with no detectable homologs of known 3D structure identified using sequence-search methods is a major challenge today. We propose an original method that computes the conditional probabilities for the amino-acid sequence of a protein to fit to known protein 3D structures using a structural alphabet, known as Protein Blocks (PBs). PBs constitute a library of 16 local structural prototypes that approximate every part of protein backbone structures. It is used to encode 3D protein structures into 1D PB sequences and to capture sequence to structure relationships. Our method relies on amino acid occurrence matrices, one for each PB, to score global and local threading of query amino acid sequences to protein folds encoded into PB sequences. It does not use any information from residue contacts or sequence-search methods or explicit incorporation of hydrophobic effect. The performance of the method was assessed with independent test datasets derived from SCOP 1.75A. With a Z-score cutoff that achieved 95% specificity (i.e., less than 5% false positives), global and local threading showed sensitivity of 64.1% and 34.2%, respectively. We further tested its performance on 57 difficult CASP10 targets that had no known homologs in PDB: 38 compatible templates were identified by our approach and 66% of these hits yielded correctly predicted structures. This method scales-up well and offers promising perspectives for structural annotations at genomic level. It has been implemented in the form of a web-server that is freely available at http://www.bo-protscience.fr/forsa.
Resumo:
In recent times, zebrafish has garnered lot of popularity as model organism to study human cancers. Despite high evolutionary divergence from humans, zebrafish develops almost all types of human tumors when induced. However, mechanistic details of tumor formation have remained largely unknown. Present study is aimed at analysis of repertoire of kinases in zebrafish proteome to provide insights into various cellular components. Annotation using highly sensitive remote homology detection methods revealed ``substantial expansion'' of Ser/Thr/Tyr kinase family in zebrafish compared to humans, constituting over 3% of proteome. Subsequent classification of kinases into subfamilies revealed presence of large number of CAMK group of kinases, with massive representation of PIM kinases, important for cell cycle regulation and growth. Extensive sequence comparison between human and zebrafish PIM kinases revealed high conservation of functionally important residues with a few organism specific variations. There are about 300 PIM kinases in zebrafish kinome, while human genome codes for only about 500 kinases altogether. PIM kinases have been implicated in various human cancers and are currently being targeted to explore their therapeutic potentials. Hence, in depth analysis of PIM kinases in zebrafish has opened up new avenues of research to verify the model organism status of zebrafish.
Resumo:
Background: In the post-genomic era where sequences are being determined at a rapid rate, we are highly reliant on computational methods for their tentative biochemical characterization. The Pfam database currently contains 3,786 families corresponding to ``Domains of Unknown Function'' (DUF) or ``Uncharacterized Protein Family'' (UPF), of which 3,087 families have no reported three-dimensional structure, constituting almost one-fourth of the known protein families in search for both structure and function. Results: We applied a `computational structural genomics' approach using five state-of-the-art remote similarity detection methods to detect the relationship between uncharacterized DUFs and domain families of known structures. The association with a structural domain family could serve as a start point in elucidating the function of a DUF. Amongst these five methods, searches in SCOP-NrichD database have been applied for the first time. Predictions were classified into high, medium and low-confidence based on the consensus of results from various approaches and also annotated with enzyme and Gene ontology terms. 614 uncharacterized DUFs could be associated with a known structural domain, of which high confidence predictions, involving at least four methods, were made for 54 families. These structure-function relationships for the 614 DUF families can be accessed on-line at http://proline.biochem.iisc.ernet.in/RHD_DUFS/. For potential enzymes in this set, we assessed their compatibility with the associated fold and performed detailed structural and functional annotation by examining alignments and extent of conservation of functional residues. Detailed discussion is provided for interesting assignments for DUF3050, DUF1636, DUF1572, DUF2092 and DUF659. Conclusions: This study provides insights into the structure and potential function for nearly 20 % of the DUFs. Use of different computational approaches enables us to reliably recognize distant relationships, especially when they converge to a common assignment because the methods are often complementary. We observe that while pointers to the structural domain can offer the right clues to the function of a protein, recognition of its precise functional role is still `non-trivial' with many DUF domains conserving only some of the critical residues. It is not clear whether these are functional vestiges or instances involving alternate substrates and interacting partners. Reviewers: This article was reviewed by Drs Eugene Koonin, Frank Eisenhaber and Srikrishna Subramanian.
Resumo:
Background: Candida auris is a multidrug resistant, emerging agent of fungemia in humans. Its actual global distribution remains obscure as the current commercial methods of clinical diagnosis misidentify it as C. haemulonii. Here we report the first draft genome of C. auris to explore the genomic basis of virulence and unique differences that could be employed for differential diagnosis. Results: More than 99.5 % of the C. auris genomic reads did not align to the current whole (or draft) genome sequences of Candida albicans, Candida lusitaniae, Candida glabrata and Saccharomyces cerevisiae; thereby indicating its divergence from the active Candida clade. The genome spans around 12.49 Mb with 8527 predicted genes. Functional annotation revealed that among the sequenced Candida species, it is closest to the hemiascomycete species Clavispora lusitaniae. Comparison with the well-studied species Candida albicans showed that it shares significant virulence attributes with other pathogenic Candida species such as oligopeptide transporters, mannosyl transfersases, secreted proteases and genes involved in biofilm formation. We also identified a plethora of transporters belonging to the ABC and major facilitator superfamily along with known MDR transcription factors which explained its high tolerance to antifungal drugs. Conclusions: Our study emphasizes an urgent need for accurate fungal screening methods such as PCR and electrophoretic karyotyping to ensure proper management of fungemia. Our work highlights the potential genetic mechanisms involved in virulence and pathogenicity of an important emerging human pathogen namely C. auris. Owing to its diversity at the genomic scale; we expect the genome sequence to be a useful resource to map species specific differences that will help develop accurate diagnostic markers and better drug targets.
Resumo:
We have developed an integrated database for Mycobacterium tuberculosis H37Rv (Mtb) that collates information on protein sequences, domain assignments, functional annotation and 3D structural information along with protein-protein and protein-small molecule interactions. SInCRe (Structural Interactome Computational Resource) is developed out of CamBan (Cambridge and Bangalore) collaboration. The motivation for development of this database is to provide an integrated platform to allow easily access and interpretation of data and results obtained by all the groups in CamBan in the field of Mtb informatics. In-house algorithms and databases developed independently by various academic groups in CamBan are used to generate Mtb-specific datasets and are integrated in this database to provide a structural dimension to studies on tuberculosis. The SInCRe database readily provides information on identification of functional domains, genome-scale modelling of structures of Mtb proteins and characterization of the small-molecule binding sites within Mtb. The resource also provides structure-based function annotation, information on small-molecule binders including FDA (Food and Drug Administration)-approved drugs, protein-protein interactions (PPIs) and natural compounds that bind to pathogen proteins potentially and result in weakening or elimination of host-pathogen protein-protein interactions. Together they provide prerequisites for identification of off-target binding.
Resumo:
[ENG]Aiming at an integrated and mechanistic view of the early biological effects of selected metals in the marine sentinel organism Mytilus galloprovincialis, we exposed mussels for 48 hours to 50, 100 and 200 nM solutions of equimolar Cd, Cu and Hg salts and measured cytological and molecular biomarkers in parallel. Focusing on the mussel gills, first target of toxic water contaminants and actively proliferating tissue, we detected significant dose-related increases of cells with micronuclei and other nuclear abnormalities in the treated mussels, with differences in the bioconcentration of the three metals determined in the mussel flesh by atomic absorption spectrometry. Gene expression profiles, determined in the same individual gills in parallel, revealed some transcriptional changes at the 50 nM dose, and substantial increases of differentially expressed genes at the 100 and 200 nM doses, with roughly similar amounts of up- and down-regulated genes. The functional annotation of gill transcripts with consistent expression trends and significantly altered at least in one dose point disclosed the complexity of the induced cell response. The most evident transcriptional changes concerned protein synthesis and turnover, ion homeostasis, cell cycle regulation and apoptosis, and intracellular trafficking (transcript sequences denoting heat shock proteins, metal binding thioneins, sequestosome 1 and proteasome subunits, and GADD45 exemplify up-regulated genes while transcript sequences denoting actin, tubulins and the apoptosis inhibitor 1 exemplify down-regulated genes). Overall, nanomolar doses of co-occurring free metal ions have induced significant structural and functional changes in the mussel gills: the intensity of response to the stimulus measured in laboratory supports the additional validation of molecular markers of metal exposure to be used in Mussel Watch programs
Resumo:
Traditional software development captures the user needs during the requirement analysis. The Web makes this endeavour even harder due to the difficulty to determine who these users are. In an attempt to tackle the heterogeneity of the user base, Web Personalization techniques are proposed to guide the users’ experience. In addition, Open Innovation allows organisations to look beyond their internal resources to develop new products or improve existing processes. This thesis sits in between by introducing Open Personalization as a means to incorporate actors other than webmasters in the personalization of web applications. The aim is to provide the technological basis that builds up a trusty environment for webmasters and companion actors to collaborate, i.e. "an architecture of participation". Such architecture very much depends on these actors’ profile. This work tackles three profiles (i.e. software partners, hobby programmers and end users), and proposes three "architectures of participation" tuned for each profile. Each architecture rests on different technologies: a .NET annotation library based on Inversion of Control for software partners, a Modding Interface in JavaScript for hobby programmers, and finally, a domain specific language for end-users. Proof-of-concept implementations are available for the three cases while a quantitative evaluation is conducted for the domain specific language.
Resumo:
These guidelines have been produced to support the implementation of the Code of Conduct for Responsible Fisheries particularly with regard to the need for responsibility in the post– harvest sector of the fish producing industry. The industry that produces fish for food has three major areas of responsibility: to the consumer of the food to ensure that it is safe to eat, is of expected quality and nutritional value, to the resource to ensure that it is not wasted and to the environment to ensure that negative impacts are minimized. In addition the industry has a responsibility to itself to ensure the continued ability of many millions of people throughout the world to earn a gainful living from working within the industry. Article 11.1 of the Code of Conduct for Responsible Fisheries and other related parts of the Code are concerned particularly with these responsibilities. This publication provides annotation to and guidance on these articles to assist those charged with implementation of the Code to identify possible courses of action necessary to ensure that the industry is conducted in a sustainable manner. (PDF contains 42 pages)
Resumo:
ENGLISH: The skipjack tuna, Katsuwonus pelamis is an important resource of the tropical and subtropical waters of the world ocean. Fishermen of many countries exploit this resource; at the present time, the annual world catch is approximately 200 thousand metric tons. Many fishery experts believe that the skipjack is not being fully utilized while stocks of other tunas are being fished, in some areas, at levels exceeding their maximum sustainable yields. In addition to the importance of skipjack as a commercial fish and as a source of food, there is a small but expanding recreational fishery in some countries bordering the Pacific. This bibliography provides a list of publications pertaining to the biology and fishery of the Pacific skipjack tuna. Papers concerned with food technology, food chemistry, radio-chemistry, and certain other subjects are excluded. The main sources for our publication have been the existing bibliographies of tunas, which are listed and indexed accordingly. In addition, reports of various marine laboratories and other scientific organizations have been checked; these are too numerous to list. We are fairly confident that all major works pertaining to skipjack tuna in the Pacific, printed prior to the end of 1966, appear in this bibliography. Only reports considered to be in permanent form are included. Annotations are based on actual examination of each of the entries listed here. The annotations do not evaluate a paper but serve rather to give a more precise idea of its contents if not revealed by the title alone. If the title sufficed in this respect, no annotation was prepared. A relatively small number of works believed to contain information pertinent to our bibliography could not be examined, but a list of such papers is provided. SPANISH: El atún barrilete, Katsuwonus pelamis, es un recurso importante de las aguas tropicales y subtropicales del océano mundial. Los pescadores de varios países explotan este recurso; actualmente, la captura mundial anual es aproximadamente de 200,000 toneladas métricas. Muchos expertos en la pesquería creen que el barrilete no es utilizado completamente, mientras los stocks de otros atunes son pescados en algunas áreas a niveles que exceden su rendimiento máximo sostenible. Además de la importancia del barrilete como pez comercial y como fuente de alimento, existe una pesquería pequeña recreativa que se está desarrollando en algunos países colindantes con el Pacífico. Esta bibliografía suministra una lista de publicaciones correspondientes a la biología y pesquería del atún barrilete en el Pacífico. Estudios referentes a la tecnología alimenticia, química alimenticia, radioquímica y ciertos otros sujetos son excluídos. Las fuentes principales correspondientes a nuestra publicación han sido las bibliografías existentes sobre atunes, las cuales están enumeradas y catalogadas de acuerdo. Además, se han examinado los informes de varios laboratorios marítimos y los de otras organizaciones científicas; éstos son demasiado numerosos para enumerar. Estamos bastante seguros de que todos los trabajos principales correspondientes al atún barrilete del Pacífico, editados antes de terminar el año de 1966, aparecen en esta bibliografía. Se incluyen únicamente los informes que se consideran permanentes. Las anotaciones se basan en el examen actual de cada una de las entradas aquí referidas. Las anotaciones no evaluan un estudio, pero sirven más bien para dar una idea más precisa de su contenido si el título por sí mismo no lo explica. No se preparó ninguna anotación si el título a este respecto era suficiente. Un número relativamente pequeño de trabajos que se cree tengan información pertinente a nuestra bibliografía no pudo ser examinado, pero se suministra una lista de tales estudios. (PDF contains 227 pages.)
Resumo:
[EN]In this report we present the tags we use when annotating the gold standard of syntactic functions and the decisions taken during its annotation. The gold standard is a necessary resource to evaluate the rulebased surface syntactic parser (the one based on the Constraint Grammar formalism), and, moreover, it can be useful to develop and evaluate statistical parsers. The tags we are presenting here follow the Constraint Grammar (CG) formalism (Karlsson et al., 1995). In fact, last experiments show that good results have been obtained when parsing with CG (Karlsson et al., 1995; Samuelsson and Voutilainen,1997; Tapanainen and Järvinen, 1997; Bick, 2000).
Resumo:
In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.
In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.
Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.
In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.