970 resultados para GENE PREDICTION
Resumo:
As a result of mutation in genes, which is a simple change in our DNA, we will have undesirable phenotypes which are known as genetic diseases or disorders. These small changes, which happen frequently, can have extreme results. Understanding and identifying these changes and associating these mutated genes with genetic diseases can play an important role in our health, by making us able to find better diagnosis and therapeutic strategies for these genetic diseases. As a result of years of experiments, there is a vast amount of data regarding human genome and different genetic diseases that they still need to be processed properly to extract useful information. This work is an effort to analyze some useful datasets and to apply different techniques to associate genes with genetic diseases. Two genetic diseases were studied here: Parkinson’s disease and breast cancer. Using genetic programming, we analyzed the complex network around known disease genes of the aforementioned diseases, and based on that we generated a ranking for genes, based on their relevance to these diseases. In order to generate these rankings, centrality measures of all nodes in the complex network surrounding the known disease genes of the given genetic disease were calculated. Using genetic programming, all the nodes were assigned scores based on the similarity of their centrality measures to those of the known disease genes. Obtained results showed that this method is successful at finding these patterns in centrality measures and the highly ranked genes are worthy as good candidate disease genes for being studied. Using standard benchmark tests, we tested our approach against ENDEAVOUR and CIPHER - two well known disease gene ranking frameworks - and we obtained comparable results.
Resumo:
The woodland strawberry, Fragaria vesca (2n = 2x = 14), is a versatile experimental plant system. This diminutive herbaceous perennial has a small genome (240 Mb), is amenable to genetic transformation and shares substantial sequence identity with the cultivated strawberry (Fragaria Ã- ananassa) and other economically important rosaceous plants. Here we report the draft F. vesca genome, which was sequenced to ×-39 coverage using second-generation technology, assembled de novo and then anchored to the genetic linkage map into seven pseudochromosomes. This diploid strawberry sequence lacks the large genome duplications seen in other rosids. Gene prediction modeling identified 34,809 genes, with most being supported by transcriptome mapping. Genes critical to valuable horticultural traits including flavor, nutritional value and flowering time were identified. Macrosyntenic relationships between Fragaria and Prunus predict a hypothetical ancestral Rosaceae genome that had nine chromosomes. New phylogenetic analysis of 154 protein-coding genes suggests that assignment of Populus to Malvidae, rather than Fabidae, is warranted.
Resumo:
Ankylosing spondylitis (AS) and spondyloarthritis are strongly genetically determined. The long-standing association with HLA-B27 is well described, although the mechanism by which that association induces AS remains uncertain. Recent developments include the description of HLA-B27 tag single nucleotide polymorphisms in European and Asian populations. An increasing number of non-MHC genetic associations have been reported, which provided amongst other things the first evidence of the involvement of the IL-23 pathway in AS. The association with ERAP1 is now known to be restricted to HLA-B27 positive disease. Preliminary studies on the genetics of axial spondyloarthritis demonstrate a lower HLA-B27 carriage rate compared with AS. Studies with larger samples and including non-European ethnic groups are likely to further advance the understanding of the genetics of AS and spondyloarthritis. © 2012.
Resumo:
Automatic molecular classification of cancer based on DNA microarray has many advantages over conventional classification based on morphological appearance of the tumor. Using artificial neural networks is a general approach for automatic classification. In this paper, Direction-Basis-Function neuron and Priority-Ordered algorithm are applied to neural networks. And the leukemia gene expression dataset is used as an example to testify the classifier. The result of our method is compared to that of SVM. It shows that our method makes a better performance than SVM.
Resumo:
Lactococcus lactis is used extensively world-wide for the production of fermented dairy products. Bacteriophages (phages) infecting L. lactis can result in slow or incomplete fermentations, or may even cause total fermentation failure. Therefore, bacteriophages disrupting L. lactis fermentation are of economic concern. This thesis employed a multifaceted approach to investigate various molecular aspects of phage-host interaction in L. lactis. The genome sequence of an Irish dairy starter strain, the prophage-cured L. lactis subsp. cremoris UC509.9, was studied. The 2,250,427 bp circular chromosome represents the smallest among its sequenced lactococcal equivalents. The genome displays clear genetic adaptation to the dairy niche in the form of extensive reductive evolution. Gene prediction identified 2066 protein-encoding genes, including 104 which showed significant homology to transposase-specifying genes. Over 9 % of the identified genes appear to be inactivated through stop codons or frame shift mutations. Many pseudogenes were found in genes that are assigned to carbohydrate and amino acid transport and metabolism orthologous groups, reflecting L. lactis UC509.9’s adaptation to the lactose and casein-rich dairy environment. Sequence analysis of the eight plasmids of L. lactis revealed extensive adaptation to the dairy environment. Key industrial phenotypes were mapped and novel lactococcal plasmid-associated genes highlighted. In addition to chromosomally-encoded bacteriophage resistance systems, six functional such systems were identified, including two abortive infection systems, AbiB and AbiD1, explaining the observed phage resistance of L. lactis UC509.9 Molecular analysis suggests that the constitutive expression of AbiB is not lethal to cells, suggesting the protein is expressed in an un/inactivated form. Analysis of 936 species phage sk1-escape mutants of AbiB revealed that all such mutants harbour mutations in orf6, which encodes the major capsid protein. Results suggest that the major capsid protein is required for activation of the AbiB system, although this requires furrther investigations. Temporal transcriptomes of L. lactis UC509.9 undergoing lytic infection with either one of two distinct bacteriophages, Tuc2009 and c2, was determined and compared to the transcriptome of uninfected UC509.9 cells. Whole genome microarrays performed at various time-points post-infection demonstrated a rather modest impact on host transcription. Alterations in the UC509.9 transcriptome during lytic infection appear phage-specific, with a relatively small number of differentially transcribed genes shared between infection with either Tuc2009 or c2. Transcriptional profiles of both bacteriophages during lytic infection was shown to generally correlate with previous studies and allowed the confirmation of previously predicted promoter sequences. Bioinformatic analysis of genomic regions encoding the presumed cell wall polysaccharide (CW PS) biosynthesis gene cluster of several strains of L. lactis was performed. Results demonstrate the presence of three dominant genetic types of this gene cluster, termed type A, B and C. These regions were used for the development of a multiplex PCR to identify CW PS genotype of various lactococcal strains. Analysis of 936 species phage receptor binding protein phylogeny (RBP) and CW PS genotype revealed an apparent correlation between RBP phylogeny and CW PS type, thereby providing a partial explanation for the observed narrow host range of 936 phages. Further analysis of the genetic locus encompassing the presumed CW PS biosynthesis operon of eight strains identified as belonging to the CW PS C (geno)type, revealed the presence of a variable region among the examined strains. The obtained comparative analysis allowed for the identification of five subgroups of the C type, named C1 to C5. We purified an acidic polysaccharide from the cell wall of L. lactis 3107 (C2 subtype) and confirmed that it is structurally different from the CW PS of the C1 subtype L. lactis MG1363. Combinations of genes from the variable region of C2 subtype were amplified from L. lactis 3107 and introduced into a mutant of the C1 subtype L. lactis NZ9000 (a direct derivative of MG1363) deficient in CW PS biosynthesis. The resulting recombinant mutant synthesized a CW PS with a composition characteristic for that of the C2 subtype L. lactis 3107 and not the wildtype C1 L. lactis NZ9000. The recombinant mutant exhibited a changed phage resistance/sensitivity profile consistent with that of L. lactis 3107, which unambiguously demonstrated that L. lactis 3107 CW PS is the host cell surface receptor of two bacteriophages belonging to the P335 species as well as phages that are member of the 936 species. The research presented in this thesis has significantly advanced our understanding of L. lactis bacteriophage-host interactions in several ways. Firstly, the examination of plasmidencoded bacteriophage resistance systems has allowed inferences to be made regarding the mode of action of AbiB, thereby providing a platform for further elucidation of the molecular trigger of this system. Secondly, the phage infection transcriptome data presented, in addition to previous work, has made L. lactis a model organism in terms of transcriptomic studies of bacteriophage-host interactions. And finally, the research described in this thesis has for the first time explicitly revealed the nature of a carbohydrate bacteriophage receptor in L. lactis, while also providing a logical explanation for the observed narrow host ranges exhibited by 936 and P335 phages. Future research in discerning the structures of other L. lactis CW PS, combined with the determination of the molecular interplay between receptor binding proteins of these phages and CW PS will allow an in depth understanding of the mechanism by which the most prevalent lactococcal phages identify and adsorb to their specific host.
Resumo:
Mycosis fungoides (MF) is the most frequent type of cutaneous T-cell lymphoma, whose diagnosis and study is hampered by its morphologic similarity to inflammatory dermatoses (ID) and the low proportion of tumoral cells, which often account for only 5% to 10% of the total tissue cells. cDNA microarray studies using the CNIO OncoChip of 29 MF and 11 ID cases revealed a signature of 27 genes implicated in the tumorigenesis of MF, including tumor necrosis factor receptor (TNFR)-dependent apoptosis regulators, STAT4, CD40L, and other oncogenes and apoptosis inhibitors. Subsequently a 6-gene prediction model was constructed that is capable of distinguishing MF and ID cases with unprecedented accuracy. This model correctly predicted the class of 97% of cases in a blind test validation using 24 MF patients with low clinical stages. Unsupervised hierarchic clustering has revealed 2 major subclasses of MF, one of which tends to include more aggressive-type MF cases including tumoral MF forms. Furthermore, signatures associated with abnormal immunophenotype (11 genes) and tumor stage disease (5 genes) were identified.
Resumo:
The genome sequence of Aedes aegypti was recently reported. A significant amount of Expressed Sequence Tags (ESTs) were sequenced to aid in the gene prediction process. In the present work we describe an integrated analysis of the genomic and EST data, focusing on genes with preferential expression in larvae (LG), adults (AG) and in both stages (SG). A total of 913 genes (5.4% of the transcript complement) are LG, including ion transporters and cuticle proteins that are important for ion homeostasis and defense. From a starting set of 245 genes encoding the trypsin domain, we identified 66 putative LG, AG, and SG trypsins by manual curation. Phylogenetic analyses showed that AG trypsins are divergent from their larval counterparts (LG), grouping with blood-induced trypsins from Anopheles gambiae and Simulium vittatum. These results support the hypothesis that blood-feeding arose only once, in the ancestral Culicomorpha. Peritrophins are proteins that interlock chitin fibrils to form the peritrophic membrane (PM) that compartmentalizes the food in the midgut. These proteins are recognized by having chitin-binding domains with 6 conserved Cys and may also present mucin-like domains (regions expected to be highly O-glycosylated). PM may be formed by a ring of cells (type 2, seen in Ae. aegypti larvae and Drosophila melanogaster) or by most midgut cells (type 1, found in Ae. aegypti adult and Tribolium castaneum). LG and D. melanogaster peritrophins have more complex domain structures than AG and T. castaneum peritrophins. Furthermore, mucin-like domains of peritrophins from T. castaneum (feeding on rough food) are lengthier than those of adult Ae. aegypti (blood-feeding). This suggests, for the first time, that type 1 and type 2 PM may have variable molecular architectures determined by different peritrophins and/or ancillary proteins, which may be partly modulated by diet.
Resumo:
One of the most important goals of bioinformatics is the ability to identify genes in uncharacterized DNA sequences on world wide database. Gene expression on prokaryotes initiates when the RNA-polymerase enzyme interacts with DNA regions called promoters. In these regions are located the main regulatory elements of the transcription process. Despite the improvement of in vitro techniques for molecular biology analysis, characterizing and identifying a great number of promoters on a genome is a complex task. Nevertheless, the main drawback is the absence of a large set of promoters to identify conserved patterns among the species. Hence, a in silico method to predict them on any species is a challenge. Improved promoter prediction methods can be one step towards developing more reliable ab initio gene prediction methods. In this work, we present an empirical comparison of Machine Learning (ML) techniques such as Na¨ýve Bayes, Decision Trees, Support Vector Machines and Neural Networks, Voted Perceptron, PART, k-NN and and ensemble approaches (Bagging and Boosting) to the task of predicting Bacillus subtilis. In order to do so, we first built two data set of promoter and nonpromoter sequences for B. subtilis and a hybrid one. In order to evaluate of ML methods a cross-validation procedure is applied. Good results were obtained with methods of ML like SVM and Naïve Bayes using B. subtilis. However, we have not reached good results on hybrid database
Resumo:
A database (SpliceDB) of known mammalian splice site sequences has been developed. We extracted 43 337 splice pairs from mammalian divisions of the gene-centered Infogene database, including sites from incomplete or alternatively spliced genes. Known EST sequences supported 22 815 of them. After discarding sequences with putative errors and ambiguous location of splice junctions the verified dataset includes 22 489 entries. Of these, 98.71% contain canonical GT–AG junctions (22 199 entries) and 0.56% have non-canonical GC–AG splice site pairs. The remainder (0.73%) occurs in a lot of small groups (with a maximum size of 0.05%). We especially studied non-canonical splice sites, which comprise 3.73% of GenBank annotated splice pairs. EST alignments allowed us to verify only the exonic part of splice sites. To check the conservative dinucleotides we compared sequences of human non-canonical splice sites with sequences from the high throughput genome sequencing project (HTG). Out of 171 human non-canonical and EST-supported splice pairs, 156 (91.23%) had a clear match in the human HTG. They can be classified after sequence analysis as: 79 GC–AG pairs (of which one was an error that corrected to GC–AG), 61 errors corrected to GT–AG canonical pairs, six AT–AC pairs (of which two were errors corrected to AT–AC), one case was produced from a non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two other cases left of supported non-canonical splice pairs. The information about verified splice site sequences for canonical and non-canonical sites is presented in SpliceDB with the supporting evidence. We also built weight matrices for the major splice groups, which can be incorporated into gene prediction programs. SpliceDB is available at the computational genomic Web server of the Sanger Centre: http://genomic.sanger.ac.uk/spldb/SpliceDB.html and at http://www.softberry.com/spldb/SpliceDB.html.
Resumo:
By using sensitive homology-search and gene-finding programs, we have found that a genomic region from the tip of the short arm of human chromosome 16 (16p13.3) encodes a putative secreted protein consisting of a domain related to the whey acidic protein (WAP) domain, a domain homologous with follistatin modules of the Kazal-domain family (FS module), an immunoglobulin-related domain (Ig domain), two tandem domains related to Kunitz-type protease inhibitor modules (KU domains), and a domain belonging to the recently defined NTR-module family (NTR domain). The gene encoding these WAP, FS, Ig, KU, and NTR modules (hereafter referred to as the WFIKKN gene) is intron-depleted—its single 1,157-bp intron splits the WAP module. The validity of our gene prediction was confirmed by sequencing a WFIKKN cDNA cloned from a lung cDNA library. Studies on the tissue-expression pattern of the WFIKKN gene have shown that the gene is expressed primarily in pancreas, kidney, liver, placenta, and lung. As to the function of the WFIKKN protein, it is noteworthy that it contains FS, WAP, and KU modules, i.e., three different module types homologous with domains frequently involved in inhibition of serine proteases. The protein also contains an NTR module, a domain type implicated in inhibition of zinc metalloproteinases of the metzincin family. On the basis of its intriguing homologies, we suggest that the WFIKKN protein is a multivalent protease inhibitor that may control the action of multiple types of serine proteases as well as metalloproteinase(s).
Resumo:
Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences.
Resumo:
The Bifibobacterium longum subsp. longum 35624™ strain (formerly named Bifidobacterium longum subsp. infantis) is a well described probiotic with clinical efficacy in Irritable Bowel Syndrome clinical trials and induces immunoregulatory effects in mice and in humans. This paper presents (a) the genome sequence of the organism allowing the assignment to its correct subspeciation longum; (b) a comparative genome assessment with other B. longum strains and (c) the molecular structure of the 35624 exopolysaccharide (EPS624). Comparative genome analysis of the 35624 strain with other B. longum strains determined that the sub-speciation of the strain is longum and revealed the presence of a 35624-specific gene cluster, predicted to encode the biosynthetic machinery for EPS624. Following isolation and acid treatment of the EPS, its chemical structure was determined using gas and liquid chromatography for sugar constituent and linkage analysis, electrospray and matrix assisted laser desorption ionization mass spectrometry for sequencing and NMR. The EPS consists of a branched hexasaccharide repeating unit containing two galactose and two glucose moieties, galacturonic acid and the unusual sugar 6-deoxy-L-talose. These data demonstrate that the B. longum 35624 strain has specific genetic features, one of which leads to the generation of a characteristic exopolysaccharide.
Resumo:
A novel multiple regression method (RM) is developed to predict identity-by-descent probabilities at a locus L (IBDL), among individuals without pedigree, given information on surrounding markers and population history. These IBDL probabilities are a function of the increase in linkage disequilibrium (LD) generated by drift in a homogeneous population over generations. Three parameters are sufficient to describe population history: effective population size (Ne), number of generations since foundation (T), and marker allele frequencies among founders (p). IBD L are used in a simulation study to map a quantitative trait locus (QTL) via variance component estimation. RM is compared to a coalescent method (CM) in terms of power and robustness of QTL detection. Differences between RM and CM are small but significant. For example, RM is more powerful than CM in dioecious populations, but not in monoecious populations. Moreover, RM is more robust than CM when marker phases are unknown or when there is complete LD among founders or Ne is wrong, and less robust when p is wrong. CM utilises all marker haplotype information, whereas RM utilises information contained in each individual marker and all possible marker pairs but not in higher order interactions. RM consists of a family of models encompassing four different population structures, and two ways of using marker information, which contrasts with the single model that must cater for all possible evolutionary scenarios in CM.
Resumo:
MOTIVATION: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks. RESULTS: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.
Resumo:
National Natural Science Foundation of China 60753001