ABSTRACT: BACKGROUND: Many parasitic organisms, eukaryotes as well as bacteria, possess surface antigens with amino acid repeats. Making up the interface between host and pathogen such repetitive proteins may be virulence factors involved in immune evasion or cytoadherence. They find immunological applications in serodiagnostics and vaccine development. Here we use proteins which contain perfect repeats as a basis for comparative genomics between parasitic and free-living organisms. RESULTS: We have developed Reptile http://reptile.unibe.ch, a program for proteome-wide probabilistic description of perfect repeats in proteins. Parasite proteomes exhibited a large variance regarding the proportion of repeat-containing proteins. Interestingly, there was a good correlation between the percentage of highly repetitive proteins and mean protein length in parasite proteomes, but not at all in the proteomes of free-living eukaryotes. Reptile combined with programs for the prediction of transmembrane domains and GPI-anchoring resulted in an effective tool for in silico identification of potential surface antigens and virulence factors from parasites. CONCLUSION: Systemic surveys for perfect amino acid repeats allowed basic comparisons between free-living and parasitic organisms that were directly applicable to predict proteins of serological and parasitological importance. An on-line tool is available at http://genomics.unibe.ch/dora.


Helicobacter pylori is a Gram-negative bacterial pathogen with a small genome of 1.64–1.67 Mb. More than 20 putative DNA restriction-modification (R-M) systems, comprising more than 4% of the total genome, have been identified in the two completely sequenced H. pylori strains, 26695 and J99, based on sequence similarities. In this study, we have investigated the biochemical activities of 14 Type II R-M systems in H. pylori 26695. Less than 30% of the Type II R-M systems in 26695 are fully functional, similar to the results obtained from strain J99. Although nearly 90% of the R-M genes are shared by the two H. pylori strains, different sets of these R-M genes are functionally active in each strain. Interestingly, all strain-specific R-M genes are active, whereas most shared genes are inactive. This agrees with the notion that strain-specific genes have been acquired more recently through horizontal transfer from other bacteria and selected for function. Thus, they are less likely to be impaired by random mutations. Our results also show that H. pylori has extremely diversified R-M systems in different strains, and that the diversity may be maintained by constantly acquiring new R-M systems and by inactivating and deleting the old ones.


Mycobacteria of the Mycobacterium tuberculosis complex (MTBC) greatly affect humans and animals worldwide. The life cycle of mycobacteria is complex and the mechanisms resulting in pathogen infection and survival in host cells are not fully understood. Recently, comparative genomics analyses have provided new insights into the evolution and adaptation of the MTBC to survive inside the host. However, most of this information has been obtained using M. tuberculosis but not other members of the MTBC such as M. bovis and M. caprae. In this study, the genome of three M. bovis (MB1, MB3, MB4) and one M. caprae (MB2) field isolates with different lesion score, prevalence and host distribution phenotypes were sequenced. Genome sequence information was used for whole-genome and protein-targeted comparative genomics analysis with the aim of finding correlates with phenotypic variation with potential implications for tuberculosis (TB) disease risk assessment and control. At the whole-genome level the results of the first comparative genomics study of field isolates of M. bovis including M. caprae showed that as previously reported for M. tuberculosis, sequential chromosomal nucleotide substitutions were the main driver of the M. bovis genome evolution. The phylogenetic analysis provided a strong support for the M. bovis/M. caprae clade, but supported M. caprae as a separate species. The comparison of the MB1 and MB4 isolates revealed differences in genome sequence, including gene families that are important for bacterial infection and transmission, thus highlighting differences with functional implications between isolates otherwise classified with the same spoligotype. Strategic protein-targeted analysis using the ESX or type VII secretion system, proteins linking stress response with lipid metabolism, host T cell epitopes of mycobacteria, antigens and peptidoglycan assembly protein identified new genetic markers and candidate vaccine antigens that warrant further study to develop tools to evaluate risks for TB disease caused by M. bovis/M.caprae and for TB control in humans and animals.


Les champignons mycorhiziens arbusculaires (CMA) sont très répandus dans le sol où ils forment des associations symbiotiques avec la majorité des plantes appelées mycorhizes arbusculaires. Le développement des CMA dépend fortement de la plante hôte, de telle sorte qu'ils ne peuvent vivre à l'état saprotrophique, par conséquent ils sont considérés comme des biotrophes obligatoires. Les CMA forment une lignée évolutive basale des champignons et ils appartiennent au phylum Glomeromycota. Leurs mycélia sont formés d’un réseau d’hyphes cénocytiques dans lesquelles les noyaux et les organites cellulaires peuvent se déplacer librement d’un compartiment à l’autre. Les CMA permettent à la plante hôte de bénéficier d'une meilleure nutrition minérale, grâce au réseau d'hyphes extraradiculaires, qui s'étend au-delà de la zone du sol explorée par les racines. Ces hyphes possèdent une grande capacité d'absorption d’éléments nutritifs qui vont être transportés par ceux-ci jusqu’aux racines. De ce fait, les CMA améliorent la croissance des plantes tout en les protégeant des stresses biotiques et abiotiques. Malgré l’importance des CMA, leurs génétique et évolution demeurent peu connues. Leurs études sont ardues à cause de leur mode de vie qui empêche leur culture en absence des plantes hôtes. En plus leur diversité génétique intra-isolat des génomes nucléaires, complique d’avantage ces études, en particulier le développement des marqueurs moléculaires pour des études biologiques, écologiques ainsi que les fonctions des CMA. C’est pour ces raisons que les génomes mitochondriaux offrent des opportunités et alternatives intéressantes pour étudier les CMA. En effet, les génomes mitochondriaux (mt) publiés à date, ne montrent pas de polymorphismes génétique intra-isolats. Cependant, des exceptions peuvent exister. Pour aller de l’avant avec la génomique mitochondriale, nous avons besoin de générer beaucoup de données de séquençages de l’ADN mitochondrial (ADNmt) afin d’étudier les méchanismes évolutifs, la génétique des population, l’écologie des communautés et la fonction des CMA. Dans ce contexte, l’objectif de mon projet de doctorat consiste à: 1) étudier l’évolution des génomes mt en utilisant l’approche de la génomique comparative au niveau des espèces proches, des isolats ainsi que des espèces phylogénétiquement éloignées chez les CMA; 2) étudier l’hérédité génétique des génomes mt au sein des isolats de l’espèce modèle Rhizophagus irregularis par le biais des anastomoses ; 3) étudier l’organisation des ADNmt et les gènes mt pour le développement des marqueurs moléculaires pour des études phylogénétiques. Nous avons utilisé l’approche dite ‘whole genome shotgun’ en pyroséquençage 454 et Illumina HiSeq pour séquencer plusieurs taxons de CMA sélectionnés selon leur importance et leur disponibilité. Les assemblages de novo, le séquençage conventionnel Sanger, l’annotation et la génomique comparative ont été réalisés pour caractériser des ADNmt complets. Nous avons découvert plusieurs mécanismes évolutifs intéressant chez l’espèce Gigaspora rosea dans laquelle le génome mt est complètement remanié en comparaison avec Rhizophagus irregularis isolat DAOM 197198. En plus nous avons mis en évidence que deux gènes cox1 et rns sont fragmentés en deux morceaux. Nous avons démontré que les ARN transcrits les deux fragments de cox1 se relient entre eux par épissage en trans ‘Trans-splicing’ à l’aide de l’ARN du gene nad5 I3 qui met ensemble les deux ARN cox1.1 et cox1.2 en formant un ARN complet et fonctionnel. Nous avons aussi trouvé une organisation de l’ADNmt très particulière chez l’espèce Rhizophagus sp. Isolat DAOM 213198 dont le génome mt est constitué par deux chromosomes circulaires. En plus nous avons trouvé une quantité considérable des séquences apparentées aux plasmides ‘plasmid-related sequences’ chez les Glomeraceae par rapport aux Gigasporaceae, contribuant ainsi à une évolution rapide des ADNmt chez les Glomeromycota. Nous avons aussi séquencé plusieurs isolats de l’espèces R. irregularis et Rhizophagus sp. pour décortiquer leur position phylogénéque et inférer des relations évolutives entre celles-ci. La comparaison génomique mt nous montré l’existence de plusieurs éléments mobiles comme : des cadres de lecture ‘open reading frames (mORFs)’, des séquences courtes inversées ‘short inverted repeats (SIRs)’, et des séquences apparentées aux plasimdes ‘plasmid-related sequences (dpo)’ qui impactent l’ordre des gènes mt et permettent le remaniement chromosomiques des ADNmt. Tous ces divers mécanismes évolutifs observés au niveau des isolats, nous permettent de développer des marqueurs moléculaires spécifiques à chaque isolat ou espèce de CMA. Les données générées dans mon projet de doctorat ont permis d’avancer les connaissances fondamentales des génomes mitochondriaux non seulement chez les Glomeromycètes, mais aussi de chez le règne des Fungi et les eucaryotes en général. Les trousses moléculaires développées dans ce projet peuvent servir à des études de la génétique des populations, des échanges génétiques et l’écologie des CMA ce qui va contribuer à la compréhension du rôle primorial des CMA en agriculture et environnement.


We present a novel, web-accessible scientific workflow system which makes large-scale comparative studies accessible without programming or excessive configuration requirements. GPFlow allows a workflow defined on single input values to be automatically lifted to operate over collections of input values and supports the formation and processing of collections of values without the need for explicit iteration constructs. We introduce a new model for collection processing based on key aggregation and slicing which guarantees processing integrity and facilitates automatic association of inputs, allowing scientific users to manage the combinatorial explosion of data values inherent in large scale comparative studies. The approach is demonstrated using a core task from comparative genomics, and builds upon our previous work in supporting combined interactive and batch operation, through a lightweight web-based user interface.


BACKGROUND: The increasing number of assembled mammalian genomes makes it possible to compare genome organisation across mammalian lineages and reconstruct chromosomes of the ancestral marsupial and therian (marsupial and eutherian) mammals. However, the reconstruction of ancestral genomes requires genome assemblies to be anchored to chromosomes. The recently sequenced tammar wallaby (Macropus eugenii) genome was assembled into over 300,000 contigs. We previously devised an efficient strategy for mapping large evolutionarily conserved blocks in non-model mammals, and applied this to determine the arrangement of conserved blocks on all wallaby chromosomes, thereby permitting comparative maps to be constructed and resolve the long debated issue between a 2n=14 and 2n=22 ancestral marsupial karyotype. RESULTS: We identified large blocks of genes conserved between human and opossum, and mapped genes corresponding to the ends of these blocks by fluorescence in situ hybridization (FISH). A total of 242 genes was assigned to wallaby chromosomes in the present study, bringing the total number of genes mapped to 554 and making it the most densely cytogenetically mapped marsupial genome. We used these gene assignments to construct comparative maps between wallaby and opossum, which uncovered many intrachromosomal rearrangements, particularly for genes found on wallaby chromosomes X and 3. Expanding comparisons to include chicken and human permitted the putative ancestral marsupial (2n=14) and therian mammal (2n=19) karyotypes to be reconstructed. CONCLUSIONS: Our physical mapping data for the tammar wallaby has uncovered the events shaping marsupial genomes and enabled us to predict the ancestral marsupial karyotype, supporting a 2n=14 ancestor. Futhermore, our predicted therian ancestral karyotype has helped to understand the evolution of the ancestral eutherian genome.


Chlamydia pneumoniae is an obligate intracellular bacterium implicated in a wide range of human diseases including atherosclerosis and Alzheimer's disease. Efforts to understand the relationships between C. pneumoniae detected in these diseases have been hindered by the availability of sequence data for non-respiratory strains. In this study, we sequenced the whole genomes for C. pneumoniae isolates from atherosclerosis and Alzheimer's disease, and compared these to previously published C. pneumoniae genomes. Phylogenetic analyses of these new C. pneumoniae strains indicate two sub-groups within human C. pneumoniae, and suggest that both recombination and mutation events have driven the evolution of human C. pneumoniae. Further fine-detailed analyses of these new C. pneumoniae sequences show several genetically variable loci. This suggests that similar strains of C. pneumoniae are found in the brain, lungs and cardiovascular system and that only minor genetic differences may contribute to the adaptation of particular strains in human disease.


Evolutionary history of biological entities is recorded within their nucleic acid sequences and can (sometimes) be deciphered by thorough genomic analysis. In this study we sought to gain insights into the diversity and evolution of bacterial and archaeal viruses. Our primary interest was pointed towards those virus groups/families for which comprehensive genomic analysis was not previously possible due to the lack of sufficient amount of genomic data. During the course of this work twenty-five putative proviruses integrated into various prokaryotic genomes were identified, enabling us to undertake a comparative genomics approach. This analysis allowed us to test the previously formulated evolutionary hypotheses and also provided valuable information on the molecular mechanisms behind the genome evolution of the studied virus groups.


Background. Several types of networks, such as transcriptional, metabolic or protein-protein interaction networks of various organisms have been constructed, that have provided a variety of insights into metabolism and regulation. Here, we seek to exploit the reaction-based networks of three organisms for comparative genomics. We use concepts from spectral graph theory to systematically determine how differences in basic metabolism of organisms are reflected at the systems level and in the overall topological structures of their metabolic networks. Methodology/Principal Findings. Metabolome-based reaction networks of Mycobacterium tuberculosis, Mycobacterium leprae and Escherichia coli have been constructed based on the KEGG LIGAND database, followed by graph spectral analysis of the network to identify hubs as well as the sub-clustering of reactions. The shortest and alternate paths in the reaction networks have also been examined. Sub-cluster profiling demonstrates that reactions of the mycolic acid pathway in mycobacteria form a tightly connected sub-cluster. Identification of hubs reveals reactions involving glutamate to be central to mycobacterial metabolism, and pyruvate to be at the centre of the E. coli metabolome. The analysis of shortest paths between reactions has revealed several paths that are shorter than well established pathways. Conclusions. We conclude that severe downsizing of the leprae genome has not significantly altered the global structure of its reaction network but has reduced the total number of alternate paths between its reactions while keeping the shortest paths between them intact. The hubs in the mycobacterial networks that are absent in the human metabolome can be explored as potential drug targets. This work demonstrates the usefulness of constructing metabolome based networks of organisms and the feasibility of their analyses through graph spectral methods. The insights obtained from such studies provide a broad overview of the similarities and differences between organisms, taking comparative genomics studies to a higher dimension.


The Asian elephant Elephas maximus and the African elephant Loxodonta africana that diverged 5-7 million years ago exhibit differences in their physiology, behaviour and morphology. A comparative genomics approach would be useful and necessary for evolutionary and functional genetic studies of elephants. We performed sequencing of E. maximus and map to L. africana at similar to 15X coverage. Through comparative sequence analyses, we have identified Asian elephant specific homozygous, non-synonymous single nucleotide variants (SNVs) that map to 1514 protein coding genes, many of which are involved in olfaction. We also present the first report of a high-coverage transcriptome sequence in E. maximus from peripheral blood lymphocytes. We have identified 103 novel protein coding transcripts and 66-long non-coding (lnc)RNAs. We also report the presence of 181 protein domains unique to elephants when compared to other Afrotheria species. Each of these findings can be further investigated to gain a better understanding of functional differences unique to elephant species, as well as those unique to elephantids in comparison with other mammals. This work therefore provides a valuable resource to explore the immense research potential of comparative analyses of transcriptome and genome sequences in the Asian elephant.


The full-length cDNA sequence (3219 base pairs) of the trehalose-6-phosphate synthase gene of Porphyra yezoensis (PyTPS) was isolated by RACE-PCR and deposited in GenBank (NCBI) with the accession number AY729671. PyTPS encodes a protein of 908 amino acids before a stop codon, and has a calculated molecular mass of 101,591 Daltons. The PyTPS protein consists of a TPS domain in the N-terminus and a putative TPP domain at the C-terminus. Homology alignment for PyTPS and the TPS proteins from bacteria, yeast and higher plants indicated that the most closely related sequences to PyTPS were those from higher plants (OsTPS and AtTPS5), whereas the most distant sequence to PyTPS was from bacteria (EcOtsAB). Based on the identified sequence of the PyTPS gene, PCR primers were designed and used to amplify the TPS genes from nine other seaweed species. Sequences of the nine obtained TPS genes were deposited in GenBank (NCBI). All 10 TPS genes encoded peptides of 908 amino acids and the sequences were highly conserved both in nucleotide composition (>94%) and in amino acid composition (>96%). Unlike the TPS genes from some other plants, there was no intron in any of the 10 isolated seaweed TPS genes.


Yersiniosis is an acute or chronic enteric zoonosis caused by enteropathogenic Yersinia species. Although yersiniosis is predominantly associated with gastroenteric forms of infection, extraintestinal forms are often reported from the elderly or patients with predisposing factors. Yersiniosis is often reported in countries with cold and mild climates (Northern and Central Europe, New Zealand and North of Russian Federation). The Irish Health Protection Surveillance Centre (HPSC) currently records only 3-7 notified cases of yersiniosis per year. At the same time pathogenic Yersinia enterocolitica is recovered from pigs (main source of pathogenic Y. enterocolitica) at the levels similar to that observed in Yersinia endemic countries. Introduction of Yersinia selective culture procedures may increase Yersinia isolation rates. To establish whether the small number of notifications of human disease was an underestimate due to lack of specific selective culture for Yersinia we carried out a prospective culture study of faecal samples from outpatients with diarrhoea, with additional culture of appendix and throat swabs. Higher levels of anti-Yersinia seroprevalence than yersiniosis notification rates in endemic countries suggests that most yersiniosis cases are unrecognised by culture. Subsequently, in addition to a prospective culture study of clinical specimens, we carried out serological screening of Irish blood donors and environmental screening of human sewage. Pathogenic Yersinia strains were not isolated from 1,189 faeces samples, nor from 297 throat swabs, or 23 appendix swabs. This suggested that current low notification rates in Ireland are not due to the lack of specific Yersinia culture procedures. Molecular screening detected a wider variety of Y. enterocolitica-specific targets in pig slurry than in human sewage. A serological survey for antibodies against Yersinia YOP (Yersinia Outer Proteins) proteins in Irish blood donors found antibodies in 25%, with an age-related trend to increased seropositivity, compatible with the hypothesis that yersiniosis may have been more prevalent in Ireland in the recent past. Y. enterocolitica is a heterogeneous group of microorganisms that comprises strains with different degree of pathogenicity. Although non-pathogenic Y. enterocolitica lack conventional virulence factors, these strains can be isolated from patients with diarrhoea. Insecticidal Toxin Complex (ITC) and Cytolethal Distending Toxins can potentially contribute to the virulence of non-pathogenic Y. enterocolitica in the absence of other virulence factors. We compared distribution of ITC and CDT loci among pathogenic and non-pathogenic Y. enterocolitica. Additionally, to demonstrate potential pathogenicity of non-pathogenic Y. enterocolitica we compared their virulence towards Galleria mellonella larvae (a non-mammalian model of human bacterial infections) with the virulence of highly and mildly pathogenic Y. enterocolitica strains. Surprisingly, virulence of pathogenic and non-pathogenic Y. enterocolitica in Galleria mellonella larvae observed at 37°C did not correlate with their pathogenic potential towards humans. Comparative phylogenomic analysis detects predicted coding sequences (CDSs) that define host-pathogen interactions and hence providing insights into molecular evolution of bacterial virulence. Comparative phylogenomic analysis of microarray data generated in Y. enterocolitica strains isolated in the Great Britain from humans with diarrhoea and domestic animals revealed high genetic heterogeneity of these species. Because of the extensive human, animal and food exchanges between the UK and Ireland the objective of this study was to gain further insight into genetic heterogeneity and relationships among clinical and non-clinical Y. enterocolitica strains of various pathogenic potential isolated in Ireland and Great Britain. No evidence of direct transfer of strains between the two countries was found.


BACKGROUND: The evolutionary relationships of modern birds are among the most challenging to understand in systematic biology and have been debated for centuries. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders, and used the genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomics analyses (Jarvis et al. in press; Zhang et al. in press). Here we release assemblies and datasets associated with the comparative genome analyses, which include 38 newly sequenced avian genomes plus previously released or simultaneously released genomes of Chicken, Zebra finch, Turkey, Pigeon, Peregrine falcon, Duck, Budgerigar, Adelie penguin, Emperor penguin and the Medium Ground Finch. We hope that this resource will serve future efforts in phylogenomics and comparative genomics. FINDINGS: The 38 bird genomes were sequenced using the Illumina HiSeq 2000 platform and assembled using a whole genome shotgun strategy. The 48 genomes were categorized into two groups according to the N50 scaffold size of the assemblies: a high depth group comprising 23 species sequenced at high coverage (>50X) with multiple insert size libraries resulting in N50 scaffold sizes greater than 1 Mb (except the White-throated Tinamou and Bald Eagle); and a low depth group comprising 25 species sequenced at a low coverage (~30X) with two insert size libraries resulting in an average N50 scaffold size of about 50 kb. Repetitive elements comprised 4%-22% of the bird genomes. The assembled scaffolds allowed the homology-based annotation of 13,000 ~ 17000 protein coding genes in each avian genome relative to chicken, zebra finch and human, as well as comparative and sequence conservation analyses. CONCLUSIONS: Here we release full genome assemblies of 38 newly sequenced avian species, link genome assembly downloads for the 7 of the remaining 10 species, and provide a guideline of genomic data that has been generated and used in our Avian Phylogenomics Project. To the best of our knowledge, the Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date. The genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparative genomics, evolution, neurobiology, development biology, and other related areas.


The order Lagomorpha comprises about 90 living species, divided in 2 families: the pikas (Family Ochotonidae), and the rabbits, hares, and jackrabbits (Family Leporidae). Lagomorphs are important economically and scientifically as major human food resources, valued game species, pests of agricultural significance, model laboratory animals, and key elements in food webs. A quarter of the lagomorph species are listed as threatened. They are native to all continents except Antarctica, and occur up to 5000 m above sea level, from the equator to the Arctic, spanning a wide range of environmental conditions. The order has notable taxonomic problems presenting significant difficulties for defining a species due to broad phenotypic variation, overlap of morphological characteristics, and relatively recent speciation events. At present, only the genomes of 2 species, the European rabbit (Oryctolagus cuniculus) and American pika (Ochotona princeps) have been sequenced and assembled. Starting from a paucity of genome information, the main scientific aim of the Lagomorph Genomics Consortium (LaGomiCs), born from a cooperative initiative of the European COST Action “A Collaborative European Network on Rabbit Genome Biology—RGB-Net” and the World Lagomorph Society (WLS), is to provide an international framework for the sequencing of the genome of all extant and selected extinct lagomorphs. Sequencing the genomes of an entire order will provide a large amount of information to address biological problems not only related to lagomorphs but also to all mammals. We present current and planned sequencing programs and outline the final objective of LaGomiCs possible through broad international collaboration.