931 resultados para Bioinformatics
Resumo:
Amplifications and deletions of chromosomal DNA, as well as copy-neutral loss of heterozygosity have been associated with diseases processes. High-throughput single nucleotide polymorphism (SNP) arrays are useful for making genome-wide estimates of copy number and genotype calls. Because neighboring SNPs in high throughput SNP arrays are likely to have dependent copy number and genotype due to the underlying haplotype structure and linkage disequilibrium, hidden Markov models (HMM) may be useful for improving genotype calls and copy number estimates that do not incorporate information from nearby SNPs. We improve previous approaches that utilize a HMM framework for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding confidence scores when available. Using simulated data, we demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package ICE.
Resumo:
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number, develops markerand study-level summaries of batch effects, and demonstrates how the marker-level estimates can be integrated with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R. A compendium for reproducing the analysis is available from the author’s website (http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html).
Resumo:
Important food crops like rice are constantly exposed to various stresses that can have devastating effect on their survival and productivity. Being sessile, these highly evolved organisms have developed elaborate molecular machineries to sense a mixture of stress signals and elicit a precise response to minimize the damage. However, recent discoveries revealed that the interplay of these stress regulatory and signaling molecules is highly complex and remains largely unknown. In this work, we conducted large scale analysis of differential gene expression using advanced computational methods to dissect regulation of stress response which is at the heart of all molecular changes leading to the observed phenotypic susceptibility. One of the most important stress conditions in terms of loss of productivity is drought. We performed genomic and proteomic analysis of epigenetic and miRNA mechanisms in regulation of drought responsive genes in rice and found subsets of genes with striking properties. Overexpressed genesets included higher number of epigenetic marks, miRNA targets and transcription factors which regulate drought tolerance. On the other hand, underexpressed genesets were poor in above features but were rich in number of metabolic genes with multiple co-expression partners contributing majorly towards drought resistance. Identification and characterization of the patterns exhibited by differentially expressed genes hold key to uncover the synergistic and antagonistic components of the cross talk between stress response mechanisms. We performed meta-analysis on drought and bacterial stresses in rice and Arabidopsis, and identified hundreds of shared genes. We found high level of conservation of gene expression between these stresses. Weighted co-expression network analysis detected two tight clusters of genes made up of master transcription factors and signaling genes showing strikingly opposite expression status. To comprehensively identify the shared stress responsive genes between multiple abiotic and biotic stresses in rice, we performed meta-analyses of microarray studies from seven different abiotic and six biotic stresses separately and found more than thirteen hundred shared stress responsive genes. Various machine learning techniques utilizing these genes classified the stresses into two major classes' namely abiotic and biotic stresses and multiple classes of individual stresses with high accuracy and identified the top genes showing distinct patterns of expression. Functional enrichment and co-expression network analysis revealed the different roles of plant hormones, transcription factors in conserved and non-conserved genesets in regulation of stress response.
Resumo:
BACKGROUND: Gene expression analysis has emerged as a major biological research area, with real-time quantitative reverse transcription PCR (RT-QPCR) being one of the most accurate and widely used techniques for expression profiling of selected genes. In order to obtain results that are comparable across assays, a stable normalization strategy is required. In general, the normalization of PCR measurements between different samples uses one to several control genes (e.g. housekeeping genes), from which a baseline reference level is constructed. Thus, the choice of the control genes is of utmost importance, yet there is not a generally accepted standard technique for screening a large number of candidates and identifying the best ones. RESULTS: We propose a novel approach for scoring and ranking candidate genes for their suitability as control genes. Our approach relies on publicly available microarray data and allows the combination of multiple data sets originating from different platforms and/or representing different pathologies. The use of microarray data allows the screening of tens of thousands of genes, producing very comprehensive lists of candidates. We also provide two lists of candidate control genes: one which is breast cancer-specific and one with more general applicability. Two genes from the breast cancer list which had not been previously used as control genes are identified and validated by RT-QPCR. Open source R functions are available at http://www.isrec.isb-sib.ch/~vpopovic/research/ CONCLUSION: We proposed a new method for identifying candidate control genes for RT-QPCR which was able to rank thousands of genes according to some predefined suitability criteria and we applied it to the case of breast cancer. We also empirically showed that translating the results from microarray to PCR platform was achievable.
Resumo:
BACKGROUND: The mollicute Mycoplasma conjunctivae is the etiological agent leading to infectious keratoconjunctivitis (IKC) in domestic sheep and wild caprinae. Although this pathogen is relatively benign for domestic animals treated by antibiotics, it can lead wild animals to blindness and death. This is a major cause of death in the protected species in the Alps (e.g., Capra ibex, Rupicapra rupicapra). METHODS: The genome was sequenced using a combined technique of GS-FLX (454) and Sanger sequencing, and annotated by an automatic pipeline that we designed using several tools interconnected via PERL scripts. The resulting annotations are stored in a MySQL database. RESULTS: The annotated sequence is deposited in the EMBL database (FM864216) and uploaded into the mollicutes database MolliGen http://cbi.labri.fr/outils/molligen/ allowing for comparative genomics. CONCLUSION: We show that our automatic pipeline allows for annotating a complete mycoplasma genome and present several examples of analysis in search for biological targets (e.g., pathogenic proteins).
Resumo:
BACKGROUND Aeromonas salmonicida subsp. salmonicida, the etiologic agent of furunculosis, is a major pathogen of fisheries worldwide. Several virulence factors have been described, but the type-three secretion system (T3SS) is recognized as having a major effect on virulence by injecting effectors directly into fish cells. In this study we used high-throughput proteomics to display the differences between in vitro secretome of A. salmonicida wild-type (wt, hypervirulent, JF2267) and T3SS-deficient (isogenic ΔascV, extremely low-virulent, JF2747) strains in exponential and stationary phases of growth. RESULTS Results confirmed the secretion of effectors AopH, AexT, AopP and AopO via T3SS, and for the first time demonstrated the impact of T3SS in secretion of Ati2, AopN and ExsE that are known as effectors in other pathogens. Translocators, needle subunits, Ati1, and AscX were also secreted in supernatants (SNs) dependent on T3SS. AopH, Ati2, AexT, AopB and AopD were in the top seven most abundant excreted proteins. EF-G, EF-Tu, DnaK, HtpG, PNPase, PepN and MdeA were moderately secreted in wt SNs and predicted to be putative T3 effectors by bioinformatics. Pta and ASA_P5G088 were increased in wt SNs and T3-associated in other bacteria. Ten conserved cytoplasmic proteins were more abundant in wt SNs than in the ΔascV mutant, but without any clear association to a secretion system. T1-secreted proteins were predominantly found in wt SNs: OmpAI, OmpK40, DegQ, insulinase ASA_0716, hypothetical ASA_0852 and ASA_3619. Presence of T3SS components in pellets was clearly decreased by ascV deletion, while no impact was observed on T1- and T2SS. Our results demonstrated that the ΔascV mutant strain excreted well-described (VapA, AerA, AerB, GCAT, Pla1, PlaC, TagA, Ahe2, GbpA and enolase) and yet uncharacterized potential toxins, adhesins and enzymes as much as or even more than the wt strain. Other putative important virulence factors were not detected. CONCLUSIONS We demonstrated the whole in vitro secretome and T3SS repertoire of hypervirulent A. salmonicida. Several toxins, adhesins and enzymes that are not part of the T3SS secretome were secreted to a higher extent in the extremely low-virulent ΔascV mutant. All together, our results show the high importance of an intact T3SS to initiate the furunculosis and offer new information about the pathogenesis.
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.
Resumo:
Most newly synthesized messenger RNAs possess a 5’ cap and a 3’ poly(A) tail. The process of poly(A) tail shortening, also termed deadenylation, is important for post-transcriptional gene regulation, because deadenylation not only leads to mRNA translational inhibition but also is the first step of major mRNA degradation. Translationally inhibited mRNAs can be stored and/or degraded in dynamic cytoplasmic foci termed mRNA processing bodies, or P bodies, which are conserved in eukaryotes. To shed new light on the mechanisms of P body formation and P body functions, I focused on the link between deadenylation factors and P bodies. I found that the two major deadenylation complexes, Pan3-Pan2 and Ccr4-Caf1, can both be enriched in P bodies. The deadenylase activity of the Ccr4-Caf1 complex is prerequisite for P body formation. Pan3, but not the deadenylase Pan2, is essential for P body formation. While the C-terminal domain of Pan3 is important for interaction with Pan2, Pan3 N-terminal domain is important for Pan3 to form cytoplasmic foci colocalizing with P bodies and to promote mRNA decay. Interestingly, Pan3 N-terminal domain may be phosphorylated to regulate Pan3 localization and functions. Aside from the functions of the two deadenylation complexes in P bodies, I also studied all reported human P body proteins as a whole using bioinformatics. This effort not only has generated a comprehensive picture of the functions of and interactions among human P body proteins, but also has predicted proteins that may regulate P body formation and/or functions. In summary, my study has established a direct link between mRNA deadenylation and P body formation and has also led to new hypotheses to guide future research on how P body dynamics are controlled.
Resumo:
High-throughput assays, such as yeast two-hybrid system, have generated a huge amount of protein-protein interaction (PPI) data in the past decade. This tremendously increases the need for developing reliable methods to systematically and automatically suggest protein functions and relationships between them. With the available PPI data, it is now possible to study the functions and relationships in the context of a large-scale network. To data, several network-based schemes have been provided to effectively annotate protein functions on a large scale. However, due to those inherent noises in high-throughput data generation, new methods and algorithms should be developed to increase the reliability of functional annotations. Previous work in a yeast PPI network (Samanta and Liang, 2003) has shown that the local connection topology, particularly for two proteins sharing an unusually large number of neighbors, can predict functional associations between proteins, and hence suggest their functions. One advantage of the work is that their algorithm is not sensitive to noises (false positives) in high-throughput PPI data. In this study, we improved their prediction scheme by developing a new algorithm and new methods which we applied on a human PPI network to make a genome-wide functional inference. We used the new algorithm to measure and reduce the influence of hub proteins on detecting functionally associated proteins. We used the annotations of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as independent and unbiased benchmarks to evaluate our algorithms and methods within the human PPI network. We showed that, compared with the previous work from Samanta and Liang, our algorithm and methods developed in this study improved the overall quality of functional inferences for human proteins. By applying the algorithms to the human PPI network, we obtained 4,233 significant functional associations among 1,754 proteins. Further comparisons of their KEGG and GO annotations allowed us to assign 466 KEGG pathway annotations to 274 proteins and 123 GO annotations to 114 proteins with estimated false discovery rates of <21% for KEGG and <30% for GO. We clustered 1,729 proteins by their functional associations and made pathway analysis to identify several subclusters that are highly enriched in certain signaling pathways. Particularly, we performed a detailed analysis on a subcluster enriched in the transforming growth factor β signaling pathway (P<10-50) which is important in cell proliferation and tumorigenesis. Analysis of another four subclusters also suggested potential new players in six signaling pathways worthy of further experimental investigations. Our study gives clear insight into the common neighbor-based prediction scheme and provides a reliable method for large-scale functional annotations in this post-genomic era.
Resumo:
Attention has recently been drawn to Enterococcus faecium because of an increasing number of nosocomial infections caused by this species and its resistance to multiple antibacterial agents. However, relatively little is known about the pathogenic determinants of this organism. We have previously identified a cell-wall-anchored collagen adhesin, Acm, produced by some isolates of E. faecium, and a secreted antigen, SagA, exhibiting broad-spectrum binding to extracellular matrix proteins. Here, we analysed the draft genome of strain TX0016 for potential microbial surface components recognizing adhesive matrix molecules (MSCRAMMs). Genome-based bioinformatics identified 22 predicted cell-wall-anchored E. faecium surface proteins (Fms), of which 15 (including Acm) had characteristics typical of MSCRAMMs, including predicted folding into a modular architecture with multiple immunoglobulin-like domains. Functional characterization of one [Fms10; redesignated second collagen adhesin of E. faecium (Scm)] revealed that recombinant Scm(65) (A- and B-domains) and Scm(36) (A-domain) bound to collagen type V efficiently in a concentration-dependent manner, bound considerably less to collagen type I and fibrinogen, and differed from Acm in their binding specificities to collagen types IV and V. Results from far-UV circular dichroism measurements of recombinant Scm(36) and of Acm(37) indicated that these proteins were rich in beta-sheets, supporting our folding predictions. Whole-cell ELISA and FACS analyses unambiguously demonstrated surface expression of Scm in most E. faecium isolates. Strikingly, 11 of the 15 predicted MSCRAMMs clustered in four loci, each with a class C sortase gene; nine of these showed similarity to Enterococcus faecalis Ebp pilus subunits and also contained motifs essential for pilus assembly. Antibodies against one of the predicted major pilus proteins, Fms9 (redesignated EbpC(fm)), detected a 'ladder' pattern of high-molecular-mass protein bands in a Western blot analysis of cell surface extracts from E. faecium, suggesting that EbpC(fm) is polymerized into a pilus structure. Further analysis of the transcripts of the corresponding gene cluster indicated that fms1 (ebpA(fm)), fms5 (ebpB(fm)) and ebpC(fm) are co-transcribed, a result consistent with those for pilus-encoding gene clusters of other Gram-positive bacteria. All 15 genes occurred frequently in 30 clinically derived diverse E. faecium isolates tested. The common occurrence of MSCRAMM- and pilus-encoding genes and the presence of a second collagen-binding protein may have important implications for our understanding of this emerging pathogen.
Resumo:
Vector control is the mainstay of malaria control programmes. Successful vector control profoundly relies on accurate information on the target mosquito populations in order to choose the most appropriate intervention for a given mosquito species and to monitor its impact. An impediment to identify mosquito species is the existence of morphologically identical sibling species that play different roles in the transmission of pathogens and parasites. Currently PCR diagnostics are used to distinguish between sibling species. PCR based methods are, however, expensive, time-consuming and their development requires a priori DNA sequence information. Here, we evaluated an inexpensive molecular proteomics approach for Anopheles species: matrix assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). MALDI-TOF MS is a well developed protein profiling tool for the identification of microorganisms but so far has received little attention as a diagnostic tool in entomology. We measured MS spectra from specimens of 32 laboratory colonies and 2 field populations representing 12 Anopheles species including the A. gambiae species complex. An important step in the study was the advancement and implementation of a bioinformatics approach improving the resolution over previously applied cluster analysis. Borrowing tools for linear discriminant analysis from genomics, MALDI-TOF MS accurately identified taxonomically closely related mosquito species, including the separation between the M and S molecular forms of A. gambiae sensu stricto. The approach also classifies specimens from different laboratory colonies; hence proving also very promising for its use in colony authentication as part of quality assurance in laboratory studies. While being exceptionally accurate and robust, MALDI-TOF MS has several advantages over other typing methods, including simple sample preparation and short processing time. As the method does not require DNA sequence information, data can also be reviewed at any later stage for diagnostic or functional patterns without the need for re-designing and re-processing biological material.
Resumo:
In this paper, we propose novel methodologies for the automatic segmentation and recognition of multi-food images. The proposed methods implement the first modules of a carbohydrate counting and insulin advisory system for type 1 diabetic patients. Initially the plate is segmented using pyramidal mean-shift filtering and a region growing algorithm. Then each of the resulted segments is described by both color and texture features and classified by a support vector machine into one of six different major food classes. Finally, a modified version of the Huang and Dom evaluation index was proposed, addressing the particular needs of the food segmentation problem. The experimental results prove the effectiveness of the proposed method achieving a segmentation accuracy of 88.5% and recognition rate equal to 87%