44 resultados para Annotation de génomes
Resumo:
In this paper, we propose a new learning approach to Web data annotation, where a support vector machine-based multiclass classifier is trained to assign labels to data items. For data record extraction, a data section re-segmentation algorithm based on visual and content features is introduced to improve the performance of Web data record extraction. We have implemented the proposed approach and tested it with a large set of Web query result pages in different domains. Our experimental results show that our proposed approach is highly effective and efficient.
Resumo:
BACKGROUND: Klebsiella pneumoniae strains are pathogenic to animals and humans, in which they are both a frequent cause of nosocomial infections and a re-emerging cause of severe community-acquired infections. K. pneumoniae isolates of the capsular serotype K2 are among the most virulent. In order to identify novel putative virulence factors that may account for the severity of K2 infections, the genome sequence of the K2 reference strain Kp52.145 was determined and compared to two K1 and K2 strains of low virulence and to the reference strains MGH 78578 and NTUH-K2044.
RESULTS: In addition to diverse functions related to host colonization and virulence encoded in genomic regions common to the four strains, four genomic islands specific for Kp52.145 were identified. These regions encoded genes for the synthesis of colibactin toxin, a putative cytotoxin outer membrane protein, secretion systems, nucleases and eukaryotic-like proteins. In addition, an insertion within a type VI secretion system locus included sel1 domain containing proteins and a phospholipase D family protein (PLD1). The pld1 mutant was avirulent in a pneumonia model in mouse. The pld1 mRNA was expressed in vivo and the pld1 gene was associated with K. pneumoniae isolates from severe infections. Analysis of lipid composition of a defective E. coli strain complemented with pld1 suggests an involvement of PLD1 in cardiolipin metabolism.
CONCLUSIONS: Determination of the complete genome of the K2 reference strain identified several genomic islands comprising putative elements of pathogenicity. The role of PLD1 in pathogenesis was demonstrated for the first time and suggests that lipid metabolism is a novel virulence mechanism of K. pneumoniae.
Resumo:
An MS/MS based analytical strategy was followed to solve the complete sequence of two new peptides from frog (Odorrana schmackeri) skin secretion. This involved reduction and alkylation with two different alkylating agents followed by high resolution tandem mass spectrometry. De novo sequencing was achieved by complementary CID and ETD fragmentations of full-length peptides and of selected tryptic fragments. Heavy and light isotope dimethyl labeling assisted with annotation of sequence ion series. The identified primary structures are GCD[I/L]STCATHN[I/L]VNE[I/L]NKFDKSKPSSGGVGPESP-NH2 and SCNLSTCATHNLVNELNKFDKSKPSSGGVGPESF-NH2, i.e. two carboxyamidated 34 residue peptides with an aminoterminal intramolecular ring structure formed by a disulfide bridge between Cys2 and Cys7. Edman degradation analysis of the second peptide positively confirmed the exact sequence, resolving I/L discriminations. Both peptide sequences are novel and share homology with calcitonin, calcitonin gene related peptide (CGRP) and adrenomedullin from other vertebrates. Detailed sequence analysis as well as the 34 residue length of both O. schmackeri peptides, suggest they do not fully qualify as either calcitonins (32 residues) or CGRPs (37 amino acids) and may justify their classification in a novel peptide family within the calcitonin gene related peptide superfamily. Smooth muscle contractility assays with synthetic replicas of the S–S linked peptides on rat tail artery, uterus, bladder and ileum did not reveal myotropic activity.
Resumo:
To assess factors influencing the success of whole-genome sequencing for mainstream clinical diagnosis, we sequenced 217 individuals from 156 independent cases or families across a broad spectrum of disorders in whom previous screening had identified no pathogenic variants. We quantified the number of candidate variants identified using different strategies for variant calling, filtering, annotation and prioritization. We found that jointly calling variants across samples, filtering against both local and external databases, deploying multiple annotation tools and using familial transmission above biological plausibility contributed to accuracy. Overall, we identified disease-causing variants in 21% of cases, with the proportion increasing to 34% (23/68) for mendelian disorders and 57% (8/14) in family trios. We also discovered 32 potentially clinically actionable variants in 18 genes unrelated to the referral disorder, although only 4 were ultimately considered reportable. Our results demonstrate the value of genome sequencing for routine clinical diagnosis but also highlight many outstanding challenges.
Resumo:
The discovery and clinical application of molecular biomarkers in solid tumors, increasingly relies on nucleic acid extraction from FFPE tissue sections and subsequent molecular profiling. This in turn requires the pathological review of haematoxylin & eosin (H&E) stained slides, to ensure sample quality, tumor DNA sufficiency by visually estimating the percentage tumor nuclei and tumor annotation for manual macrodissection. In this study on NSCLC, we demonstrate considerable variation in tumor nuclei percentage between pathologists, potentially undermining the precision of NSCLC molecular evaluation and emphasising the need for quantitative tumor evaluation. We subsequently describe the development and validation of a system called TissueMark for automated tumor annotation and percentage tumor nuclei measurement in NSCLC using computerized image analysis. Evaluation of 245 NSCLC slides showed precise automated tumor annotation of cases using Tissuemark, strong concordance with manually drawn boundaries and identical EGFR mutational status, following manual macrodissection from the image analysis generated tumor boundaries. Automated analysis of cell counts for % tumor measurements by Tissuemark showed reduced variability and significant correlation (p < 0.001) with benchmark tumor cell counts. This study demonstrates a robust image analysis technology that can facilitate the automated quantitative analysis of tissue samples for molecular profiling in discovery and diagnostics.
Resumo:
Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis.
Resumo:
The introduction of Next Generation Sequencing (NGS) has revolutionised population genetics, providing studies of non-model species with unprecedented genomic coverage, allowing evolutionary biologists to address questions previously far beyond the reach of available resources. Furthermore, the simple mutation model of Single Nucleotide Polymorphisms (SNPs) permits cost-effective high-throughput genotyping in thousands of individuals simultaneously. Genomic resources are scarce for the Atlantic herring (Clupea harengus), a small pelagic species that sustains high revenue fisheries. This paper details the development of 578 SNPs using a combined NGS and high-throughput genotyping approach. Eight individuals covering the species distribution in the eastern Atlantic were bar-coded and multiplexed into a single cDNA library and sequenced using the 454 GS FLX platform. SNP discovery was performed by de novo sequence clustering and contig assembly, followed by the mapping of reads against consensus contig sequences. Selection of candidate SNPs for genotyping was conducted using an in silico approach. SNP validation and genotyping were performed simultaneously using an Illumina 1,536 GoldenGate assay. Although the conversion rate of candidate SNPs in the genotyping assay cannot be predicted in advance, this approach has the potential to maximise cost and time efficiencies by avoiding expensive and time-consuming laboratory stages of SNP validation. Additionally, the in silico approach leads to lower ascertainment bias in the resulting SNP panel as marker selection is based only on the ability to design primers and the predicted presence of intron-exon boundaries. Consequently SNPs with a wider spectrum of minor allele frequencies (MAFs) will be genotyped in the final panel. The genomic resources presented here represent a valuable multi-purpose resource for developing informative marker panels for population discrimination, microarray development and for population genomic studies in the wild.
Resumo:
The growing accessibility to genomic resources using next-generation sequencing (NGS) technologies has revolutionized the application of molecular genetic tools to ecology and evolutionary studies in non-model organisms. Here we present the case study of the European hake (Merluccius merluccius), one of the most important demersal resources of European fisheries. Two sequencing platforms, the Roche 454 FLX (454) and the Illumina Genome Analyzer (GAII), were used for Single Nucleotide Polymorphisms (SNPs) discovery in the hake muscle transcriptome. De novo transcriptome assembly into unique contigs, annotation, and in silico SNP detection were carried out in parallel for 454 and GAII sequence data. High-throughput genotyping using the Illumina GoldenGate assay was performed for validating 1,536 putative SNPs. Validation results were analysed to compare the performances of 454 and GAII methods and to evaluate the role of several variables (e.g. sequencing depth, intron-exon structure, sequence quality and annotation). Despite well-known differences in sequence length and throughput, the two approaches showed similar assay conversion rates (approximately 43%) and percentages of polymorphic loci (67.5% and 63.3% for GAII and 454, respectively). Both NGS platforms therefore demonstrated to be suitable for large scale identification of SNPs in transcribed regions of non-model species, although the lack of a reference genome profoundly affects the genotyping success rate. The overall efficiency, however, can be improved using strict quality and filtering criteria for SNP selection (sequence quality, intron-exon structure, target region score).
Resumo:
Evidence that persistent environmental pollutants may target the male reproductive system is increasing. The male reproductive system is regulated by secretion of testosterone by testicular Leydig cells, and perturbation of Leydig cell function may have ultimate consequences. 3-Methylsulfonyl-DDE (3-MeSO2-DDE) is a potent adrenal toxicants formed from the persistent insecticide DDT. Although studies have revealed the endocrine disruptive effect of 3-MeSO2-DDE, the underlying mechanisms at cellular level in steroidogenic Leydig cells remains to be established. The current study addresses the effect of 3-MeSO2-DDE on viability, hormone production and proteome response of primary neonatal porcine Leydig cells. The AlamarBlue™ assay was used to evaluate cell viability. Solid phase radioimmunoassay was used to measure concentration of hormones produced by both unstimulated and Luteinizing hormone (LH)-stimulated Leydig cells following 48h exposure. Protein samples from Leydig cells exposed to a non-cytotoxic concentration of 3-MeSO2-DDE (10μM) were subjected to nano-LC-MS/MS and analyzed on a Q Exactive mass spectrometer and quantified using label-free quantitative algorithm. Gene Ontology (GO) and Ingenuity Pathway Analysis (IPA) were carried out for functional annotation and identification of protein interaction networks. 3-MeSO2-DDE regulated Leydig cell steroidogenesis differentially depending on cell culture condition. Whereas its effect on testosterone secretion at basal condition was stimulatory, the effect on LH-stimulated cells was inhibitory. From triplicate experiments, a total of 6804 proteins were identified in which the abundance of 86 proteins in unstimulated Leydig cells and 145 proteins in LH-stimulated Leydig cells was found to be significantly regulated in response to 3-MeSO2-DDE exposure. These proteins not only are the first reported in relation to 3-MeSO2-DDE exposure, but also display small number of proteins shared between culture conditions, suggesting the action of 3-MeSO2-DDE on several targeted pathways, including mitochondrial dysfunction, oxidative phosphorylation, EIF2-signaling, and glutathione-mediated detoxification. Further identification and characterization of these proteins and pathways may build our understanding to the molecular basis of 3-MeSO2-DDE induced endocrine disruption in Leydig cells.
Resumo:
Introduction: Fewer than 50% of adults and 40% of youth meet US CDC guidelines for physical activity (PA) with the built environment (BE) a culprit for limited PA. A challenge in evaluating policy and BE change is the forethought to capture a priori PA behaviors and the ability to eliminate bias in post-change environments. The present objective was to analyze existing public data feeds to quantify effectiveness of BE interventions. The Archive of Many Outdoor Scenes (AMOS) has collected 135 million images of outdoor environments from 12,000 webcams since 2006. Many of these environments have experienced BE change. Methods: One example of BE change is the addition of protected bike lanes and a bike share program in Washington, DC.Weselected an AMOS webcam that captured this change. AMOS captures a photograph from eachwebcamevery half hour.AMOScaptured the 120 webcam photographs between 0700 and 1900 during the first work week of June 2009 and the 120 photographs from the same week in 2010. We used the Amazon Mechanical Turk (MTurk) website to crowd-source the image annotation. MTurk workers were paid US$0.01 to mark each pedestrian, cyclist and vehicle in a photograph. Each image was coded 5 unique times (n=1200). The data, counts of transportation mode, was downloaded to SPSS for analysis. Results: The number of cyclists per scene increased four-fold between 2009 and 2010 (F=36.72, p=0.002). There was no significant increase in pedestrians between the two years, however there was a significant increase in number of vehicles per scene (F=16.81, p
Resumo:
Background: Esophageal adenocarcinoma (EA) is one of the fastest rising cancers in western countries. Barrett’s Esophagus (BE) is the premalignant precursor of EA. However, only a subset of BE patients develop EA, which complicates the clinical management in the absence of valid predictors. Genetic risk factors for BE and EA are incompletely understood. This study aimed to identify novel genetic risk factors for BE and EA.Methods: Within an international consortium of groups involved in the genetics of BE/EA, we performed the first meta-analysis of all genome-wide association studies (GWAS) available, involving 6,167 BE patients, 4,112 EA patients, and 17,159 representative controls, all of European ancestry, genotyped on Illumina high-density SNP-arrays, collected from four separate studies within North America, Europe, and Australia. Meta-analysis was conducted using the fixed-effects inverse variance-weighting approach. We used the standard genome-wide significant threshold of 5×10-8 for this study. We also conducted an association analysis following reweighting of loci using an approach that investigates annotation enrichment among the genome-wide significant loci. The entire GWAS-data set was also analyzed using bioinformatics approaches including functional annotation databases as well as gene-based and pathway-based methods in order to identify pathophysiologically relevant cellular pathways.Findings: We identified eight new associated risk loci for BE and EA, within or near the CFTR (rs17451754, P=4·8×10-10), MSRA (rs17749155, P=5·2×10-10), BLK (rs10108511, P=2·1×10-9), KHDRBS2 (rs62423175, P=3·0×10-9), TPPP/CEP72 (rs9918259, P=3·2×10-9), TMOD1 (rs7852462, P=1·5×10-8), SATB2 (rs139606545, P=2·0×10-8), and HTR3C/ABCC5 genes (rs9823696, P=1·6×10-8). A further novel risk locus at LPA (rs12207195, posteriori probability=0·925) was identified after re-weighting using significantly enriched annotations. This study thereby doubled the number of known risk loci. The strongest disease pathways identified (P<10-6) belong to muscle cell differentiation and to mesenchyme development/differentiation, which fit with current pathophysiological BE/EA concepts. To our knowledge, this study identified for the first time an EA-specific association (rs9823696, P=1·6×10-8) near HTR3C/ABCC5 which is independent of BE development (P=0·45).Interpretation: The identified disease loci and pathways reveal new insights into the etiology of BE and EA. Furthermore, the EA-specific association at HTR3C/ABCC5 may constitute a novel genetic marker for the prediction of transition from BE to EA. Mutations in CFTR, one of the new risk loci identified in this study, cause cystic fibrosis (CF), the most common recessive disorder in Europeans. Gastroesophageal reflux (GER) belongs to the phenotypic CF-spectrum and represents the main risk factor for BE/EA. Thus, the CFTR locus may trigger a common GER-mediated pathophysiology.
Resumo:
MOTIVATION: Data from RNA-seq experiments provide us with many new possibilities to gain insights into biological and disease mechanisms of cellular functioning. However, the reproducibility and robustness of RNA-seq data analysis results is often unclear. This is in part attributed to the two counter acting goals of (a) a cost efficient and (b) an optimal experimental design leading to a compromise, e.g., in the sequencing depth of experiments.
RESULTS: We introduce an R package called samExploreR that allows the subsampling (m out of n bootstraping) of short-reads based on SAM files facilitating the investigation of sequencing depth related questions for the experimental design. Overall, this provides a systematic way for exploring the reproducibility and robustness of general RNA-seq studies. We exemplify the usage of samExploreR by studying the influence of the sequencing depth and the annotation on the identification of differentially expressed genes.
AVAILABILITY: Availability: samExploreR is available as an R package from Bioconductor (after acceptance of the paper, download link: http://www.bio-complexity.com/samExploreR_1.0.0.tar.gz).
Resumo:
The annotation of Business Dynamics models with parameters and equations, to simulate the system under study and further evaluate its simulation output, typically involves a lot of manual work. In this paper we present an approach for automated equation formulation of a given Causal Loop Diagram (CLD) and a set of associated time series with the help of neural network evolution (NEvo). NEvo enables the automated retrieval of surrogate equations for each quantity in the given CLD, hence it produces a fully annotated CLD that can be used for later simulations to predict future KPI development. In the end of the paper, we provide a detailed evaluation of NEvo on a business use-case to demonstrate its single step prediction capabilities.