945 resultados para Bioinformatics
Resumo:
Early transcriptional activation events that occur in bladder immediately following bacterial urinary tract infection (UTI) are not well defined. In this study, we describe the whole bladder transcriptome of uropathogenic Escherichia coli (UPEC) cystitis in mice using genome-wide expression profiling to define the transcriptome of innate immune activation stemming from UPEC colonization of the bladder. Bladder RNA from female C57BL/6 mice, analyzed using 1.0 ST-Affymetrix microarrays, revealed extensive activation of diverse sets of innate immune response genes, including those that encode multiple IL-family members, receptors, metabolic regulators, MAPK activators, and lymphocyte signaling molecules. These were among 1564 genes differentially regulated at 2 h postinfection, highlighting a rapid and broad innate immune response to bladder colonization. Integrative systems-level analyses using InnateDB (http://www.innatedb.com) bioinformatics and ingenuity pathway analysis identified multiple distinct biological pathways in the bladder transcriptome with extensive involvement of lymphocyte signaling, cell cycle alterations, cytoskeletal, and metabolic changes. A key regulator of IL activity identified in the transcriptome was IL-10, which was analyzed functionally to reveal marked exacerbation of cystitis in IL-10–deficient mice. Studies of clinical UTI revealed significantly elevated urinary IL-10 in patients with UPEC cystitis, indicating a role for IL-10 in the innate response to human UTI. The whole bladder transcriptome presented in this work provides new insight into the diversity of innate factors that determine UTI on a genome-wide scale and will be valuable for further data mining. Identification of protective roles for other elements in the transcriptome will provide critical new insight into the complex cascade of events that underpin UTI.
Resumo:
Background Designing novel proteins with site-directed recombination has enormous prospects. By locating effective recombination sites for swapping sequence parts, the probability that hybrid sequences have the desired properties is increased dramatically. The prohibitive requirements for applying current tools led us to investigate machine learning to assist in finding useful recombination sites from amino acid sequence alone. Results We present STAR, Site Targeted Amino acid Recombination predictor, which produces a score indicating the structural disruption caused by recombination, for each position in an amino acid sequence. Example predictions contrasted with those of alternative tools, illustrate STAR'S utility to assist in determining useful recombination sites. Overall, the correlation coefficient between the output of the experimentally validated protein design algorithm SCHEMA and the prediction of STAR is very high (0.89). Conclusion STAR allows the user to explore useful recombination sites in amino acid sequences with unknown structure and unknown evolutionary origin. The predictor service is available from http://pprowler.itee.uq.edu.au/star.
Resumo:
We present a machine learning model that predicts a structural disruption score from a protein s primary structure. SCHEMA was introduced by Frances Arnold and colleagues as a method for determining putative recombination sites of a protein on the basis of the full (PDB) description of its structure. The present method provides an alternative to SCHEMA that is able to determine the same score from sequence data only. Circumventing the need for resolving the full structure enables the exploration of yet unresolved and even hypothetical sequences for protein design efforts. Deriving the SCHEMA score from a primary structure is achieved using a two step approach: first predicting a secondary structure from the sequence and then predicting the SCHEMA score from the predicted secondary structure. The correlation coefficient for the prediction is 0.88 and indicates the feasibility of replacing SCHEMA with little loss of precision.
Resumo:
Determination of sequence similarity is a central issue in computational biology, a problem addressed primarily through BLAST, an alignment based heuristic which has underpinned much of the analysis and annotation of the genomic era. Despite their success, alignment-based approaches scale poorly with increasing data set size, and are not robust under structural sequence rearrangements. Successive waves of innovation in sequencing technologies – so-called Next Generation Sequencing (NGS) approaches – have led to an explosion in data availability, challenging existing methods and motivating novel approaches to sequence representation and similarity scoring, including adaptation of existing methods from other domains such as information retrieval. In this work, we investigate locality-sensitive hashing of sequences through binary document signatures, applying the method to a bacterial protein classification task. Here, the goal is to predict the gene family to which a given query protein belongs. Experiments carried out on a pair of small but biologically realistic datasets (the full protein repertoires of families of Chlamydia and Staphylococcus aureus genomes respectively) show that a measure of similarity obtained by locality sensitive hashing gives highly accurate results while offering a number of avenues which will lead to substantial performance improvements over BLAST..
Resumo:
Motivation Shotgun sequence read data derived from xenograft material contains a mixture of reads arising from the host and reads arising from the graft. Classifying the read mixture to separate the two allows for more precise analysis to be performed. Results We present a technique, with an associated tool Xenome, which performs fast, accurate and specific classification of xenograft-derived sequence read data. We have evaluated it on RNA-Seq data from human, mouse and human-in-mouse xenograft datasets.
Resumo:
The advances in modern information and communication (ICT) technology continue to address the challenges and improve` health outcomes for the survivors of chronic disease such as prostate cancer. The management of survivorship is increasingly becoming an important need for the survivors to manage their chronic conditions. The technology interventions such as tele-health as well as self-managed technology applications have shown a potential to improve survivorship outcomes. However, the application of these tools should be supported by strong health economics evidence. This work discusses the challenges of technology led survivorship care models and presents an integrated approach to address these challenges.
Resumo:
Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.
Resumo:
This thesis presents a novel program parallelization technique incorporating with dynamic and static scheduling. It utilizes a problem specific pattern developed from the prior knowledge of the targeted problem abstraction. Suitable for solving complex parallelization problems such as data intensive all-to-all comparison constrained by memory, the technique delivers more robust and faster task scheduling compared to the state-of-the art techniques. Good performance is achieved from the technique in data intensive bioinformatics applications.
Resumo:
Circos plots are graphical outputs that display three dimensional chromosomal interactions and fusion transcripts. However, the Circos plot tool is not an interactive visualization tool, but rather a figure generator. For example, it does not enable data to be added dynamically, nor does it provide information for specific data points interactively. Recently, an R-based Circos tool (RCircos) has been developed to integrate Circos to R, but similarly, Rcircos can only be used to generate plots. Thus, we have developed a Circos plot tool (J-Circos) that is an interactive visualization tool that can plot Circos figures, as well as being able to dynamically add data to the figure, and providing information for specific data points using mouse hover display and zoom in/out functions. J-Circos uses the Java computer language to enable it to be used on most operating systems (Windows, MacOS, Linux). Users can input data into JCircos using flat data formats, as well as from the GUI. J-Circos will enable biologists to better study more complex chromosomal interactions and fusion transcripts that are otherwise difficult to visualize from next-generation sequencing data.
Resumo:
The function of a protein can be partially determined by the information contained in its amino acid sequence. It can be assumed that proteins with similar amino acid sequences normally have closer functions. Hence analysing the similarity of proteins has become one of the most important areas of protein study. In this work, a layered comparison method is used to analyze the similarity of proteins. It is based on the empirical mode decomposition (EMD) method, and protein sequences are characterized by the intrinsic mode functions (IMFs). The similarity of proteins is studied with a new cross-correlation formula. It seems that the EMD method can be used to detect the functional relationship of two proteins. This kind of similarity method is a complement of traditional sequence similarity approaches which focus on the alignment of amino acids
Resumo:
Epigenetic changes correspond to heritable modifications of the chromosome structure, which do not involve alteration of the DNA sequence but do affect gene expression. These mechanisms play an important role in normal cell differentiation, but aberration is associated also with several diseases, including cancer and neural disorders. In consequence, despite intensive studies in recent years, the contribution of modifications remains largely unquantified due to overall system complexity and insufficient data. Computational models can provide powerful auxiliary tools to experimentation, not least as scales from the sub-cellular through cell populations (or to networks of genes) can be spanned. In this paper, the challenges to development, of realistic cross-scale models, are discussed and illustrated with respect to current work.