917 resultados para High-throughput
Resumo:
High-throughput SNP arrays provide estimates of genotypes for up to one million loci, often used in genome-wide association studies. While these estimates are typically very accurate, genotyping errors do occur, which can influence in particular the most extreme test statistics and p-values. Estimates for the genotype uncertainties are also available, although typically ignored. In this manuscript, we develop a framework to incorporate these genotype uncertainties in case-control studies for any genetic model. We verify that using the assumption of a “local alternative” in the score test is very reasonable for effect sizes typically seen in SNP association studies, and show that the power of the score test is simply a function of the correlation of the genotype probabilities with the true genotypes. We demonstrate that the power to detect a true association can be substantially increased for difficult to call genotypes, resulting in improved inference in association studies.
Resumo:
High-throughput gene expression technologies such as microarrays have been utilized in a variety of scientific applications. Most of the work has been on assessing univariate associations between gene expression with clinical outcome (variable selection) or on developing classification procedures with gene expression data (supervised learning). We consider a hybrid variable selection/classification approach that is based on linear combinations of the gene expression profiles that maximize an accuracy measure summarized using the receiver operating characteristic curve. Under a specific probability model, this leads to consideration of linear discriminant functions. We incorporate an automated variable selection approach using LASSO. An equivalence between LASSO estimation with support vector machines allows for model fitting using standard software. We apply the proposed method to simulated data as well as data from a recently published prostate cancer study.
Resumo:
The last few years have seen the advent of high-throughput technologies to analyze various properties of the transcriptome and proteome of several organisms. The congruency of these different data sources, or lack thereof, can shed light on the mechanisms that govern cellular function. A central challenge for bioinformatics research is to develop a unified framework for combining the multiple sources of functional genomics information and testing associations between them, thus obtaining a robust and integrated view of the underlying biology. We present a graph theoretic approach to test the significance of the association between multiple disparate sources of functional genomics data by proposing two statistical tests, namely edge permutation and node label permutation tests. We demonstrate the use of the proposed tests by finding significant association between a Gene Ontology-derived "predictome" and data obtained from mRNA expression and phenotypic experiments for Saccharomyces cerevisiae. Moreover, we employ the graph theoretic framework to recast a surprising discrepancy presented in Giaever et al. (2002) between gene expression and knockout phenotype, using expression data from a different set of experiments.
Resumo:
The advances in computational biology have made simultaneous monitoring of thousands of features possible. The high throughput technologies not only bring about a much richer information context in which to study various aspects of gene functions but they also present challenge of analyzing data with large number of covariates and few samples. As an integral part of machine learning, classification of samples into two or more categories is almost always of interest to scientists. In this paper, we address the question of classification in this setting by extending partial least squares (PLS), a popular dimension reduction tool in chemometrics, in the context of generalized linear regression based on a previous approach, Iteratively ReWeighted Partial Least Squares, i.e. IRWPLS (Marx, 1996). We compare our results with two-stage PLS (Nguyen and Rocke, 2002A; Nguyen and Rocke, 2002B) and other classifiers. We show that by phrasing the problem in a generalized linear model setting and by applying bias correction to the likelihood to avoid (quasi)separation, we often get lower classification error rates.
Resumo:
Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of basepairs across the genome. Genome-wide association studies (GWAS) may simultaneously screen for copy number-phenotype and SNP-phenotype associations as part of the analytic strategy. However, genome-wide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post-hoc quality control procedures that exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch effects and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of diallelic genotype calls from experimental data to estimate batch- and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in quantile-normalized intensities, while the latter illustrates the robustness of our approach to datasets where as many as 25% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy-number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package CRLMM available at Bioconductor (http:www.bioconductor.org).
Resumo:
Functional neuroimaging techniques enable investigations into the neural basis of human cognition, emotions, and behaviors. In practice, applications of functional magnetic resonance imaging (fMRI) have provided novel insights into the neuropathophysiology of major psychiatric,neurological, and substance abuse disorders, as well as into the neural responses to their treatments. Modern activation studies often compare localized task-induced changes in brain activity between experimental groups. One may also extend voxel-level analyses by simultaneously considering the ensemble of voxels constituting an anatomically defined region of interest (ROI) or by considering means or quantiles of the ROI. In this work we present a Bayesian extension of voxel-level analyses that offers several notable benefits. First, it combines whole-brain voxel-by-voxel modeling and ROI analyses within a unified framework. Secondly, an unstructured variance/covariance for regional mean parameters allows for the study of inter-regional functional connectivity, provided enough subjects are available to allow for accurate estimation. Finally, an exchangeable correlation structure within regions allows for the consideration of intra-regional functional connectivity. We perform estimation for our model using Markov Chain Monte Carlo (MCMC) techniques implemented via Gibbs sampling which, despite the high throughput nature of the data, can be executed quickly (less than 30 minutes). We apply our Bayesian hierarchical model to two novel fMRI data sets: one considering inhibitory control in cocaine-dependent men and the second considering verbal memory in subjects at high risk for Alzheimer’s disease. The unifying hierarchical model presented in this manuscript is shown to enhance the interpretation content of these data sets.
Resumo:
Functional Magnetic Resonance Imaging (fMRI) is a non-invasive technique which is commonly used to quantify changes in blood oxygenation and flow coupled to neuronal activation. One of the primary goals of fMRI studies is to identify localized brain regions where neuronal activation levels vary between groups. Single voxel t-tests have been commonly used to determine whether activation related to the protocol differs across groups. Due to the generally limited number of subjects within each study, accurate estimation of variance at each voxel is difficult. Thus, combining information across voxels in the statistical analysis of fMRI data is desirable in order to improve efficiency. Here we construct a hierarchical model and apply an Empirical Bayes framework on the analysis of group fMRI data, employing techniques used in high throughput genomic studies. The key idea is to shrink residual variances by combining information across voxels, and subsequently to construct an improved test statistic in lieu of the classical t-statistic. This hierarchical model results in a shrinkage of voxel-wise residual sample variances towards a common value. The shrunken estimator for voxelspecific variance components on the group analyses outperforms the classical residual error estimator in terms of mean squared error. Moreover, the shrunken test-statistic decreases false positive rate when testing differences in brain contrast maps across a wide range of simulation studies. This methodology was also applied to experimental data regarding a cognitive activation task.
Resumo:
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number, develops markerand study-level summaries of batch effects, and demonstrates how the marker-level estimates can be integrated with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R. A compendium for reproducing the analysis is available from the author’s website (http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html).
Resumo:
The detection of virulence determinants harbored by pathogenic Escherichia coli is important for establishing the pathotype responsible for infection. A sensitive and specific miniaturized virulence microarray containing 60 oligonucleotide probes was developed. It detected six E. coli pathotypes and will be suitable in the future for high-throughput use.
Resumo:
Cu is an essential nutrient for man, but can be toxic if intakes are too high. In sensitive populations, marginal over- or under-exposure can have detrimental effects. Malnourished children, the elderly, and pregnant or lactating females may be susceptible for Cu deficiency. Cu status and exposure in the population can currently not be easily measured, as neither plasma Cu nor plasma cuproenzymes reflect Cu status precisely. Some blood markers (such as ceruloplasmin) indicate severe Cu depletion, but do not inversely respond to Cu excess, and are not suitable to indicate marginal states. A biomarker of Cu is needed that is sensitive to small changes in Cu status, and that responds to Cu excess as well as deficiency. Such a marker will aid in monitoring Cu status in large populations, and will help to avoid chronic health effects (for example, liver damage in chronic toxicity, osteoporosis, loss of collagen stability, or increased susceptibility to infections in deficiency). The advent of high-throughput technologies has enabled us to screen for potential biomarkers in the whole proteome of a cell, not excluding markers that have no direct link to Cu. Further, this screening allows us to search for a whole group of proteins that, in combination, reflect Cu status. The present review emphasises the need to find sensitive biomarkers for Cu, examines potential markers of Cu status already available, and discusses methods to identify a novel suite of biomarkers.
Resumo:
Trypanosoma brucei rhodesiense and T. b. gambiense are the causative agents of sleeping sickness, a fatal disease that affects 36 countries in sub-Saharan Africa. Nevertheless, only a handful of clinically useful drugs are available. These drugs suffer from severe side-effects. The situation is further aggravated by the alarming incidence of treatment failures in several sleeping sickness foci, apparently indicating the occurrence of drug-resistant trypanosomes. Because of these reasons, and since vaccination does not appear to be feasible due to the trypanosomes' ever changing coat of variable surface glycoproteins (VSGs), new drugs are needed urgently. The entry of Trypanosoma brucei into the post-genomic age raises hopes for the identification of novel kinds of drug targets and in turn new treatments for sleeping sickness. The pragmatic definition of a drug target is, a protein that is essential for the parasite and does not have homologues in the host. Such proteins are identified by comparing the predicted proteomes of T. brucei and Homo sapiens, then validated by large-scale gene disruption or gene silencing experiments in trypanosomes. Once all proteins that are essential and unique to the parasite are identified, inhibitors may be found by high-throughput screening. However powerful, this functional genomics approach is going to miss a number of attractive targets. Several current, successful parasiticides attack proteins that have close homologues in the human proteome. Drugs like DFMO or pyrimethamine inhibit parasite and host enzymes alike--a therapeutic window is opened only by subtle differences in the regulation of the targets, which cannot be recognized in silico. Working against the post-genomic approach is also the fact that essential proteins tend to be more highly conserved between species than non-essential ones. Here we advocate drug targeting, i.e. uptake or activation of a drug via parasite-specific pathways, as a chemotherapeutic strategy to selectively inhibit enzymes that have equally sensitive counterparts in the host. The T. brucei purine salvage machinery offers opportunities for both metabolic and transport-based targeting: unusual nucleoside and nucleobase permeases may be exploited for selective import, salvage enzymes for selective activation of purine antimetabolites.
Resumo:
In this study, we present a novel genotyping scheme to classify German wild-type varicella-zoster virus (VZV) strains and to differentiate them from the Oka vaccine strain (genotype B). This approach is based on analysis of four loci in open reading frames (ORFs) 51 to 58, encompassing a total length of 1,990 bp. The new genotyping scheme produced identical clusters in phylogenetic analyses compared to full-genome sequences from well-characterized VZV strains. Based on genotype A, D, B, and C reference strains, a dichotomous identification key (DIK) was developed and applied for VZV strains obtained from vesicle fluid and liquor samples originating from 42 patients suffering from varicella or zoster between 2003 and 2006. Sequencing of regions in ORFs 51, 52, 53, 56, 57, and 58 identified 18 single-nucleotide polymorphisms (SNPs), including two novel ones, SNP 89727 and SNP 92792 in ORF51 and ORF52, respectively. The DIK as well as phylogenetic analysis by Bayesian inference showed that 14 VZV strains belonged to genotype A, and 28 VZV strains were classified as genotype D. Neither Japanese (vaccine)-like B strains nor recombinant-like C strains were found within the samples from Germany. The novel genotyping scheme and the DIK were demonstrated to be practical and simple and allow the highly efficient replication of phylogenetic patterns in VZV initially derived from full-genome DNA sequence analyses. Therefore, this approach may allow us to draw a more comprehensive picture of wild-type VZV strains circulating in Germany and Central Europe by high-throughput procedures in the future.
Resumo:
Biofuels are an increasingly important component of worldwide energy supply. This research aims to understand the pathways and impacts of biofuels production, and to improve these processes to make them more efficient. In Chapter 2, a life cycle assessment (LCA) is presented for cellulosic ethanol production from five potential feedstocks of regional importance to the upper Midwest - hybrid poplar, hybrid willow, switchgrass, diverse prairie grasses, and logging residues - according to the requirements of Renewable Fuel Standard (RFS). Direct land use change emissions are included for the conversion of abandoned agricultural land to feedstock production, and computer models of the conversion process are used in order to determine the effect of varying biomass composition on overall life cycle impacts. All scenarios analyzed here result in greater than 60% reduction in greenhouse gas emissions relative to petroleum gasoline. Land use change effects were found to contribute significantly to the overall emissions for the first 20 years after plantation establishment. Chapter 3 is an investigation of the effects of biomass mixtures on overall sugar recovery from the combined processes of dilute acid pretreatment and enzymatic hydrolysis. Biomass mixtures studied were aspen, a hardwood species well suited to biochemical processing; balsam, a high-lignin softwood species, and switchgrass, an herbaceous energy crop with high ash content. A matrix of three different dilute acid pretreatment severities and three different enzyme loading levels was used to characterize interactions between pretreatment and enzymatic hydrolysis. Maximum glucose yield for any species was 70% oftheoretical for switchgrass, and maximum xylose yield was 99.7% of theoretical for aspen. Supplemental β-glucosidase increased glucose yield from enzymatic hydrolysis by an average of 15%, and total sugar recoveries for mixtures could be predicted to within 4% by linear interpolation of the pure species results. Chapter 4 is an evaluation of the potential for producing Trichoderma reesei cellulose hydrolases in the Kluyveromyces lactis yeast expression system. The exoglucanases Cel6A and Cel7A, and the endoglucanase Cel7B were inserted separately into the K. lactis and the enzymes were analyzed for activity on various substrates. Recombinant Cel7B was found to be active on carboxymethyl cellulose and Avicel powdered cellulose substrates. Recombinant Cel6A was also found to be active on Avicel. Recombinant Cel7A was produced, but no enzymatic activity was detected on any substrate. Chapter 5 presents a new method for enzyme improvement studies using enzyme co-expression and yeast growth rate measurements as a potential high-throughput expression and screening system in K. lactis yeast. Two different K. lactis strains were evaluated for their usefulness in growth screening studies, one wild-type strain and one strain which has had the main galactose metabolic pathway disabled. Sequential transformation and co-expression of the exoglucanase Cel6A and endoglucanase Cel7B was performed, and improved hydrolysis rates on Avicel were detectable in the cell culture supernatant. Future work should focus on hydrolysis of natural substrates, developing the growth screening method, and utilizing the K. lactis expression system for directed evolution of enzymes.
Resumo:
Synthetic oligonucleotides and peptides have found wide applications in industry and academic research labs. There are ~60 peptide drugs on the market and over 500 under development. The global annual sale of peptide drugs in 2010 was estimated to be $13 billion. There are three oligonucleotide-based drugs on market; among them, the FDA newly approved Kynamro was predicted to have a $100 million annual sale. The annual sale of oligonucleotides to academic labs was estimated to be $700 million. Both bio-oligomers are mostly synthesized on automated synthesizers using solid phase synthesis technology, in which nucleoside or amino acid monomers are added sequentially until the desired full-length sequence is reached. The additions cannot be complete, which generates truncated undesired failure sequences. For almost all applications, these impurities must be removed. The most widely used method is HPLC. However, the method is slow, expensive, labor-intensive, not amendable for automation, difficult to scale up, and unsuitable for high throughput purification. It needs large capital investment, and consumes large volumes of harmful solvents. The purification costs are estimated to be more than 50% of total production costs. Other methods for bio-oligomer purification also have drawbacks, and are less favored than HPLC for most applications. To overcome the problems of known biopolymer purification technologies, we have developed two non-chromatographic purification methods. They are (1) catching failure sequences by polymerization, and (2) catching full-length sequences by polymerization. In the first method, a polymerizable group is attached to the failure sequences of the bio-oligomers during automated synthesis; purification is achieved by simply polymerizing the failure sequences into an insoluble gel and extracting full-length sequences. In the second method, a polymerizable group is attached to the full-length sequences, which are then incorporated into a polymer; impurities are removed by washing, and pure product is cleaved from polymer. These methods do not need chromatography, and all drawbacks of HPLC no longer exist. Using them, purification is achieved by simple manipulations such as shaking and extraction. Therefore, they are suitable for large scale purification of oligonucleotide and peptide drugs, and also ideal for high throughput purification, which currently has a high demand for research projects involving total gene synthesis. The dissertation will present the details about the development of the techniques. Chapter 1 will make an introduction to oligodeoxynucleotides (ODNs), their synthesis and purification. Chapter 2 will describe the detailed studies of using the catching failure sequences by polymerization method to purify ODNs. Chapter 3 will describe the further optimization of the catching failure sequences by polymerization ODN purification technology to the level of practical use. Chapter 4 will present using the catching full-length sequence by polymerization method for ODN purification using acid-cleavable linker. Chapter 5 will make an introduction to peptides, their synthesis and purification. Chapter 6 will describe the studies using the catching full-length sequence by polymerization method for peptide purification.
Resumo:
Abstract Radiation metabolomics employing mass spectral technologies represents a plausible means of high-throughput minimally invasive radiation biodosimetry. A simplified metabolomics protocol is described that employs ubiquitous gas chromatography-mass spectrometry and open source software including random forests machine learning algorithm to uncover latent biomarkers of 3 Gy gamma radiation in rats. Urine was collected from six male Wistar rats and six sham-irradiated controls for 7 days, 4 prior to irradiation and 3 after irradiation. Water and food consumption, urine volume, body weight, and sodium, potassium, calcium, chloride, phosphate and urea excretion showed major effects from exposure to gamma radiation. The metabolomics protocol uncovered several urinary metabolites that were significantly up-regulated (glyoxylate, threonate, thymine, uracil, p-cresol) and down-regulated (citrate, 2-oxoglutarate, adipate, pimelate, suberate, azelaate) as a result of radiation exposure. Thymine and uracil were shown to derive largely from thymidine and 2'-deoxyuridine, which are known radiation biomarkers in the mouse. The radiation metabolomic phenotype in rats appeared to derive from oxidative stress and effects on kidney function. Gas chromatography-mass spectrometry is a promising platform on which to develop the field of radiation metabolomics further and to assist in the design of instrumentation for use in detecting biological consequences of environmental radiation release.