3 resultados para Bioinformatics
em Bucknell University Digital Commons - Pensilvania - USA
Resumo:
With the advent of cheaper and faster DNA sequencing technologies, assembly methods have greatly changed. Instead of outputting reads that are thousands of base pairs long, new sequencers parallelize the task by producing read lengths between 35 and 400 base pairs. Reconstructing an organism’s genome from these millions of reads is a computationally expensive task. Our algorithm solves this problem by organizing and indexing the reads using n-grams, which are short, fixed-length DNA sequences of length n. These n-grams are used to efficiently locate putative read joins, thereby eliminating the need to perform an exhaustive search over all possible read pairs. Our goal was develop a novel n-gram method for the assembly of genomes from next-generation sequencers. Specifically, a probabilistic, iterative approach was utilized to determine the most likely reads to join through development of a new metric that models the probability of any two arbitrary reads being joined together. Tests were run using simulated short read data based on randomly created genomes ranging in lengths from 10,000 to 100,000 nucleotides with 16 to 20x coverage. We were able to successfully re-assemble entire genomes up to 100,000 nucleotides in length.
Resumo:
With a virus such as Human Immunodeficiency Virus (HIV) that has infected millions of people worldwide, and with many unaware that they are infected, it becomes vital to understand how the virus works and how it functions at the molecular level. Because there currently is no vaccine and no way to eradicate the virus from an infected person, any information about how the virus interacts with its host greatly increases the chances of understanding how HIV works and brings scientists one step closer to being able to combat such a destructive virus. Thousands of HIV viruses have been sequenced and are available in many online databases for public use. Attributes that are linked to each sequence include the viral load within the host and how sick the patient is currently. Being able to predict the stage of infection for someone is a valuable resource, as it could potentially aid in treatment options and proper medication use. Our approach of analyzing region-specific amino acid composition for select genes has been able to predict patient disease state up to an accuracy of 85.4%. Moreover, we output a set of classification rules based on the sequence that may prove useful for diagnosing the expected clinical outcome of the infected patient.
Resumo:
Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We compare phylogenetic trees generated using our technique against trees generated using traditional alignment techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring an alignment prior to establishing sequence similarity.