20 resultados para Subsequences


Relevância:

10.00% 10.00%

Publicador:

Resumo:

The analysis of sequential data is required in many diverse areas such as telecommunications, stock market analysis, and bioinformatics. A basic problem related to the analysis of sequential data is the sequence segmentation problem. A sequence segmentation is a partition of the sequence into a number of non-overlapping segments that cover all data points, such that each segment is as homogeneous as possible. This problem can be solved optimally using a standard dynamic programming algorithm. In the first part of the thesis, we present a new approximation algorithm for the sequence segmentation problem. This algorithm has smaller running time than the optimal dynamic programming algorithm, while it has bounded approximation ratio. The basic idea is to divide the input sequence into subsequences, solve the problem optimally in each subsequence, and then appropriately combine the solutions to the subproblems into one final solution. In the second part of the thesis, we study alternative segmentation models that are devised to better fit the data. More specifically, we focus on clustered segmentations and segmentations with rearrangements. While in the standard segmentation of a multidimensional sequence all dimensions share the same segment boundaries, in a clustered segmentation the multidimensional sequence is segmented in such a way that dimensions are allowed to form clusters. Each cluster of dimensions is then segmented separately. We formally define the problem of clustered segmentations and we experimentally show that segmenting sequences using this segmentation model, leads to solutions with smaller error for the same model cost. Segmentation with rearrangements is a novel variation to the segmentation problem: in addition to partitioning the sequence we also seek to apply a limited amount of reordering, so that the overall representation error is minimized. We formulate the problem of segmentation with rearrangements and we show that it is an NP-hard problem to solve or even to approximate. We devise effective algorithms for the proposed problem, combining ideas from dynamic programming and outlier detection algorithms in sequences. In the final part of the thesis, we discuss the problem of aggregating results of segmentation algorithms on the same set of data points. In this case, we are interested in producing a partitioning of the data that agrees as much as possible with the input partitions. We show that this problem can be solved optimally in polynomial time using dynamic programming. Furthermore, we show that not all data points are candidates for segment boundaries in the optimal solution.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In this article we describe and demonstrate the versatility of a computer program, GENOME MAPPING, that uses interactive graphics and runs on an IRIS workstation. The program helps to visualize as well as analyse global and local patterns of genomic DNA sequences. It was developed keeping in mind the requirements of the human genome sequencing programme, which requires rapid analysis of the data. Using GENOME MAPPING one can discern signature patterns of different kinds of sequences and analyse such patterns for repetitive as well as rare sequence strings. Further, one can visualize the extent of global homology between different genomic sequences. An application of our method to the published yeast mitochondrial genome data shows similar sequence organizations in the entire sequence and in smaller subsequences.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Amostras de DNA são encontradas em fragmentos, obtidos em vestígios de uma cena de crime, ou coletados de amostras de cabelo ou sangue, para testes genéticos ou de paternidade. Para identificar se esse fragmento pertence ou não a uma sequência de DNA, é necessário compará-los com uma sequência determinada, que pode estar armazenada em um banco de dados para, por exemplo, apontar um suspeito. Para tal, é preciso uma ferramenta eficiente para realizar o alinhamento da sequência de DNA encontrada com a armazenada no banco de dados. O alinhamento de sequências de DNA, em inglês DNA matching, é o campo da bioinformática que tenta entender a relação entre as sequências genéticas e suas relações funcionais e parentais. Essa tarefa é frequentemente realizada através de softwares que varrem clusters de base de dados, demandando alto poder computacional, o que encarece o custo de um projeto de alinhamento de sequências de DNA. Esta dissertação apresenta uma arquitetura de hardware paralela, para o algoritmo BLAST, que permite o alinhamento de um par de sequências de DNA. O algoritmo BLAST é um método heurístico e atualmente é o mais rápido. A estratégia do BLAST é dividir as sequências originais em subsequências menores de tamanho w. Após realizar as comparações nessas pequenas subsequências, as etapas do BLAST analisam apenas as subsequências que forem idênticas. Com isso, o algoritmo diminui o número de testes e combinações necessárias para realizar o alinhamento. Para cada sequência idêntica há três etapas, a serem realizadas pelo algoritmo: semeadura, extensão e avaliação. A solução proposta se inspira nas características do algoritmo para implementar um hardware totalmente paralelo e com pipeline entre as etapas básicas do BLAST. A arquitetura de hardware proposta foi implementada em FPGA e os resultados obtidos mostram a comparação entre área ocupada, número de ciclos e máxima frequência de operação permitida, em função dos parâmetros de alinhamento. O resultado é uma arquitetura de hardware em lógica reconfigurável, escalável, eficiente e de baixo custo, capaz de alinhar pares de sequências utilizando o algoritmo BLAST.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Glutenite reservoir is one of the most important reservoir types in china. Because of its particularity of rock structure and pore structure, it is usually difficult in development, especially for its serious heterogeneity. On the basis of seismic, well logs, core data and production performance, the lower Wuerhe group can be divided into one second-order sequences, two third-order sequences and twenty two subsequences, corresponding to the five stages and twenty two minlayers. In addition, the fault systems are interpreted and the control action of fault systems to reservoir development is also described. The lower Wuerhe formation of 8th district belongs to fluvial-dominated fan delta sedimentation, according to the analysis of well logs, logging data and core data. It can be subdivided into two kinds of subfacies and nine kinds of microfacies. The fan delta plain subfacies mainly consist of braided channel, unconcentrated flow, mud flow and sieve deposit microfacies. The fan delta front subfacies include subaqueous distributary channel, subaqueous interdistributary channel, debris flow, subaqueous barrier and grain flow microfacies. Combined with the regional geological characteristics, the porosity model of lower Wuerhe formation is performed using core data. A permeability model based on the flow zone index is also formed according to the pore throat characteristics and flow property. Finally, the heterogeneity is analyzed. The result shows that the lower Wuerhe formation has a feature of middle-high heterogeneity, and it is controlled by material sources and sedimentary facies belt.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The synapsing variable-length crossover (SVLC algorithm provides a biologically inspired method for performing meaningful crossover between variable-length genomes. In addition to providing a rationale for variable-length crossover, it also provides a genotypic similarity metric for variable-length genomes, enabling standard niche formation techniques to be used with variable-length genomes. Unlike other variable-length crossover techniques which consider genomes to be rigid inflexible arrays and where some or all of the crossover points are randomly selected, the SVLC algorithm considers genomes to be flexible and chooses non-random crossover points based on the common parental sequence similarity. The SVLC algorithm recurrently "glues" or synapses homogenous genetic subsequences together. This is done in such a way that common parental sequences are automatically preserved in the offspring with only the genetic differences being exchanged or removed, independent of the length of such differences. In a variable-length test problem, the SVLC algorithm compares favorably with current variable-length crossover techniques. The variable-length approach is further advocated by demonstrating how a variable-length genetic algorithm (GA) can obtain a high fitness solution in fewer iterations than a traditional fixed-length GA in a two-dimensional vector approximation task.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In an attempt to improve automated gene prediction in the untranslated region of a gene, we completed an in-depth analysis of the minimum free energy for 8,689 sub-genetic DNA sequences. We expanded Zhang's classification model and classified each sub-genetic sequence into one of 27 possible motifs. We calculated the minimum free energy for each motif to explore statistical features that correlate to biologically relevant sub-genetic sequences. If biologically relevant sub-genetic sequences fall into distinct free energy quanta it may be possible to characterize a motif based on its minimum free energy. Proper characterization of motifs can lead to greater understanding in automated genefinding, gene variability and the role DNA structure plays in gene network regulation.

Our analysis determined: (1) the average free energy value for exons, introns and other biologically relevant sub-genetic sequences, (2) that these subsequences do not exist in distinct energy quanta, (3) that introns exist however in a tightly coupled average minimum free energy quantum compared to all other biologically relevant sub-genetic sequence types, (4) that single exon genes demonstrate a higher stability than exons which span the entire coding sequence as part of a multi-exon gene and (5) that all motif types contain a free energy global minimum at approximately nucleotide position 1,000 before reaching a plateau. These results should be relevant to the biochemist and bioinformatician seeking to understand the relationship between sub-genetic sequences and the information behind them.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

One of the results of the surge in interest in the internet is the great increase in availability of pictorial and video data. Web browsers such as Netscape give access to an enormous range of such data. In order to make use of large amounts of pictorial and video data, it is necessary to develop indexing and retrieval methods. Pictorial databases have made great progress recently, to the extent that there are now a number of commercially available products. Video databases are now being researched and developed from a number of different viewpoints. Given a general indexing scheme for video, the next step is to reuse clips in further applications. In this paper we present an initial application for the reuse of video clips. The aim of the system is to resequence video clips for a particular application. We have chosen a well-constrained application for this purpose, the aim being to produce a video tour of a campus between designated start and destination points from a set of indexed video clips. We use clips of a guide entering and leaving buildings on our campus, and when visitors select a start location and a destination, the system will retrieve clips suitable for guiding the visitor along the correct path. The system uses an index of spatial relationships of key objects for the video clips to decide which clips provide the correct sequence of motion around the campus. Although the full power of the indexing notation is unnecessary for this simple problem, the results from this initial implementation indicate that the concept could be applicable to more complex problems.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Time series discord has proven to be a useful concept for time-series anomaly identification. To search for discords, various algorithms have been developed. Most of these algorithms rely on pre-building an index (such as a trie) for subsequences. Users of these algorithms are typically required to choose optimal values for word-length and/or alphabet-size parameters of the index, which are not intuitive. In this paper, we propose an algorithm to directly search for the top-K discords, without the requirement of building an index or tuning external parameters. The algorithm exploits quasi-periodicity present in many time series. For quasi-periodic time series, the algorithm gains significant speedup by reducing the number of calls to the distance function.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Time-series discord is widely used in data mining applications to characterize anomalous subsequences in time series. Compared to some other discord search algorithms, the direct search algorithm based on the recurrence plot shows the advantage of being fast and parameter free. The direct search algorithm, however, relies on quasi-periodicity in input time series, an assumption that limits the algorithm's applicability. In this paper, we eliminate the periodicity assumption from the direct search algorithm by proposing a reference function for subsequences and a new sampling strategy based on the reference function. These measures result in a new algorithm with improved efficiency and robustness, as evidenced by our empirical evaluation.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Die Materialverfolgung gewinnt in der Metallindustrie immer mehr an Bedeutung:rnEs ist notwendig, dass ein Metallband im Fertigungsprozess ein festgelegtes Programm durchläuft - erst dann ist die Qualität des Endprodukts garantiert. Die bisherige Praxis besteht darin, jedem Metallband eine Nummer zuzuordnen, mit der dieses Band beschriftet wird. Bei einer tagelangen Lagerung der Bänder zwischen zwei Produktionsschritten erweist sich diese Methode als fehleranfällig: Die Beschriftungen können z.B. verloren gehen, verwechselt, falsch ausgelesen oder unleserlich werden. 2007 meldete die iba AG das Patent zur Identifikation der Metallbänder anhand ihres Dickenprofils an (Anhaus [3]) - damit kann die Identität des Metallbandes zweifelsfrei nachgewiesen werden, eine zuverlässige Materialverfolgung wurde möglich.Es stellte sich jedoch heraus, dass die messfehlerbehafteten Dickenprofile, die als lange Zeitreihen aufgefasst werden können, mit Hilfe von bisherigen Verfahren (z.B. L2-Abstandsminimierung oder Dynamic Time Warping) nicht erfolgreich verglichen werden können.Diese Arbeit stellt einen effizienten feature-basierten Algorithmus zum Vergleichrnzweier Zeitreihen vor. Er ist sowohl robust gegenüber Rauschen und Messausfällen als auch invariant gegenüber solchen Koordinatentransformationen der Zeitreihen wie Skalierung und Translation. Des Weiteren sind auch Vergleiche mit Teilzeitreihen möglich. Unser Framework zeichnet sich sowohl durch seine hohe Genauigkeit als auch durch seine hohe Geschwindigkeit aus: Mehr als 99.5% der Anfragen an unsere aus realen Profilen bestehende Testdatenbank werden richtig beantwortet. Mit mehreren hundert Zeitreihen-Vergleichen pro Sekunde ist es etwa um den Faktor 10 schneller als die auf dem Gebiet der Zeitreihenanalyse etablierten Verfahren, die jedoch nicht im Stande sind, mehr als 90% der Anfragen korrekt zu verarbeiten. Der Algorithmus hat sich als industrietauglich erwiesen. Die iba AG setzt ihn in einem weltweit einzigartigen dickenprofilbasierten Überwachungssystemrnzur Materialverfolgung ein, das in ersten Stahl- und Aluminiumwalzwerkenrnbereits erfolgreich zum Einsatz kommt.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

DNA microarray is a powerful tool to measure the level of a mixed population of nucleic acids at one time, which has great impact in many aspects of life sciences research. In order to distinguish nucleic acids with very similar composition by hybridization, it is necessary to design probes with high specificities, i.e. uniqueness, and also sensitivities, i.e., suitable melting temperature and no secondary structure. We make use of available biology tools to gain necessary sequence information of human chromosome 12, and combined with evolutionary strategy (ES) to find unique subsequences representing all predicted exons. The results are presented and discussed.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Transgenic BALB/c mice that express intrathyroidal human thyroid stimulating hormone receptor (TSHR) A-subunit, unlike wild-type (WT) littermates, develop thyroid lymphocytic infiltration and spreading to other thyroid autoantigens after T regulatory cell (Treg) depletion and immunization with human thyrotropin receptor (hTSHR) adenovirus. To determine if this process involves intramolecular epitope spreading, we studied antibody and T cell recognition of TSHR ectodomain peptides (A–Z). In transgenic and WT mice, regardless of Treg depletion, TSHR antibodies bound predominantly to N-terminal peptide A and much less to a few downstream peptides. After Treg depletion, splenocytes from WT mice responded to peptides C, D and J (all in the A-subunit), but transgenic splenocytes recognized only peptide D. Because CD4+ T cells are critical for thyroid lymphocytic infiltration, amino acid sequences of these peptides were examined for in silico binding to BALB/c major histocompatibility complex class II (IA–d). High affinity subsequences (inhibitory concentration of 50% < 50 nm) are present in peptides C and D (not J) of the hTSHR and mouse TSHR equivalents. These data probably explain why transgenic splenocytes do not recognize peptide J. Mouse TSHR mRNA levels are comparable in transgenic and WT thyroids, but only transgenics have human A-subunit mRNA. Transgenic mice can present mouse TSHR and human A-subunit-derived peptides. However, WT mice can present only mouse TSHR, and two to four amino acid species differences may preclude recognition by CD4+ T cells activated by hTSHR-adenovirus. Overall, thyroid lymphocytic infiltration in the transgenic mice is unrelated to epitopic spreading but involves human A-subunit peptides for recognition by T cells activated using the hTSHR.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Background. The secondary structure of folded RNA sequences is a good model to map phenotype onto genotype, as represented by the RNA sequence. Computational studies of the evolution of ensembles of RNA molecules towards target secondary structures yield valuable clues to the mechanisms behind adaptation of complex populations. The relationship between the space of sequences and structures, the organization of RNA ensembles at mutation-selection equilibrium, the time of adaptation as a function of the population parameters, the presence of collective effects in quasispecies, or the optimal mutation rates to promote adaptation all are issues that can be explored within this framework. Results. We investigate the effect of microscopic mutations on the phenotype of RNA molecules during their in silico evolution and adaptation. We calculate the distribution of the effects of mutations on fitness, the relative fractions of beneficial and deleterious mutations and the corresponding selection coefficients for populations evolving under different mutation rates. Three different situations are explored: the mutation-selection equilibrium (optimized population) in three different fitness landscapes, the dynamics during adaptation towards a goal structure (adapting population), and the behavior under periodic population bottlenecks (perturbed population). Conclusions. The ratio between the number of beneficial and deleterious mutations experienced by a population of RNA sequences increases with the value of the mutation rate µ at which evolution proceeds. In contrast, the selective value of mutations remains almost constant, independent of µ, indicating that adaptation occurs through an increase in the amount of beneficial mutations, with little variations in the average effect they have on fitness. Statistical analyses of the distribution of fitness effects reveal that small effects, either beneficial or deleterious, are well described by a Pareto distribution. These results are robust under changes in the fitness landscape, remarkably when, in addition to selecting a target secondary structure, specific subsequences or low-energy folds are required. A population perturbed by bottlenecks behaves similarly to an adapting population, struggling to return to the optimized state. Whether it can survive in the long run or whether it goes extinct depends critically on the length of the time interval between bottlenecks. © 2010 Stich et al; licensee BioMed Central Ltd.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The accurate identification of T-cell epitopes remains a principal goal of bioinformatics within immunology. As the immunogenicity of peptide epitopes is dependent on their binding to major histocompatibility complex (MHC) molecules, the prediction of binding affinity is a prerequisite to the reliable prediction of epitopes. The iterative self-consistent (ISC) partial-least-squares (PLS)-based additive method is a recently developed bioinformatic approach for predicting class II peptide−MHC binding affinity. The ISC−PLS method overcomes many of the conceptual difficulties inherent in the prediction of class II peptide−MHC affinity, such as the binding of a mixed population of peptide lengths due to the open-ended class II binding site. The method has applications in both the accurate prediction of class II epitopes and the manipulation of affinity for heteroclitic and competitor peptides. The method is applied here to six class II mouse alleles (I-Ab, I-Ad, I-Ak, I-As, I-Ed, and I-Ek) and included peptides up to 25 amino acids in length. A series of regression equations highlighting the quantitative contributions of individual amino acids at each peptide position was established. The initial model for each allele exhibited only moderate predictivity. Once the set of selected peptide subsequences had converged, the final models exhibited a satisfactory predictive power. Convergence was reached between the 4th and 17th iterations, and the leave-one-out cross-validation statistical terms - q2, SEP, and NC - ranged between 0.732 and 0.925, 0.418 and 0.816, and 1 and 6, respectively. The non-cross-validated statistical terms r2 and SEE ranged between 0.98 and 0.995 and 0.089 and 0.180, respectively. The peptides used in this study are available from the AntiJen database (http://www.jenner.ac.uk/AntiJen). The PLS method is available commercially in the SYBYL molecular modeling software package. The resulting models, which can be used for accurate T-cell epitope prediction, will be made freely available online (http://www.jenner.ac.uk/MHCPred).