20 resultados para Graph ite
Resumo:
The amount of biological data has grown exponentially in recent decades. Modern biotechnologies, such as microarrays and next-generation sequencing, are capable to produce massive amounts of biomedical data in a single experiment. As the amount of the data is rapidly growing there is an urgent need for reliable computational methods for analyzing and visualizing it. This thesis addresses this need by studying how to efficiently and reliably analyze and visualize high-dimensional data, especially that obtained from gene expression microarray experiments. First, we will study the ways to improve the quality of microarray data by replacing (imputing) the missing data entries with the estimated values for these entries. Missing value imputation is a method which is commonly used to make the original incomplete data complete, thus making it easier to be analyzed with statistical and computational methods. Our novel approach was to use curated external biological information as a guide for the missing value imputation. Secondly, we studied the effect of missing value imputation on the downstream data analysis methods like clustering. We compared multiple recent imputation algorithms against 8 publicly available microarray data sets. It was observed that the missing value imputation indeed is a rational way to improve the quality of biological data. The research revealed differences between the clustering results obtained with different imputation methods. On most data sets, the simple and fast k-NN imputation was good enough, but there were also needs for more advanced imputation methods, such as Bayesian Principal Component Algorithm (BPCA). Finally, we studied the visualization of biological network data. Biological interaction networks are examples of the outcome of multiple biological experiments such as using the gene microarray techniques. Such networks are typically very large and highly connected, thus there is a need for fast algorithms for producing visually pleasant layouts. A computationally efficient way to produce layouts of large biological interaction networks was developed. The algorithm uses multilevel optimization within the regular force directed graph layout algorithm.
Resumo:
Identification of low-dimensional structures and main sources of variation from multivariate data are fundamental tasks in data analysis. Many methods aimed at these tasks involve solution of an optimization problem. Thus, the objective of this thesis is to develop computationally efficient and theoretically justified methods for solving such problems. Most of the thesis is based on a statistical model, where ridges of the density estimated from the data are considered as relevant features. Finding ridges, that are generalized maxima, necessitates development of advanced optimization methods. An efficient and convergent trust region Newton method for projecting a point onto a ridge of the underlying density is developed for this purpose. The method is utilized in a differential equation-based approach for tracing ridges and computing projection coordinates along them. The density estimation is done nonparametrically by using Gaussian kernels. This allows application of ridge-based methods with only mild assumptions on the underlying structure of the data. The statistical model and the ridge finding methods are adapted to two different applications. The first one is extraction of curvilinear structures from noisy data mixed with background clutter. The second one is a novel nonlinear generalization of principal component analysis (PCA) and its extension to time series data. The methods have a wide range of potential applications, where most of the earlier approaches are inadequate. Examples include identification of faults from seismic data and identification of filaments from cosmological data. Applicability of the nonlinear PCA to climate analysis and reconstruction of periodic patterns from noisy time series data are also demonstrated. Other contributions of the thesis include development of an efficient semidefinite optimization method for embedding graphs into the Euclidean space. The method produces structure-preserving embeddings that maximize interpoint distances. It is primarily developed for dimensionality reduction, but has also potential applications in graph theory and various areas of physics, chemistry and engineering. Asymptotic behaviour of ridges and maxima of Gaussian kernel densities is also investigated when the kernel bandwidth approaches infinity. The results are applied to the nonlinear PCA and to finding significant maxima of such densities, which is a typical problem in visual object tracking.
Resumo:
This study is done to examine waste power plant’s optimal processing chain and it is important to consider from several points of view on why one option is better than the other. This is to insure that the right decision is made. Incineration of waste has devel-oped to be one decent option for waste disposal. There are several legislation matters and technical options to consider when starting up a waste power plant. From the tech-niques pretreatment, burner and flue gas cleaning are the biggest ones to consider. The treatment of incineration residues is important since it can be very harmful for the envi-ronment. The actual energy production from waste is not highly efficient and there are several harmful compounds emitted. Recycling of waste before incineration is not very typical and there are not many recycling options for materials that cannot be easily re-cycled to same product. Life cycle assessment is a good option for studying the envi-ronmental effect of the system. It has four phases that are part of the iterative study process. In this study the case environment is a waste power plant. The modeling of the plant is done with GaBi 6 software and the scope is from gate-to-grave. There are three different scenarios, from which the first and second are compared to each other to reach conclusions. Zero scenario is part of the study to demonstrate situation without the power plant. The power plant in this study is recycling some materials in scenario one and in scenario two even more materials and utilize the bottom ash more ways than one. The model has the substitutive processes for the materials when they are not recycled in the plant. The global warming potential results show that scenario one is the best option. The variable costs that have been considered tell the same result. The conclusion is that the waste power plant should not recycle more and utilize bottom ash in a number of ways. The area is not ready for that kind of utilization and production from recycled materials.
Resumo:
The advancement of science and technology makes it clear that no single perspective is any longer sufficient to describe the true nature of any phenomenon. That is why the interdisciplinary research is gaining more attention overtime. An excellent example of this type of research is natural computing which stands on the borderline between biology and computer science. The contribution of research done in natural computing is twofold: on one hand, it sheds light into how nature works and how it processes information and, on the other hand, it provides some guidelines on how to design bio-inspired technologies. The first direction in this thesis focuses on a nature-inspired process called gene assembly in ciliates. The second one studies reaction systems, as a modeling framework with its rationale built upon the biochemical interactions happening within a cell. The process of gene assembly in ciliates has attracted a lot of attention as a research topic in the past 15 years. Two main modelling frameworks have been initially proposed in the end of 1990s to capture ciliates’ gene assembly process, namely the intermolecular model and the intramolecular model. They were followed by other model proposals such as templatebased assembly and DNA rearrangement pathways recombination models. In this thesis we are interested in a variation of the intramolecular model called simple gene assembly model, which focuses on the simplest possible folds in the assembly process. We propose a new framework called directed overlap-inclusion (DOI) graphs to overcome the limitations that previously introduced models faced in capturing all the combinatorial details of the simple gene assembly process. We investigate a number of combinatorial properties of these graphs, including a necessary property in terms of forbidden induced subgraphs. We also introduce DOI graph-based rewriting rules that capture all the operations of the simple gene assembly model and prove that they are equivalent to the string-based formalization of the model. Reaction systems (RS) is another nature-inspired modeling framework that is studied in this thesis. Reaction systems’ rationale is based upon two main regulation mechanisms, facilitation and inhibition, which control the interactions between biochemical reactions. Reaction systems is a complementary modeling framework to traditional quantitative frameworks, focusing on explicit cause-effect relationships between reactions. The explicit formulation of facilitation and inhibition mechanisms behind reactions, as well as the focus on interactions between reactions (rather than dynamics of concentrations) makes their applicability potentially wide and useful beyond biological case studies. In this thesis, we construct a reaction system model corresponding to the heat shock response mechanism based on a novel concept of dominance graph that captures the competition on resources in the ODE model. We also introduce for RS various concepts inspired by biology, e.g., mass conservation, steady state, periodicity, etc., to do model checking of the reaction systems based models. We prove that the complexity of the decision problems related to these properties varies from P to NP- and coNP-complete to PSPACE-complete. We further focus on the mass conservation relation in an RS and introduce the conservation dependency graph to capture the relation between the species and also propose an algorithm to list the conserved sets of a given reaction system.
Resumo:
The Baltic Sea is a unique environment that contains unique genetic populations. In order to study these populations on a genetic level basic molecular research is needed. The aim of this thesis was to provide a basic genetic resource for population genomic studies by de novo assembling a transcriptome for the Baltic Sea isopod Idotea balthica. RNA was extracted from a whole single adult male isopod and sequenced using Illumina (125bp PE) RNA-Seq. The reads were preprocessed using FASTQC for quality control, TRIMMOMATIC for trimming, and RCORRECTOR for error correction. The preprocessed reads were then assembled with TRINITY, a de Bruijn graph-based assembler, using different k-mer sizes. The different assemblies were combined and clustered using CD-HIT. The assemblies were evaluated using TRANSRATE for quality and filtering, BUSCO for completeness, and TRANSDECODER for annotation potential. The 25-mer assembly was annotated using PANNZER (protein annotation with z-score) and BLASTX. The 25-mer assembly represents the best first draft assembly since it contains the most information. However, this assembly shows high levels of polymorphism, which currently cannot be differentiated as paralogs or allelic variants. Furthermore, this assembly is incomplete, which could be improved by sampling additional developmental stages.