32 resultados para computational complexity
em Helda - Digital Repository of University of Helsinki
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
This dissertation is a theoretical study of finite-state based grammars used in natural language processing. The study is concerned with certain varieties of finite-state intersection grammars (FSIG) whose parsers define regular relations between surface strings and annotated surface strings. The study focuses on the following three aspects of FSIGs: (i) Computational complexity of grammars under limiting parameters In the study, the computational complexity in practical natural language processing is approached through performance-motivated parameters on structural complexity. Each parameter splits some grammars in the Chomsky hierarchy into an infinite set of subset approximations. When the approximations are regular, they seem to fall into the logarithmic-time hierarchyand the dot-depth hierarchy of star-free regular languages. This theoretical result is important and possibly relevant to grammar induction. (ii) Linguistically applicable structural representations Related to the linguistically applicable representations of syntactic entities, the study contains new bracketing schemes that cope with dependency links, left- and right branching, crossing dependencies and spurious ambiguity. New grammar representations that resemble the Chomsky-Schützenberger representation of context-free languages are presented in the study, and they include, in particular, representations for mildly context-sensitive non-projective dependency grammars whose performance-motivated approximations are linear time parseable. (iii) Compilation and simplification of linguistic constraints Efficient compilation methods for certain regular operations such as generalized restriction are presented. These include an elegant algorithm that has already been adopted as the approach in a proprietary finite-state tool. In addition to the compilation methods, an approach to on-the-fly simplifications of finite-state representations for parse forests is sketched. These findings are tightly coupled with each other under the theme of locality. I argue that the findings help us to develop better, linguistically oriented formalisms for finite-state parsing and to develop more efficient parsers for natural language processing. Avainsanat: syntactic parsing, finite-state automata, dependency grammar, first-order logic, linguistic performance, star-free regular approximations, mildly context-sensitive grammars
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
An efficient and statistically robust solution for the identification of asteroids among numerous sets of astrometry is presented. In particular, numerical methods have been developed for the short-term identification of asteroids at discovery, and for the long-term identification of scarcely observed asteroids over apparitions, a task which has been lacking a robust method until now. The methods are based on the solid foundation of statistical orbital inversion properly taking into account the observational uncertainties, which allows for the detection of practically all correct identifications. Through the use of dimensionality-reduction techniques and efficient data structures, the exact methods have a loglinear, that is, O(nlog(n)), computational complexity, where n is the number of included observation sets. The methods developed are thus suitable for future large-scale surveys which anticipate a substantial increase in the astrometric data rate. Due to the discontinuous nature of asteroid astrometry, separate sets of astrometry must be linked to a common asteroid from the very first discovery detections onwards. The reason for the discontinuity in the observed positions is the rotation of the observer with the Earth as well as the motion of the asteroid and the observer about the Sun. Therefore, the aim of identification is to find a set of orbital elements that reproduce the observed positions with residuals similar to the inevitable observational uncertainty. Unless the astrometric observation sets are linked, the corresponding asteroid is eventually lost as the uncertainty of the predicted positions grows too large to allow successful follow-up. Whereas the presented identification theory and the numerical comparison algorithm are generally applicable, that is, also in fields other than astronomy (e.g., in the identification of space debris), the numerical methods developed for asteroid identification can immediately be applied to all objects on heliocentric orbits with negligible effects due to non-gravitational forces in the time frame of the analysis. The methods developed have been successfully applied to various identification problems. Simulations have shown that the methods developed are able to find virtually all correct linkages despite challenges such as numerous scarce observation sets, astrometric uncertainty, numerous objects confined to a limited region on the celestial sphere, long linking intervals, and substantial parallaxes. Tens of previously unknown main-belt asteroids have been identified with the short-term method in a preliminary study to locate asteroids among numerous unidentified sets of single-night astrometry of moving objects, and scarce astrometry obtained nearly simultaneously with Earth-based and space-based telescopes has been successfully linked despite a substantial parallax. Using the long-term method, thousands of realistic 3-linkages typically spanning several apparitions have so far been found among designated observation sets each spanning less than 48 hours.
Resumo:
A distributed system is a collection of networked autonomous processing units which must work in a cooperative manner. Currently, large-scale distributed systems, such as various telecommunication and computer networks, are abundant and used in a multitude of tasks. The field of distributed computing studies what can be computed efficiently in such systems. Distributed systems are usually modelled as graphs where nodes represent the processors and edges denote communication links between processors. This thesis concentrates on the computational complexity of the distributed graph colouring problem. The objective of the graph colouring problem is to assign a colour to each node in such a way that no two nodes connected by an edge share the same colour. In particular, it is often desirable to use only a small number of colours. This task is a fundamental symmetry-breaking primitive in various distributed algorithms. A graph that has been coloured in this manner using at most k different colours is said to be k-coloured. This work examines the synchronous message-passing model of distributed computation: every node runs the same algorithm, and the system operates in discrete synchronous communication rounds. During each round, a node can communicate with its neighbours and perform local computation. In this model, the time complexity of a problem is the number of synchronous communication rounds required to solve the problem. It is known that 3-colouring any k-coloured directed cycle requires at least ½(log* k - 3) communication rounds and is possible in ½(log* k + 7) communication rounds for all k ≥ 3. This work shows that for any k ≥ 3, colouring a k-coloured directed cycle with at most three colours is possible in ½(log* k + 3) rounds. In contrast, it is also shown that for some values of k, colouring a directed cycle with at most three colours requires at least ½(log* k + 1) communication rounds. Furthermore, in the case of directed rooted trees, reducing a k-colouring into a 3-colouring requires at least log* k + 1 rounds for some k and possible in log* k + 3 rounds for all k ≥ 3. The new positive and negative results are derived using computational methods, as the existence of distributed colouring algorithms corresponds to the colourability of so-called neighbourhood graphs. The colourability of these graphs is analysed using Boolean satisfiability (SAT) solvers. Finally, this thesis shows that similar methods are applicable in capturing the existence of distributed algorithms for other graph problems, such as the maximal matching problem.
Resumo:
Metabolism is the cellular subsystem responsible for generation of energy from nutrients and production of building blocks for larger macromolecules. Computational and statistical modeling of metabolism is vital to many disciplines including bioengineering, the study of diseases, drug target identification, and understanding the evolution of metabolism. In this thesis, we propose efficient computational methods for metabolic modeling. The techniques presented are targeted particularly at the analysis of large metabolic models encompassing the whole metabolism of one or several organisms. We concentrate on three major themes of metabolic modeling: metabolic pathway analysis, metabolic reconstruction and the study of evolution of metabolism. In the first part of this thesis, we study metabolic pathway analysis. We propose a novel modeling framework called gapless modeling to study biochemically viable metabolic networks and pathways. In addition, we investigate the utilization of atom-level information on metabolism to improve the quality of pathway analyses. We describe efficient algorithms for discovering both gapless and atom-level metabolic pathways, and conduct experiments with large-scale metabolic networks. The presented gapless approach offers a compromise in terms of complexity and feasibility between the previous graph-theoretic and stoichiometric approaches to metabolic modeling. Gapless pathway analysis shows that microbial metabolic networks are not as robust to random damage as suggested by previous studies. Furthermore the amino acid biosynthesis pathways of the fungal species Trichoderma reesei discovered from atom-level data are shown to closely correspond to those of Saccharomyces cerevisiae. In the second part, we propose computational methods for metabolic reconstruction in the gapless modeling framework. We study the task of reconstructing a metabolic network that does not suffer from connectivity problems. Such problems often limit the usability of reconstructed models, and typically require a significant amount of manual postprocessing. We formulate gapless metabolic reconstruction as an optimization problem and propose an efficient divide-and-conquer strategy to solve it with real-world instances. We also describe computational techniques for solving problems stemming from ambiguities in metabolite naming. These techniques have been implemented in a web-based sofware ReMatch intended for reconstruction of models for 13C metabolic flux analysis. In the third part, we extend our scope from single to multiple metabolic networks and propose an algorithm for inferring gapless metabolic networks of ancestral species from phylogenetic data. Experimenting with 16 fungal species, we show that the method is able to generate results that are easily interpretable and that provide hypotheses about the evolution of metabolism.
Resumo:
Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.
Resumo:
We have presented an overview of the FSIG approach and related FSIG gram- mars to issues of very low complexity and parsing strategy. We ended up with serious optimism according to which most FSIG grammars could be decom- posed in a reasonable way and then processed efficiently.
Resumo:
Atherosclerosis is a disease of the arteries; its characteristic features include chronic inflammation, extra- and intracellular lipid accumulation, extracellular matrix remodeling, and an increase in extracellular matrix volume. The underlying mechanisms in the pathogenesis of advanced atherosclerotic plaques, that involve local acidity of the extracellular fluid, are still incompletely understood. In this thesis project, my co-workers and I studied the different mechanisms by which local extracellular acidity could promote accumulation of the atherogenic apolipoprotein B-100 (apoB-100)-containing plasma lipoprotein particles in the inner layer of the arterial wall, the intima. We found that lipolysis of atherogenic apoB-100-containing plasma lipoprotein particles (LDL, IDL, and sVLDL) by the secretory phospholipase A2 group V (sPLA2-V) enzyme, was increased at acidic pH. Also, the binding of apoB-100-containing plasma lipoprotein particles to human aortic proteoglycans was dramatically enhanced at acidic pH. Additionally, lipolysis by sPLA2-V enzyme further increased this binding. Using proteoglycan-affinity chromatography, we found that sVLDL lipoprotein particles consist of populations, differing in their affinities toward proteoglycans. These populations also contained different amounts of apolipoprotein E (apoE) and apolipoprotein C-III (apoC-III); the amounts of apoC-III and apoE per particle were highest in the population with the lowest affinity toward proteoglycans. Since PLA2-modification of LDL particles has been shown to change their aggregation behavior, we also studied the effect of acidic pH on the monolayer structure covering lipoprotein particles after PLA2-induced hydrolysis. Using molecular dynamics simulations, we found that, in acidity, the monolayer is more tightly packed laterally; moreover, its spontaneous curvature is negative, suggesting that acidity may promote lipoprotein particles fusion. In addition to extracellular lipid accumulation, the apoB-100-containing plasma lipoprotein particles can be taken up by inflammatory cells, namely macrophages. Using radiolabeled lipoprotein particles and cell cultures, we showed that sPLA2-V-modification of LDL, IDL, and sVLDL lipoproteins particles, at neutral or acidic pH, increased their uptake by human monocyte-derived macrophages.
Resumo:
The molecular level structure of mixtures of water and alcohols is very complicated and has been under intense research in the recent past. Both experimental and computational methods have been used in the studies. One method for studying the intra- and intermolecular bindings in the mixtures is the use of the so called difference Compton profiles, which are a way to obtain information about changes in the electron wave functions. In the process of Compton scattering a photon scatters inelastically from an electron. The Compton profile that is obtained from the electron wave functions is directly proportional to the probability of photon scattering at a given energy to a given solid angle. In this work we develop a method to compute Compton profiles numerically for mixtures of liquids. In order to obtain the electronic wave functions necessary to calculate the Compton profiles we need some statistical information about atomic coordinates. Acquiring this using ab-initio molecular dynamics is beyond our computational capabilities and therefore we use classical molecular dynamics to model the movement of atoms in the mixture. We discuss the validity of the chosen method in view of the results obtained from the simulations. There are some difficulties in using classical molecular dynamics for the quantum mechanical calculations, but these can possibly be overcome by parameter tuning. According to the calculations clear differences can be seen in the Compton profiles of different mixtures. This prediction needs to be tested in experiments in order to find out whether the approximations made are valid.
Resumo:
There is intense activity in the area of theoretical chemistry of gold. It is now possible to predict new molecular species, and more recently, solids by combining relativistic methodology with isoelectronic thinking. In this thesis we predict a series of solid sheet-type crystals for Group-11 cyanides, MCN (M=Cu, Ag, Au), and Group-2 and 12 carbides MC2 (M=Be-Ba, Zn-Hg). The idea of sheets is then extended to nanostrips which can be bent to nanorings. The bending energies and deformation frequencies can be systematized by treating these molecules as an elastic bodies. In these species Au atoms act as an 'intermolecular glue'. Further suggested molecular species are the new uncongested aurocarbons, and the neutral Au_nHg_m clusters. Many of the suggested species are expected to be stabilized by aurophilic interactions. We also estimate the MP2 basis-set limit of the aurophilicity for the model compounds [ClAuPH_3]_2 and [P(AuPH_3)_4]^+. Beside investigating the size of the basis-set applied, our research confirms that the 19-VE TZVP+2f level, used a decade ago, already produced 74 % of the present aurophilic attraction energy for the [ClAuPH_3]_2 dimer. Likewise we verify the preferred C4v structure for the [P(AuPH_3)_4]^+ cation at the MP2 level. We also perform the first calculation on model aurophilic systems using the SCS-MP2 method and compare the results to high-accuracy CCSD(T) ones. The recently obtained high-resolution microwave spectra on MCN molecules (M=Cu, Ag, Au) provide an excellent testing ground for quantum chemistry. MP2 or CCSD(T) calculations, correlating all 19 valence electrons of Au and including BSSE and SO corrections, are able to give bond lengths to 0.6 pm, or better. Our calculated vibrational frequencies are expected to be better than the currently available experimental estimates. Qualitative evidence for multiple Au-C bonding in triatomic AuCN is also found.
Resumo:
The chemical and physical properties of bimetallic clusters have attracted considerable attention due to the potential technological applications of mixed-metal systems. It is of fundamental interests to study clusters because they are the link between atomic surface and bulk properties. More information of metal-metal bond in small clusters can be hence released. The studies in my thesis mainly focus on the two different kinds of bimetallic clusters: the clusters consisting of extraordinary shaped all metal four-membered rings and a series of sodium auride clusters. As described in most general organic chemistry books nowadays, a group of compounds are classified as aromatic compounds because of their remarkable stabilities, particular geometrical and energetic properties and so on. The notation of aromaticity is essentially qualitative. More recently, the connection has been made between aromaticity and energetic and magnetic properties. Also, the discussions of the aromatic nature of molecular rings are no longer limited to organic compounds obeying the Hückel’s rule. In our research, we mainly applied the GIMIC method to several bimetallic clusters at the CCSD level, and compared the results with those obtained by using chemical shift based methods. The magnetically induced ring currents can be generated easily by employing GIMIC method, and the nature of aromaticity for each system can be therefore clarified. We performed intensive quantum chemical calculations to explore the characters of the anionic sodium auride clusters and the corresponding neutral clusters since it has been fascinating in investigating molecules with gold atom involved due to its distinctive physical and chemical properties. As small gold clusters, the sodium auride clusters seem to form planar structures. With the addition of a negative charge, the gold atom in anionic clusters prefers to carry the charge and orients itself away from other gold atoms. As a result, the energetically lowest isomer for an anionic cluster is distinguished from the one for the corresponding neutral cluster. Mostly importantly, we presented a comprehensive strategy of ab initio applications to computationally implement the experimental photoelectron spectra.