4 resultados para data complexity
em Helda - Digital Repository of University of Helsinki
Resumo:
Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
The increased accuracy in the cosmological observations, especially in the measurements of the comic microwave background, allow us to study the primordial perturbations in grater detail. In this thesis, we allow the possibility for a correlated isocurvature perturbations alongside the usual adiabatic perturbations. Thus far the simplest six parameter \Lambda CDM model has been able to accommodate all the observational data rather well. However, we find that the 3-year WMAP data and the 2006 Boomerang data favour a nonzero nonadiabatic contribution to the CMB angular power sprctrum. This is primordial isocurvature perturbation that is positively correlated with the primordial curvature perturbation. Compared with the adiabatic \Lambda CMD model we have four additional parameters describing the increased complexity if the primordial perturbations. Our best-fit model has a 4% nonadiabatic contribution to the CMB temperature variance and the fit is improved by \Delta\chi^2 = 9.7. We can attribute this preference for isocurvature to a feature in the peak structure of the angular power spectrum, namely, the widths of the second and third acoustic peak. Along the way, we have improved our analysis methods by identifying some issues with the parametrisation of the primordial perturbation spectra and suggesting ways to handle these. Due to the improvements, the convergence of our Markov chains is improved. The change of parametrisation has an effect on the MCMC analysis because of the change in priors. We have checked our results against this and find only marginal differences between our parametrisation.
Resumo:
In this dissertation I study language complexity from a typological perspective. Since the structuralist era, it has been assumed that local complexity differences in languages are balanced out in cross-linguistic comparisons and that complexity is not affected by the geopolitical or sociocultural aspects of the speech community. However, these assumptions have seldom been studied systematically from a typological point of view. My objective is to define complexity so that it is possible to compare it across languages and to approach its variation with the methods of quantitative typology. My main empirical research questions are: i) does language complexity vary in any systematic way in local domains, and ii) can language complexity be affected by the geographical or social environment? These questions are studied in three articles, whose findings are summarized in the introduction to the dissertation. In order to enable cross-language comparison, I measure complexity as the description length of the regularities in an entity; I separate it from difficulty, focus on local instead of global complexity, and break it up into different types. This approach helps avoid the problems that plagued earlier metrics of language complexity. My approach to grammar is functional-typological in nature, and the theoretical framework is basic linguistic theory. I delimit the empirical research functionally to the marking of core arguments (the basic participants in the sentence). I assess the distributions of complexity in this domain with multifactorial statistical methods and use different sampling strategies, implementing, for instance, the Greenbergian view of universals as diachronic laws of type preference. My data come from large and balanced samples (up to approximately 850 languages), drawn mainly from reference grammars. The results suggest that various significant trends occur in the marking of core arguments in regard to complexity and that complexity in this domain correlates with population size. These results provide evidence that linguistic patterns interact among themselves in terms of complexity, that language structure adapts to the social environment, and that there may be cognitive mechanisms that limit complexity locally. My approach to complexity and language universals can therefore be successfully applied to empirical data and may serve as a model for further research in these areas.