39 resultados para avoin data


Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Matrix decompositions, where a given matrix is represented as a product of two other matrices, are regularly used in data mining. Most matrix decompositions have their roots in linear algebra, but the needs of data mining are not always those of linear algebra. In data mining one needs to have results that are interpretable -- and what is considered interpretable in data mining can be very different to what is considered interpretable in linear algebra. --- The purpose of this thesis is to study matrix decompositions that directly address the issue of interpretability. An example is a decomposition of binary matrices where the factor matrices are assumed to be binary and the matrix multiplication is Boolean. The restriction to binary factor matrices increases interpretability -- factor matrices are of the same type as the original matrix -- and allows the use of Boolean matrix multiplication, which is often more intuitive than normal matrix multiplication with binary matrices. Also several other decomposition methods are described, and the computational complexity of computing them is studied together with the hardness of approximating the related optimization problems. Based on these studies, algorithms for constructing the decompositions are proposed. Constructing the decompositions turns out to be computationally hard, and the proposed algorithms are mostly based on various heuristics. Nevertheless, the algorithms are shown to be capable of finding good results in empirical experiments conducted with both synthetic and real-world data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cell transition data is obtained from a cellular phone that switches its current serving cell tower. The data consists of a sequence of transition events, which are pairs of cell identifiers and transition times. The focus of this thesis is applying data mining methods to such data, developing new algorithms, and extracting knowledge that will be a solid foundation on which to build location-aware applications. In addition to a thorough exploration of the features of the data, the tools and methods developed in this thesis provide solutions to three distinct research problems. First, we develop clustering algorithms that produce a reliable mapping between cell transitions and physical locations observed by users of mobile devices. The main clustering algorithm operates in online fashion, and we consider also a number of offline clustering methods for comparison. Second, we define the concept of significant locations, known as bases, and give an online algorithm for determining them. Finally, we consider the task of predicting the movement of the user, based on historical data. We develop a prediction algorithm that considers paths of movement in their entirety, instead of just the most recent movement history. All of the presented methods are evaluated with a significant body of real cell transition data, collected from about one hundred different individuals. The algorithms developed in this thesis are designed to be implemented on a mobile device, and require no extra hardware sensors or network infrastructure. By not relying on external services and keeping the user information as much as possible on the user s own personal device, we avoid privacy issues and let the users control the disclosure of their location information.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Despite much research on forest biodiversity in Fennoscandia, the exact mechanisms of species declines in dead-wood dependent fungi are still poorly understood. In particular, there is only limited information on why certain fungal species have responded negatively to habitat loss and fragmentation, while others have not. Understanding the mechanisms behind species declines would be essential for the design and development of ecologically effective and scientifically informed conservation measures, and management practices that would promote biodiversity in production forests. In this thesis I study the ecology of polypores and their responses to forest management, with a particular focus on why some species have declined more than others. The data considered in the thesis comprise altogether 98,318 dead-wood objects, with 43,085 observations of 174 fungal species. Out of these, 1,964 observations represent 58 red-listed species. The data were collected from 496 sites, including woodland key habitats, clear-cuts with retention trees, mature managed forests, and natural or natural-like forests in southern Finland and Russian Karelia. I show that the most relevant way of measuring resource availability can differ to a great extent between species seemingly sharing the same resources. It is thus critical to measure the availability of resources in a way that takes into account the ecological requirements of the species. The results show that connectivity at the local, landscape and regional scales is important especially for the highly specialized species, many of which are also red-listed. Habitat loss and fragmentation affect not only species diversity but also the relative abundances of the species and, consequently, species interactions and fungal successional pathways. Changes in species distributions and abundances are likely to affect the food chains in which wood-inhabiting fungi are involved, and thus the functioning of the whole forest ecosystem. The findings of my thesis highlight the importance of protecting well-connected, large and high-quality forest areas to maintain forest biodiversity. Small habitat patches distributed across the landscape are likely to contribute only marginally to protection of red-listed species, especially if habitat quality is not substantially higher than in ordinary managed forest, as is the case with woodland key habitats. Key habitats might supplement the forest protection network if they were delineated larger and if harvesting of individual trees was prohibited in them. Taking the landscape perspective into account in the design and development of conservation measures is critical while striving to halt the decline of forest biodiversity in an ecologically effective manner.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

During the past decades agricultural intensification has caused dramatic population declines in a wide range of taxa related to farmland habitats, including farmland birds. In this thesis, I studied how boreal farmland landscape characteristics and agricultural land use affect the abundance and diversity of farmland birds using extensive field data collected by territory mapping of breeding farmland birds in various parts of Finland. My results show that the area and openness of agricultural areas are key determinants of farmland bird abundance and distribution. A landscape composition with enough open farmland combined with key habitats such as farmyards and wetland is likely to provide essential prerequisites for the occurrence of a rich farmland avifauna. In Finland, the majority of large areas suitable for open habitat specialists are located in southern and western parts of the country. However, the diversity of the species with an unfavourable conservation status in Europe (SPECs) had notable hotspot areas in northern and north-western agricultural areas. I found that in boreal agroecosystems farmland birds favour fields with springtime vegetative cover, especially agricultural grasslands and set-asides. Hence, in the spring cereal dominated Finnish agroecosystems it is the absence of field vegetation that may limit populations of many farmland bird species. It is likely that the decrease of crops providing vegetative cover in the spring, such as permanent grasslands, cultivated grass, and autumn-sown cereals, has greatly contributed to the declines of Finnish farmland birds. Grass crops have persistently declined in Finland as a consequence of specialization in crop production and the large-scale decline in livestock husbandry. Small-scale non-crop habitats, especially ditches and ditch margins, are also important for many bird species in the Finnish agroecosystems, but have dramatically declined during the last decades. A major problem for farmland bird conservation in Finland is the conflict between landscape structure and agricultural management. Areas with mixed and cattle farming are virtually absent from the large agricultural plains of southern and south-western Finland, where the landscape structure is more likely to be favourable for rich farmland bird assemblages. On the other hand, mixed and cattle farming is still rather frequent in northern and central parts of the country, where the landscape structure is not suitable for many farmland specialist birds requiring open landscapes. My results provide useful guidelines for farmland bird conservation, and imply that considerable attention needs to be paid to landscape factors when selecting areas for various conservational management actions, such as agri-environment schemes. Actions promoting the abundance of set-asides, grass crops, and ditches would markedly benefit Finnish farmland bird populations. Organic farming may benefit farmland birds, but it is not clear how general its beneficial effect is in boreal agroecosystems. The most urgent action aiming to preserve farmland biodiversity would be to support re-introducing and sustaining cattle farming by environmental subsidies. This would be especially beneficial in the southern parts of Finland, where the landscape characteristics and abundance of agricultural areas are most suitable for farmland birds and where cattle farming is currently rare.