9 resultados para Statistically Weighted Regularities
em Helda - Digital Repository of University of Helsinki
Resumo:
Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.
Resumo:
The Earth's ecosystems are protected from the dangerous part of the solar ultraviolet (UV) radiation by stratospheric ozone, which absorbs most of the harmful UV wavelengths. Severe depletion of stratospheric ozone has been observed in the Antarctic region, and to a lesser extent in the Arctic and midlatitudes. Concern about the effects of increasing UV radiation on human beings and the natural environment has led to ground based monitoring of UV radiation. In order to achieve high-quality UV time series for scientific analyses, proper quality control (QC) and quality assurance (QA) procedures have to be followed. In this work, practices of QC and QA are developed for Brewer spectroradiometers and NILU-UV multifilter radiometers, which measure in the Arctic and Antarctic regions, respectively. These practices are applicable to other UV instruments as well. The spectral features and the effect of different factors affecting UV radiation were studied for the spectral UV time series at Sodankylä. The QA of the Finnish Meteorological Institute's (FMI) two Brewer spectroradiometers included daily maintenance, laboratory characterizations, the calculation of long-term spectral responsivity, data processing and quality assessment. New methods for the cosine correction, the temperature correction and the calculation of long-term changes of spectral responsivity were developed. Reconstructed UV irradiances were used as a QA tool for spectroradiometer data. The actual cosine correction factor was found to vary between 1.08-1.12 and 1.08-1.13. The temperature characterization showed a linear temperature dependence between the instrument's internal temperature and the photon counts per cycle. Both Brewers have participated in international spectroradiometer comparisons and have shown good stability. The differences between the Brewers and the portable reference spectroradiometer QASUME have been within 5% during 2002-2010. The features of the spectral UV radiation time series at Sodankylä were analysed for the time period 1990-2001. No statistically significant long-term changes in UV irradiances were found, and the results were strongly dependent on the time period studied. Ozone was the dominant factor affecting UV radiation during the springtime, whereas clouds played a more important role during the summertime. During this work, the Antarctic NILU-UV multifilter radiometer network was established by the Instituto Nacional de Meteorogía (INM) as a joint Spanish-Argentinian-Finnish cooperation project. As part of this work, the QC/QA practices of the network were developed. They included training of the operators, daily maintenance, regular lamp tests and solar comparisons with the travelling reference instrument. Drifts of up to 35% in the sensitivity of the channels of the NILU-UV multifilter radiometers were found during the first four years of operation. This work emphasized the importance of proper QC/QA, including regular lamp tests, for the multifilter radiometers also. The effect of the drifts were corrected by a method scaling the site NILU-UV channels to those of the travelling reference NILU-UV. After correction, the mean ratios of erythemally-weighted UV dose rates measured during solar comparisons between the reference NILU-UV and the site NILU-UVs were 1.007±0.011 and 1.012±0.012 for Ushuaia and Marambio, respectively, when the solar zenith angle varied up to 80°. Solar comparisons between the NILU-UVs and spectroradiometers showed a ±5% difference near local noon time, which can be seen as proof of successful QC/QA procedures and transfer of irradiance scales. This work also showed that UV measurements made in the Arctic and Antarctic can be comparable with each other.
Resumo:
We propose to compress weighted graphs (networks), motivated by the observation that large networks of social, biological, or other relations can be complex to handle and visualize. In the process also known as graph simplication, nodes and (unweighted) edges are grouped to supernodes and superedges, respectively, to obtain a smaller graph. We propose models and algorithms for weighted graphs. The interpretation (i.e. decompression) of a compressed, weighted graph is that a pair of original nodes is connected by an edge if their supernodes are connected by one, and that the weight of an edge is approximated to be the weight of the superedge. The compression problem now consists of choosing supernodes, superedges, and superedge weights so that the approximation error is minimized while the amount of compression is maximized. In this paper, we formulate this task as the 'simple weighted graph compression problem'. We then propose a much wider class of tasks under the name of 'generalized weighted graph compression problem'. The generalized task extends the optimization to preserve longer-range connectivities between nodes, not just individual edge weights. We study the properties of these problems and propose a range of algorithms to solve them, with dierent balances between complexity and quality of the result. We evaluate the problems and algorithms experimentally on real networks. The results indicate that weighted graphs can be compressed efficiently with relatively little compression error.
Resumo:
We use parallel weighted finite-state transducers to implement a part-of-speech tagger, which obtains state-of-the-art accuracy when used to tag the Europarl corpora for Finnish, Swedish and English. Our system consists of a weighted lexicon and a guesser combined with a bigram model factored into two weighted transducers. We use both lemmas and tag sequences in the bigram model, which guarantees reliable bigram estimates.
Resumo:
In this paper we present simple methods for construction and evaluation of finite-state spell-checking tools using an existing finite-state lexical automaton, freely available finite-state tools and Internet corpora acquired from projects such as Wikipedia. As an example, we use a freely available open-source implementation of Finnish morphology, made with traditional finite-state morphology tools, and demonstrate rapid building of Northern Sámi and English spell checkers from tools and resources available from the Internet.