14 resultados para minimum message length

em Helda - Digital Repository of University of Helsinki


Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The effect of temperature on height growth of Scots pine in the northern boreal zone in Lapland was studied in two different time scales. Intra-annual growth was monitored in four stands in up to four growing seasons using an approximately biweekly measurement interval. Inter-annual growth was studied using growth records representing seven stands and five geographical locations. All the stands were growing on a dry to semi-dry heath that is a typical site type for pine stands in Finland. The applied methodology is based on applied time-series analysis and multilevel modelling. Intra-annual elongation of the leader shoot correlated with temperature sum accumulation. Height growth ceased when, on average, 41% of the relative temperature sum of the site was achieved (observed minimum and maximum were 38% and 43%). The relative temperature sum was calculated by dividing the actual temperature sum by the long-term mean of the total annual temperature sum for the site. Our results suggest that annual height growth ceases when a location-specific temperature sum threshold is attained. The positive effect of the mean July temperature of the previous year on annual height increment proved to be very strong at high latitudes. The mean November temperature of the year before the previous had a statistically significantly effect on height increment in the three northernmost stands. The effect of mean monthly precipitation on annual height growth was statistically insignificant. There was a non-linear dependence between length and needle density of annual shoots. Exceptionally low height growth results in high needle-density, but the effect is weaker in years of average or good height growth. Radial growth and next year s height growth are both largely controlled by current July temperature. Nevertheless, their growth variation in terms of minimum and maximum is not necessarily strongly correlated. This is partly because height growth is more sensitive to changes in temperature. In addition, the actual effective temperature period is not exactly the same for these two growth components. Yet, there is a long-term balance that was also statistically distinguishable; radial growth correlated significantly with height growth with a lag of 2 years. Temperature periods shorter than a month are more effective variables than mean monthly values, but the improvement is on the scale of modest to good when applying Julian days or growing-degree-days as pointers.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This study addresses three important issues in tree bucking optimization in the context of cut-to-length harvesting. (1) Would the fit between the log demand and log output distributions be better if the price and/or demand matrices controlling the bucking decisions on modern cut-to-length harvesters were adjusted to the unique conditions of each individual stand? (2) In what ways can we generate stand and product specific price and demand matrices? (3) What alternatives do we have to measure the fit between the log demand and log output distributions, and what would be an ideal goodness-of-fit measure? Three iterative search systems were developed for seeking stand-specific price and demand matrix sets: (1) A fuzzy logic control system for calibrating the price matrix of one log product for one stand at a time (the stand-level one-product approach); (2) a genetic algorithm system for adjusting the price matrices of one log product in parallel for several stands (the forest-level one-product approach); and (3) a genetic algorithm system for dividing the overall demand matrix of each of the several log products into stand-specific sub-demands simultaneously for several stands and products (the forest-level multi-product approach). The stem material used for testing the performance of the stand-specific price and demand matrices against that of the reference matrices was comprised of 9 155 Norway spruce (Picea abies (L.) Karst.) sawlog stems gathered by harvesters from 15 mature spruce-dominated stands in southern Finland. The reference price and demand matrices were either direct copies or slightly modified versions of those used by two Finnish sawmilling companies. Two types of stand-specific bucking matrices were compiled for each log product. One was from the harvester-collected stem profiles and the other was from the pre-harvest inventory data. Four goodness-of-fit measures were analyzed for their appropriateness in determining the similarity between the log demand and log output distributions: (1) the apportionment degree (index), (2) the chi-square statistic, (3) Laspeyres quantity index, and (4) the price-weighted apportionment degree. The study confirmed that any improvement in the fit between the log demand and log output distributions can only be realized at the expense of log volumes produced. Stand-level pre-control of price matrices was found to be advantageous, provided the control is done with perfect stem data. Forest-level pre-control of price matrices resulted in no improvement in the cumulative apportionment degree. Cutting stands under the control of stand-specific demand matrices yielded a better total fit between the demand and output matrices at the forest level than was obtained by cutting each stand with non-stand-specific reference matrices. The theoretical and experimental analyses suggest that none of the three alternative goodness-of-fit measures clearly outperforms the traditional apportionment degree measure. Keywords: harvesting, tree bucking optimization, simulation, fuzzy control, genetic algorithms, goodness-of-fit

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis studies optimisation problems related to modern large-scale distributed systems, such as wireless sensor networks and wireless ad-hoc networks. The concrete tasks that we use as motivating examples are the following: (i) maximising the lifetime of a battery-powered wireless sensor network, (ii) maximising the capacity of a wireless communication network, and (iii) minimising the number of sensors in a surveillance application. A sensor node consumes energy both when it is transmitting or forwarding data, and when it is performing measurements. Hence task (i), lifetime maximisation, can be approached from two different perspectives. First, we can seek for optimal data flows that make the most out of the energy resources available in the network; such optimisation problems are examples of so-called max-min linear programs. Second, we can conserve energy by putting redundant sensors into sleep mode; we arrive at the sleep scheduling problem, in which the objective is to find an optimal schedule that determines when each sensor node is asleep and when it is awake. In a wireless network simultaneous radio transmissions may interfere with each other. Task (ii), capacity maximisation, therefore gives rise to another scheduling problem, the activity scheduling problem, in which the objective is to find a minimum-length conflict-free schedule that satisfies the data transmission requirements of all wireless communication links. Task (iii), minimising the number of sensors, is related to the classical graph problem of finding a minimum dominating set. However, if we are not only interested in detecting an intruder but also locating the intruder, it is not sufficient to solve the dominating set problem; formulations such as minimum-size identifying codes and locating dominating codes are more appropriate. This thesis presents approximation algorithms for each of these optimisation problems, i.e., for max-min linear programs, sleep scheduling, activity scheduling, identifying codes, and locating dominating codes. Two complementary approaches are taken. The main focus is on local algorithms, which are constant-time distributed algorithms. The contributions include local approximation algorithms for max-min linear programs, sleep scheduling, and activity scheduling. In the case of max-min linear programs, tight upper and lower bounds are proved for the best possible approximation ratio that can be achieved by any local algorithm. The second approach is the study of centralised polynomial-time algorithms in local graphs these are geometric graphs whose structure exhibits spatial locality. Among other contributions, it is shown that while identifying codes and locating dominating codes are hard to approximate in general graphs, they admit a polynomial-time approximation scheme in local graphs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this thesis the role played by expansive and introduced species in the phytoplankton ecology of the Baltic Sea was investigated. The aims were threefold. First, the studies investigated the resting stages of dinoflagellates, which were transported into the Baltic Sea via shipping and were able to germinate under the ambient, nutrient-rich, brackish water conditions. The studies also estimated which factors favoured the occurrence and spread of P. minimum in the Baltic Sea and discussed the identification of this morphologically variable species. In addition, the classification of phytoplankton species recently observed in the Baltic Sea was discussed. Incubation of sediments from four Finnish ports and 10 ships ballast tanks revealed that the sediments act as sources of living dinoflagellates and other phytoplankton. Dinoflagellates germinated from all ports detected and from 90% of ballast tanks. The concentrations of cells germinating from ballast tank sediments were mostly low compared with the acceptable cell concentrations set by the International Maritime Organization s (IMO s) International Convention for the Control and Management of Ships Ballast Water and Sediments. However, the IMO allows such high concentrations of small cells in the discharged ballast water that the total number of cells in large ballast water tanks can be very high. Prorocentrum minimum occurred in the Baltic Sea annually but with no obvious trend in the 10-year timespan from 1993 to 2002. The species occurred under wide ranges of temperatures and salinities and the abundance of the species was positively related especially to the presence of organic nitrogen and phosphorus. This indicated that the species was favoured by increased organic nutrient loading and runoff from land and rivers. The cell shape of P. minimum varied from triangular to oval-round, but morphological fine details indicated that only one morphospecies was present. P. minimum also is, according to present knowledge, the only potentially harmful phytoplankton species that has recently expanded widely into new areas of the Baltic Sea.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Knowledge Flow, my dear friend! I would like to introduce you to a close relative of yours: Organizational Communication. You might want to take a moment to hear what your newfound kin has to say. As bright as you are dear Flow, you're missing a piece of the puzzle - for one cannot study any aspect of an organization relating to communication without acknowledging the message. Without a message, communication does not exist. Organizational Communication has always appreciated this. Perhaps the time has come for you to join rank and do so too? The main point of this work is to prove that the form of a message considerably affects communication, interpretation - and knowledge flow. As stories are at the heart of this thesis; and entertaining, reader-friendly communication its main argument, the entire manuscript is written in story form and is intentionally breaking academic writing tradition as far as writing style goes. Each chapter reads as a story of sorts and put together they create a grand narrative of my journey as a PhD student, the research I have conducted and the outcomes of this work. Thus if a reader hopes to make any sense of this title, she must read it in the same way one would read a novel, from beginning to end. This is a thesis with three aspirations. First, it sets out to prove that knowledge flow cannot be studied without a message. Second, it moves on to give the reader a once-over of a much used message form: storytelling. After these two goals are tackled the path is clear to research if message form indeed is as essential as claimed. I do so through both a qualitative and a quantitative study. The former acted as both a stepping stone into the research area and as an inspirational pilot, from which the research design for the larger quantitative study was drawn. Together, these two studies answered my research question - and allowed me to fulfill the third, final and foremost aspiration of this study - bridging the gap between two separate fields of knowledge management: knowledge flow and storytelling.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis studies optimisation problems related to modern large-scale distributed systems, such as wireless sensor networks and wireless ad-hoc networks. The concrete tasks that we use as motivating examples are the following: (i) maximising the lifetime of a battery-powered wireless sensor network, (ii) maximising the capacity of a wireless communication network, and (iii) minimising the number of sensors in a surveillance application. A sensor node consumes energy both when it is transmitting or forwarding data, and when it is performing measurements. Hence task (i), lifetime maximisation, can be approached from two different perspectives. First, we can seek for optimal data flows that make the most out of the energy resources available in the network; such optimisation problems are examples of so-called max-min linear programs. Second, we can conserve energy by putting redundant sensors into sleep mode; we arrive at the sleep scheduling problem, in which the objective is to find an optimal schedule that determines when each sensor node is asleep and when it is awake. In a wireless network simultaneous radio transmissions may interfere with each other. Task (ii), capacity maximisation, therefore gives rise to another scheduling problem, the activity scheduling problem, in which the objective is to find a minimum-length conflict-free schedule that satisfies the data transmission requirements of all wireless communication links. Task (iii), minimising the number of sensors, is related to the classical graph problem of finding a minimum dominating set. However, if we are not only interested in detecting an intruder but also locating the intruder, it is not sufficient to solve the dominating set problem; formulations such as minimum-size identifying codes and locating–dominating codes are more appropriate. This thesis presents approximation algorithms for each of these optimisation problems, i.e., for max-min linear programs, sleep scheduling, activity scheduling, identifying codes, and locating–dominating codes. Two complementary approaches are taken. The main focus is on local algorithms, which are constant-time distributed algorithms. The contributions include local approximation algorithms for max-min linear programs, sleep scheduling, and activity scheduling. In the case of max-min linear programs, tight upper and lower bounds are proved for the best possible approximation ratio that can be achieved by any local algorithm. The second approach is the study of centralised polynomial-time algorithms in local graphs – these are geometric graphs whose structure exhibits spatial locality. Among other contributions, it is shown that while identifying codes and locating–dominating codes are hard to approximate in general graphs, they admit a polynomial-time approximation scheme in local graphs.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The aim of this study was to examine the applicability of the Phonological Mean Length of Utterance (pMLU) method to the data of children acquiring Finnish, for both typically developing children and children with a Specific Language Impairment (SLI). Study I examined typically developing children at the end of the one-word stage (N=17, mean age 1;8), and Study II analysed children s (N=5) productions in a follow-up study with four assessment points (ages 2;0, 2;6, 3;0, 3;6). Study III was carried out in the form of a review article that examined recent research on the phonological development of children acquiring Finnish and compared the results with general trends and cross-linguistic findings in phonological development. Study IV included children with SLI (N=4, mean age 4;10) and age-matched peers. The analyses in Studies I, II and IV were made using the quantitative pMLU method. In the pMLU method, pMLU values are counted for both the words that the children targeted (so-called target words) and the words produced by the children. When the child s average pMLU value was divided with the average target word pMLU value, it is possible to examine that child s accuracy in producing the words with the Whole-Word Proximity (PWP) value. In addition, the number of entirely correctly produced words is counted to obtain the Whole-Word Correctness (PWC) value. Qualitative analyses were carried out in order to examine how the children s phoneme inventories and deficiencies in phonotactics would explain the observed pMLU, PWP and PWC values. The results showed that the pMLU values for children acquiring Finnish were relatively high already at the end of the one-word stage (Study I). The values were found to reflect the characteristics of the ambient language. Typological features that lead to cross-linguistic differences in pMLU values were also observed in the review article (Study III), which noted that in the course of phonological acquisition there are a large number of language-specific phenomena and processes. Study II indicated that overall the children s phonological development during the follow-up period was reflected in the pMLU, PWP and PWC values, although the method showed limitations in detecting qualitative differences between the children. Correct vowels were not scored in the pMLU counts, which led to some misleadingly high pMLU and PWP results: vowel errors were only reflected in the PWC values. Typically developing children in Study II reached the highest possible pMLU results already around age 3;6. At the same time, the differences between the children with SLI and age-matched peers in the pMLU values were very prominent (Study IV). The values for the children with SLI were similar to the ones reported for two-year-old children. Qualitative analyses revealed that the phonologies of the children with SLI largely resembled the ones of younger, typically developing children. However, unusual errors were also witnessed (e.g., vowel errors, omissions of word-initial stops, consonants added to the initial position in words beginning with a vowel). This dissertation provides an application of a new tool for quantitative phonological assessment and analysis in children acquiring Finnish. The preliminary results suggest that, with some modifications, the pMLU method can be used to assess children s phonological development and that it has some advantages compared to the earlier, segment-oriented approaches. Qualitative analyses complemented the pMLU s observations on the children s phonologies. More research is needed in order to verify the levels of the pMLU, PWP and PWC values in children acquiring Finnish.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Microbes in natural and artificial environments as well as in the human body are a key part of the functional properties of these complex systems. The presence or absence of certain microbial taxa is a correlate of functional status like risk of disease or course of metabolic processes of a microbial community. As microbes are highly diverse and mostly notcultivable, molecular markers like gene sequences are a potential basis for detection and identification of key types. The goal of this thesis was to study molecular methods for identification of microbial DNA in order to develop a tool for analysis of environmental and clinical DNA samples. Particular emphasis was placed on specificity of detection which is a major challenge when analyzing complex microbial communities. The approach taken in this study was the application and optimization of enzymatic ligation of DNA probes coupled with microarray read-out for high-throughput microbial profiling. The results show that fungal phylotypes and human papillomavirus genotypes could be accurately identified from pools of PCR amplicons generated from purified sample DNA. Approximately 1 ng/μl of sample DNA was needed for representative PCR amplification as measured by comparisons between clone sequencing and microarray. A minimum of 0,25 amol/μl of PCR amplicons was detectable from amongst 5 ng/μl of background DNA, suggesting that the detection limit of the test comprising of ligation reaction followed by microarray read-out was approximately 0,04%. Detection from sample DNA directly was shown to be feasible with probes forming a circular molecule upon ligation followed by PCR amplification of the probe. In this approach, the minimum detectable relative amount of target genome was found to be 1% of all genomes in the sample as estimated from 454 deep sequencing results. Signal-to-noise of contact printed microarrays could be improved by using an internal microarray hybridization control oligonucleotide probe together with a computational algorithm. The algorithm was based on identification of a bias in the microarray data and correction of the bias as shown by simulated and real data. The results further suggest semiquantitative detection to be possible by ligation detection, allowing estimation of target abundance in a sample. However, in practise, comprehensive sequence information of full length rRNA genes is needed to support probe design with complex samples. This study shows that DNA microarray has the potential for an accurate microbial diagnostic platform to take advantage of increasing sequence data and to replace traditional, less efficient methods that still dominate routine testing in laboratories. The data suggests that ligation reaction based microarray assay can be optimized to a degree that allows good signal-tonoise and semiquantitative detection.