12 resultados para Maximum-likelihood-estimation
em Helda - Digital Repository of University of Helsinki
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
The Minimum Description Length (MDL) principle is a general, well-founded theoretical formalization of statistical modeling. The most important notion of MDL is the stochastic complexity, which can be interpreted as the shortest description length of a given sample of data relative to a model class. The exact definition of the stochastic complexity has gone through several evolutionary steps. The latest instantation is based on the so-called Normalized Maximum Likelihood (NML) distribution which has been shown to possess several important theoretical properties. However, the applications of this modern version of the MDL have been quite rare because of computational complexity problems, i.e., for discrete data, the definition of NML involves an exponential sum, and in the case of continuous data, a multi-dimensional integral usually infeasible to evaluate or even approximate accurately. In this doctoral dissertation, we present mathematical techniques for computing NML efficiently for some model families involving discrete data. We also show how these techniques can be used to apply MDL in two practical applications: histogram density estimation and clustering of multi-dimensional data.
Resumo:
The focus of this study is on statistical analysis of categorical responses, where the response values are dependent of each other. The most typical example of this kind of dependence is when repeated responses have been obtained from the same study unit. For example, in Paper I, the response of interest is the pneumococcal nasopharengyal carriage (yes/no) on 329 children. For each child, the carriage is measured nine times during the first 18 months of life, and thus repeated respones on each child cannot be assumed independent of each other. In the case of the above example, the interest typically lies in the carriage prevalence, and whether different risk factors affect the prevalence. Regression analysis is the established method for studying the effects of risk factors. In order to make correct inferences from the regression model, the associations between repeated responses need to be taken into account. The analysis of repeated categorical responses typically focus on regression modelling. However, further insights can also be gained by investigating the structure of the association. The central theme in this study is on the development of joint regression and association models. The analysis of repeated, or otherwise clustered, categorical responses is computationally difficult. Likelihood-based inference is often feasible only when the number of repeated responses for each study unit is small. In Paper IV, an algorithm is presented, which substantially facilitates maximum likelihood fitting, especially when the number of repeated responses increase. In addition, a notable result arising from this work is the freely available software for likelihood-based estimation of clustered categorical responses.
Resumo:
Leaf and needle biomasses are key factors in forest health. Insects that feed on needles cause growth losses and tree mortality. Insect outbreaks in Finnish forests have increased rapidly during the last decade and due to climate change the damages are expected to become more serious. There is a need for cost-efficient methods for inventorying these outbreaks. Remote sensing is a promising means for estimating forests and damages. The purpose of this study is to investigate the usability of airborne laser scanning in estimating Scots pine defoliation caused by the common pine sawfly (Diprion pini L.). The study area is situated in Ilomantsi district, eastern Finland. Study materials included high-pulse airborne laser scannings from July and October 2008. Reference data consisted of 90 circular field plots measured in May-June 2009. Defoliation percentage on these field plots was estimated visually. The study was made on plot-level and methods used were linear regression, unsupervised classification, Maximum likelihood method, and stepwise linear regression. Field plots were divided in defoliation classes in two different ways: When divided in two classes the defoliation percentages used were 0–20 % and 20–100 % and when divided in four classes 0–10 %, 10–20 %, 20–30 % and 30–100 %. The results varied depending on method and laser scanning. In the first laser scanning the best results were obtained with stepwise linear regression. The kappa value was 0,47 when using two classes and 0,37 when divided in four classes. In the second laser scanning the best results were obtained with Maximum likelihood. The kappa values were 0,42 and 0,37, correspondingly. The feature that explained defoliation best was vegetation index (pulses reflected from height > 2m / all pulses). There was no significant difference in the results between the two laser scannings so the seasonal change in defoliation could not be detected in this study.
Resumo:
The Thesis presents a state-space model for a basketball league and a Kalman filter algorithm for the estimation of the state of the league. In the state-space model, each of the basketball teams is associated with a rating that represents its strength compared to the other teams. The ratings are assumed to evolve in time following a stochastic process with independent Gaussian increments. The estimation of the team ratings is based on the observed game scores that are assumed to depend linearly on the true strengths of the teams and independent Gaussian noise. The team ratings are estimated using a recursive Kalman filter algorithm that produces least squares optimal estimates for the team strengths and predictions for the scores of the future games. Additionally, if the Gaussianity assumption holds, the predictions given by the Kalman filter maximize the likelihood of the observed scores. The team ratings allow probabilistic inference about the ranking of the teams and their relative strengths as well as about the teams’ winning probabilities in future games. The predictions about the winners of the games are correct 65-70% of the time. The team ratings explain 16% of the random variation observed in the game scores. Furthermore, the winning probabilities given by the model are concurrent with the observed scores. The state-space model includes four independent parameters that involve the variances of noise terms and the home court advantage observed in the scores. The Thesis presents the estimation of these parameters using the maximum likelihood method as well as using other techniques. The Thesis also gives various example analyses related to the American professional basketball league, i.e., National Basketball Association (NBA), and regular seasons played in year 2005 through 2010. Additionally, the season 2009-2010 is discussed in full detail, including the playoffs.
Resumo:
Aptitude-based student selection: A study concerning the admission processes of some technically oriented healthcare degree programmes in Finland (Orthotics and Prosthetics, Dental Technology and Optometry). The data studied consisted of conveniencesamples of preadmission information and the results of the admission processes of three technically oriented healthcare degree programmes (Orthotics and Prosthetics, Dental Technology and Optometry) in Finland during the years 1977-1986 and 2003. The number of the subjects tested and interviewed in the first samples was 191, 615 and 606, and in the second 67, 64 and 89, respectively. The questions of the six studies were: I. How were different kinds of preadmission data related to each other? II. Which were the major determinants of the admission decisions? III. Did the graduated students and those who dropped out differ from each other? IV. Was it possible to predict how well students would perform in the programmes? V. How was the student selection executed in the year 2003? VI. Should clinical vs. statistical prediction or both be used? (Some remarks are presented on Meehl's argument: "Always, we might as well face it, the shadow of the statistician hovers in the background; always the actuary will have the final word.") The main results of the study were as follows: Ability tests, dexterity tests and judgements of personality traits (communication skills, initiative, stress tolerance and motivation) provided unique, non-redundant information about the applicants. Available demographic variables did not bias the judgements of personality traits. In all three programme settings, four-factor solutions (personality, reasoning, gender-technical and age-vocational with factor scores) could be extracted by the Maximum Likelihood method with graphical Varimax rotation. The personality factor dominated the final aptitude judgements and very strongly affected the selection decisions. There were no clear differences between graduated students and those who had dropped out in regard to the four factors. In addition, the factor scores did not predict how well the students performed in the programmes. Meehl's argument on the uncertainty of clinical prediction was supported by the results, which on the other hand did not provide any relevant data for rules on statistical prediction. No clear arguments for or against the aptitude-based student selection was presented. However, the structure of the aptitude measures and their impact on the admission process are now better known. The concept of "personal aptitude" is not necessarily included in the values and preferences of those in charge of organizing the schooling. Thus, obviously the most well-founded and cost-effective way to execute student selection is to rely on e.g. the grade point averages of the matriculation examination and/or written entrance exams. This procedure, according to the present study, would result in a student group which has a quite different makeup (60%) from the group selected on the basis of aptitude tests. For the recruiting organizations, instead, "personal aptitude" may be a matter of great importance. The employers, of course, decide on personnel selection. The psychologists, if consulted, are responsible for the proper use of psychological measures.
Resumo:
Sorkkasairaudet ovat kasvava ongelma lypsykarjatiloilla. Sorkka- ja jalkaviat aiheuttavat ennenaikaisten poistojen lisäksi taloudellisia tappioita alentamalla maitotuotosta ja lisäämällä eläinlääkintä- ja sorkkahoitokuluja. Tämän työn tavoitteena oli tutkia sorkkasairauksien periytyvyyttä ja sorkkasairauksiin vaikuttavia tekijöitä. Tutkimusaineisto saatiin Terveet Sorkat -ohjelmasta, johon liittyminen on vapaaehtoista. Sorkkahoitajat olivat luokitelleet sorkkasairaudet vuosina 2003 2004. Sorkkasairaudet (vertymät anturassa, krooninen sorkkakuume, valkoviivan repeämä, anturahaavauma, sorkkavälin ihotulehdus, kantasyöpymä, sorkka-alueen ihotulehdus ja sorkkakiertymä ja muut sorkkasairaudet) oli luokiteltu aineistossa kaksiluokkaisina (kyllä/ei) ominaisuuksina. Aineiston esikäsittelyyn, alustaviin analyyseihin ja kiinteiden tekijöiden tilastollisen merkitsevyyden testaamiseen F-testillä käytettiin WSYS-ohjelmistoa. Lisäksi kiinteiden tekijöiden merkitsevyyttä testattiin logit-mallilla SAS-ohjelmistolla. Varianssikomponentit laskettiin Restricted Maximum Likelihood (REML)-menetelmällä VCE4-ohjelmistolla. Toistuvuuseläinmallilla saatiin seuraavia periytymisasteen arvioita: vertymät anturassa 0,05, valkoviivan repeämä 0,04, sorkkakiertymä 0,05, kantasyöpymä 0,01, anturahaavauma 0,03 ja sorkkasairaudet yhtenä ominaisuutena 0,06. Sorkkasairauksien periytymisasteiden arviot muutettuna sorkkasairausalttiuksien periytymisasteiksi olivat: vertymät anturassa 0,11, valkoviivan repeämä 0,12, sorkkakiertymä 0,15, kantasyöpymä 0,03, anturahaavauma 0,17 ja sorkkasairaudet yhtenä ominaisuutena 0,09. Sorkkasairauksien väliset geneettiset korrelaatiot olivat positiivisia lukuun ottamatta valkoviivan repeämän ja kantasyöpymän välistä geneettistä korrelaatiota, joka oli lievästi negatiivinen. Sorkkasairauksien geneettiset korrelaatiot 305 päivän maitotuotokseen olivat -0,20 0,27. Tämän tutkimuksen ja aiempien tutkimusten perusteella perimän osuus sorkkasairauksiin ei ole kovin suuri. Koska ympäristötekijöillä on suuri merkitys sorkkasairauksien esiintymiseen, sorkkasairauksien ennaltaehkäisyssä tulisi kiinnittää erityistä huomiota navetan olosuhteisiin, säännölliseen sorkkahoitoon ja oikeaan ruokintaan.
Resumo:
The Taita Hills in southeastern Kenya form the northernmost part of Africa’s Eastern Arc Mountains, which have been identified by Conservation International as one of the top ten biodiversity hotspots on Earth. As with many areas of the developing world, over recent decades the Taita Hills have experienced significant population growth leading to associated major changes in land use and land cover (LULC), as well as escalating land degradation, particularly soil erosion. Multi-temporal medium resolution multispectral optical satellite data, such as imagery from the SPOT HRV, HRVIR, and HRG sensors, provides a valuable source of information for environmental monitoring and modelling at a landscape level at local and regional scales. However, utilization of multi-temporal SPOT data in quantitative remote sensing studies requires the removal of atmospheric effects and the derivation of surface reflectance factor. Furthermore, for areas of rugged terrain, such as the Taita Hills, topographic correction is necessary to derive comparable reflectance throughout a SPOT scene. Reliable monitoring of LULC change over time and modelling of land degradation and human population distribution and abundance are of crucial importance to sustainable development, natural resource management, biodiversity conservation, and understanding and mitigating climate change and its impacts. The main purpose of this thesis was to develop and validate enhanced processing of SPOT satellite imagery for use in environmental monitoring and modelling at a landscape level, in regions of the developing world with limited ancillary data availability. The Taita Hills formed the application study site, whilst the Helsinki metropolitan region was used as a control site for validation and assessment of the applied atmospheric correction techniques, where multiangular reflectance field measurements were taken and where horizontal visibility meteorological data concurrent with image acquisition were available. The proposed historical empirical line method (HELM) for absolute atmospheric correction was found to be the only applied technique that could derive surface reflectance factor within an RMSE of < 0.02 ps in the SPOT visible and near-infrared bands; an accuracy level identified as a benchmark for successful atmospheric correction. A multi-scale segmentation/object relationship modelling (MSS/ORM) approach was applied to map LULC in the Taita Hills from the multi-temporal SPOT imagery. This object-based procedure was shown to derive significant improvements over a uni-scale maximum-likelihood technique. The derived LULC data was used in combination with low cost GIS geospatial layers describing elevation, rainfall and soil type, to model degradation in the Taita Hills in the form of potential soil loss, utilizing the simple universal soil loss equation (USLE). Furthermore, human population distribution and abundance were modelled with satisfactory results using only SPOT and GIS derived data and non-Gaussian predictive modelling techniques. The SPOT derived LULC data was found to be unnecessary as a predictor because the first and second order image texture measurements had greater power to explain variation in dwelling unit occurrence and abundance. The ability of the procedures to be implemented locally in the developing world using low-cost or freely available data and software was considered. The techniques discussed in this thesis are considered equally applicable to other medium- and high-resolution optical satellite imagery, as well the utilized SPOT data.
Resumo:
Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.
Resumo:
Tutkimuksen tarkoituksena oli selvittää Opaskoirakoulun pentutestissä mitattavien ominaisuuksien perinnölliset tunnusluvut sekä niihin vaikuttavat tekijät. Aineisto koostui Opaskoirakoululla vuosina 1988–2008 pentutestin suorittaneista koirista (900 kpl). Suomen Kennelliito ry:stä saatiin sukulaisuusaineisto, johon täydennettiin rekisteriin kuulumattomat koirat. Tutkittavia ominaisuuksia oli 11 kappaletta: käyttäytyminen sylissä, luoksetulo, kontakti, taistelu, alistus, palautuminen, ääniherkkyys, käyttäytyminen pöydällä, seuraaminen, toimintakyky ja hermorakenne. Kaikki ominaisuudet arvostellaan pentutestissä valmiiden käyttäytymismallien mukaisesti. Aineiston esikäsittelyyn ja alustaviin analyyseihin käytettiin Microsoft Office Excel 2003-ohjelmaa sekä WSYS-L ja XWSYS-ohjelmistoja. Kiinteiden tekijöiden luokitteluun ja merkitsevyyden testaamiseen käytettiin WSYS-L ja XWSYS-ohjelmistoa. Varianssikomponentit sekä periytymisasteet laskettiin Restricted Maximum Likelihood (REML)-menetelmällä käyttäen VCE6-ohjelmistoa Tutkittujen ominaisuuksien periytymisasteiden arviot olivat alhaisia tai keskinkertaisia (h2= 0,07-0,39). Korkeimmat periytymisasteiden arviot olivat ominaisuuksilla alistus (0,39), toimintakyky (0,32) ja kontakti (0,31). Ominaisuuksien väliset geneettiset korrelaatiot olivat pääosin positiivisia. Poikkeuksen muodosti ominaisuus alistus, joka oli negatiivisesti korreloitunut palautumisen (-0,47) ja seuraamisen (-0,64) kanssa. Osa ominaisuuksista oli erittäin voimakkaasti korreloituneita (r > 0,8). Fenotyyppiset korrelaatiot vaihtelivat välillä 0,11–0,73 ja olivat geneettisiä korrelaatioita matalampia. Tämän tutkimuksen perusteella ominaisuuksissa käyttäytyminen sylissä, luoksetulo, kontakti, taistelu, alistus, ääniherkkyys, käyttäytyminen pöydällä ja toimintakyky on geneettistä vaihtelua ja sen perusteella niitä on mahdollista muuttaa haluttuun suuntaan jalostuksen avulla.
Resumo:
Questions of the small size of non-industrial private forest (NIPF) holdings in Finland are considered and factors affecting their partitioning are analyzed. This work arises out of Finnish forest policy statements in which the small average size of holdings has been seen to have a negative influence on the economics of forestry. A survey of the literature indicates that the size of holdings is an important factor determining the costs of logging and silvicultural operations, while its influence on the timber supply is slight. The empirical data are based on a sample of 314 holdings collected by interviewing forest owners in the years 1980-86. In 1990-91 the same holdings were resurveyed by means of a postal inquiry and partly by interviewing forest owners. The principal objective in compiling the data is to assist in quantifying ownership factors that influence partitioning among different kinds of NIPF holdings. Thus the mechanism of partitioning were described and a maximum likelihood logistic regression model was constructed using seven independent holding and ownership variables. One out of four holdings had undergone partitioning in conjunction with a change in ownership, one fifth among family owned holdings and nearly a half among jointly owned holdings. The results of the logistic regression model indicate, for instance, that the odds on partitioning is about three times greater for jointly owned holdings than for family owned ones. Also, the probabilities of partitioning were estimated and the impact of independent dichotomous variables on the probability of partitioning ranged between 0.02 and 0.10. The low value of the Hosmer-Lemeshow test statistic indicates a good fit of the model and the rate of correct classification was estimated to be 88 per cent with a cutoff point of 0.5. The average size of holdings undergoing ownership changes decreased from 29.9 ha to 28.7 ha over the approximate interval 1983-90. In addition, the transition probability matrix showed that the trends towards smaller size categories mostly involved in the small size categories, less than 20 ha. The results of the study can be used in considering the effects of the small size of holdings for forestry and if the purpose is to influence partitioning through forest or rural policy.