13 resultados para Naive Bayes classifier
em Helda - Digital Repository of University of Helsinki
Resumo:
Minimum Description Length (MDL) is an information-theoretic principle that can be used for model selection and other statistical inference tasks. There are various ways to use the principle in practice. One theoretically valid way is to use the normalized maximum likelihood (NML) criterion. Due to computational difficulties, this approach has not been used very often. This thesis presents efficient floating-point algorithms that make it possible to compute the NML for multinomial, Naive Bayes and Bayesian forest models. None of the presented algorithms rely on asymptotic analysis and with the first two model classes we also discuss how to compute exact rational number solutions.
Resumo:
The aim of this thesis is to develop a fully automatic lameness detection system that operates in a milking robot. The instrumentation, measurement software, algorithms for data analysis and a neural network model for lameness detection were developed. Automatic milking has become a common practice in dairy husbandry, and in the year 2006 about 4000 farms worldwide used over 6000 milking robots. There is a worldwide movement with the objective of fully automating every process from feeding to milking. Increase in automation is a consequence of increasing farm sizes, the demand for more efficient production and the growth of labour costs. As the level of automation increases, the time that the cattle keeper uses for monitoring animals often decreases. This has created a need for systems for automatically monitoring the health of farm animals. The popularity of milking robots also offers a new and unique possibility to monitor animals in a single confined space up to four times daily. Lameness is a crucial welfare issue in the modern dairy industry. Limb disorders cause serious welfare, health and economic problems especially in loose housing of cattle. Lameness causes losses in milk production and leads to early culling of animals. These costs could be reduced with early identification and treatment. At present, only a few methods for automatically detecting lameness have been developed, and the most common methods used for lameness detection and assessment are various visual locomotion scoring systems. The problem with locomotion scoring is that it needs experience to be conducted properly, it is labour intensive as an on-farm method and the results are subjective. A four balance system for measuring the leg load distribution of dairy cows during milking in order to detect lameness was developed and set up in the University of Helsinki Research farm Suitia. The leg weights of 73 cows were successfully recorded during almost 10,000 robotic milkings over a period of 5 months. The cows were locomotion scored weekly, and the lame cows were inspected clinically for hoof lesions. Unsuccessful measurements, caused by cows standing outside the balances, were removed from the data with a special algorithm, and the mean leg loads and the number of kicks during milking was calculated. In order to develop an expert system to automatically detect lameness cases, a model was needed. A probabilistic neural network (PNN) classifier model was chosen for the task. The data was divided in two parts and 5,074 measurements from 37 cows were used to train the model. The operation of the model was evaluated for its ability to detect lameness in the validating dataset, which had 4,868 measurements from 36 cows. The model was able to classify 96% of the measurements correctly as sound or lame cows, and 100% of the lameness cases in the validation data were identified. The number of measurements causing false alarms was 1.1%. The developed model has the potential to be used for on-farm decision support and can be used in a real-time lameness monitoring system.
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.
Resumo:
In this thesis the use of the Bayesian approach to statistical inference in fisheries stock assessment is studied. The work was conducted in collaboration of the Finnish Game and Fisheries Research Institute by using the problem of monitoring and prediction of the juvenile salmon population in the River Tornionjoki as an example application. The River Tornionjoki is the largest salmon river flowing into the Baltic Sea. This thesis tackles the issues of model formulation and model checking as well as computational problems related to Bayesian modelling in the context of fisheries stock assessment. Each article of the thesis provides a novel method either for extracting information from data obtained via a particular type of sampling system or for integrating the information about the fish stock from multiple sources in terms of a population dynamics model. Mark-recapture and removal sampling schemes and a random catch sampling method are covered for the estimation of the population size. In addition, a method for estimating the stock composition of a salmon catch based on DNA samples is also presented. For most of the articles, Markov chain Monte Carlo (MCMC) simulation has been used as a tool to approximate the posterior distribution. Problems arising from the sampling method are also briefly discussed and potential solutions for these problems are proposed. Special emphasis in the discussion is given to the philosophical foundation of the Bayesian approach in the context of fisheries stock assessment. It is argued that the role of subjective prior knowledge needed in practically all parts of a Bayesian model should be recognized and consequently fully utilised in the process of model formulation.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
In visual object detection and recognition, classifiers have two interesting characteristics: accuracy and speed. Accuracy depends on the complexity of the image features and classifier decision surfaces. Speed depends on the hardware and the computational effort required to use the features and decision surfaces. When attempts to increase accuracy lead to increases in complexity and effort, it is necessary to ask how much are we willing to pay for increased accuracy. For example, if increased computational effort implies quickly diminishing returns in accuracy, then those designing inexpensive surveillance applications cannot aim for maximum accuracy at any cost. It becomes necessary to find trade-offs between accuracy and effort. We study efficient classification of images depicting real-world objects and scenes. Classification is efficient when a classifier can be controlled so that the desired trade-off between accuracy and effort (speed) is achieved and unnecessary computations are avoided on a per input basis. A framework is proposed for understanding and modeling efficient classification of images. Classification is modeled as a tree-like process. In designing the framework, it is important to recognize what is essential and to avoid structures that are narrow in applicability. Earlier frameworks are lacking in this regard. The overall contribution is two-fold. First, the framework is presented, subjected to experiments, and shown to be satisfactory. Second, certain unconventional approaches are experimented with. This allows the separation of the essential from the conventional. To determine if the framework is satisfactory, three categories of questions are identified: trade-off optimization, classifier tree organization, and rules for delegation and confidence modeling. Questions and problems related to each category are addressed and empirical results are presented. For example, related to trade-off optimization, we address the problem of computational bottlenecks that limit the range of trade-offs. We also ask if accuracy versus effort trade-offs can be controlled after training. For another example, regarding classifier tree organization, we first consider the task of organizing a tree in a problem-specific manner. We then ask if problem-specific organization is necessary.
Resumo:
Yleisellä tasolla tutkimuksen kohteena oli Suomen helluntailiikkeen spiritualiteetti. Tutkimuksen kehysperusjoukkona oli Helsingin Saalem-seurakunnan tilaisuuksiin osallistuvat ihmiset. Aineisto kerättiin kyselylomakkeilla syksyllä 2004 Saalem-seurakunnan tilaisuuksissa. Täytettyjä lomakkeita kertyi 230. Vastaajien ikä vaihteli 13-87 vuoteen ja heistä 36% olimiehiä. 70% kuului Saalem-seurakuntaan ja 17% johonkin toiseen helluntaiseurakuntaan. Ei-helluntailaisia oli 13% vastaajista. Rajoittuneelta osin käytössä oli myös 500 vastaajan vertailuaineisto Kallion kaupunginosan alueelta. Tämän niinsanotun Case Kallio -aineiston vastaajat olivat pääsääntöisesti heikosti sitoutuneita kristinuskon oppeihin sekä hartaudenharjoittamiseen. Vastaajista 50% oli miehiä. Ikä vaihteli 18-39-uoden välillä. Teoreettisena lähtökohtana tutkimukselle toimi yhdysvaltalaisen Daniel Albrechtin empiirinen tutkimus helluntailais-karismaattisesta spiritualiteetista. Hän määrittelee helluntailais-karismaattisen spiritualiteetin muodostuvan kolmesta tekijästä: uskomuksista, käytännöistä sekä niin sanotuista sensibiliteeteistä. Sensibiliteeteillä tarkoitettaan asennoitumista toimintaa kohti. Albrechtin luomien kategorioiden pohjalta laadittiin kyselylomakkeeseen kaksi mittaria. Toinen mittasi koko helluntailaisen spiritualiteetin kenttää kuvaavia perustekijöitä, joihin sisältyivät uskomukset, käytännöt sekä sensibiliteetit. Toinen mittari keskittyi mittaamaan vain yhtä spiritualiteettimääritelmän osaa, sensibiliteettejä. Helluntailaisuuteen painottuvan näkökulman lisäksi tutkimuksessa käytettiin hyväksi David Hayn spiritualiteettinäkemystä. Hän määrittelee spiritualiteetin arkitodellisuuden ylittäväksi tietoisuudeksi. Hayn laatimien kategorioiden avulla kartoitettiin yleisinhimillistä spiritualiteettia. Tutkimuksen tarkoituksena oli selvittää Saalem-seurakunnan spiritualiteetin ilmenemismuotoja ja eroavaisuuksia suhteessa taustoihin. Lisäksi verrattiin Saalemista kerättyä aineistoa vertailuaineistoon (Case Kallio) sekä selvitettiin kahden erilaisesta lähtökohdasta nousevan spiritualitteettinäkemyksen yhteyttä toisiinsa. Tutkimus oli luonteeltaan kvantitatiivinen. Tutkimusmenetelminä käytettiin tilastollisia testejä sekä faktorianalyysiä. Faktorianalyysin rinnalla käytettiin niin kutsutta Bayes-mallinnusta, jolla ei ole parametrisille menetelmille asetettuja tiukkoja käyttöehtoja. Saalem-seurakunnasta tutkimustulokseksi saatiin 11 eritasoista spiritualiteettiulottuvuutta. Albrechtin esittämät seitsemän sensibiliteettikategoriaa löytyivät lähes sellaisenaan aineistosta, kun taas helluntailaisen spiritualiteetin perustekijöiden sekä yleisinhimillisen spiritualiteetin kohdalla käytössä olleet mittarit eivät toimineet täysin odotetulla tavalla. Kahta erilaista aineistoa voitiin vertailla yleisinhimillisen spiritualiteetin osalta. Yleisinhimillinen spiritualiteetti ei ollut vieras ilmiö kristillisestä opista ja hartaudenharjoittamisesta vieraantuneille vastaajille. Kuitenkin se sai korkeampia vastauspistemääriä helluntailaisten parissa. Kyseistä spiritualiteettia eriytyi kuvaamaan kaksi ulottuvuutta: yhteisöllinen altruismi sekä arjen kauneus. Pelkästään Saalem-seurakunnasta kerätystä aineistosta eriytyi lisäksi kolme helluntailaisen spiritualiteetin perustekijää: sana ja missio, johtajakeskeisyys sekä ylistys -ulottuvuudet. Samasta aineistosta nousi kuusi sensibiliteettiulottuvuutta: ylistys,yleinen puhdistuminen, seremoniallisuus, armolahjat, tavoitteellisuus sekä hengellinen puhdistuminen ja muutos. Toinen ylistysulottuvuus kuvasi ylistyksen merkitystä, toinen ylistystapaa. Saalem-seurakunnasta kerätyn aineiston keskiöön asettui sanaa ja missiota kuvaava ulottuvuus. Korkeimman vastauskeskiarvon sai tavoitteellisuusulottuvuus, samoin kuin molemmat yleisinhimillistä spiritualiteettia kuvastaneet ulottuvuudet saivat korkeita vastauskeskiarvoja. Helluntailaisen spiritualiteetin ulottuvuudet korreloivat positiivisesti yleisinhimillisen spiritualiteetin ulottuvuuksien kanssa. Tulokset voitiin yleistää koskemaan Helsingin Saalem-seurakunnan jäsenistöä sekä pääkaupunkiseudun helluntailaisuutta. Koko Suomen helluntailiikkeen kohdalla tuloksia voitiin pitää suuntaa-antavina. Avainsanat: helluntailiike, spiritualiteetti, Saalem, kvantitatiivinen tutkimus, monimuuttujamenetelmät, Bayes-mallinnus, Daniel Albrecht, David Hay
Resumo:
Measurement of fractional exhaled nitric oxide (FENO) has proven useful in assessment of patients with respiratory symptoms, especially in predicting steroid response. The objective of these studies was to clarify issues relevant for the clinical use of FENO. The influence of allergic sensitization per se on FENO in healthy asymptomatic subjects was studied, the association between airway inflammation and bronchial hyperresponsiveness (BHR) in steroid-naive subjects with symptoms suggesting asthma was examined, as well as the possible difference in this association between atopic and nonatopic subjects. Influence of smoking on FENO was compared between atopic and nonatopic steroid-naive asthmatics and healthy subjects. The short-term repeatability of FENO in COPD patients was examined in order to assess whether the degree of chronic obstruction influences the repeatability. For these purposes, we studied a random sample of 248 citizens of Helsinki, 227 army conscripts with current symptoms suggesting asthma, 19 COPD patients, and 39 healthy subjects. FENO measurement, spirometry and bronchodilatation test, structured interview. skin prick tests, and histamine and exercise challenges were performed. Among healthy subjects with no signs of airway diseases, median FENO was similar in skin prick test-positive and –negative subjects, and the upper normal limit of FENO was 30 ppb. In atopic and nonatopic subjects with symptoms suggesting asthma, FENO associated with severity of exercise- or histamine-induced BHR only in atopic patients. FENO in smokers with steroid-naive asthma was significantly higher than in healthy smokers and nonsmokers. Among atopic asthmatics, FENO was significantly lower in smokers than in nonsmokers, whereas no difference appeared among nonatopic asthmatics. The 24-h repeatability of FENO was equally good in COPD patients as in healthy subjects. These findings indicate that allergic sensitization per se does not influence FENO, supporting the view that elevated FENO indicates NO-producing airway inflammation, and that same reference range can be applied to both skin prick test-positive and -negative subjects. The significant correlation between FENO and degree of BHR only in atopic steroid-naive subjects with current asthmatic symptoms supports the view that pathogenesis of BHR in atopic asthma is strongly involved in NO-producing airway inflammation, whereas in development of BHR in nonatopic asthma other mechanisms may dominate. Attenuation of FENO only in atopic but not in nonatopic smokers with steroid-naive asthma may result from differences in mechanisms of FENO formation as well as in sensitivity of these mechanisms to smoking in atopic and nonatopic asthma. The results suggest, however, that in young adult smokers, FENO measurement may prove useful in assessment of airway inflammation. The short-term repeatability of FENO in COPD patients with moderate to very severe disease and in healthy subjects was equally good.
Resumo:
Exposure to water-damaged buildings and the associated health problems have evoked concern and created confusion during the past 20 years. Individuals exposed to moisture problem buildings report adverse health effects such as non-specific respiratory symptoms. Microbes, especially fungi, growing on the damp material have been considered as potential sources of the health problems encountered in these buildings. Fungi and their airborne fungal spores contain allergens and secondary metabolites which may trigger allergic as well as inflammatory types of responses in the eyes and airways. Although epidemiological studies have revealed an association between damp buildings and health problems, no direct cause-and-effect relationship has been established. Further knowledge is needed about the epidemiology and the mechanisms leading to the symptoms associated with exposure to fungi. Two different approaches have been used in this thesis in order to investigate the diverse health effects associated with exposure to moulds. In the first part, sensitization to moulds was evaluated and potential cross-reactivity studied in patients attending a hospital for suspected allergy. In the second part, one typical mould known to be found in water-damaged buildings and to produce toxic secondary metabolites was used to study the airway responses in an experimental model. Exposure studies were performed on both naive and allergen sensitized mice. The first part of the study showed that mould allergy is rare and highly dependent on the atopic status of the examined individual. The prevalence of sensitization was 2.7% to Cladosporium herbarum and 2.8% to Alternaria alternata in patients, the majority of whom were atopic subjects. Some of the patients sensitized to mould suffered from atopic eczema. Frequently the patients were observed to possess specific serum IgE antibodies to a yeast present in the normal skin flora, Pityrosporum ovale. In some of these patients, the IgE binding was partly found to be due to binding to shared glycoproteins in the mould and yeast allergen extracts. The second part of the study revealed that exposure to Stachybotrys chartarum spores induced an airway inflammation in the lungs of mice. The inflammation was characterized by an influx of inflammatory cells, mainly neutrophils and lymphocytes, into the lungs but with almost no differences in airway responses seen between the satratoxin producing and non-satratoxin producing strain. On the other hand, when mice were exposed to S. chartarum and sensitized/challenged with ovalbumin the extent of the inflammation was markedly enhanced. A synergistic increase in the numbers of inflammatory cells was seen in BAL and severe inflammation was observed in the histological lung sections. In conclusion, the results in this thesis imply that exposure to moulds in water damaged buildings may trigger health effects in susceptible individuals. The symptoms can rarely be explained by IgE mediated allergy to moulds. Other non-allergic mechanisms seem to be involved. Stachybotrys chartarum is one of the moulds potentially responsible for health problems. In this thesis, new reaction models for the airway inflammation induced by S. chartarum have been found using experimental approaches. The immunological status played an important role in the airway inflammation, enhancing the effects of mould exposure. The results imply that sensitized individuals may be more susceptible to exposure to moulds than non-sensitized individuals.
Resumo:
The purpose of this research is to draw up a clear construction of an anticipatory communicative decision-making process and a successful implementation of a Bayesian application that can be used as an anticipatory communicative decision-making support system. This study is a decision-oriented and constructive research project, and it includes examples of simulated situations. As a basis for further methodological discussion about different approaches to management research, in this research, a decision-oriented approach is used, which is based on mathematics and logic, and it is intended to develop problem solving methods. The approach is theoretical and characteristic of normative management science research. Also, the approach of this study is constructive. An essential part of the constructive approach is to tie the problem to its solution with theoretical knowledge. Firstly, the basic definitions and behaviours of an anticipatory management and managerial communication are provided. These descriptions include discussions of the research environment and formed management processes. These issues define and explain the background to further research. Secondly, it is processed to managerial communication and anticipatory decision-making based on preparation, problem solution, and solution search, which are also related to risk management analysis. After that, a solution to the decision-making support application is formed, using four different Bayesian methods, as follows: the Bayesian network, the influence diagram, the qualitative probabilistic network, and the time critical dynamic network. The purpose of the discussion is not to discuss different theories but to explain the theories which are being implemented. Finally, an application of Bayesian networks to the research problem is presented. The usefulness of the prepared model in examining a problem and the represented results of research is shown. The theoretical contribution includes definitions and a model of anticipatory decision-making. The main theoretical contribution of this study has been to develop a process for anticipatory decision-making that includes management with communication, problem-solving, and the improvement of knowledge. The practical contribution includes a Bayesian Decision Support Model, which is based on Bayesian influenced diagrams. The main contributions of this research are two developed processes, one for anticipatory decision-making, and the other to produce a model of a Bayesian network for anticipatory decision-making. In summary, this research contributes to decision-making support by being one of the few publicly available academic descriptions of the anticipatory decision support system, by representing a Bayesian model that is grounded on firm theoretical discussion, by publishing algorithms suitable for decision-making support, and by defining the idea of anticipatory decision-making for a parallel version. Finally, according to the results of research, an analysis of anticipatory management for planned decision-making is presented, which is based on observation of environment, analysis of weak signals, and alternatives to creative problem solving and communication.
Resumo:
The relationship between site characteristics and understorey vegetation composition was analysed with quantitative methods, especially from the viewpoint of site quality estimation. Theoretical models were applied to an empirical data set collected from the upland forests of southern Finland comprising 104 sites dominated by Scots pine (Pinus sylvestris L.), and 165 sites dominated by Norway spruce (Picea abies (L.) Karsten). Site index H100 was used as an independent measure of site quality. A new model for the estimation of site quality at sites with a known understorey vegetation composition was introduced. It is based on the application of Bayes' theorem to the density function of site quality within the study area combined with the species-specific presence-absence response curves. The resulting posterior probability density function may be used for calculating an estimate for the site variable. Using this method, a jackknife estimate of site index H100 was calculated separately for pine- and spruce-dominated sites. The results indicated that the cross-validation root mean squared error (RMSEcv) of the estimates improved from 2.98 m down to 2.34 m relative to the "null" model (standard deviation of the sample distribution) in pine-dominated forests. In spruce-dominated forests RMSEcv decreased from 3.94 m down to 3.16 m. In order to assess these results, four other estimation methods based on understorey vegetation composition were applied to the same data set. The results showed that none of the methods was clearly superior to the others. In pine-dominated forests, RMSEcv varied between 2.34 and 2.47 m, and the corresponding range for spruce-dominated forests was from 3.13 to 3.57 m.
Resumo:
Bayesian networks are compact, flexible, and interpretable representations of a joint distribution. When the network structure is unknown but there are observational data at hand, one can try to learn the network structure. This is called structure discovery. This thesis contributes to two areas of structure discovery in Bayesian networks: space--time tradeoffs and learning ancestor relations. The fastest exact algorithms for structure discovery in Bayesian networks are based on dynamic programming and use excessive amounts of space. Motivated by the space usage, several schemes for trading space against time are presented. These schemes are presented in a general setting for a class of computational problems called permutation problems; structure discovery in Bayesian networks is seen as a challenging variant of the permutation problems. The main contribution in the area of the space--time tradeoffs is the partial order approach, in which the standard dynamic programming algorithm is extended to run over partial orders. In particular, a certain family of partial orders called parallel bucket orders is considered. A partial order scheme that provably yields an optimal space--time tradeoff within parallel bucket orders is presented. Also practical issues concerning parallel bucket orders are discussed. Learning ancestor relations, that is, directed paths between nodes, is motivated by the need for robust summaries of the network structures when there are unobserved nodes at work. Ancestor relations are nonmodular features and hence learning them is more difficult than modular features. A dynamic programming algorithm is presented for computing posterior probabilities of ancestor relations exactly. Empirical tests suggest that ancestor relations can be learned from observational data almost as accurately as arcs even in the presence of unobserved nodes.