35 resultados para Statistical parameters
em Helda - Digital Repository of University of Helsinki
Resumo:
In genetic epidemiology, population-based disease registries are commonly used to collect genotype or other risk factor information concerning affected subjects and their relatives. This work presents two new approaches for the statistical inference of ascertained data: a conditional and full likelihood approaches for the disease with variable age at onset phenotype using familial data obtained from population-based registry of incident cases. The aim is to obtain statistically reliable estimates of the general population parameters. The statistical analysis of familial data with variable age at onset becomes more complicated when some of the study subjects are non-susceptible, that is to say these subjects never get the disease. A statistical model for a variable age at onset with long-term survivors is proposed for studies of familial aggregation, using latent variable approach, as well as for prospective studies of genetic association studies with candidate genes. In addition, we explore the possibility of a genetic explanation of the observed increase in the incidence of Type 1 diabetes (T1D) in Finland in recent decades and the hypothesis of non-Mendelian transmission of T1D associated genes. Both classical and Bayesian statistical inference were used in the modelling and estimation. Despite the fact that this work contains five studies with different statistical models, they all concern data obtained from nationwide registries of T1D and genetics of T1D. In the analyses of T1D data, non-Mendelian transmission of T1D susceptibility alleles was not observed. In addition, non-Mendelian transmission of T1D susceptibility genes did not make a plausible explanation for the increase in T1D incidence in Finland. Instead, the Human Leucocyte Antigen associations with T1D were confirmed in the population-based analysis, which combines T1D registry information, reference sample of healthy subjects and birth cohort information of the Finnish population. Finally, a substantial familial variation in the susceptibility of T1D nephropathy was observed. The presented studies show the benefits of sophisticated statistical modelling to explore risk factors for complex diseases.
Resumo:
Tämän tutkimuksen tavoitteena oli selvittää tilalla määritetyn hyvinvoinnin yhteyttä emakoiden tuotantotuloksiin. Hyvinvointia arvioitiin suomalaisen hyvinvointi-indeksin, A-indeksi, avulla. Tuotantotuloksina käytettiin kahta erilaista tuotosaineistoa, jotka molemmat pohjautuivat kansalliseen tuotosseuranta aineistoon. Hyvinvointimääritykset tehtiin 30 porsastuotantosikalassa maaliskuun 2007 aikana. A-indeksi koostuu kuudesta kategoriasta ’liikkumismahdollisuudet’, ’alustan ominaisuudet’, ’sosiaaliset kontaktit’, ’valo, ilma ja melu’, ’ruokinta ja veden saanti’ sekä ’eläinten terveys ja hoidon taso’. Jokaisessa kategoriassa on 3-10 pääosin ympäristöperäistä muuttujaa, jotka vaihtelevat osastoittain. Maksimipistemäärä osastolle on 100. Hyvinvointimittaukset tehtiin porsitus-, tiineytys- ja joutilasosastoilla. Erillisten tiineytysosastojen pienen lukumäärän takia (n=7) tilakohtaiset tiineytys- ja joutilasosastopisteet yhdistettiin ja keskiarvoja käytettiin analyyseissä. Yhteyksiä tuotokseen tutkittiin kahden eri aineiston avulla 1) Tilaraportti aineisto (n=29) muodostuu muokkaamattomista tila- ja tuotostuloksista tilavierailua edeltävän vuoden ajalta, 2) POTSIaineisto (n=30) muodostuu POTSI-ohjelmalla (MTT) muokatusta tuotantoaineistosta, joka sisältää managementtiryhmän (tila, vuosi, vuodenaika) vaikutuksen ensikoiden ja emakoiden pahnuekohtaiseen tuotokseen. Yhteyksiä analysointiin korrelaatio- ja regressioanalyysien avulla. Vaikka osallistuminen tutkimukseen oli vapaaehtoista, molempien tuotantoaineistojen perusteella tutkimustilat edustavat keskituottoista suomalaista sikatilaa. A-indeksin kokonaispisteet vaihtelivat välillä 37,5–64,0 porsitusosastolla ja 39,5–83,5 joutilasosastolla. Tilaraporttiaineistoa käytettäessä paremmat pisteet porsitusosaston ’eläinten terveys ja hoidon taso’ -kategoriasta lyhensivät eläinten lisääntymissykliä, lisäsivät syntyvien pahnueiden ja porsaiden määrää sekä alensivat kuolleena syntyneiden lukumäärää. Regressiomallin mukaan ’eläinten terveys ja hoidon taso’ -kategoria selitti syntyvien porsaiden lukumäärän, porsimisvälin pituuden sekä keskiporsimiskerran vaihtelua. Paremmat pisteet joutilasosaston ’liikkumismahdollisuudet’ kategoriasta alensivat syntyneiden pahnueiden sekä syntyneiden että vieroitettujen porsaiden lukumäärää. Regressiomallin mukaan ensikkopahnueiden osuus ja ”liikkumismahdollisuudet” kategorian pisteet selittivät vieroitettujen porsaiden lukumäärän vaihtelua. POTSI-aineiston yhteydessä kuolleena syntyneiden porsaiden lukumäärän aleneminen oli ensikoilla yhteydessä parempiin porsitusosaston ’sosiaalisiin kontakteihin’ ja emakoilla puolestaan joutilasosaston parempiin ’eläinten terveys ja hoidon taso’ pisteisiin. Kahden eri tuotantoaineiston avulla saadut tulokset erosivat toisistaan. Seuraavissa tutkimuksissa onkin suositeltavampaa käyttää Tilaraporttiaineistoja, joissa tuotokset ilmoitetaan vuosikohtaisina. Tämän tutkimuksen perusteella hyvinvoinnilla ja tuotoksella on yhteyksiä, joilla on myös merkittävää taloudellista vaikutusta. Erityisesti hyvä eläinten hoito ja eläinten terveys lisäävät tuotettujen porsaiden määrää ja lyhentävät lisääntymiskiertoa. Erityishuomiota tulee kiinnittää vapaana olevien joutilaiden emakoiden sosiaaliseen stressiin ja rehunsaannin varmistamiseen kaikille yksilöille.
Resumo:
Modern sample surveys started to spread after statistician at the U.S. Bureau of the Census in the 1940s had developed a sampling design for the Current Population Survey (CPS). A significant factor was also that digital computers became available for statisticians. In the beginning of 1950s, the theory was documented in textbooks on survey sampling. This thesis is about the development of the statistical inference for sample surveys. For the first time the idea of statistical inference was enunciated by a French scientist, P. S. Laplace. In 1781, he published a plan for a partial investigation in which he determined the sample size needed to reach the desired accuracy in estimation. The plan was based on Laplace s Principle of Inverse Probability and on his derivation of the Central Limit Theorem. They were published in a memoir in 1774 which is one of the origins of statistical inference. Laplace s inference model was based on Bernoulli trials and binominal probabilities. He assumed that populations were changing constantly. It was depicted by assuming a priori distributions for parameters. Laplace s inference model dominated statistical thinking for a century. Sample selection in Laplace s investigations was purposive. In 1894 in the International Statistical Institute meeting, Norwegian Anders Kiaer presented the idea of the Representative Method to draw samples. Its idea was that the sample would be a miniature of the population. It is still prevailing. The virtues of random sampling were known but practical problems of sample selection and data collection hindered its use. Arhtur Bowley realized the potentials of Kiaer s method and in the beginning of the 20th century carried out several surveys in the UK. He also developed the theory of statistical inference for finite populations. It was based on Laplace s inference model. R. A. Fisher contributions in the 1920 s constitute a watershed in the statistical science He revolutionized the theory of statistics. In addition, he introduced a new statistical inference model which is still the prevailing paradigm. The essential idea is to draw repeatedly samples from the same population and the assumption that population parameters are constants. Fisher s theory did not include a priori probabilities. Jerzy Neyman adopted Fisher s inference model and applied it to finite populations with the difference that Neyman s inference model does not include any assumptions of the distributions of the study variables. Applying Fisher s fiducial argument he developed the theory for confidence intervals. Neyman s last contribution to survey sampling presented a theory for double sampling. This gave the central idea for statisticians at the U.S. Census Bureau to develop the complex survey design for the CPS. Important criterion was to have a method in which the costs of data collection were acceptable, and which provided approximately equal interviewer workloads, besides sufficient accuracy in estimation.
Resumo:
This study reports a diachronic corpus investigation of common-number pronouns used to convey unknown or otherwise unspecified reference. The study charts agreement patterns in these pronouns in various diachronic and synchronic corpora. The objective is to provide base-line data on variant frequencies and distributions in the history of English, as there are no previous systematic corpus-based observations on this topic. This study seeks to answer the questions of how pronoun use is linked with the overall typological development in English and how their diachronic evolution is embedded in the linguistic and social structures in which they are used. The theoretical framework draws on corpus linguistics and historical sociolinguistics, grammaticalisation, diachronic typology, and multivariate analysis of modelling sociolinguistic variation. The method employs quantitative corpus analyses from two main electronic corpora, one from Modern English and the other from Present-day English. The Modern English material is the Corpus of Early English Correspondence, and the time frame covered is 1500-1800. The written component of the British National Corpus is used in the Present-day English investigations. In addition, the study draws supplementary data from other electronic corpora. The material is used to compare the frequencies and distributions of common-number pronouns between these two time periods. The study limits the common-number uses to two subsystems, one anaphoric to grammatically singular antecedents and one cataphoric, in which the pronoun is followed by a relative clause. Various statistical tools are used to process the data, ranging from cross-tabulations to multivariate VARBRUL analyses in which the effects of sociolinguistic and systemic parameters are assessed to model their impact on the dependent variable. This study shows how one pronoun type has extended its uses in both subsystems, an increase linked with grammaticalisation and the changes in other pronouns in English through the centuries. The variationist sociolinguistic analysis charts how grammaticalisation in the subsystems is embedded in the linguistic and social structures in which the pronouns are used. The study suggests a scale of two statistical generalisations of various sociolinguistic factors which contribute to grammaticalisation and its embedding at various stages of the process.
Resumo:
Screening of wastewater effluents from municipal and industrial wastewater treatment plants with biotests showed that the treated wastewater effluents possess only minor acute toxic properties towards whole organisms (e.g. bacteria, algae, daphnia), if any. In vitro tests (sub-mitochondrial membranes and fish hepatocytes) were generally more susceptible to the effluents. Most of the effluents indicated the presence of hormonally active compounds, as the production of vitellogenin, an egg yolk precursor protein, was induced in fish hepatocytes exposed to wastewater. In addition, indications of slight genotoxic potential was found in one effluent concentrate with a recombinant bacteria test. Reverse electron transport (RET) of mitochondrial membranes was used as a model test to conduct effluent assessment followed by toxicant characterisations and identifications. Using a modified U.S. EPA Toxicity Identification Evaluation Phase I scheme and additional case-specific methods, the main compound in a pulp and paper mill effluent causing RET inhibition was characterised to be an organic, relatively hydrophilic high molecular weight (HMW) compound. The toxicant could be verified as HMW lignin by structural analyses using nuclear magnetic resonance. In the confirmation step commercial and in-house extracted lignin products were used. The possible toxicity related structures were characterised by statistical analysis of the chemical breakdown structures of laboratory-scale pulping and bleaching effluents and the toxicities of these effluents. Finally, the biological degradation of the identified toxicant and other wastewater constituents was evaluated using bioassays in combination with chemical analyses. Biological methods have not been used routinely in establishing effluent discharge limits in Finland. However, the biological effects observed in this study could not have been predicted using only routine physical and chemical effluent monitoring parameters. Therefore chemical parameters cannot be considered to be sufficient in controlling effluent discharges especially in case of unknown, possibly bioaccumulative, compounds that may be present in small concentrations and may cause chronic effects.
Resumo:
Pressurised hot water extraction (PHWE) exploits the unique temperature-dependent solvent properties of water minimising the use of harmful organic solvents. Water is environmentally friendly, cheap and easily available extraction medium. The effects of temperature, pressure and extraction time in PHWE have often been studied, but here the emphasis was on other parameters important for the extraction, most notably the dimensions of the extraction vessel and the stability and solubility of the analytes to be extracted. Non-linear data analysis and self-organising maps were employed in the data analysis to obtain correlations between the parameters studied, recoveries and relative errors. First, pressurised hot water extraction (PHWE) was combined on-line with liquid chromatography-gas chromatography (LC-GC), and the system was applied to the extraction and analysis of polycyclic aromatic hydrocarbons (PAHs) in sediment. The method is of superior sensitivity compared with the traditional methods, and only a small 10 mg sample was required for analysis. The commercial extraction vessels were replaced by laboratory-made stainless steel vessels because of some problems that arose. The performance of the laboratory-made vessels was comparable to that of the commercial ones. In an investigation of the effect of thermal desorption in PHWE, it was found that at lower temperatures (200ºC and 250ºC) the effect of thermal desorption is smaller than the effect of the solvating property of hot water. At 300ºC, however, thermal desorption is the main mechanism. The effect of the geometry of the extraction vessel on recoveries was studied with five specially constructed extraction vessels. In addition to the extraction vessel geometry, the sediment packing style and the direction of water flow through the vessel were investigated. The geometry of the vessel was found to have only minor effect on the recoveries, and the same was true of the sediment packing style and the direction of water flow through the vessel. These are good results because these parameters do not have to be carefully optimised before the start of extractions. Liquid-liquid extraction (LLE) and solid-phase extraction (SPE) were compared as trapping techniques for PHWE. LLE was more robust than SPE and it provided better recoveries and repeatabilities than did SPE. Problems related to blocking of the Tenax trap and unrepeatable trapping of the analytes were encountered in SPE. Thus, although LLE is more labour intensive, it can be recommended over SPE. The stabilities of the PAHs in aqueous solutions were measured using a batch-type reaction vessel. Degradation was observed at 300ºC even with the shortest heating time. Ketones and quinones and other oxidation products were observed. Although the conditions of the stability studies differed considerably from the extraction conditions in PHWE, the results indicate that the risk of analyte degradation must be taken into account in PHWE. The aqueous solubilities of acenaphthene, anthracene and pyrene were measured, first below and then above the melting point of the analytes. Measurements below the melting point were made to check that the equipment was working, and the results were compared with those obtained earlier. Good agreement was found between the measured and literature values. A new saturation cell was constructed for the solubility measurements above the melting point of the analytes because the flow-through saturation cell could not be used above the melting point. An exponential relationship was found between the solubilities measured for pyrene and anthracene and temperature.
Resumo:
The problem of recovering information from measurement data has already been studied for a long time. In the beginning, the methods were mostly empirical, but already towards the end of the sixties Backus and Gilbert started the development of mathematical methods for the interpretation of geophysical data. The problem of recovering information about a physical phenomenon from measurement data is an inverse problem. Throughout this work, the statistical inversion method is used to obtain a solution. Assuming that the measurement vector is a realization of fractional Brownian motion, the goal is to retrieve the amplitude and the Hurst parameter. We prove that under some conditions, the solution of the discretized problem coincides with the solution of the corresponding continuous problem as the number of observations tends to infinity. The measurement data is usually noisy, and we assume the data to be the sum of two vectors: the trend and the noise. Both vectors are supposed to be realizations of fractional Brownian motions, and the goal is to retrieve their parameters using the statistical inversion method. We prove a partial uniqueness of the solution. Moreover, with the support of numerical simulations, we show that in certain cases the solution is reliable and the reconstruction of the trend vector is quite accurate.
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Microarrays are high throughput biological assays that allow the screening of thousands of genes for their expression. The main idea behind microarrays is to compute for each gene a unique signal that is directly proportional to the quantity of mRNA that was hybridized on the chip. A large number of steps and errors associated with each step make the generated expression signal noisy. As a result, microarray data need to be carefully pre-processed before their analysis can be assumed to lead to reliable and biologically relevant conclusions. This thesis focuses on developing methods for improving gene signal and further utilizing this improved signal for higher level analysis. To achieve this, first, approaches for designing microarray experiments using various optimality criteria, considering both biological and technical replicates, are described. A carefully designed experiment leads to signal with low noise, as the effect of unwanted variations is minimized and the precision of the estimates of the parameters of interest are maximized. Second, a system for improving the gene signal by using three scans at varying scanner sensitivities is developed. A novel Bayesian latent intensity model is then applied on these three sets of expression values, corresponding to the three scans, to estimate the suitably calibrated true signal of genes. Third, a novel image segmentation approach that segregates the fluorescent signal from the undesired noise is developed using an additional dye, SYBR green RNA II. This technique helped in identifying signal only with respect to the hybridized DNA, and signal corresponding to dust, scratch, spilling of dye, and other noises, are avoided. Fourth, an integrated statistical model is developed, where signal correction, systematic array effects, dye effects, and differential expression, are modelled jointly as opposed to a sequential application of several methods of analysis. The methods described in here have been tested only for cDNA microarrays, but can also, with some modifications, be applied to other high-throughput technologies. Keywords: High-throughput technology, microarray, cDNA, multiple scans, Bayesian hierarchical models, image analysis, experimental design, MCMC, WinBUGS.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
What can the statistical structure of natural images teach us about the human brain? Even though the visual cortex is one of the most studied parts of the brain, surprisingly little is known about how exactly images are processed to leave us with a coherent percept of the world around us, so we can recognize a friend or drive on a crowded street without any effort. By constructing probabilistic models of natural images, the goal of this thesis is to understand the structure of the stimulus that is the raison d etre for the visual system. Following the hypothesis that the optimal processing has to be matched to the structure of that stimulus, we attempt to derive computational principles, features that the visual system should compute, and properties that cells in the visual system should have. Starting from machine learning techniques such as principal component analysis and independent component analysis we construct a variety of sta- tistical models to discover structure in natural images that can be linked to receptive field properties of neurons in primary visual cortex such as simple and complex cells. We show that by representing images with phase invariant, complex cell-like units, a better statistical description of the vi- sual environment is obtained than with linear simple cell units, and that complex cell pooling can be learned by estimating both layers of a two-layer model of natural images. We investigate how a simplified model of the processing in the retina, where adaptation and contrast normalization take place, is connected to the nat- ural stimulus statistics. Analyzing the effect that retinal gain control has on later cortical processing, we propose a novel method to perform gain control in a data-driven way. Finally we show how models like those pre- sented here can be extended to capture whole visual scenes rather than just small image patches. By using a Markov random field approach we can model images of arbitrary size, while still being able to estimate the model parameters from the data.