953 resultados para Statistical inference


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The focus of this study is on statistical analysis of categorical responses, where the response values are dependent of each other. The most typical example of this kind of dependence is when repeated responses have been obtained from the same study unit. For example, in Paper I, the response of interest is the pneumococcal nasopharengyal carriage (yes/no) on 329 children. For each child, the carriage is measured nine times during the first 18 months of life, and thus repeated respones on each child cannot be assumed independent of each other. In the case of the above example, the interest typically lies in the carriage prevalence, and whether different risk factors affect the prevalence. Regression analysis is the established method for studying the effects of risk factors. In order to make correct inferences from the regression model, the associations between repeated responses need to be taken into account. The analysis of repeated categorical responses typically focus on regression modelling. However, further insights can also be gained by investigating the structure of the association. The central theme in this study is on the development of joint regression and association models. The analysis of repeated, or otherwise clustered, categorical responses is computationally difficult. Likelihood-based inference is often feasible only when the number of repeated responses for each study unit is small. In Paper IV, an algorithm is presented, which substantially facilitates maximum likelihood fitting, especially when the number of repeated responses increase. In addition, a notable result arising from this work is the freely available software for likelihood-based estimation of clustered categorical responses.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.

Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Variety selection in perennial pasture crops involves identifying best varieties from data collected from multiple harvest times in field trials. For accurate selection, the statistical methods for analysing such data need to account for the spatial and temporal correlation typically present. This paper provides an approach for analysing multi-harvest data from variety selection trials in which there may be a large number of harvest times. Methods are presented for modelling the variety by harvest effects while accounting for the spatial and temporal correlation between observations. These methods provide an improvement in model fit compared to separate analyses for each harvest, and provide insight into variety by harvest interactions. The approach is illustrated using two traits from a lucerne variety selection trial. The proposed method provides variety predictions allowing for the natural sources of variation and correlation in multi-harvest data.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Early detection of (pre-)signs of ulceration on a diabetic foot is valuable for clinical practice. Hyperspectral imaging is a promising technique for detection and classification of such (pre-)signs. However, the number of the spectral bands should be limited to avoid overfitting, which is critical for pixel classification with hyperspectral image data. The goal was to design a detector/classifier based on spectral imaging (SI) with a small number of optical bandpass filters. The performance and stability of the design were also investigated. The selection of the bandpass filters boils down to a feature selection problem. A dataset was built, containing reflectance spectra of 227 skin spots from 64 patients, measured with a spectrometer. Each skin spot was annotated manually by clinicians as "healthy" or a specific (pre-)sign of ulceration. Statistical analysis on the data set showed the number of required filters is between 3 and 7, depending on additional constraints on the filter set. The stability analysis revealed that shot noise was the most critical factor affecting the classification performance. It indicated that this impact could be avoided in future SI systems with a camera sensor whose saturation level is higher than 106, or by postimage processing.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

From the autocorrelation function of geomagnetic polarity intervals, it is shown that the field reversal intervals are not independent but form a process akin to the Markov process, where the random input to the model is itself a moving average process. The input to the moving average model is, however, an independent Gaussian random sequence. All the parameters in this model of the geomagnetic field reversal have been estimated. In physical terms this model implies that the mechanism of reversal possesses a memory.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Population dynamics are generally viewed as the result of intrinsic (purely density dependent) and extrinsic (environmental) processes. Both components, and potential interactions between those two, have to be modelled in order to understand and predict dynamics of natural populations; a topic that is of great importance in population management and conservation. This thesis focuses on modelling environmental effects in population dynamics and how effects of potentially relevant environmental variables can be statistically identified and quantified from time series data. Chapter I presents some useful models of multiplicative environmental effects for unstructured density dependent populations. The presented models can be written as standard multiple regression models that are easy to fit to data. Chapters II IV constitute empirical studies that statistically model environmental effects on population dynamics of several migratory bird species with different life history characteristics and migration strategies. In Chapter II, spruce cone crops are found to have a strong positive effect on the population growth of the great spotted woodpecker (Dendrocopos major), while cone crops of pine another important food resource for the species do not effectively explain population growth. The study compares rate- and ratio-dependent effects of cone availability, using state-space models that distinguish between process and observation error in the time series data. Chapter III shows how drought, in combination with settling behaviour during migration, produces asymmetric spatially synchronous patterns of population dynamics in North American ducks (genus Anas). Chapter IV investigates the dynamics of a Finnish population of skylark (Alauda arvensis), and point out effects of rainfall and habitat quality on population growth. Because the skylark time series and some of the environmental variables included show strong positive autocorrelation, the statistical significances are calculated using a Monte Carlo method, where random autocorrelated time series are generated. Chapter V is a simulation-based study, showing that ignoring observation error in analyses of population time series data can bias the estimated effects and measures of uncertainty, if the environmental variables are autocorrelated. It is concluded that the use of state-space models is an effective way to reach more accurate results. In summary, there are several biological assumptions and methodological issues that can affect the inferential outcome when estimating environmental effects from time series data, and that therefore need special attention. The functional form of the environmental effects and potential interactions between environment and population density are important to deal with. Other issues that should be considered are assumptions about density dependent regulation, modelling potential observation error, and when needed, accounting for spatial and/or temporal autocorrelation.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An efficient and statistically robust solution for the identification of asteroids among numerous sets of astrometry is presented. In particular, numerical methods have been developed for the short-term identification of asteroids at discovery, and for the long-term identification of scarcely observed asteroids over apparitions, a task which has been lacking a robust method until now. The methods are based on the solid foundation of statistical orbital inversion properly taking into account the observational uncertainties, which allows for the detection of practically all correct identifications. Through the use of dimensionality-reduction techniques and efficient data structures, the exact methods have a loglinear, that is, O(nlog(n)), computational complexity, where n is the number of included observation sets. The methods developed are thus suitable for future large-scale surveys which anticipate a substantial increase in the astrometric data rate. Due to the discontinuous nature of asteroid astrometry, separate sets of astrometry must be linked to a common asteroid from the very first discovery detections onwards. The reason for the discontinuity in the observed positions is the rotation of the observer with the Earth as well as the motion of the asteroid and the observer about the Sun. Therefore, the aim of identification is to find a set of orbital elements that reproduce the observed positions with residuals similar to the inevitable observational uncertainty. Unless the astrometric observation sets are linked, the corresponding asteroid is eventually lost as the uncertainty of the predicted positions grows too large to allow successful follow-up. Whereas the presented identification theory and the numerical comparison algorithm are generally applicable, that is, also in fields other than astronomy (e.g., in the identification of space debris), the numerical methods developed for asteroid identification can immediately be applied to all objects on heliocentric orbits with negligible effects due to non-gravitational forces in the time frame of the analysis. The methods developed have been successfully applied to various identification problems. Simulations have shown that the methods developed are able to find virtually all correct linkages despite challenges such as numerous scarce observation sets, astrometric uncertainty, numerous objects confined to a limited region on the celestial sphere, long linking intervals, and substantial parallaxes. Tens of previously unknown main-belt asteroids have been identified with the short-term method in a preliminary study to locate asteroids among numerous unidentified sets of single-night astrometry of moving objects, and scarce astrometry obtained nearly simultaneously with Earth-based and space-based telescopes has been successfully linked despite a substantial parallax. Using the long-term method, thousands of realistic 3-linkages typically spanning several apparitions have so far been found among designated observation sets each spanning less than 48 hours.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A generalized technique is proposed for modeling the effects of process variations on dynamic power by directly relating the variations in process parameters to variations in dynamic power of a digital circuit. The dynamic power of a 2-input NAND gate is characterized by mixed-mode simulations, to be used as a library element for 65mn gate length technology. The proposed methodology is demonstrated with a multiplier circuit built using the NAND gate library, by characterizing its dynamic power through Monte Carlo analysis. The statistical technique of Response. Surface Methodology (RSM) using Design of Experiments (DOE) and Least Squares Method (LSM), are employed to generate a "hybrid model" for gate power to account for simultaneous variations in multiple process parameters. We demonstrate that our hybrid model based statistical design approach results in considerable savings in the power budget of low power CMOS designs with an error of less than 1%, with significant reductions in uncertainty by atleast 6X on a normalized basis, against worst case design.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

"We thank MrGilder for his considered comments and suggestions for alternative analyses of our data. We also appreciate Mr Gilder’s support of our call for larger studies to contribute to the evidence base for preoperative loading with high-carbohydrate fluids..."