34 resultados para statistical framework
em Helda - Digital Repository of University of Helsinki
Resumo:
Advancements in the analysis techniques have led to a rapid accumulation of biological data in databases. Such data often are in the form of sequences of observations, examples including DNA sequences and amino acid sequences of proteins. The scale and quality of the data give promises of answering various biologically relevant questions in more detail than what has been possible before. For example, one may wish to identify areas in an amino acid sequence, which are important for the function of the corresponding protein, or investigate how characteristics on the level of DNA sequence affect the adaptation of a bacterial species to its environment. Many of the interesting questions are intimately associated with the understanding of the evolutionary relationships among the items under consideration. The aim of this work is to develop novel statistical models and computational techniques to meet with the challenge of deriving meaning from the increasing amounts of data. Our main concern is on modeling the evolutionary relationships based on the observed molecular data. We operate within a Bayesian statistical framework, which allows a probabilistic quantification of the uncertainties related to a particular solution. As the basis of our modeling approach we utilize a partition model, which is used to describe the structure of data by appropriately dividing the data items into clusters of related items. Generalizations and modifications of the partition model are developed and applied to various problems. Large-scale data sets provide also a computational challenge. The models used to describe the data must be realistic enough to capture the essential features of the current modeling task but, at the same time, simple enough to make it possible to carry out the inference in practice. The partition model fulfills these two requirements. The problem-specific features can be taken into account by modifying the prior probability distributions of the model parameters. The computational efficiency stems from the ability to integrate out the parameters of the partition model analytically, which enables the use of efficient stochastic search algorithms.
Resumo:
Bacteria play an important role in many ecological systems. The molecular characterization of bacteria using either cultivation-dependent or cultivation-independent methods reveals the large scale of bacterial diversity in natural communities, and the vastness of subpopulations within a species or genus. Understanding how bacterial diversity varies across different environments and also within populations should provide insights into many important questions of bacterial evolution and population dynamics. This thesis presents novel statistical methods for analyzing bacterial diversity using widely employed molecular fingerprinting techniques. The first objective of this thesis was to develop Bayesian clustering models to identify bacterial population structures. Bacterial isolates were identified using multilous sequence typing (MLST), and Bayesian clustering models were used to explore the evolutionary relationships among isolates. Our method involves the inference of genetic population structures via an unsupervised clustering framework where the dependence between loci is represented using graphical models. The population dynamics that generate such a population stratification were investigated using a stochastic model, in which homologous recombination between subpopulations can be quantified within a gene flow network. The second part of the thesis focuses on cluster analysis of community compositional data produced by two different cultivation-independent analyses: terminal restriction fragment length polymorphism (T-RFLP) analysis, and fatty acid methyl ester (FAME) analysis. The cluster analysis aims to group bacterial communities that are similar in composition, which is an important step for understanding the overall influences of environmental and ecological perturbations on bacterial diversity. A common feature of T-RFLP and FAME data is zero-inflation, which indicates that the observation of a zero value is much more frequent than would be expected, for example, from a Poisson distribution in the discrete case, or a Gaussian distribution in the continuous case. We provided two strategies for modeling zero-inflation in the clustering framework, which were validated by both synthetic and empirical complex data sets. We show in the thesis that our model that takes into account dependencies between loci in MLST data can produce better clustering results than those methods which assume independent loci. Furthermore, computer algorithms that are efficient in analyzing large scale data were adopted for meeting the increasing computational need. Our method that detects homologous recombination in subpopulations may provide a theoretical criterion for defining bacterial species. The clustering of bacterial community data include T-RFLP and FAME provides an initial effort for discovering the evolutionary dynamics that structure and maintain bacterial diversity in the natural environment.
Resumo:
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals.
Resumo:
This study reports a diachronic corpus investigation of common-number pronouns used to convey unknown or otherwise unspecified reference. The study charts agreement patterns in these pronouns in various diachronic and synchronic corpora. The objective is to provide base-line data on variant frequencies and distributions in the history of English, as there are no previous systematic corpus-based observations on this topic. This study seeks to answer the questions of how pronoun use is linked with the overall typological development in English and how their diachronic evolution is embedded in the linguistic and social structures in which they are used. The theoretical framework draws on corpus linguistics and historical sociolinguistics, grammaticalisation, diachronic typology, and multivariate analysis of modelling sociolinguistic variation. The method employs quantitative corpus analyses from two main electronic corpora, one from Modern English and the other from Present-day English. The Modern English material is the Corpus of Early English Correspondence, and the time frame covered is 1500-1800. The written component of the British National Corpus is used in the Present-day English investigations. In addition, the study draws supplementary data from other electronic corpora. The material is used to compare the frequencies and distributions of common-number pronouns between these two time periods. The study limits the common-number uses to two subsystems, one anaphoric to grammatically singular antecedents and one cataphoric, in which the pronoun is followed by a relative clause. Various statistical tools are used to process the data, ranging from cross-tabulations to multivariate VARBRUL analyses in which the effects of sociolinguistic and systemic parameters are assessed to model their impact on the dependent variable. This study shows how one pronoun type has extended its uses in both subsystems, an increase linked with grammaticalisation and the changes in other pronouns in English through the centuries. The variationist sociolinguistic analysis charts how grammaticalisation in the subsystems is embedded in the linguistic and social structures in which the pronouns are used. The study suggests a scale of two statistical generalisations of various sociolinguistic factors which contribute to grammaticalisation and its embedding at various stages of the process.
Resumo:
In this dissertation, I present an overall methodological framework for studying linguistic alternations, focusing specifically on lexical variation in denoting a single meaning, that is, synonymy. As the practical example, I employ the synonymous set of the four most common Finnish verbs denoting THINK, namely ajatella, miettiä, pohtia and harkita ‘think, reflect, ponder, consider’. As a continuation to previous work, I describe in considerable detail the extension of statistical methods from dichotomous linguistic settings (e.g., Gries 2003; Bresnan et al. 2007) to polytomous ones, that is, concerning more than two possible alternative outcomes. The applied statistical methods are arranged into a succession of stages with increasing complexity, proceeding from univariate via bivariate to multivariate techniques in the end. As the central multivariate method, I argue for the use of polytomous logistic regression and demonstrate its practical implementation to the studied phenomenon, thus extending the work by Bresnan et al. (2007), who applied simple (binary) logistic regression to a dichotomous structural alternation in English. The results of the various statistical analyses confirm that a wide range of contextual features across different categories are indeed associated with the use and selection of the selected think lexemes; however, a substantial part of these features are not exemplified in current Finnish lexicographical descriptions. The multivariate analysis results indicate that the semantic classifications of syntactic argument types are on the average the most distinctive feature category, followed by overall semantic characterizations of the verb chains, and then syntactic argument types alone, with morphological features pertaining to the verb chain and extra-linguistic features relegated to the last position. In terms of overall performance of the multivariate analysis and modeling, the prediction accuracy seems to reach a ceiling at a Recall rate of roughly two-thirds of the sentences in the research corpus. The analysis of these results suggests a limit to what can be explained and determined within the immediate sentential context and applying the conventional descriptive and analytical apparatus based on currently available linguistic theories and models. The results also support Bresnan’s (2007) and others’ (e.g., Bod et al. 2003) probabilistic view of the relationship between linguistic usage and the underlying linguistic system, in which only a minority of linguistic choices are categorical, given the known context – represented as a feature cluster – that can be analytically grasped and identified. Instead, most contexts exhibit degrees of variation as to their outcomes, resulting in proportionate choices over longer stretches of usage in texts or speech.
Resumo:
The goal of this research was to establish the necessary conditions under which individuals are prepared to commit themselves to quality assurance work in the organisation of a Polytechnic. The conditions were studied using four main concepts: awareness of quality, commitment to the organisation, leadership and work welfare. First, individuals were asked to describe these four concepts. Then, relationships between the concepts were analysed in order to establish the conditions for the commitment of an individual towards quality assurance work (QA). The study group comprised the entire personnel of Helsinki Polytechnic, of which 341 (44.5%) individuals participated. Mixed methods were used as the methodological base. A questionnaire and interviews were used as the research methods. The data from the interviews were used for the validation of the results, as well as for completing the analysis. The results of these interviews and analyses were integrated using the concurrent nested design method. In addition, the questionnaire was used to separately analyse the impressions and meanings of the awareness of quality and leadership, because, according to the pre-understanding, impressions of phenomena expressed in terms of reality have an influence on the commitment to QA. In addition to statistical figures, principal component analysis was used as a description method. For comparisons between groups, one way variance analysis and effect size analysis were used. For explaining the analysis methods, forward regression analysis and structural modelling were applied. As a result of the research it was found that 51% of the conditions necessary for a commitment to QA were explained by an individual’s experience/belief that QA was a method of development, that QA was possible to participate in and that the meaning of quality included both product and process qualities. If analysed separately, other main concepts (commitment to the organisation, leadership and work welfare) played only a small part in explaining an individual’s commitment. In the context of this research, a structural path model of the main concepts was built. In the model, the concepts were interconnected by paths created as a result of a literature search covering the main concepts, as well as a result of an analysis of the empirical material of this thesis work. The path model explained 46% of the necessary conditions under which individuals are prepared to commit themselves to QA. The most important path for achieving a commitment stemmed from product and system quality emanating from the new goals of the Polytechnic, moved through the individual’s experience that QA is a method of the total development of quality and ended in a commitment to QA. The second most important path stemmed from the individual’s experience of belonging to a supportive work community, moved through the supportive value of the job and through affective commitment to the organisation and ended in a commitment to QA. The third path stemmed from an individual’s experiences in participating in QA, moved through collective system quality and through these to the supportive value of the job to affective commitment to the organisation and ended in a commitment to QA. The final path in the path model stemmed from leadership by empowerment, moved through collective system quality, the supportive value of the job and an affective commitment to the organisation, and again, ended in a commitment to QA. As a result of the research, it was found that the individual’s functional department was an important factor in explaining the differences between groups. Therefore, it was found that understanding the processing of part cultures in the organisation is important when developing QA. Likewise, learning-teaching paradigms proved to be a differentiating factor. Individuals thinking according to the humanistic-constructivistic paradigm showed more commitment to QA than technological-rational thinkers. Also, it was proved that the QA training program did not increase commitment, as the path model demonstrated that those who participated in training showed 34% commitment, whereas those who did not showed 55% commitment. As a summary of the results it can be said that the necessary conditions under which individuals are prepared to commit themselves to QA cannot be treated in a reductionistic way. Instead, the conditions must be treated as one totality, with all the main concepts interacting simultaneously. Also, the theoretical framework of quality must include its dynamic aspect, which means the development of the work of the individual and learning through auditing. In addition, this dynamism includes the reflection of the paradigm of the functions of the individual as well as that of all parts of the organisation. It is important to understand and manage the various ways of thinking and the cultural differences produced by the fragmentation of the organisation. Finally, it seems possible that the path model can be generalised for use in any organisation development project where the personnel should be committed.
Resumo:
The purpose of the present study was to investigate the possibilities and interconnec-tions that exist concerning the relationship between the University of Applied Sci-ences and the Learning by Developing action model (LbD), on the one hand, and education for sustainable development and high-quality learning as a part of profes-sional competence development on the other. The research and learning environment was the Coping at Home research project and its Caring TV project, which provided the context of the Physiotherapy for Elderly People professional study unit. The re-searcher was a teacher and an evaluator of her own students learning. The aims of the study were to monitor and evaluate learning at the individual and group level using tools of high-quality learning − improved concept maps − related to understanding the projects core concept of successful ageing. Conceptions were evaluated through aspects of sustainable development and a conceptual basis of physiotherapy. As edu-cational research this was a multi-method case study design experiment. The three research questions were as follows. 1. What kind of individual conceptions and conceptual structures do students build concerning the concept of successful ageing? How many and what kind of concepts and propositions do they have a) before the study unit, b) after the study unit, c) after the social-knowledge building? 2. What kind of social-knowledge building exists? a) What kind of social learn-ing process exists? b) What kind of socially created concepts, propositions and conceptual structures do the students possess after the project? c) What kind of meaning does the social-knowledge building have at an individual level? 3. How do physiotherapy competences develop according to the results of the first and second research questions? The subjects were 22 female, third-year Bachelor of Physiotherapy students in Laurea University of Applied Sciences in Finland. Individual learning was evaluated in 12 of the 22 students. The data was collected as a part of the learning exercises of the Physiotherapy for Elderly People study unit, with improved concept maps both at individual and group levels. The students were divided into two social-knowledge building groups: the first group had 15 members and second 7 members. Each group created a group-level concept map on the theme of successful ageing. These face-to-face interactions were recorded with CMapTools and videotaped. The data consists of both individually produced concept maps and group-produced concept maps of the two groups and the videotaped material of these processes. The data analysis was carried out at the intersection of various research traditions. Individually produced data was analysed based on content analysis. Group-produced data was analysed based on content analysis and dialogue analysis. The data was also analysed by simple statistical analysis. In the individually produced improved concept maps the students conceptions were comprehensive, and the first concept maps were found to have many concepts unrelated to each other. The conceptual structures were between spoke structures and chain structures. Only a few professional concepts were evident. In the second indi-vidual improved concept maps the conception was more professional than earlier, particulary from the functional point of view. The conceptual structures mostly re-sembled spoke structures. After the second individual concept mapping social map-ping interventions were made in the two groups. After this, multidisciplinary concrete links were established between all concepts in almost all individual concept maps, and the interconnectedness of the concepts in different subject areas was thus understood. The conceptual structures were mainly net structures. The concepts in these individual concept maps were also found to be more professional and concrete than in the previ-ous concept maps of these subjects. In addition, the wider context dependency of the concepts was recognized in many individual concept maps. This implies a conceptual framework for specialists. The social-knowledge building was similar to a social learning process. Both socio-cultural processes and cognitive processes were found to develop students conceptual awareness and the ability to engage in intentional learning. In the knowl-edge-building process two aspects were found: knowledge creation and pedagogical action. The discussion during the concept-mapping process was similar to a shared thinking process. In visualising the process with CMapTools, students easily comple-mented each others thoughts and words, as if mutually telepathic . Synthesizing, supporting, asking and answering, peer teaching and counselling, tutoring, evaluating and arguing took place, and students were very active, self-directed and creative. It took hundreds of conversations before a common understanding could be found. The use of concept mapping in particular was very effective. The concepts in these group-produced concept maps were found to be professional, and values of sustainable development were observed. The results show the importance of developing the contents and objectives of the European Qualification Framework as well as education for sustainable development, especially in terms of the need for knowledge creation, global responsibility and systemic, holistic and critical thinking in order to develop clinical practice. Keywords: education for sustainable development, learning, knowledge building, improved concept map, conceptual structure, competence, successful ageing
Resumo:
In genetic epidemiology, population-based disease registries are commonly used to collect genotype or other risk factor information concerning affected subjects and their relatives. This work presents two new approaches for the statistical inference of ascertained data: a conditional and full likelihood approaches for the disease with variable age at onset phenotype using familial data obtained from population-based registry of incident cases. The aim is to obtain statistically reliable estimates of the general population parameters. The statistical analysis of familial data with variable age at onset becomes more complicated when some of the study subjects are non-susceptible, that is to say these subjects never get the disease. A statistical model for a variable age at onset with long-term survivors is proposed for studies of familial aggregation, using latent variable approach, as well as for prospective studies of genetic association studies with candidate genes. In addition, we explore the possibility of a genetic explanation of the observed increase in the incidence of Type 1 diabetes (T1D) in Finland in recent decades and the hypothesis of non-Mendelian transmission of T1D associated genes. Both classical and Bayesian statistical inference were used in the modelling and estimation. Despite the fact that this work contains five studies with different statistical models, they all concern data obtained from nationwide registries of T1D and genetics of T1D. In the analyses of T1D data, non-Mendelian transmission of T1D susceptibility alleles was not observed. In addition, non-Mendelian transmission of T1D susceptibility genes did not make a plausible explanation for the increase in T1D incidence in Finland. Instead, the Human Leucocyte Antigen associations with T1D were confirmed in the population-based analysis, which combines T1D registry information, reference sample of healthy subjects and birth cohort information of the Finnish population. Finally, a substantial familial variation in the susceptibility of T1D nephropathy was observed. The presented studies show the benefits of sophisticated statistical modelling to explore risk factors for complex diseases.
Resumo:
The sustainability of food production has increasingly attracted the attention of consumers, farmers, food and retailing companies, and politicians. One manifestation of such attention is the growing interest in organic foods. Organic agriculture has the potential to enhance the ecological modernisation of food production by implementing the organic method as a preventative innovation that simultaneously produces environmental and economic benefits. However, in addition to the challenges to organic farming, the small market share of organic products in many countries today and Finland in particular risks undermining the achievement of such benefits. The problems identified as hindrances to the increased consumption of organic food are the poor availability, limited variety and high prices of organic products, the complicated buying decisions and the difficulties in delivering the intangible value of organic foods. Small volumes and sporadic markets, high costs, lack of market information, as well as poor supply reliability are obstacles to increasing the volume of organic production and processing. These problems shift the focus from a single actor to the entire supply chain and require solutions that involve more interaction among the actors within the organic chain. As an entity, the organic food chain has received very little scholarly attention. Researchers have mainly approached the organic chain from the perspective of a single actor, or they have described its structure rather than the interaction between the actors. Consequently, interaction among the primary actors in organic chains, i.e. farmers, manufacturers, retailers and consumers, has largely gone unexamined. The purpose of this study is to shed light on the interaction of the primary actors within a whole organic chain in relation to the ecological modernisation of food production. This information is organised into a conceptual framework to help illuminate this complex field. This thesis integrates the theories and concepts of three approaches: food system studies, supply chain management and ecological modernisation. Through a case study, a conceptual system framework will be developed and applied to a real life-situation. The thesis is supported by research published in four articles. All examine the same organic chains through case studies, but each approaches the problem from a different, complementary perspective. The findings indicated that regardless of the coherent values emphasising responsibility, the organic chains were loosely integrated to operate as a system. The focus was on product flow, leaving other aspects of value creation largely aside. Communication with consumers was rare, and none of the actors had taken a leading role in enhancing the market for organic products. Such a situation presents unsuitable conditions for ecological modernisation of food production through organic food and calls for contributions from stakeholders other than those directly involved in the product chain. The findings inspired a revision of the original conceptual framework. The revised framework, the three-layer framework , distinguishes the different layers of interaction. By gradually enlarging the chain orientation the different but interrelated layers become visible. A framework is thus provided for further research and for understanding practical implications of the performance of organic food chains. The revised framework provides both an ideal model for organic chains in relation to ecological modernisation and demonstrates a situation consistent with the empirical evidence.
Resumo:
Digital elevation models (DEMs) have been an important topic in geography and surveying sciences for decades due to their geomorphological importance as the reference surface for gravita-tion-driven material flow, as well as the wide range of uses and applications. When DEM is used in terrain analysis, for example in automatic drainage basin delineation, errors of the model collect in the analysis results. Investigation of this phenomenon is known as error propagation analysis, which has a direct influence on the decision-making process based on interpretations and applications of terrain analysis. Additionally, it may have an indirect influence on data acquisition and the DEM generation. The focus of the thesis was on the fine toposcale DEMs, which are typically represented in a 5-50m grid and used in the application scale 1:10 000-1:50 000. The thesis presents a three-step framework for investigating error propagation in DEM-based terrain analysis. The framework includes methods for visualising the morphological gross errors of DEMs, exploring the statistical and spatial characteristics of the DEM error, making analytical and simulation-based error propagation analysis and interpreting the error propagation analysis results. The DEM error model was built using geostatistical methods. The results show that appropriate and exhaustive reporting of various aspects of fine toposcale DEM error is a complex task. This is due to the high number of outliers in the error distribution and morphological gross errors, which are detectable with presented visualisation methods. In ad-dition, the use of global characterisation of DEM error is a gross generalisation of reality due to the small extent of the areas in which the decision of stationarity is not violated. This was shown using exhaustive high-quality reference DEM based on airborne laser scanning and local semivariogram analysis. The error propagation analysis revealed that, as expected, an increase in the DEM vertical error will increase the error in surface derivatives. However, contrary to expectations, the spatial au-tocorrelation of the model appears to have varying effects on the error propagation analysis depend-ing on the application. The use of a spatially uncorrelated DEM error model has been considered as a 'worst-case scenario', but this opinion is now challenged because none of the DEM derivatives investigated in the study had maximum variation with spatially uncorrelated random error. Sig-nificant performance improvement was achieved in simulation-based error propagation analysis by applying process convolution in generating realisations of the DEM error model. In addition, typology of uncertainty in drainage basin delineations is presented.
Resumo:
Determination of the environmental factors controlling earth surface processes and landform patterns is one of the central themes in physical geography. However, the identification of the main drivers of the geomorphological phenomena is often challenging. Novel spatial analysis and modelling methods could provide new insights into the process-environment relationships. The objective of this research was to map and quantitatively analyse the occurrence of cryogenic phenomena in subarctic Finland. More precisely, utilising a grid-based approach the distribution and abundance of periglacial landforms were modelled to identify important landscape scale environmental factors. The study was performed using a comprehensive empirical data set of periglacial landforms from an area of 600 km2 at a 25-ha resolution. The utilised statistical methods were generalized linear modelling (GLM) and hierarchical partitioning (HP). GLMs were used to produce distribution and abundance models and HP to reveal independently the most likely causal variables. The GLM models were assessed utilising statistical evaluation measures, prediction maps, field observations and the results of HP analyses. A total of 40 different landform types and subtypes were identified. Topographical, soil property and vegetation variables were the primary correlates for the occurrence and cover of active periglacial landforms on the landscape scale. In the model evaluation, most of the GLMs were shown to be robust although the explanation power, prediction ability as well as the selected explanatory variables varied between the models. The great potential of the combination of a spatial grid system, terrain data and novel statistical techniques to map the occurrence of periglacial landforms was demonstrated in this study. GLM proved to be a useful modelling framework for testing the shapes of the response functions and significances of the environmental variables and the HP method helped to make better deductions of the important factors of earth surface processes. Hence, the numerical approach presented in this study can be a useful addition to the current range of techniques available to researchers to map and monitor different geographical phenomena.
Resumo:
Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.