578 resultados para Dirichlet-multinomial
Resumo:
Archaeozoological mortality profiles have been used to infer site-specific subsistence strategies. There is however no common agreement on the best way to present these profiles and confidence intervals around age class proportions. In order to deal with these issues, we propose the use of the Dirichlet distribution and present a new approach to perform age-at-death multivariate graphical comparisons. We demonstrate the efficiency of this approach using domestic sheep/goat dental remains from 10 Cardial sites (Early Neolithic) located in South France and the Iberian Peninsula. We show that the Dirichlet distribution in age-at-death analysis can be used: (i) to generate Bayesian credible intervals around each age class of a mortality profile, even when not all age classes are observed; and (ii) to create 95% kernel density contours around each age-at-death frequency distribution when multiple sites are compared using correspondence analysis. The statistical procedure we present is applicable to the analysis of any categorical count data and particularly well-suited to archaeological data (e.g. potsherds, arrow heads) where sample sizes are typically small.
Resumo:
Text cohesion is an important element of discourse processing. This paper presents a new approach to modeling, quantifying, and visualizing text cohesion using automated cohesion flow indices that capture semantic links among paragraphs. Cohesion flow is calculated by applying Cohesion Network Analysis, a combination of semantic distances, Latent Semantic Analysis, and Latent Dirichlet Allocation, as well as Social Network Analysis. Experiments performed on 315 timed essays indicated that cohesion flow indices are significantly correlated with human ratings of text coherence and essay quality. Visualizations of the global cohesion indices are also included to support a more facile understanding of how cohesion flow impacts coherence in terms of semantic dependencies between paragraphs.
Resumo:
This paper introduces a novel, in-depth approach of analyzing the differences in writing style between two famous Romanian orators, based on automated textual complexity indices for Romanian language. The considered authors are: (a) Mihai Eminescu, Romania’s national poet and a remarkable journalist of his time, and (b) Ion C. Brătianu, one of the most important Romanian politicians from the middle of the 18th century. Both orators have a common journalistic interest consisting in their desire to spread the word about political issues in Romania via the printing press, the most important public voice at that time. In addition, both authors exhibit writing style particularities, and our aim is to explore these differences through our ReaderBench framework that computes a wide range of lexical and semantic textual complexity indices for Romanian and other languages. The used corpus contains two collections of speeches for each orator that cover the period 1857–1880. The results of this study highlight the lexical and cohesive textual complexity indices that reflect very well the differences in writing style, measures relying on Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) semantic models.
Resumo:
El artículo analiza los cambios político electorales en León, Guanajuato, a partir de cómo se fue configurando el desplazamiento del Partido Revolucionario Institucional (PRI) por el Partido Acción Nacional (PAN) en este ayuntamiento en el año 1988, hasta el cambio de correlación de fuerzas en el año 2012. Ello da pauta para analizar los escenarios que podrían caracterizar las próximas elecciones de este año. Con este objetivo se propone un modelo estadístico para dicho estudio: el modelo de regresión Dirichlet, el cual permite considerar la naturaleza de los datos electorales.
The article analyzes the electoral changes in León, Guanajuato, based on how it was setting the displacement of the Institutional Revolutionary Party (PRI) by the National Action Party (PAN) in this council in 1988, until the change of correlation forces in 2012, which gives guidelines to analyze the scenarios that could characterize the upcoming elections this year. With this aim the authors proposed a statistical model for the study: the Dirichlet regression model, which allows to consider the nature of electoral data.
Resumo:
Tests for dependence of continuous, discrete and mixed continuous-discrete variables are ubiquitous in science. The goal of this paper is to derive Bayesian alternatives to frequentist null hypothesis significance tests for dependence. In particular, we will present three Bayesian tests for dependence of binary, continuous and mixed variables. These tests are nonparametric and based on the Dirichlet Process, which allows us to use the same prior model for all of them. Therefore, the tests are “consistent” among each other, in the sense that the probabilities that variables are dependent computed with these tests are commensurable across the different types of variables being tested. By means of simulations with artificial data, we show the effectiveness of the new tests.
Resumo:
Genome-wide association studies (GWAS) of schizophrenia have yielded more than 100 common susceptibility variants, and strongly support a substantial polygenic contribution of a large number of small allelic effects. It has been hypothesized that familial schizophrenia is largely a consequence of inherited rather than environmental factors. We investigated the extent to which familiality of schizophrenia is associated with enrichment for common risk variants detectable in a large GWAS. We analyzed single nucleotide polymorphism (SNP) data for cases reporting a family history of psychotic illness (N = 978), cases reporting no such family history (N = 4,503), and unscreened controls (N = 8,285) from the Psychiatric Genomics Consortium (PGC1) study of schizophrenia. We used a multinomial logistic regression approach with model-fitting to detect allelic effects specific to either family history subgroup. We also considered a polygenic model, in which we tested whether family history positive subjects carried more schizophrenia risk alleles than family history negative subjects, on average. Several individual SNPs attained suggestive but not genome-wide significant association with either family history subgroup. Comparison of genome-wide polygenic risk scores based on GWAS summary statistics indicated a significant enrichment for SNP effects among family history positive compared to family history negative cases (Nagelkerke's R(2 ) = 0.0021; P = 0.00331; P-value threshold <0.4). Estimates of variability in disease liability attributable to the aggregate effect of genome-wide SNPs were significantly greater for family history positive compared to family history negative cases (0.32 and 0.22, respectively; P = 0.031). We found suggestive evidence of allelic effects detectable in large GWAS of schizophrenia that might be specific to particular family history subgroups. However, consideration of a polygenic risk score indicated a significant enrichment among family history positive cases for common allelic effects. Familial illness might, therefore, represent a more heritable form of schizophrenia, as suggested by previous epidemiological studies.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
This paper presents a study of the effects of alcohol consumption on household income in Ireland using the Slán National Health and Lifestyle Survey 2007 dataset, accounting for endogeneity and selection bias. Drinkers are categorised into one of four categories based on the recommended weekly drinking levels by the Irish Health Promotion Unit; those who never drank, non-drinkers, moderate and heavy drinkers. A multinomial logit OLS Two Step Estimate is used to explain individual's choice of drinking status and to correct for selection bias which would result in the selection into a particular category of drinking being endogenous. Endogeneity which may arise through the simultaneity of drinking status and income either due to the reverse causation between the two variables, income affecting alcohol consumption or alcohol consumption affecting income, or due to unobserved heterogeneity, is addressed. This paper finds that the household income of drinkers is higher than that of non-drinkers and of those who never drank. There is very little difference between the household income of moderate and heavy drinkers, with heavy drinkers earning slightly more. Weekly household income for those who never drank is €454.20, non-drinkers is €506.26, compared with €683.36 per week for moderate drinkers and €694.18 for heavy drinkers.
Resumo:
The Short Term Assessment of Risk and Treatability is a structured judgement tool used to inform risk estimation for multiple adverse outcomes. In research, risk estimates outperform the tool's strength and vulnerability scales for violence prediction. Little is known about what its’component parts contribute to the assignment of risk estimates and how those estimates fare in prediction of non-violent adverse outcomes compared with the structured components. START assessment and outcomes data from a secure mental health service (N=84) was collected. Binomial and multinomial regression analyses determined the contribution of selected elements of the START structured domain and recent adverse risk events to risk estimates and outcomes prediction for violence, self-harm/suicidality, victimisation, and self-neglect. START vulnerabilities and lifetime history of violence, predicted the violence risk estimate; self-harm and victimisation estimates were predicted only by corresponding recent adverse events. Recent adverse events uniquely predicted all corresponding outcomes, with the exception of self-neglect which was predicted by the strength scale. Only for victimisation did the risk estimate outperform prediction based on the START components and recent adverse events. In the absence of recent corresponding risk behaviour, restrictions imposed on the basis of START-informed risk estimates could be unwarranted and may be unethical.
Resumo:
With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the author(s) of a biomedical publication, or implicit, such as the positive or negative sentiment that an author had when she wrote a product review; there may also be complex context such as the social network of the authors. Many applications require analysis of topic patterns over different contexts. For instance, analysis of search logs in the context of the user can reveal how we can improve the quality of a search engine by optimizing the search results according to particular users; analysis of customer reviews in the context of positive and negative sentiments can help the user summarize public opinions about a product; analysis of blogs or scientific publications in the context of a social network can facilitate discovery of more meaningful topical communities. Since context information significantly affects the choices of topics and language made by authors, in general, it is very important to incorporate it into analyzing and mining text data. In general, modeling the context in text, discovering contextual patterns of language units and topics from text, a general task which we refer to as Contextual Text Mining, has widespread applications in text mining. In this thesis, we provide a novel and systematic study of contextual text mining, which is a new paradigm of text mining treating context information as the ``first-class citizen.'' We formally define the problem of contextual text mining and its basic tasks, and propose a general framework for contextual text mining based on generative modeling of text. This conceptual framework provides general guidance on text mining problems with context information and can be instantiated into many real tasks, including the general problem of contextual topic analysis. We formally present a functional framework for contextual topic analysis, with a general contextual topic model and its various versions, which can effectively solve the text mining problems in a lot of real world applications. We further introduce general components of contextual topic analysis, by adding priors to contextual topic models to incorporate prior knowledge, regularizing contextual topic models with dependency structure of context, and postprocessing contextual patterns to extract refined patterns. The refinements on the general contextual topic model naturally lead to a variety of probabilistic models which incorporate different types of context and various assumptions and constraints. These special versions of the contextual topic model are proved effective in a variety of real applications involving topics and explicit contexts, implicit contexts, and complex contexts. We then introduce a postprocessing procedure for contextual patterns, by generating meaningful labels for multinomial context models. This method provides a general way to interpret text mining results for real users. By applying contextual text mining in the ``context'' of other text information management tasks, including ad hoc text retrieval and web search, we further prove the effectiveness of contextual text mining techniques in a quantitative way with large scale datasets. The framework of contextual text mining not only unifies many explorations of text analysis with context information, but also opens up many new possibilities for future research directions in text mining.
Resumo:
The purpose of this study was to examine relationships between multiple characteristics of maternal employment, parenting practices, and adolescents’ transition outcomes to young adulthood. The research addressed four main research questions. First, are the characteristics of maternal work (i.e., hours worked, multiple jobs held, work schedules, earnings, and occupation) related to adolescents’ enrollment in post-secondary education, employment, or involvement in neither of these types of activities as young adults? Second, are the work characteristics related to parental involvement and monitoring, and are the parenting practices related to adolescents’ transition outcomes? Third, do parental involvement and monitoring mediate any relationships between the characteristics of maternal employment and adolescents’ transition outcomes? Finally, do any associations between characteristics of maternal employment and parenting practices and adolescents’ transition outcomes vary by poverty status, race/ethnicity, or gender? To address these research questions, secondary data analysis was conducted, using data from the National Longitudinal Survey of Youth (NLSY) from 1998 through 2004. The study sample consisted of 849 youths who were 15 through 17 years of age in either 1998 or 2000, and were 19 through 21 years of age when their transition outcomes in young adulthood were measured four years later. Multinomial logistic and ordinary least squares regression models were estimated to answer the research questions. Study findings indicated that of the maternal work characteristics, mothers’ multiple jobs held, occupation, and work schedule were significantly related to the youths’ transition outcomes. When mothers held multiple jobs for 1 to 25 weeks per year, and when mothers held jobs involving lower levels of occupational complexity, their youths were more likely to experience employment rather than post-secondary education. Adolescents whose mothers worked a standard work schedule were less likely to experience other types of transitions than post-secondary education. With regard to the effects of maternal employment on parenting practices, none of the maternal work variables were related to parental involvement, and only one variable, mothers working less than 40 hours per week, was negatively related to parental monitoring. In addition, when parents were more involved with their youths’ education, the youths were less likely to transition into employment and other types of transitions rather than post-secondary education. The parenting practices did not mediate the relation between the significant work variables (holding multiple jobs, work schedule, and occupation) and youths’ transition outcomes. Finally, none of the interactions between maternal work characteristics and poverty status, race/ethnicity, and gender met the criteria for determining significance; but in a series of sub-group analyses, some differences according to poverty status and gender were found. Despite the lack of mediation and moderation, the findings of this study have important implications for social policy and social work intervention. Based on the findings, suggestions are made in these areas to improve working mothers’ lives and their adolescents’ development and successful transition to adulthood. Finally, directions for future research are discussed.
Resumo:
Abstract Scheduling problems are generally NP-hard combinatorial problems, and a lot of research has been done to solve these problems heuristically. However, most of the previous approaches are problem-specific and research into the development of a general scheduling algorithm is still in its infancy. Mimicking the natural evolutionary process of the survival of the fittest, Genetic Algorithms (GAs) have attracted much attention in solving difficult scheduling problems in recent years. Some obstacles exist when using GAs: there is no canonical mechanism to deal with constraints, which are commonly met in most real-world scheduling problems, and small changes to a solution are difficult. To overcome both difficulties, indirect approaches have been presented (in [1] and [2]) for nurse scheduling and driver scheduling, where GAs are used by mapping the solution space, and separate decoding routines then build solutions to the original problem. In our previous indirect GAs, learning is implicit and is restricted to the efficient adjustment of weights for a set of rules that are used to construct schedules. The major limitation of those approaches is that they learn in a non-human way: like most existing construction algorithms, once the best weight combination is found, the rules used in the construction process are fixed at each iteration. However, normally a long sequence of moves is needed to construct a schedule and using fixed rules at each move is thus unreasonable and not coherent with human learning processes. When a human scheduler is working, he normally builds a schedule step by step following a set of rules. After much practice, the scheduler gradually masters the knowledge of which solution parts go well with others. He can identify good parts and is aware of the solution quality even if the scheduling process is not completed yet, thus having the ability to finish a schedule by using flexible, rather than fixed, rules. In this research we intend to design more human-like scheduling algorithms, by using ideas derived from Bayesian Optimization Algorithms (BOA) and Learning Classifier Systems (LCS) to implement explicit learning from past solutions. BOA can be applied to learn to identify good partial solutions and to complete them by building a Bayesian network of the joint distribution of solutions [3]. A Bayesian network is a directed acyclic graph with each node corresponding to one variable, and each variable corresponding to individual rule by which a schedule will be constructed step by step. The conditional probabilities are computed according to an initial set of promising solutions. Subsequently, each new instance for each node is generated by using the corresponding conditional probabilities, until values for all nodes have been generated. Another set of rule strings will be generated in this way, some of which will replace previous strings based on fitness selection. If stopping conditions are not met, the Bayesian network is updated again using the current set of good rule strings. The algorithm thereby tries to explicitly identify and mix promising building blocks. It should be noted that for most scheduling problems the structure of the network model is known and all the variables are fully observed. In this case, the goal of learning is to find the rule values that maximize the likelihood of the training data. Thus learning can amount to 'counting' in the case of multinomial distributions. In the LCS approach, each rule has its strength showing its current usefulness in the system, and this strength is constantly assessed [4]. To implement sophisticated learning based on previous solutions, an improved LCS-based algorithm is designed, which consists of the following three steps. The initialization step is to assign each rule at each stage a constant initial strength. Then rules are selected by using the Roulette Wheel strategy. The next step is to reinforce the strengths of the rules used in the previous solution, keeping the strength of unused rules unchanged. The selection step is to select fitter rules for the next generation. It is envisaged that the LCS part of the algorithm will be used as a hill climber to the BOA algorithm. This is exciting and ambitious research, which might provide the stepping-stone for a new class of scheduling algorithms. Data sets from nurse scheduling and mall problems will be used as test-beds. It is envisaged that once the concept has been proven successful, it will be implemented into general scheduling algorithms. It is also hoped that this research will give some preliminary answers about how to include human-like learning into scheduling algorithms and may therefore be of interest to researchers and practitioners in areas of scheduling and evolutionary computation. References 1. Aickelin, U. and Dowsland, K. (2003) 'Indirect Genetic Algorithm for a Nurse Scheduling Problem', Computer & Operational Research (in print). 2. Li, J. and Kwan, R.S.K. (2003), 'Fuzzy Genetic Algorithm for Driver Scheduling', European Journal of Operational Research 147(2): 334-344. 3. Pelikan, M., Goldberg, D. and Cantu-Paz, E. (1999) 'BOA: The Bayesian Optimization Algorithm', IlliGAL Report No 99003, University of Illinois. 4. Wilson, S. (1994) 'ZCS: A Zeroth-level Classifier System', Evolutionary Computation 2(1), pp 1-18.
Resumo:
A Bayesian optimisation algorithm for a nurse scheduling problem is presented, which involves choosing a suitable scheduling rule from a set for each nurse's assignment. When a human scheduler works, he normally builds a schedule systematically following a set of rules. After much practice, the scheduler gradually masters the knowledge of which solution parts go well with others. He can identify good parts and is aware of the solution quality even if the scheduling process is not yet completed, thus having the ability to finish a schedule by using flexible, rather than fixed, rules. In this paper, we design a more human-like scheduling algorithm, by using a Bayesian optimisation algorithm to implement explicit learning from past solutions. A nurse scheduling problem from a UK hospital is used for testing. Unlike our previous work that used Genetic Algorithms to implement implicit learning [1], the learning in the proposed algorithm is explicit, i.e. we identify and mix building blocks directly. The Bayesian optimisation algorithm is applied to implement such explicit learning by building a Bayesian network of the joint distribution of solutions. The conditional probability of each variable in the network is computed according to an initial set of promising solutions. Subsequently, each new instance for each variable is generated by using the corresponding conditional probabilities, until all variables have been generated, i.e. in our case, new rule strings have been obtained. Sets of rule strings are generated in this way, some of which will replace previous strings based on fitness. If stopping conditions are not met, the conditional probabilities for all nodes in the Bayesian network are updated again using the current set of promising rule strings. For clarity, consider the following toy example of scheduling five nurses with two rules (1: random allocation, 2: allocate nurse to low-cost shifts). In the beginning of the search, the probabilities of choosing rule 1 or 2 for each nurse is equal, i.e. 50%. After a few iterations, due to the selection pressure and reinforcement learning, we experience two solution pathways: Because pure low-cost or random allocation produces low quality solutions, either rule 1 is used for the first 2-3 nurses and rule 2 on remainder or vice versa. In essence, Bayesian network learns 'use rule 2 after 2-3x using rule 1' or vice versa. It should be noted that for our and most other scheduling problems, the structure of the network model is known and all variables are fully observed. In this case, the goal of learning is to find the rule values that maximize the likelihood of the training data. Thus, learning can amount to 'counting' in the case of multinomial distributions. For our problem, we use our rules: Random, Cheapest Cost, Best Cover and Balance of Cost and Cover. In more detail, the steps of our Bayesian optimisation algorithm for nurse scheduling are: 1. Set t = 0, and generate an initial population P(0) at random; 2. Use roulette-wheel selection to choose a set of promising rule strings S(t) from P(t); 3. Compute conditional probabilities of each node according to this set of promising solutions; 4. Assign each nurse using roulette-wheel selection based on the rules' conditional probabilities. A set of new rule strings O(t) will be generated in this way; 5. Create a new population P(t+1) by replacing some rule strings from P(t) with O(t), and set t = t+1; 6. If the termination conditions are not met (we use 2000 generations), go to step 2. Computational results from 52 real data instances demonstrate the success of this approach. They also suggest that the learning mechanism in the proposed approach might be suitable for other scheduling problems. Another direction for further research is to see if there is a good constructing sequence for individual data instances, given a fixed nurse scheduling order. If so, the good patterns could be recognized and then extracted as new domain knowledge. Thus, by using this extracted knowledge, we can assign specific rules to the corresponding nurses beforehand, and only schedule the remaining nurses with all available rules, making it possible to reduce the solution space. Acknowledgements The work was funded by the UK Government's major funding agency, Engineering and Physical Sciences Research Council (EPSRC), under grand GR/R92899/01. References [1] Aickelin U, "An Indirect Genetic Algorithm for Set Covering Problems", Journal of the Operational Research Society, 53(10): 1118-1126,
Resumo:
Objective: To describe the association between consumption of different alcoholic beverages and adherence to the Mediterranean diet. Methods: A cross-sectional analysis was conducted of the baseline data of the DiSA-UMH study, an ongoing cohort study with Spanish health science students (n = 1098) aged 17-35 years. Dietary information was collected by a validated 84-item food frequency questionnaire. Participants were grouped into non-drinkers, exclusive beer and/or wine drinkers and drinkers of all types of alcoholic beverages. Mediterranean diet adherence was determined by using a modification of the relative Mediterranean Diet Score (rMED; score range: 0-16) according to consumption of 8 dietary components. We performed multiple linear and multinomial regression analyses. Results: The mean alcohol consumption was 4.3 g/day (SD: 6.1). A total of 19.5%, 18.9% and 61.6% of the participants were non-drinkers, exclusive beer and/or wine drinkers and drinkers of all types of alcoholic beverages, respectively. Participants who consumed beer and/or wine exclusively had higher rMED scores than non-drinkers (β: 0.76, 95%CI: 0.25-1.27). Drinkers of all types of alcoholic beverages had similar rMED scores to non-drinkers. Non-drinkers consumed less fish and more meat, whereas drinkers of all types of alcoholic beverages consumed fewer fruits, vegetables and more meat than exclusive beer and/or wine drinkers. Conclusions: The overall alcohol consumption among the students in our study was low-to-moderate. Exclusive beer and/or wine drinkers differed regarding the Mediterranean diet pattern from non-drinkers and drinkers of all types of alcohol. These results show the need to properly adjust for diet in studies of the effects of alcohol consumption.
Resumo:
In this dissertation I draw a connection between quantum adiabatic optimization, spectral graph theory, heat-diffusion, and sub-stochastic processes through the operators that govern these processes and their associated spectra. In particular, we study Hamiltonians which have recently become known as ``stoquastic'' or, equivalently, the generators of sub-stochastic processes. The operators corresponding to these Hamiltonians are of interest in all of the settings mentioned above. I predominantly explore the connection between the spectral gap of an operator, or the difference between the two lowest energies of that operator, and certain equilibrium behavior. In the context of adiabatic optimization, this corresponds to the likelihood of solving the optimization problem of interest. I will provide an instance of an optimization problem that is easy to solve classically, but leaves open the possibility to being difficult adiabatically. Aside from this concrete example, the work in this dissertation is predominantly mathematical and we focus on bounding the spectral gap. Our primary tool for doing this is spectral graph theory, which provides the most natural approach to this task by simply considering Dirichlet eigenvalues of subgraphs of host graphs. I will derive tight bounds for the gap of one-dimensional, hypercube, and general convex subgraphs. The techniques used will also adapt methods recently used by Andrews and Clutterbuck to prove the long-standing ``Fundamental Gap Conjecture''.