25 resultados para Decision Processes
Resumo:
This work addresses the problem of estimating the optimal value function in a Markov Decision Process from observed state-action pairs. We adopt a Bayesian approach to inference, which allows both the model to be estimated and predictions about actions to be made in a unified framework, providing a principled approach to mimicry of a controller on the basis of observed data. A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation from theposterior distribution over the optimal value function. This step includes a parameter expansion step, which is shown to be essential for good convergence properties of the MCMC sampler. As an illustration, the method is applied to learning a human controller.
Resumo:
Decisions about noisy stimuli require evidence integration over time. Traditionally, evidence integration and decision making are described as a one-stage process: a decision is made when evidence for the presence of a stimulus crosses a threshold. Here, we show that one-stage models cannot explain psychophysical experiments on feature fusion, where two visual stimuli are presented in rapid succession. Paradoxically, the second stimulus biases decisions more strongly than the first one, contrary to predictions of one-stage models and intuition. We present a two-stage model where sensory information is integrated and buffered before it is fed into a drift diffusion process. The model is tested in a series of psychophysical experiments and explains both accuracy and reaction time distributions. © 2012 Rüter et al.
Resumo:
This work shows how a dialogue model can be represented as a Partially Observable Markov Decision Process (POMDP) with observations composed of a discrete and continuous component. The continuous component enables the model to directly incorporate a confidence score for automated planning. Using a testbed simulated dialogue management problem, we show how recent optimization techniques are able to find a policy for this continuous POMDP which outperforms a traditional MDP approach. Further, we present a method for automatically improving handcrafted dialogue managers by incorporating POMDP belief state monitoring, including confidence score information. Experiments on the testbed system show significant improvements for several example handcrafted dialogue managers across a range of operating conditions.
Resumo:
We consider the inverse reinforcement learning problem, that is, the problem of learning from, and then predicting or mimicking a controller based on state/action data. We propose a statistical model for such data, derived from the structure of a Markov decision process. Adopting a Bayesian approach to inference, we show how latent variables of the model can be estimated, and how predictions about actions can be made, in a unified framework. A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation from the posterior distribution. This step includes a parameter expansion step, which is shown to be essential for good convergence properties of the MCMC sampler. As an illustration, the method is applied to learning a human controller.
Resumo:
Production responsiveness refers to the ability of a production system to achieve its operational goals in the presence of supplier, internal and customer disturbances, where disturbances are those sources of change which occur independently of the system's intentions. A set of audit tools for assessing the responsiveness of production operations is being prepared as part of an EPSRC funded investigation. These tools are based on the idea that the ability to respond is linked to: the nature of the disturbances or changes requiring a response; their impact on production goals; and the inherent response capabilities of the operation. These response capabilities include information gathering and processing (to detect disturbances and production conditions), decision processes (which initiate system responses to disturbances) and various types of process flexibilities and buffers (which provide the physical means of dealing with disturbances). The paper discusses concepts and issues associated with production responsiveness, describes the audit tools that have been developed and illustrates their use in the context of a steel manufacturing plant.
Resumo:
Although partially observable Markov decision processes (POMDPs) have shown great promise as a framework for dialog management in spoken dialog systems, important scalability issues remain. This paper tackles the problem of scaling slot-filling POMDP-based dialog managers to many slots with a novel technique called composite point-based value iteration (CSPBVI). CSPBVI creates a "local" POMDP policy for each slot; at runtime, each slot nominates an action and a heuristic chooses which action to take. Experiments in dialog simulation show that CSPBVI successfully scales POMDP-based dialog managers without compromising performance gains over baseline techniques and preserving robustness to errors in user model estimation. Copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
Resumo:
This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet. The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better. © 2011 ACM.
Resumo:
A set of audit tools is being prepared for assessing the response capability of a production operation, as part of an EPSRC1 funded investigation into improving the responsiveness of manufacturing production systems. These tools are based on the idea that the ability to respond is linked to i) the nature of the disturbances or changes requiring a response, ii) their impact on production goals and iii) the decision processes which initiate system responses to disturbances.