844 resultados para Markov decision processes
Resumo:
Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in the parameters of a Markov Decision Process (MDP). Unlike the case of an MDP, the notion of an optimal policy for a BMDP is not entirely straightforward. We consider two notions of optimality based on optimistic and pessimistic criteria. These have been analyzed for discounted BMDPs. Here we provide results for average reward BMDPs. We establish a fundamental relationship between the discounted and the average reward problems, prove the existence of Blackwell optimal policies and, for both notions of optimality, derive algorithms that converge to the optimal value function.
Resumo:
Exploiting wind-energy is one possible way to ex- tend flight duration for Unmanned Arial Vehicles. Wind-energy can also be used to minimise energy consumption for a planned path. In this paper, we consider uncertain time-varying wind fields and plan a path through them. A Gaussian distribution is used to determine uncertainty in the Time-varying wind fields. We use Markov Decision Process to plan a path based upon the uncertainty of Gaussian distribution. Simulation results that compare the direct line of flight between start and target point and our planned path for energy consumption and time of travel are presented. The result is a robust path using the most visited cell while sampling the Gaussian distribution of the wind field in each cell.
Resumo:
Ocean processes are complex and have high variability in both time and space. Thus, ocean scientists must collect data over long time periods to obtain a synoptic view of ocean processes and resolve their spatiotemporal variability. One way to perform these persistent observations is to utilise an autonomous vehicle that can remain on deployment for long time periods. However, such vehicles are generally underactuated and slow moving. A challenge for persistent monitoring with these vehicles is dealing with currents while executing a prescribed path or mission. Here we present a path planning method for persistent monitoring that exploits ocean currents to increase navigational accuracy and reduce energy consumption.
Resumo:
Exploiting wind-energy is one possible way to extend flight duration for Unmanned Arial Vehicles. Wind-energy can also be used to minimise energy consumption for a planned path. In this paper, we consider uncertain time-varying wind fields and plan a path through them. A Gaussian distribution is used to determine uncertainty in the Time-varying wind fields. We use Markov Decision Process to plan a path based upon the uncertainty of Gaussian distribution. Simulation results that compare the direct line of flight between start and target point and our planned path for energy consumption and time of travel are presented. The result is a robust path using the most visited cell while sampling the Gaussian distribution of the wind field in each cell.
Resumo:
We develop a simulation based algorithm for finite horizon Markov decision processes with finite state and finite action space. Illustrative numerical experiments with the proposed algorithm are shown for problems in flow control of communication networks and capacity switching in semiconductor fabrication.
Resumo:
We develop in this article the first actor-critic reinforcement learning algorithm with function approximation for a problem of control under multiple inequality constraints. We consider the infinite horizon discounted cost framework in which both the objective and the constraint functions are suitable expected policy-dependent discounted sums of certain sample path functions. We apply the Lagrange multiplier method to handle the inequality constraints. Our algorithm makes use of multi-timescale stochastic approximation and incorporates a temporal difference (TD) critic and an actor that makes a gradient search in the space of policy parameters using efficient simultaneous perturbation stochastic approximation (SPSA) gradient estimates. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal policy. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
The existence of an optimal feedback law is established for the risk-sensitive optimal control problem with denumerable state space. The main assumptions imposed are irreducibility and a near monotonicity condition on the one-step cost function. A solution can be found constructively using either value iteration or policy iteration under suitable conditions on initial feedback law.
Resumo:
We develop a simulation based algorithm for finite horizon Markov decision processes with finite state and finite action space. Illustrative numerical experiments with the proposed algorithm are shown for problems in flow control of communication networks and capacity switching in semiconductor fabrication.
Resumo:
We develop an online actor-critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.
Resumo:
We introduce and study a class of non-stationary semi-Markov decision processes on a finite horizon. By constructing an equivalent Markov decision process, we establish the existence of a piecewise open loop relaxed control which is optimal for the finite horizon problem.
Resumo:
We present a novel multi-timescale Q-learning algorithm for average cost control in a Markov decision process subject to multiple inequality constraints. We formulate a relaxed version of this problem through the Lagrange multiplier method. Our algorithm is different from Q-learning in that it updates two parameters - a Q-value parameter and a policy parameter. The Q-value parameter is updated on a slower time scale as compared to the policy parameter. Whereas Q-learning with function approximation can diverge in some cases, our algorithm is seen to be convergent as a result of the aforementioned timescale separation. We show the results of experiments on a problem of constrained routing in a multistage queueing network. Our algorithm is seen to exhibit good performance and the various inequality constraints are seen to be satisfied upon convergence of the algorithm.
Resumo:
This work addresses the problem of estimating the optimal value function in a Markov Decision Process from observed state-action pairs. We adopt a Bayesian approach to inference, which allows both the model to be estimated and predictions about actions to be made in a unified framework, providing a principled approach to mimicry of a controller on the basis of observed data. A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation from theposterior distribution over the optimal value function. This step includes a parameter expansion step, which is shown to be essential for good convergence properties of the MCMC sampler. As an illustration, the method is applied to learning a human controller.