24 resultados para critic
Resumo:
This article proposes a three-timescale simulation based algorithm for solution of infinite horizon Markov Decision Processes (MDPs). We assume a finite state space and discounted cost criterion and adopt the value iteration approach. An approximation of the Dynamic Programming operator T is applied to the value function iterates. This 'approximate' operator is implemented using three timescales, the slowest of which updates the value function iterates. On the middle timescale we perform a gradient search over the feasible action set of each state using Simultaneous Perturbation Stochastic Approximation (SPSA) gradient estimates, thus finding the minimizing action in T. On the fastest timescale, the 'critic' estimates, over which the gradient search is performed, are obtained. A sketch of convergence explaining the dynamics of the algorithm using associated ODEs is also presented. Numerical experiments on rate based flow control on a bottleneck node using a continuous-time queueing model are performed using the proposed algorithm. The results obtained are verified against classical value iteration where the feasible set is suitably discretized. Over such a discretized setting, a variant of the algorithm of [12] is compared and the proposed algorithm is found to converge faster.
Resumo:
An approximate dynamic programming (ADP) based neurocontroller is developed for a heat transfer application. Heat transfer problem for a fin in a car's electronic module is modeled as a nonlinear distributed parameter (infinite-dimensional) system by taking into account heat loss and generation due to conduction, convection and radiation. A low-order, finite-dimensional lumped parameter model for this problem is obtained by using Galerkin projection and basis functions designed through the 'Proper Orthogonal Decomposition' technique (POD) and the 'snap-shot' solutions. A suboptimal neurocontroller is obtained with a single-network-adaptive-critic (SNAC). Further contribution of this paper is to develop an online robust controller to account for unmodeled dynamics and parametric uncertainties. A weight update rule is presented that guarantees boundedness of the weights and eliminates the need for persistence of excitation (PE) condition to be satisfied. Since, the ADP and neural network based controllers are of fairly general structure, they appear to have the potential to be controller synthesis tools for nonlinear distributed parameter systems especially where it is difficult to obtain an accurate model.
Resumo:
In this paper, we use reinforcement learning (RL) as a tool to study price dynamics in an electronic retail market consisting of two competing sellers, and price sensitive and lead time sensitive customers. Sellers, offering identical products, compete on price to satisfy stochastically arriving demands (customers), and follow standard inventory control and replenishment policies to manage their inventories. In such a generalized setting, RL techniques have not previously been applied. We consider two representative cases: 1) no information case, were none of the sellers has any information about customer queue levels, inventory levels, or prices at the competitors; and 2) partial information case, where every seller has information about the customer queue levels and inventory levels of the competitors. Sellers employ automated pricing agents, or pricebots, which use RL-based pricing algorithms to reset the prices at random intervals based on factors such as number of back orders, inventory levels, and replenishment lead times, with the objective of maximizing discounted cumulative profit. In the no information case, we show that a seller who uses Q-learning outperforms a seller who uses derivative following (DF). In the partial information case, we model the problem as a Markovian game and use actor-critic based RL to learn dynamic prices. We believe our approach to solving these problems is a new and promising way of setting dynamic prices in multiseller environments with stochastic demands, price sensitive customers, and inventory replenishments.
Resumo:
An approximate dynamic programming (ADP)-based suboptimal neurocontroller to obtain desired temperature for a high-speed aerospace vehicle is synthesized in this paper. A I-D distributed parameter model of a fin is developed from basic thermal physics principles. "Snapshot" solutions of the dynamics are generated with a simple dynamic inversion-based feedback controller. Empirical basis functions are designed using the "proper orthogonal decomposition" (POD) technique and the snapshot solutions. A low-order nonlinear lumped parameter system to characterize the infinite dimensional system is obtained by carrying out a Galerkin projection. An ADP-based neurocontroller with a dual heuristic programming (DHP) formulation is obtained with a single-network-adaptive-critic (SNAC) controller for this approximate nonlinear model. Actual control in the original domain is calculated with the same POD basis functions through a reverse mapping. Further contribution of this paper includes development of an online robust neurocontroller to account for unmodeled dynamics and parametric uncertainties inherent in such a complex dynamic system. A neural network (NN) weight update rule that guarantees boundedness of the weights and relaxes the need for persistence of excitation (PE) condition is presented. Simulation studies show that in a fairly extensive but compact domain, any desired temperature profile can be achieved starting from any initial temperature profile. Therefore, the ADP and NN-based controllers appear to have the potential to become controller synthesis tools for nonlinear distributed parameter systems.
Resumo:
Beavers are often found to be in conflict with human interests by creating nuisances like building dams on flowing water (leading to flooding), blocking irrigation canals, cutting down timbers, etc. At the same time they contribute to raising water tables, increased vegetation, etc. Consequently, maintaining an optimal beaver population is beneficial. Because of their diffusion externality (due to migratory nature), strategies based on lumped parameter models are often ineffective. Using a distributed parameter model for beaver population that accounts for their spatial and temporal behavior, an optimal control (trapping) strategy is presented in this paper that leads to a desired distribution of the animal density in a region in the long run. The optimal control solution presented, imbeds the solution for a large number of initial conditions (i.e., it has a feedback form), which is otherwise nontrivial to obtain. The solution obtained can be used in real-time by a nonexpert in control theory since it involves only using the neural networks trained offline. Proper orthogonal decomposition-based basis function design followed by their use in a Galerkin projection has been incorporated in the solution process as a model reduction technique. Optimal solutions are obtained through a "single network adaptive critic" (SNAC) neural-network architecture.
Resumo:
Diabetes is a long-term disease during which the body's production and use of insulin are impaired, causing glucose concentration level to increase in the bloodstream. Regulating blood glucose levels as close to normal as possible leads to a substantial decrease in long-term complications of diabetes. In this paper, an intelligent online feedback-treatment strategy is presented for the control of blood glucose levels in diabetic patients using single network adaptive critic (SNAC) neural networks (which is based on nonlinear optimal control theory). A recently developed mathematical model of the nonlinear dynamics of glucose and insulin interaction in the blood system has been revised and considered for synthesizing the neural network for feedback control. The idea is to replicate the function of pancreatic insulin, i.e. to have a fairly continuous measurement of blood glucose and a situation-dependent insulin injection to the body using an external device. Detailed studies are carried out to analyze the effectiveness of this adaptive critic-based feedback medication strategy. A comparison study with linear quadratic regulator (LQR) theory shows that the proposed nonlinear approach offers some important advantages such as quicker response, avoidance of hypoglycemia problems, etc. Robustness of the proposed approach is also demonstrated from a large number of simulations considering random initial conditions and parametric uncertainties. Copyright (C) 2009 John Wiley & Sons, Ltd.
Resumo:
We propose for the first time two reinforcement learning algorithms with function approximation for average cost adaptive control of traffic lights. One of these algorithms is a version of Q-learning with function approximation while the other is a policy gradient actor-critic algorithm that incorporates multi-timescale stochastic approximation. We show performance comparisons on various network settings of these algorithms with a range of fixed timing algorithms, as well as a Q-learning algorithm with full state representation that we also implement. We observe that whereas (as expected) on a two-junction corridor, the full state representation algorithm shows the best results, this algorithm is not implementable on larger road networks. The algorithm PG-AC-TLC that we propose is seen to show the best overall performance.
Resumo:
In this paper, we investigate the use of reinforcement learning (RL) techniques to the problem of determining dynamic prices in an electronic retail market. As representative models, we consider a single seller market and a two seller market, and formulate the dynamic pricing problem in a setting that easily generalizes to markets with more than two sellers. We first formulate the single seller dynamic pricing problem in the RL framework and solve the problem using the Q-learning algorithm through simulation. Next we model the two seller dynamic pricing problem as a Markovian game and formulate the problem in the RL framework. We solve this problem using actor-critic algorithms through simulation. We believe our approach to solving these problems is a promising way of setting dynamic prices in multi-agent environments. We illustrate the methodology with two illustrative examples of typical retail markets.
Resumo:
This paper presents an advanced single network adaptive critic (SNAC) aided nonlinear dynamic inversion (NDI) approach for simultaneous attitude control and trajectory tracking of a micro-quadrotor. Control of micro-quadrotors is a challenging problem due to its small size, strong coupling in pitch-yaw-roll and aerodynamic effects that often need to be ignored in the control design process to avoid mathematical complexities. In the proposed SNAC aided NDI approach, the gains of the dynamic inversion design are selected in such a way that the resulting controller behaves closely to a pre-synthesized SNAC controller for the output regulation problem. However, since SNAC is based on optimal control theory, it makes the dynamic inversion controller to operate near optimal and enhances its robustness property as well. More important, it retains two major benefits of dynamic inversion: (i) closed form expression of the controller and (ii) easy scalability to command tracking application even without any apriori knowledge of the reference command. Effectiveness of the proposed controller is demonstrated from six degree-of-freedom simulation studies of a micro-quadrotor. It has also been observed that the proposed SNAC aided NDI approach is more robust to modeling inaccuracies, as compared to the NDI controller designed independently from time domain specifications.