17 resultados para Reward
Resumo:
Learning automata arranged in a two-level hierarchy are considered. The automata operate in a stationary random environment and update their action probabilities according to the linear-reward- -penalty algorithm at each level. Unlike some hierarchical systems previously proposed, no information transfer exists from one level to another, and yet the hierarchy possesses good convergence properties. Using weak-convergence concepts it is shown that for large time and small values of parameters in the algorithm, the evolution of the optimal path probability can be represented by a diffusion whose parameters can be computed explicitly.
Resumo:
Multiaction learning automata which update their action probabilities on the basis of the responses they get from an environment are considered in this paper. The automata update the probabilities according to whether the environment responds with a reward or a penalty. Learning automata are said to possess ergodicity of the mean if the mean action probability is the state probability (or unconditional probability) of an ergodic Markov chain. In an earlier paper [11] we considered the problem of a two-action learning automaton being ergodic in the mean (EM). The family of such automata was characterized completely by proving the necessary and sufficient conditions for automata to be EM. In this paper, we generalize the results of [11] and obtain necessary and sufficient conditions for the multiaction learning automaton to be EM. These conditions involve two families of probability updating functions. It is shown that for the automaton to be EM the two families must be linearly dependent. The vector defining the linear dependence is the only vector parameter which controls the rate of convergence of the automaton. Further, the technique for reducing the variance of the limiting distribution is discussed. Just as in the two-action case, it is shown that the set of absolutely expedient schemes and the set of schemes which possess ergodicity of the mean are mutually disjoint.
Resumo:
We develop extensions of the Simulated Annealing with Multiplicative Weights (SAMW) algorithm that proposed a method of solution of Finite-Horizon Markov Decision Processes (FH-MDPs). The extensions developed are in three directions: a) Use of the dynamic programming principle in the policy update step of SAMW b) A two-timescale actor-critic algorithm that uses simulated transitions alone, and c) Extending the algorithm to the infinite-horizon discounted-reward scenario. In particular, a) reduces the storage required from exponential to linear in the number of actions per stage-state pair. On the faster timescale, a 'critic' recursion performs policy evaluation while on the slower timescale an 'actor' recursion performs policy improvement using SAMW. We give a proof outlining convergence w.p. 1 and show experimental results on two settings: semiconductor fabrication and flow control in communication networks.
Resumo:
Query incentive networks capture the role of incentives in extracting information from decentralized information networks such as a social network. Several game theoretic tilt:Kids of query incentive networks have been proposed in the literature to study and characterize the dependence, of the monetary reward required to extract the answer for a query, on various factors such as the structure of the network, the level of difficulty of the query, and the required success probability.None of the existing models, however, captures the practical andimportant factor of quality of answers. In this paper, we develop a complete mechanism design based framework to incorporate the quality of answers, in the monetization of query incentive networks. First, we extend the model of Kleinberg and Raghavan [2] to allow the nodes to modulate the incentive on the basis of the quality of the answer they receive. For this qualify conscious model. we show are existence of a unique Nash equilibrium and study the impact of quality of answers on the growth rate of the initial reward, with respect to the branching factor of the network. Next, we present two mechanisms; the direct comparison mechanism and the peer prediction mechanism, for truthful elicitation of quality from the agents. These mechanisms are based on scoring rules and cover different; scenarios which may arise in query incentive networks. We show that the proposed quality elicitation mechanisms are incentive compatible and ex-ante budget balanced. We also derive conditions under which ex-post budget balance can beachieved by these mechanisms.
Resumo:
Two optimal non-linear reinforcement schemes—the Reward-Inaction and the Penalty-Inaction—for the two-state automaton functioning in a stationary random environment are considered. Very simple conditions of symmetry of the non-linear function figuring in the reinforcement scheme are shown to be necessary and sufficient for optimality. General expressions for the variance and rate of learning are derived. These schemes are compared with the already existing optimal linear schemes in the light of average variance and average rate of learning.
Resumo:
Motivated by certain situations in manufacturing systems and communication networks, we look into the problem of maximizing the profit in a queueing system with linear reward and cost structure and having a choice of selecting the streams of Poisson arrivals according to an independent Markov chain. We view the system as a MMPP/GI/1 queue and seek to maximize the profits by optimally choosing the stationary probabilities of the modulating Markov chain. We consider two formulations of the optimization problem. The first one (which we call the PUT problem) seeks to maximize the profit per unit time whereas the second one considers the maximization of the profit per accepted customer (the PAC problem). In each of these formulations, we explore three separate problems. In the first one, the constraints come from bounding the utilization of an infinite capacity server; in the second one the constraints arise from bounding the mean queue length of the same queue; and in the third one the finite capacity of the buffer reflect as a set of constraints. In the problems bounding the utilization factor of the queue, the solutions are given by essentially linear programs, while the problems with mean queue length constraints are linear programs if the service is exponentially distributed. The problems modeling the finite capacity queue are non-convex programs for which global maxima can be found. There is a rich relationship between the solutions of the PUT and PAC problems. In particular, the PUT solutions always make the server work at a utilization factor that is no less than that of the PAC solutions.
Resumo:
In many problems of decision making under uncertainty the system has to acquire knowledge of its environment and learn the optimal decision through its experience. Such problems may also involve the system having to arrive at the globally optimal decision, when at each instant only a subset of the entire set of possible alternatives is available. These problems can be successfully modelled and analysed by learning automata. In this paper an estimator learning algorithm, which maintains estimates of the reward characteristics of the random environment, is presented for an automaton with changing number of actions. A learning automaton using the new scheme is shown to be e-optimal. The simulation results demonstrate the fast convergence properties of the new algorithm. The results of this study can be extended to the design of other types of estimator algorithms with good convergence properties.
Resumo:
In this paper we propose a novel technique to model and ana¿ lyze the performability of parallel and distributed architectures using GSPN-reward models.
Resumo:
The rhesus monkey Macaca mulatta and Hanuman langur Presbytis entellus are distributed all over the State of Himachal Pradesh, India. Although both species inhabit forested areas, only rhesus monkeys seem also to have become urbanized. There are about 200,000 rhesus monkeys and 120,000 Hanuman langurs. A three-year survey at Shimla showed an increasing trend in their populations. Potential threats to survival of these primates differ in the 12 districts. The two species differ in feeding and habitat preferences. People's feelings, perceptions and attitudes reward them point to an incipient man-monkey conflict and erosion of conservation ethics. A comprehensive management plan for these primates should be formulated, and involve local people. Copyright (C) 1996 Elsevier Science Limited
Resumo:
We develop a simulation-based, two-timescale actor-critic algorithm for infinite horizon Markov decision processes with finite state and action spaces, with a discounted reward criterion. The algorithm is of the gradient ascent type and performs a search in the space of stationary randomized policies. The algorithm uses certain simultaneous deterministic perturbation stochastic approximation (SDPSA) gradient estimates for enhanced performance. We show an application of our algorithm on a problem of mortgage refinancing. Our algorithm obtains the optimal refinancing strategies in a computationally efficient manner
Resumo:
Our work is motivated by geographical forwarding of sporadic alarm packets to a base station in a wireless sensor network (WSN), where the nodes are sleep-wake cycling periodically and asynchronously. We seek to develop local forwarding algorithms that can be tuned so as to tradeoff the end-to-end delay against a total cost, such as the hop count or total energy. Our approach is to solve, at each forwarding node enroute to the sink, the local forwarding problem of minimizing one-hop waiting delay subject to a lower bound constraint on a suitable reward offered by the next-hop relay; the constraint serves to tune the tradeoff. The reward metric used for the local problem is based on the end-to-end total cost objective (for instance, when the total cost is hop count, we choose to use the progress toward sink made by a relay as the reward). The forwarding node, to begin with, is uncertain about the number of relays, their wake-up times, and the reward values, but knows the probability distributions of these quantities. At each relay wake-up instant, when a relay reveals its reward value, the forwarding node's problem is to forward the packet or to wait for further relays to wake-up. In terms of the operations research literature, our work can be considered as a variant of the asset selling problem. We formulate our local forwarding problem as a partially observable Markov decision process (POMDP) and obtain inner and outer bounds for the optimal policy. Motivated by the computational complexity involved in the policies derived out of these bounds, we formulate an alternate simplified model, the optimal policy for which is a simple threshold rule. We provide simulation results to compare the performance of the inner and outer bound policies against the simple policy, and also against the optimal policy when the source knows the exact number of relays. Observing the good performance and the ease of implementation of the simple policy, we apply it to our motivating problem, i.e., local geographical routing of sporadic alarm packets in a large WSN. We compare the end-to-end performance (i.e., average total delay and average total cost) obtained by the simple policy, when used for local geographical forwarding, against that obtained by the globally optimal forwarding algorithm proposed by Kim et al. 1].
Resumo:
In this paper, the authors study the structure of a novel binaural sound with a certain phase and amplitude modulation and the response to this excitation when it is applied to natural rewarding circuit of human brain through auditory neural pathways. This novel excitation, also referred to as gyrosonic excitation in this work, has been found to have interesting effects such as stabilization effects on the left and right hemispheric brain signaling as captured by Galvanic Skin Resistance (GSR) measurements, control of cardiac rhythms (observed from ECG signals), mitigation of psychosomatic syndrome, and mitigation of migraine pain. Experimental data collected from human subjects are presented, and these data are examined to categorize the extent of systems disorder and reinforcement reward due to the gyrosonic stimulus. A multi-path reduced-order model has been developed to analyze the GSR signals. The filtered results are indicative of complicated reinforcing reward patterns due to the gyrosonic stimulation when it is used as a control input for patients with psychosomatic and cardiac disorders.
Resumo:
This paper addresses the problem of finding optimal power control policies for wireless energy harvesting sensor (EHS) nodes with automatic repeat request (ARQ)-based packet transmissions. The EHS harvests energy from the environment according to a Bernoulli process; and it is required to operate within the constraint of energy neutrality. The EHS obtains partial channel state information (CSI) at the transmitter through the link-layer ARQ protocol, via the ACK/NACK feedback messages, and uses it to adapt the transmission power for the packet (re)transmission attempts. The underlying wireless fading channel is modeled as a finite state Markov chain with known transition probabilities. Thus, the goal of the power management policy is to determine the best power setting for the current packet transmission attempt, so as to maximize a long-run expected reward such as the expected outage probability. The problem is addressed in a decision-theoretic framework by casting it as a partially observable Markov decision process (POMDP). Due to the large size of the state-space, the exact solution to the POMDP is computationally expensive. Hence, two popular approximate solutions are considered, which yield good power management policies for the transmission attempts. Monte Carlo simulation results illustrate the efficacy of the approach and show that the approximate solutions significantly outperform conventional approaches.
Resumo:
In geographical forwarding of packets in a large wireless sensor network (WSN) with sleep-wake cycling nodes, we are interested in the local decision problem faced by a node that has ``custody'' of a packet and has to choose one among a set of next-hop relay nodes to forward the packet toward the sink. Each relay is associated with a ``reward'' that summarizes the benefit of forwarding the packet through that relay. We seek a solution to this local problem, the idea being that such a solution, if adopted by every node, could provide a reasonable heuristic for the end-to-end forwarding problem. Toward this end, we propose a local relay selection problem consisting of a forwarding node and a collection of relay nodes, with the relays waking up sequentially at random times. At each relay wake-up instant, the forwarder can choose to probe a relay to learn its reward value, based on which the forwarder can then decide whether to stop (and forward its packet to the chosen relay) or to continue to wait for further relays to wake up. The forwarder's objective is to select a relay so as to minimize a combination of waiting delay, reward, and probing cost. The local decision problem can be considered as a variant of the asset selling problem studied in the operations research literature. We formulate the local problem as a Markov decision process (MDP) and characterize the solution in terms of stopping sets and probing sets. We provide results illustrating the structure of the stopping sets, namely, the (lower bound) threshold and the stage independence properties. Regarding the probing sets, we make an interesting conjecture that these sets are characterized by upper bounds. Through simulation experiments, we provide valuable insights into the performance of the optimal local forwarding and its use as an end-to-end forwarding heuristic.
Resumo:
Standard Susceptible-Infected-Susceptible (SIS) epidemic models assume that a message spreads from the infected to the susceptible nodes due to only susceptible-infected epidemic contact. We modify the standard SIS epidemic model to include direct recruitment of susceptible individuals to the infected class at a constant rate (independent of epidemic contacts), to accelerate information spreading in a social network. Such recruitment can be carried out by placing advertisements in the media. We provide a closed form analytical solution for system evolution in the proposed model and use it to study campaigning in two different scenarios. In the first, the net cost function is a linear combination of the reward due to extent of information diffusion and the cost due to application of control. In the second, the campaign budget is fixed. Results reveal the effectiveness of the proposed system in accelerating and improving the extent of information diffusion. Our work is useful for devising effective strategies for product marketing and political/social-awareness/crowd-funding campaigns that target individuals in a social network.