986 resultados para Policy iteration algorithm


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The main goal of this paper is to apply the so-called policy iteration algorithm (PIA) for the long run average continuous control problem of piecewise deterministic Markov processes (PDMP`s) taking values in a general Borel space and with compact action space depending on the state variable. In order to do that we first derive some important properties for a pseudo-Poisson equation associated to the problem. In the sequence it is shown that the convergence of the PIA to a solution satisfying the optimality equation holds under some classical hypotheses and that this optimal solution yields to an optimal control strategy for the average control problem for the continuous-time PDMP in a feedback form.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper proposes a high-level reinforcement learning (RL) control system for solving the action selection problem of an autonomous robot. Although the dominant approach, when using RL, has been to apply value function based algorithms, the system here detailed is characterized by the use of direct policy search methods. Rather than approximating a value function, these methodologies approximate a policy using an independent function approximator with its own parameters, trying to maximize the future expected reward. The policy based algorithm presented in this paper is used for learning the internal state/action mapping of a behavior. In this preliminary work, we demonstrate its feasibility with simulated experiments using the underwater robot GARBI in a target reaching task

Relevância:

90.00% 90.00%

Publicador:

Resumo:

One objective of artificial intelligence is to model the behavior of an intelligent agent interacting with its environment. The environment's transformations can be modeled as a Markov chain, whose state is partially observable to the agent and affected by its actions; such processes are known as partially observable Markov decision processes (POMDPs). While the environment's dynamics are assumed to obey certain rules, the agent does not know them and must learn. In this dissertation we focus on the agent's adaptation as captured by the reinforcement learning framework. This means learning a policy---a mapping of observations into actions---based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. The set of policies is constrained by the architecture of the agent's controller. POMDPs require a controller to have a memory. We investigate controllers with memory, including controllers with external memory, finite state controllers and distributed controllers for multi-agent systems. For these various controllers we work out the details of the algorithms which learn by ascending the gradient of expected cumulative reinforcement. Building on statistical learning theory and experiment design theory, a policy evaluation algorithm is developed for the case of experience re-use. We address the question of sufficient experience for uniform convergence of policy evaluation and obtain sample complexity bounds for various estimators. Finally, we demonstrate the performance of the proposed algorithms on several domains, the most complex of which is simulated adaptive packet routing in a telecommunication network.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

This paper proposes a high-level reinforcement learning (RL) control system for solving the action selection problem of an autonomous robot. Although the dominant approach, when using RL, has been to apply value function based algorithms, the system here detailed is characterized by the use of direct policy search methods. Rather than approximating a value function, these methodologies approximate a policy using an independent function approximator with its own parameters, trying to maximize the future expected reward. The policy based algorithm presented in this paper is used for learning the internal state/action mapping of a behavior. In this preliminary work, we demonstrate its feasibility with simulated experiments using the underwater robot GARBI in a target reaching task

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Este trabalho apresenta uma solução para o problema de controle admissão de conexão e alocação dinâmica de recursos em redes IEEE 802.16 através da modelagem de um Processo Markoviano de Decisão (PMD) utilizando o conceito de degradação de largura de banda, o qual é baseado nos requisitos diferenciados de largura de banda das classes de serviço do IEEE 802.16. Para o critério de desempenho do PMD é feita a atribuição de diferentes retornos a cada classe de serviço, fazendo assim o tratamento diferenciado de cada fluxo. Nesse sentido, é possível avaliar a política ótima, obtida através de um algoritmo de iteração de valores, considerando aspectos como o nível de degradação médio das classes de serviço, utilização dos recursos e probabilidades de bloqueios de cada classe de serviço em relação à carga do sistema. Resultados obtidos mostram que o método de controle markoviano proposto é capaz de priorizar as classes de serviço consideradas mais relevantes para o sistema.

Relevância:

90.00% 90.00%

Publicador:

Resumo:

Die Arbeit behandelt das Problem der Skalierbarkeit von Reinforcement Lernen auf hochdimensionale und komplexe Aufgabenstellungen. Unter Reinforcement Lernen versteht man dabei eine auf approximativem Dynamischen Programmieren basierende Klasse von Lernverfahren, die speziell Anwendung in der Künstlichen Intelligenz findet und zur autonomen Steuerung simulierter Agenten oder realer Hardwareroboter in dynamischen und unwägbaren Umwelten genutzt werden kann. Dazu wird mittels Regression aus Stichproben eine Funktion bestimmt, die die Lösung einer "Optimalitätsgleichung" (Bellman) ist und aus der sich näherungsweise optimale Entscheidungen ableiten lassen. Eine große Hürde stellt dabei die Dimensionalität des Zustandsraums dar, die häufig hoch und daher traditionellen gitterbasierten Approximationsverfahren wenig zugänglich ist. Das Ziel dieser Arbeit ist es, Reinforcement Lernen durch nichtparametrisierte Funktionsapproximation (genauer, Regularisierungsnetze) auf -- im Prinzip beliebig -- hochdimensionale Probleme anwendbar zu machen. Regularisierungsnetze sind eine Verallgemeinerung von gewöhnlichen Basisfunktionsnetzen, die die gesuchte Lösung durch die Daten parametrisieren, wodurch die explizite Wahl von Knoten/Basisfunktionen entfällt und so bei hochdimensionalen Eingaben der "Fluch der Dimension" umgangen werden kann. Gleichzeitig sind Regularisierungsnetze aber auch lineare Approximatoren, die technisch einfach handhabbar sind und für die die bestehenden Konvergenzaussagen von Reinforcement Lernen Gültigkeit behalten (anders als etwa bei Feed-Forward Neuronalen Netzen). Allen diesen theoretischen Vorteilen gegenüber steht allerdings ein sehr praktisches Problem: der Rechenaufwand bei der Verwendung von Regularisierungsnetzen skaliert von Natur aus wie O(n**3), wobei n die Anzahl der Daten ist. Das ist besonders deswegen problematisch, weil bei Reinforcement Lernen der Lernprozeß online erfolgt -- die Stichproben werden von einem Agenten/Roboter erzeugt, während er mit der Umwelt interagiert. Anpassungen an der Lösung müssen daher sofort und mit wenig Rechenaufwand vorgenommen werden. Der Beitrag dieser Arbeit gliedert sich daher in zwei Teile: Im ersten Teil der Arbeit formulieren wir für Regularisierungsnetze einen effizienten Lernalgorithmus zum Lösen allgemeiner Regressionsaufgaben, der speziell auf die Anforderungen von Online-Lernen zugeschnitten ist. Unser Ansatz basiert auf der Vorgehensweise von Recursive Least-Squares, kann aber mit konstantem Zeitaufwand nicht nur neue Daten sondern auch neue Basisfunktionen in das bestehende Modell einfügen. Ermöglicht wird das durch die "Subset of Regressors" Approximation, wodurch der Kern durch eine stark reduzierte Auswahl von Trainingsdaten approximiert wird, und einer gierigen Auswahlwahlprozedur, die diese Basiselemente direkt aus dem Datenstrom zur Laufzeit selektiert. Im zweiten Teil übertragen wir diesen Algorithmus auf approximative Politik-Evaluation mittels Least-Squares basiertem Temporal-Difference Lernen, und integrieren diesen Baustein in ein Gesamtsystem zum autonomen Lernen von optimalem Verhalten. Insgesamt entwickeln wir ein in hohem Maße dateneffizientes Verfahren, das insbesondere für Lernprobleme aus der Robotik mit kontinuierlichen und hochdimensionalen Zustandsräumen sowie stochastischen Zustandsübergängen geeignet ist. Dabei sind wir nicht auf ein Modell der Umwelt angewiesen, arbeiten weitestgehend unabhängig von der Dimension des Zustandsraums, erzielen Konvergenz bereits mit relativ wenigen Agent-Umwelt Interaktionen, und können dank des effizienten Online-Algorithmus auch im Kontext zeitkritischer Echtzeitanwendungen operieren. Wir demonstrieren die Leistungsfähigkeit unseres Ansatzes anhand von zwei realistischen und komplexen Anwendungsbeispielen: dem Problem RoboCup-Keepaway, sowie der Steuerung eines (simulierten) Oktopus-Tentakels.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper, we examine the postbuckling behavior of functionally graded material FGM rectangular plates that are integrated with surface-bonded piezoelectric actuators and are subjected to the combined action of uniform temperature change, in-plane forces, and constant applied actuator voltage. A Galerkin-differential quadrature iteration algorithm is proposed for solution of the non-linear partial differential governing equations. To account for the transverse shear strains, the Reddy higher-order shear deformation plate theory is employed. The bifurcation-type thermo-mechanical buckling of fully clamped plates, and the postbuckling behavior of plates with more general boundary conditions subject to various thermo-electro-mechanical loads, are discussed in detail. Parametric studies are also undertaken, and show the effects of applied actuator voltage, in-plane forces, volume fraction exponents, temperature change, and the character of boundary conditions on the buckling and postbuckling characteristics of the plates. (C) 2003 Elsevier Science Ltd. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Dissertação para obtenção do Grau de Mestre em Engenharia Electrotécnica e de Computadores

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Die Untersuchung des dynamischen aeroelastischen Stabilitätsverhaltens von Flugzeugen erfordert sehr komplexe Rechenmodelle, welche die wesentlichen elastomechanischen und instationären aerodynamischen Eigenschaften der Konstruktion wiedergeben sollen. Bei der Modellbildung müssen einerseits Vereinfachungen und Idealisierungen im Rahmen der Anwendung der Finite Elemente Methode und der aerodynamischen Theorie vorgenommen werden, deren Auswirkungen auf das Simulationsergebnis zu bewerten sind. Andererseits können die strukturdynamischen Kenngrößen durch den Standschwingungsversuch identifiziert werden, wobei die Ergebnisse Messungenauigkeiten enthalten. Für eine robuste Flatteruntersuchung müssen die identifizierten Unwägbarkeiten in allen Prozessschritten über die Festlegung von unteren und oberen Schranken konservativ ermittelt werden, um für alle Flugzustände eine ausreichende Flatterstabilität sicherzustellen. Zu diesem Zweck wird in der vorliegenden Arbeit ein Rechenverfahren entwickelt, welches die klassische Flatteranalyse mit den Methoden der Fuzzy- und Intervallarithmetik verbindet. Dabei werden die Flatterbewegungsgleichungen als parameterabhängiges nichtlineares Eigenwertproblem formuliert. Die Änderung der komplexen Eigenlösung infolge eines veränderlichen Einflussparameters wird mit der Methode der numerischen Fortsetzung ausgehend von der nominalen Startlösung verfolgt. Ein modifizierter Newton-Iterations-Algorithmus kommt zur Anwendung. Als Ergebnis liegen die berechneten aeroelastischen Dämpfungs- und Frequenzverläufe in Abhängigkeit von der Fluggeschwindigkeit mit Unschärfebändern vor.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This paper studies the average control problem of discrete-time Markov Decision Processes (MDPs for short) with general state space, Feller transition probabilities, and possibly non-compact control constraint sets A(x). Two hypotheses are considered: either the cost function c is strictly unbounded or the multifunctions A(r)(x) = {a is an element of A(x) : c(x, a) <= r} are upper-semicontinuous and compact-valued for each real r. For these two cases we provide new results for the existence of a solution to the average-cost optimality equality and inequality using the vanishing discount approach. We also study the convergence of the policy iteration approach under these conditions. It should be pointed out that we do not make any assumptions regarding the convergence and the continuity of the limit function generated by the sequence of relative difference of the alpha-discounted value functions and the Poisson equations as often encountered in the literature. (C) 2012 Elsevier Inc. All rights reserved.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The design process of any electric vehicle system has to be oriented towards the best energy efficiency, together with the constraint of maintaining comfort in the vehicle cabin. Main aim of this study is to research the best thermal management solution in terms of HVAC efficiency without compromising occupant’s comfort and internal air quality. An Arduino controlled Low Cost System of Sensors was developed and compared against reference instrumentation (average R-squared of 0.92) and then used to characterise the vehicle cabin in real parking and driving conditions trials. Data on the energy use of the HVAC was retrieved from the car On-Board Diagnostic port. Energy savings using recirculation can reach 30 %, but pollutants concentration in the cabin builds up in this operating mode. Moreover, the temperature profile appeared strongly nonuniform with air temperature differences up to 10° C. Optimisation methods often require a high number of runs to find the optimal configuration of the system. Fast models proved to be beneficial for these task, while CFD-1D model are usually slower despite the higher level of detail provided. In this work, the collected dataset was used to train a fast ML model of both cabin and HVAC using linear regression. Average scaled RMSE over all trials is 0.4 %, while computation time is 0.0077 ms for each second of simulated time on a laptop computer. Finally, a reinforcement learning environment was built in OpenAI and Stable-Baselines3 using the built-in Proximal Policy Optimisation algorithm to update the policy and seek for the best compromise between comfort, air quality and energy reward terms. The learning curves show an oscillating behaviour overall, with only 2 experiments behaving as expected even if too slow. This result leaves large room for improvement, ranging from the reward function engineering to the expansion of the ML model.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper presents an Adaptive Maximum Entropy (AME) approach for modeling biological species. The Maximum Entropy algorithm (MaxEnt) is one of the most used methods in modeling biological species geographical distribution. The approach presented here is an alternative to the classical algorithm. Instead of using the same set features in the training, the AME approach tries to insert or to remove a single feature at each iteration. The aim is to reach the convergence faster without affect the performance of the generated models. The preliminary experiments were well performed. They showed an increasing on performance both in accuracy and in execution time. Comparisons with other algorithms are beyond the scope of this paper. Some important researches are proposed as future works.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Although it is always weak between RFID Tag and Terminal in focus of the security, there are no security skills in RFID Tag. Recently there are a lot of studying in order to protect it, but because it has some physical limitation of RFID, that is it should be low electric power and high speed, it is impossible to protect with the skills. At present, the methods of RFID security are using a security server, a security policy and security. One of them the most famous skill is the security module, then they has an authentication skill and an encryption skill. In this paper, we designed and implemented after modification original SEED into 8 Round and 64 bits for Tag.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The purpose of this work is to present an algorithm to solve nonlinear constrained optimization problems, using the filter method with the inexact restoration (IR) approach. In the IR approach two independent phases are performed in each iteration—the feasibility and the optimality phases. The first one directs the iterative process into the feasible region, i.e. finds one point with less constraints violation. The optimality phase starts from this point and its goal is to optimize the objective function into the satisfied constraints space. To evaluate the solution approximations in each iteration a scheme based on the filter method is used in both phases of the algorithm. This method replaces the merit functions that are based on penalty schemes, avoiding the related difficulties such as the penalty parameter estimation and the non-differentiability of some of them. The filter method is implemented in the context of the line search globalization technique. A set of more than two hundred AMPL test problems is solved. The algorithm developed is compared with LOQO and NPSOL software packages.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Methods like Event History Analysis can show the existence of diffusion and part of its nature, but do not study the process itself. Nowadays, thanks to the increasing performance of computers, processes can be studied using computational modeling. This thesis presents an agent-based model of policy diffusion mainly inspired from the model developed by Braun and Gilardi (2006). I first start by developing a theoretical framework of policy diffusion that presents the main internal drivers of policy diffusion - such as the preference for the policy, the effectiveness of the policy, the institutional constraints, and the ideology - and its main mechanisms, namely learning, competition, emulation, and coercion. Therefore diffusion, expressed by these interdependencies, is a complex process that needs to be studied with computational agent-based modeling. In a second step, computational agent-based modeling is defined along with its most significant concepts: complexity and emergence. Using computational agent-based modeling implies the development of an algorithm and its programming. When this latter has been developed, we let the different agents interact. Consequently, a phenomenon of diffusion, derived from learning, emerges, meaning that the choice made by an agent is conditional to that made by its neighbors. As a result, learning follows an inverted S-curve, which leads to partial convergence - global divergence and local convergence - that triggers the emergence of political clusters; i.e. the creation of regions with the same policy. Furthermore, the average effectiveness in this computational world tends to follow a J-shaped curve, meaning that not only time is needed for a policy to deploy its effects, but that it also takes time for a country to find the best-suited policy. To conclude, diffusion is an emergent phenomenon from complex interactions and its outcomes as ensued from my model are in line with the theoretical expectations and the empirical evidence.Les méthodes d'analyse de biographie (event history analysis) permettent de mettre en évidence l'existence de phénomènes de diffusion et de les décrire, mais ne permettent pas d'en étudier le processus. Les simulations informatiques, grâce aux performances croissantes des ordinateurs, rendent possible l'étude des processus en tant que tels. Cette thèse, basée sur le modèle théorique développé par Braun et Gilardi (2006), présente une simulation centrée sur les agents des phénomènes de diffusion des politiques. Le point de départ de ce travail met en lumière, au niveau théorique, les principaux facteurs de changement internes à un pays : la préférence pour une politique donnée, l'efficacité de cette dernière, les contraintes institutionnelles, l'idéologie, et les principaux mécanismes de diffusion que sont l'apprentissage, la compétition, l'émulation et la coercition. La diffusion, définie par l'interdépendance des différents acteurs, est un système complexe dont l'étude est rendue possible par les simulations centrées sur les agents. Au niveau méthodologique, nous présenterons également les principaux concepts sous-jacents aux simulations, notamment la complexité et l'émergence. De plus, l'utilisation de simulations informatiques implique le développement d'un algorithme et sa programmation. Cette dernière réalisée, les agents peuvent interagir, avec comme résultat l'émergence d'un phénomène de diffusion, dérivé de l'apprentissage, où le choix d'un agent dépend en grande partie de ceux faits par ses voisins. De plus, ce phénomène suit une courbe en S caractéristique, poussant à la création de régions politiquement identiques, mais divergentes au niveau globale. Enfin, l'efficacité moyenne, dans ce monde simulé, suit une courbe en J, ce qui signifie qu'il faut du temps, non seulement pour que la politique montre ses effets, mais également pour qu'un pays introduise la politique la plus efficace. En conclusion, la diffusion est un phénomène émergent résultant d'interactions complexes dont les résultats du processus tel que développé dans ce modèle correspondent tant aux attentes théoriques qu'aux résultats pratiques.