Policy gradient critics

Daan Wierstra, Jürgen Schmidhuber

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

We present Policy Gradient Actor-Critic (PGAC), a new model-free Reinforcement Learning (RL) method for creating limited-memory stochastic policies for Partially Observable Markov Decision Processes (POMDPs) that require long-term memories of past observations and actions. The approach involves estimating a policy gradient for an Actor through a Policy Gradient Critic which evaluates probability distributions on actions. Gradient-based updates of history-conditional action probability distributions enable the algorithm to learn a mapping from memory states (or event histories) to probability distributions on actions, solving POMDPs through a combination of memory and stochasticity. This goes beyond previous approaches to learning purely reactive POMDP policies, without giving up their advantages. Preliminary results on important benchmark tasks show that our approach can in principle be used as a general purpose POMDP algorithm that solves RL problems in both continuous and discrete action domains. © Springer-Verlag Berlin Heidelberg 2007.
Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages466-477
Number of pages12
ISBN (Print)9783540749578
DOIs
StatePublished - Jan 1 2007
Externally publishedYes

Fingerprint

Dive into the research topics of 'Policy gradient critics'. Together they form a unique fingerprint.

Cite this