TY - GEN

T1 - Incremental basis construction from temporal difference error

AU - Sun, Yi

AU - Gomez, Faustino

AU - Ring, Mark

AU - Schmidhuber, Jürgen

N1 - Generated from Scopus record by KAUST IRTS on 2022-09-14

PY - 2011/10/7

Y1 - 2011/10/7

N2 - In many reinforcement learning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zero - without changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving value-function estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors. Copyright 2011 by the author(s)/owner(s).

AB - In many reinforcement learning (RL) systems, the value function is approximated as a linear combination of a fixed set of basis functions. Performance can be improved by adding to this set. Previous approaches construct a series of basis functions that in sufficient number can eventually represent the value function. In contrast, we show that there is a single, ideal basis function, which can directly represent the value function. Its addition to the set immediately reduces the error to zero - without changing existing weights. Moreover, this ideal basis function is simply the value function that results from replacing the MDP's reward function with its Bellman error. This result suggests a novel method for improving value-function estimation: a primary reinforcement learner estimates its value function using its present basis functions; it then sends its TD error to a secondary learner, which interprets that error as a reward function and estimates the corresponding value function; the resulting value function then becomes the primary learner's new basis function. We present both batch and online versions in combination with incremental basis projection, and demonstrate that the performance is superior to existing methods, especially in the case of large discount factors. Copyright 2011 by the author(s)/owner(s).

UR - http://www.scopus.com/inward/record.url?scp=80053457849&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781450306195

SP - 481

EP - 488

BT - Proceedings of the 28th International Conference on Machine Learning, ICML 2011

ER -