Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Fairbank, M.; Alonso, E. (2012)
Publisher: IEEE Press
Languages: English
Types: Unknown
Subjects: QA75
We describe an Adaptive Dynamic Programming algorithm VGL(λ) for learning a critic function over a large continuous state space. The algorithm, which requires a learned model of the environment, extends Dual Heuristic Dynamic Programming to include a bootstrapping parameter analogous to that used in the reinforcement learning algorithm TD(λ). We provide on-line and batch mode implementations of the algorithm, and summarise the theoretical relationships and motivations of using this method over its precursor algorithms Dual Heuristic Dynamic Programming and TD(λ). Experiments for control problems using a neural network and greedy policy are provided.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 39-47, 2009.
    • [2] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.
    • [3] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, Massachussetts, USA: The MIT Press, 1998.
    • [4] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, pp. 9-44, 1988.
    • [5] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. dissertation, Cambridge University, 1989.
    • [6] C. Kwok and D. Fox, “Reinforcement learning for sensing strategies,” in Proceedings of the International Confrerence on Intelligent Robots and Systems (IROS), 2004.
    • [7] K. Doya, “Reinforcement learning in continuous time and space,” Neural Computation, vol. 12, no. 1, pp. 219-245, 2000.
    • [8] P. J. Werbos, “Approximating dynamic programming for real-time control and neural modeling.” in Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 13, pp. 493-525.
    • [9] S. Ferrari and R. F. Stengel, “Model-based adaptive critic designs,” in Handbook of learning and approximate dynamic programming, J. Si, A. Barto, W. Powell, and D. Wunsch, Eds. New York: Wiley-IEEE Press, 2004, pp. 65-96.
    • [10] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 997-1007, 1997.
    • [11] M. Fairbank and E. Alonso, “The local optimality of reinforcement learning by value gradients, and its relationship to policy gradient learning,” CoRR, vol. abs/1101.0428, 2011. [Online]. Available: http://arxiv.org/abs/1101.0428
    • [12] L. S. Pontryagin, V. G. Boltayanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes (Translated from Russian). Wiley, 1962, vol. 4.
    • [13] M. Fairbank and E. Alonso, “A comparison of learning speed and ability to cope without exploration between DHP and TD(0),” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 1478-1485.
    • [14] J. N. Tsitsiklis and B. Van Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, Tech. Rep., 1996.
    • [15] P. J. Werbos, “Stable adaptive control using new critic designs,” eprint arXiv:adap-org/9810001, 1998.
    • [16] M. Fairbank and E. Alonso, “The divergence of reinforcement learning algorithms with value-iteration and function approximation,” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 3070-3077.
    • [17] P. J. Werbos, “Backpropagation through time: What it does and how to do it,” in Proceedings of the IEEE, vol. 78, No. 10, 1990, pp. 1550-1560.
    • [18] P. J. Werbos, T. McAvoy, and T. Su, “Neural networks, system identification, and control in the chemical process industries.” in Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 10, pp. 283-356.
    • [19] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry, “Inverted autonomous helicopter flight via reinforcement learning,” in International Symposium on Experimental Robotics. MIT Press, 2004.
    • [20] R. Munos, “Policy gradient in continuous time,” Journal of Machine Learning Research, vol. 7, pp. 413-427, 2006.
    • [21] G. K. Venayagamoorthy and D. C. Wunsch, “Dual heuristic programming excitation neurocontrol for generators in a multimachine power system,” IEEE Transactions on Industry Applications, vol. 39, pp. 382- 394, 2003.
    • [22] G. G. Lendaris and C. Paintz, “Training strategies for critic and action neural networks in dual heuristic programming method,” in Proceedings of International Conference on Neural Networks, Houston, 1997.
    • [23] M. Fairbank, “Reinforcement learning by value gradients,” CoRR, vol. abs/0803.3539, 2008. [Online]. Available: http://arxiv.org/abs/0803.3539
    • [24] B. A. Pearlmutter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147-160, 1994.
    • [25] G. Rummery and M. Niranjan, “On-line q-learning using connectionist systems,” Tech. Rep. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article