Remember Me
Or use your Academic/Social account:


Or use your Academic/Social account:


You have just completed your registration at OpenAire.

Before you can login to the site, you will need to activate your account. An e-mail will be sent to you with the proper instructions.


Please note that this site is currently undergoing Beta testing.
Any new content you create is not guaranteed to be present to the final version of the site upon release.

Thank you for your patience,
OpenAire Dev Team.

Close This Message


Verify Password:
Verify E-mail:
*All Fields Are Required.
Please Verify You Are Human:
fbtwitterlinkedinvimeoflicker grey 14rssslideshare1
Fairbank, M.; Alonso, E.; Prokhorov, D. (2013)
Languages: English
Types: Article
Subjects: RC0321, BF, QA75
We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.
  • The results below are discovered through our pilot algorithms. Let us know how we are doing!

    • [1] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: An introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2, pp. 39-47, 2009.
    • [2] R. E. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.
    • [3] P. J. Werbos, “Approximating dynamic programming for real-time control and neural modeling.” in Handbook of Intelligent Control, White and Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 13, pp. 493-525.
    • [4] D. Prokhorov and D. Wunsch, “Adaptive critic designs,” IEEE Transactions on Neural Networks, vol. 8, no. 5, pp. 997-1007, 1997.
    • [5] S. Ferrari and R. F. Stengel, “Model-based adaptive critic designs,” in Handbook of learning and approximate dynamic programming, J. Si, A. Barto, W. Powell, and D. Wunsch, Eds. New York: Wiley-IEEE Press, 2004, pp. 65-96.
    • [6] M. Fairbank, E. Alonso, and D. Prokhorov, “Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 10, pp. 1671-1678, October 2012.
    • [7] M. Fairbank, “Reinforcement learning by value gradients,” CoRR, vol. abs/0803.3539, 2008. [Online]. Available: http://arxiv.org/abs/0803.3539
    • [8] M. Fairbank and E. Alonso, “Value-gradient learning,” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 3062-3069.
    • [9] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, pp. 9-44, 1988.
    • [10] G. K. Venayagamoorthy and D. C. Wunsch, “Dual heuristic programming excitation neurocontrol for generators in a multimachine power system,” IEEE Transactions on Industry Applications, vol. 39, pp. 382- 394, 2003.
    • [11] G. G. Lendaris and C. Paintz, “Training strategies for critic and action neural networks in dual heuristic programming method,” in Proceedings of International Conference on Neural Networks, Houston, 1997.
    • [12] L. S. Pontryagin, V. G. Boltayanskii, R. V. Gamkrelidze, and E. F. Mishchenko, The Mathematical Theory of Optimal Processes (Translated from Russian). Wiley, 1962, vol. 4.
    • [13] M. Fairbank and E. Alonso, “The local optimality of reinforcement learning by value gradients, and its relationship to policy gradient learning,” CoRR, vol. abs/1101.0428, 2011. [Online]. Available: http://arxiv.org/abs/1101.0428
    • [14] --, “A comparison of learning speed and ability to cope without exploration between DHP and TD(0),” in Proceedings of the IEEE International Joint Conference on Neural Networks 2012 (IJCNN'12). IEEE Press, June 2012, pp. 1478-1485.
    • [15] P. J. Werbos, T. McAvoy, and T. Su, “Neural networks, system identification, and control in the chemical process industries.” in Handbook of Intelligent Control, White and Sofge, Eds. New York: Van Nostrand Reinhold, 1992, ch. 10, pp. 283-356.
    • [16] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press, 1960, ch. 4, pp. 42-43.
    • [17] A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof.” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 38, no. 4, pp. 943-949, 2008.
    • [18] A. Heydari and S. N. Balakrishnan, “Finite-horizon input-constrained nonlinear optimal control using single network adaptive critics,” American Control Conference ACC, pp. 3047-3052, 2011.
    • [19] D. V. Prokhorov and D. C. Wunsch, “Convergence of critic-based training,” in in Proc. IEEE Int. Conf. Syst, 1997, pp. 3057-3060.
    • [20] P. J. Werbos, “Stable adaptive control using new critic designs,” eprint arXiv:adap-org/9810001, 1998.
    • [21] L. C. Baird, “Residual algorithms: Reinforcement learning with function approximation,” in International Conference on Machine Learning, 1995, pp. 30-37.
    • [22] P. J. Werbos, “Consistency of HDP applied to a simple reinforcement learning problem,” Neural Networks, vol. 3, pp. 179-189, 1990.
    • [23] --, “Backpropagation through time: What it does and how to do it,” in Proceedings of the IEEE, vol. 78, No. 10, 1990, pp. 1550-1560.
    • [24] K. Doya, “Reinforcement learning in continuous time and space,” Neural Computation, vol. 12, no. 1, pp. 219-245, 2000.
    • [25] E. Barnard, “Temporal-difference methods and markov models,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 23, no. 2, pp. 357- 365, 1993.
    • [26] M. Fairbank, D. Prokhorov, and E. Alonso, “Approximating optimal control with value gradient learning,” in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. Lewis and D. Liu, Eds. New York: Wiley-IEEE Press, 2012, Sections 7.3.4 and 7.4.3.
    • [27] M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The RPROP algorithm,” in Proc. of the IEEE Intl. Conf. on Neural Networks, San Francisco, CA, 1993, pp. 586-591.
    • [28] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 13, pp. 834-846, 1983.
  • No related research data.
  • No similar publications.

Share - Bookmark

Cite this article