Experience Replay for Least-Squares Policy Iteration

Quan Liu; Xin Zhou; Fei Zhu; Qiming Fu; Yuchen Fu

Volume 1 Issue 3

Jul. 2014

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2014 > 1(3): 274-281

Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.

Citation:

Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.

Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.

Citation:

Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu and Yuchen Fu, "Experience Replay for Least-Squares Policy Iteration," IEEE/CAA J. of Autom. Sinica, vol. 1, no. 3, pp. 274-281, 2014.

PDF( 1571 KB)

Experience Replay for Least-Squares Policy Iteration

1. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China, and also with the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;
2. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China;
3.Fei Zhu is with the School of Computer Science and Technology, Soochow University, Jiangsu 215006, China, and also with the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;
4. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China

Funds:

This work was supported by National Natural Science Foundation of China (61303108, 61272005, 61373094, 61103045), Natural Science Foundation of Jiangsu (BK2012616), High School Natural Foundation of Jiangsu (13KJB520020), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial Application of Basic Research Program (SYG201422).

Abstract

Abstract

Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.
- reinforcement learning,
- experience replay,
- leastsquares,
- policy iteration.

FullText(HTML)

References(17)

References

[1]	Wiering M, van Otterlo M. Reinforcement learning: state-of-the-art. Adaptation, Learning, and Optimization. Berlin, Heidelberg: Springer, 2012. 12-50
[2]	Zhu Fei, Liu Quan, Fu Qi-Ming, Fu Yu-Chen. A least square actor-critic approach for continuous action space. Journal of Computer Research and Development, 2014, 51(3): 548-558 (in Chinese)
[3]	Liu De-Rong, Li Hong-Liang, Wang Ding. Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica, 2013, 39(11): 1858-1870 (in Chinese)
[4]	Zhu Mei-Qiang, Cheng Yu-Hu, Li Ming, Wang Xue-Song, Feng Huan-Ting. A hybrid transfer algorithm for reinforcement learning based on spectral method. Acta Automatica Sinica, 2012, 38(11): 1765-1776 (in Chinese)
[5]	Chen Xin, Wei Hai-Jun, Wu Min, Cao Wei-Hua. Tracking learning based on Gaussian regression for multi-agent systems in continuous space. Acta Automatica Sinica, 2013, 39(12): 2021-2031 (in Chinese)
[6]	Xu X, Zuo L, Huang Z H. Reinforcement learning algorithms with function approximation: recent advances and applications. Information Sciences, 2014, 261: 1-31
[7]	Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Recent Advances in Reinforcement Learning. New York: Springer, 1996. 33-57
[8]	Escandell-Montero P, Martínez-Martínez J D, Soria-Olivas E, Gómez-Sanchis J. Least-squares temporal difference learning based on an extreme learning machine. Neurocomputing, 2014, 141: 37-45
[9]	Maei H R, Szepesvári C, Bhatnagar S, Sutton R S. Toward off-policy learning control with function approximation. In: Proceedings of the 27th International Conference on Machine Learning. Haifa: Omnipress, 2010. 719-726
[10]	Tamar A, Castro D D, Mannor S. Temporal difference methods for the variance of the reward to go. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Atlanta, Georgia, 2013. 495-503
[11]	Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809-883
[12]	Lagoudakis M G, Parr R, Littman M L. Least-squares methods in reinforcement learning for control. Methods and Applications of Artificial Intelligence. Berlin, Heidelberg: Springer, 2002, 2308: 249-260
[13]	Lagoudakis M, Parr R. Least squares policy iteration. Journal of Machine Learning Research, 2003, 4, 1107-1149
[14]	Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming using Function Approximators. New York: CRC Press, 2010. 100-118
[15]	Adam S, Busoniu L, Babuska R. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212
[16]	Jung T, Polani D. Kernelizing LSPE (λ). In: Proceedings of the 2007 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Honolulu, HI: IEEE, 2007. 338-345
[17]	Jung T, Polani D. Least squares SVM for least squares TD learning. In: Proceedings of the 17th European Conference on Artificial Intelligence. Amsterdam: IOS Press, 2006. 499-503

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Get Citation

PDF

XML

Article Metrics

Article views (1270) PDF downloads(13)

Experience Replay for Least-Squares Policy Iteration

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content