Efficient Exploration for Multi-Agent Reinforcement Learning via Transferable Successor Features

Wenzhang Liu; Lu Dong; Dan Niu; Changyin Sun

doi:10.1109/JAS.2022.105809

Volume 9 Issue 9

Sep. 2022

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2022 > 9(9): 1673-1686

W. Z. Liu, L. Dong, D. Niu, and C. Y. Sun, “Efficient exploration for multi-agent reinforcement learning via transferable successor features,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 9, pp. 1673–1686, Sept. 2022. doi: 10.1109/JAS.2022.105809

Citation:

W. Z. Liu, L. Dong, D. Niu, and C. Y. Sun, “Efficient exploration for multi-agent reinforcement learning via transferable successor features,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 9, pp. 1673–1686, Sept. 2022. doi: 10.1109/JAS.2022.105809

Citation:

PDF( 2958 KB)

Efficient Exploration for Multi-Agent Reinforcement Learning via Transferable Successor Features

doi: 10.1109/JAS.2022.105809

Wenzhang Liu^1
,,
Lu Dong^2
,,
Dan Niu^3
,,
Changyin Sun^{4
,
,}

1.
School of Artificial Intelligence, Anhui University, Hefei 230039, and also with the Peng Cheng Laboratory, Shenzhen 518055, China
2.
School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
3.
School of Automation, Southeast University, Nanjing 210096, China
4.
School of Automation, Southeast University, Nanjing 210096, and also with the Peng Cheng Laboratory, Shenzhen 518055, China

Funds: This work was supported in part by the National Key R&D Program of China (2021ZD0112700, 2018AAA0101400), the National Natural Science Foundation of China (62173251, 61921004, U1713209), and the Natural Science Foundation of Jiangsu Province of China (BK20202006)

More Information

Author Bio:
Wenzhang Liu received the B.S. degree in engineering from the College of Communication Engineering, Jilin University, in 2016, and the Ph.D. degree in engineering from the School of Automation, Southeast University in 2021. He is currently a Postdoctoral Fellow with the School of Artificial Intelligence, Anhui University. His current research interests include multi-agent reinforcement learning, transfer reinforcement learning, and optimization, etc

Lu Dong (Member, IEEE) received the B.S. degree in physics and the Ph.D. degree in electrical engineering from Southeast University, in 2012 and 2017, respectively. She is currently an Associate Professor with the School of Cyber Science and Engineering, Southeast University. Her current research interests include adaptive dynamic programming, event-triggered control, nonlinear system control, and optimization

Dan Niu (Member, IEEE) received the Ph.D. degree in electronic science and technology from Graduate School of Information, Production and Systems, Waseda University, Japan in 2013. He has been an Associate Professor with the School of Automation, Southeast University. His research interests include AI for Industrial Control and AI for EDA

Changyin Sun (Senior Member, IEEE) received the B.S. degree in applied mathematics from the College of Mathematics, Sichuan University in 1996, and the M.S. and the Ph.D. degrees in electrical engineering from Southeast University, in 2001 and 2004, respectively. He is currently a Professor with the School of Automation, Southeast University. His current research interests include intelligent control, flight control, and optimal theory. He is an Associate Editor of IEEE Transactions on Neural Networks and Learning Systems, Neural Processing Letters, and IEEE/CAA Journal of Automatica Sinica
Corresponding author: Changyin Sun, e-mail: cysun@seu.edu.cn
¹ In this paper, we use the superscript i to represent all the corresponding variables and values of agent i.
² The code is available at: https://github.com/wenzhangliu/maddpg-sfkt.git
³ The video of the simulations: https://youtu.be/w0kscgRTGz8
Received Date: 2021-06-27
Revised Date: 2021-11-29
Accepted Date: 2022-02-22

Available Online: 2022-04-26

Abstract

Abstract

In multi-agent reinforcement learning (MARL), the behaviors of each agent can influence the learning of others, and the agents have to search in an exponentially enlarged joint-action space. Hence, it is challenging for the multi-agent teams to explore in the environment. Agents may achieve suboptimal policies and fail to solve some complex tasks. To improve the exploring efficiency as well as the performance of MARL tasks, in this paper, we propose a new approach by transferring the knowledge across tasks. Differently from the traditional MARL algorithms, we first assume that the reward functions can be computed by linear combinations of a shared feature function and a set of task-specific weights. Then, we define a set of basic MARL tasks in the source domain and pre-train them as the basic knowledge for further use. Finally, once the weights for target tasks are available, it will be easier to get a well-performed policy to explore in the target domain. Hence, the learning process of agents for target tasks is speeded up by taking full use of the basic knowledge that was learned previously. We evaluate the proposed algorithm on two challenging MARL tasks: cooperative box-pushing and non-monotonic predator-prey. The experiment results have demonstrated the improved performance compared with state-of-the-art MARL algorithms.
- Knowledge transfer,
- multi-agent systems,
- reinforcement learning,
- successor features

FullText(HTML)

¹ In this paper, we use the superscript i to represent all the corresponding variables and values of agent i.
² The code is available at: https://github.com/wenzhangliu/maddpg-sfkt.git
³ The video of the simulations: https://youtu.be/w0kscgRTGz8

References(49)

References

[1]	V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
[2]	K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao, “A survey of deep reinforcement learning in video games,” arXiv preprint arXiv: 1912.10944, 2019.
[3]	T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv: 1509.02971, 2015.
[4]	Y. Ouyang, L. Dong, L. Xue, and C. Sun, “Adaptive control based on neural networks for an uncertain 2DOF helicopter system with input deadzone and output constraints,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 807–815, 2019. doi: 10.1109/JAS.2019.1911495
[5]	X. Li, L. Dong, and C. Sun, “Data-based optimal tracking of autonomous nonlinear switching systems,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 227–238, 2020.
[6]	Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017, pp. 3357–3364.
[7]	H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement learning-based automatic exploration for navigation in unknown environment,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2064–2076, 2019.
[8]	M. G. Bellemare, S. Candido, S. Castro, J. Gong, M. C. Machado, S. Moitra, S. S. Ponda, and Z. Wang, “Autonomous navigation of stratospheric balloons using reinforcement learning,” Nature, vol. 588, no. 7836, pp. 77–82, 2020. doi: 10.1038/s41586-020-2939-8
[9]	T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Trans. Cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020. doi: 10.1109/TCYB.2020.2977374
[10]	L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” in Innovations in Multi-Agent Systems and Applications-1, Berlin Heidelberg, Germany: Springer, 2010, pp. 183–221.
[11]	P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non-stationarity,” arXiv preprint arXiv: 1707.09183, 2017.
[12]	G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht, “Dealing with non-stationarity in multiagent deep reinforcement learning,” arXiv preprint arXiv: 1906.04737, 2019.
[13]	J. Liu, Y. Zhang, Y. Yu, and C. Sun, “Fixed-time leaderfollower consensus of networked nonlinear systems via event/self-triggered control,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 11, pp. 5029–5037, 2020. doi: 10.1109/TNNLS.2019.2957069
[14]	W. Böhmer, T. Rashid, and S. Whiteson, “Exploration with unreliable intrinsic reward in multi-agent reinforcement learning,” arXiv preprint arXiv: 1906.02138, 2019.
[15]	M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proc. Int. Conf. Machine Learning, 1993, pp. 330–337.
[16]	A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS One, vol. 12, no. 4, p. e0172395, 2017.
[17]	F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approximate q-value functions for decentralized pomdps,” J. Artificial Intelligence Research, vol. 32, pp. 289–353, 2008. doi: 10.1613/jair.2447
[18]	L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,” Neurocomputing, vol. 190, pp. 82–94, 2016. doi: 10.1016/j.neucom.2016.01.031
[19]	M. E. Taylor and Stone, “Transfer learning for reinforcement learning domains: A survey,” J. Machine Learning Research, vol. 10, no. 7, pp. 1633–1685, 2009.
[20]	R. Laroche and M. Barlier, “Transfer reinforcement learning with shared dynamics,” in Proc. AAAI Conf. Artificial Intelligence, 2017, pp. 2147–2153.
[21]	L. Steccanella, S. Totaro, D. Allonsius, and A. Jonsson, “Hierarchical reinforcement learning for efficient exploration and transfer,” arXiv preprint arXiv: 2011.06335, 2020.
[22]	T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman, “Deep successor reinforcement learning,” arXiv preprint arXiv: 1606.02396, 2016.
[23]	A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver, “Successor features for transfer in reinforcement learning,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 4055–4065.
[24]	A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Zidek, and R. Munos, “Transfer in deep reinforcement learning using successor features and generalised policy improvement,” in Proc. Int. Conf. Machine Learning, PMLR, 2018, pp. 501–510.
[25]	A. Barreto, D. Borsa, S. Hou, G. Comanici, E. Aygun, P. Hamel, D. Toyama, S. Mourad, D. Silver, and D. Precup, “The option keyboard: Combining skills in reinforcement learning,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 13052–13062.
[26]	A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup, “Fast reinforcement learning with generalized policy updates,” Proc. National Academy of Sciences, vol. 117, no. 48, pp. 30079–30087, 2020. doi: 10.1073/pnas.1907370117
[27]	L. Lehnert and M. L. Littman, “Successor features combine elements of model-free and model-based reinforcement learning,” J. Machine Learning Research, vol. 21, no. 196, pp. 1–53, 2020.
[28]	R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
[29]	P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proc. Int. Conf. Autonomous Agents and Multi-Agent Systems, 2018, pp. 2085–2087.
[30]	T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4295–4304.
[31]	H. Mao, Z. Zhang, Z. Xiao, and Z. Gong, “Modelling the dynamic joint policy of teammates with attention multiagent DDPG,” in Proc. Int. Conf. Autonomous Agents and MultiAgent Systems, 2019, pp. 1108–1116.
[32]	F. L. da Silva, G. Warnell, A. H. R. Costa, and P. Stone, “Agents teaching agents: A survey on inter-agent transfer learning,” in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, 2020, pp. 2165–2167.
[33]	G. Boutsioukis, I. Partalas, and I. Vlahavas, “Transfer learning in multi-agent reinforcement learning domains,” in Proc. European Workshop on Reinforcement Learning, Springer, 2011, pp. 249–260.
[34]	F. L. Da Silva and A. H. R. Costa, “A survey on transfer learning for multiagent reinforcement learning systems,” J. Artificial Intelligence Research, vol. 64, pp. 645–703, 2019. doi: 10.1613/jair.1.11396
[35]	T. Yang, W. Wang, H. Tang, J. Hao, Z. Meng, H. Mao, D. Li, W. Liu, C. Zhang, Y. Hu, Y. Chen, and C. Fan, “Transfer among agents: An efficient multiagent transfer learning framework,” arXiv preprint arXiv: 2002.08030, 2020.
[36]	S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proc. Int. Conf. Machine Learning, 2017, pp. 2681–2690.
[37]	Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, vol. 5, no. 4, pp. 613–624, 1993. doi: 10.1162/neco.1993.5.4.613
[38]	I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman, “The successor representation in human reinforcement learning,” Nature Human Behaviour, vol. 1, no. 9, pp. 680–692, 2017. doi: 10.1038/s41562-017-0180-8
[39]	T. Gupta, A. Mahajan, B. Peng, W. Böhmer, and S. Whiteson, “Uneven: Universal value exploration for multi-agent reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2021, pp. 3930–3941.
[40]	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, USA: MIT Press, 2018.
[41]	T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1856–1865.
[42]	M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Machine Learning Proceedings, San Francisco, USA: Elsevier, 1994, pp. 157–163.
[43]	J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proc. AAAI Conf. Artificial Intelligence, 2018, pp. 2974–2982.
[44]	V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, 2000, pp. 1008–1014.
[45]	Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. doi: 10.1038/nature14539
[46]	L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Systems,Man,and Cybernetics,Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008. doi: 10.1109/TSMCC.2007.913919
[47]	Y. Wang, L. Dong, and C. Sun, “Cooperative control for multi-player pursuit-evasion games with reinforcement learning,” Neurocomputing, vol. 412, pp. 101–114, 2020. doi: 10.1016/j.neucom.2020.06.031
[48]	T. Madarasz and T. E. J. Behrens, “Better transfer learning with inferred successor maps,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 9026–9037.
[49]	A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. Int. Conf. Machine Learning, 2013, vol. 30.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(9) / Tables(1)

Get Citation

PDF

XML

Article Metrics

Article views (1045) PDF downloads(147)

Highlights

Describe the tasks in multi-agent systems with successor features
Propose multi-agent successor features which is approximated by a global successor feature network
Propose a new algorithms called multi-agent deep deterministic policy gradient with successor features (MADDPG-SFs)
Improve the exploring efficiency of the MARL by transferring the knowledge of successor features that was learned before
Propose a new algorithm called multi-agent deep deterministic policy gradient with successor features and knowledge transfer (MADDPG-SFKT)

Efficient Exploration for Multi-Agent Reinforcement Learning via Transferable Successor Features

doi: 10.1109/JAS.2022.105809

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content