A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 9
Sep.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
W. Z. Liu, L. Dong, D. Niu, and C. Y. Sun, “Efficient exploration for multi-agent reinforcement learning via transferable successor features,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 9, pp. 1673–1686, Sept. 2022. doi: 10.1109/JAS.2022.105809
Citation: W. Z. Liu, L. Dong, D. Niu, and C. Y. Sun, “Efficient exploration for multi-agent reinforcement learning via transferable successor features,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 9, pp. 1673–1686, Sept. 2022. doi: 10.1109/JAS.2022.105809

Efficient Exploration for Multi-Agent Reinforcement Learning via Transferable Successor Features

doi: 10.1109/JAS.2022.105809
Funds:  This work was supported in part by the National Key R&D Program of China (2021ZD0112700, 2018AAA0101400), the National Natural Science Foundation of China (62173251, 61921004, U1713209), and the Natural Science Foundation of Jiangsu Province of China (BK20202006)
More Information
  • In multi-agent reinforcement learning (MARL), the behaviors of each agent can influence the learning of others, and the agents have to search in an exponentially enlarged joint-action space. Hence, it is challenging for the multi-agent teams to explore in the environment. Agents may achieve suboptimal policies and fail to solve some complex tasks. To improve the exploring efficiency as well as the performance of MARL tasks, in this paper, we propose a new approach by transferring the knowledge across tasks. Differently from the traditional MARL algorithms, we first assume that the reward functions can be computed by linear combinations of a shared feature function and a set of task-specific weights. Then, we define a set of basic MARL tasks in the source domain and pre-train them as the basic knowledge for further use. Finally, once the weights for target tasks are available, it will be easier to get a well-performed policy to explore in the target domain. Hence, the learning process of agents for target tasks is speeded up by taking full use of the basic knowledge that was learned previously. We evaluate the proposed algorithm on two challenging MARL tasks: cooperative box-pushing and non-monotonic predator-prey. The experiment results have demonstrated the improved performance compared with state-of-the-art MARL algorithms.

     

  • loading
  • 1 In this paper, we use the superscript i to represent all the corresponding variables and values of agent i.
    2 The code is available at: https://github.com/wenzhangliu/maddpg-sfkt.git
    3 The video of the simulations: https://youtu.be/w0kscgRTGz8
  • [1]
    V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
    [2]
    K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao, “A survey of deep reinforcement learning in video games,” arXiv preprint arXiv: 1912.10944, 2019.
    [3]
    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv: 1509.02971, 2015.
    [4]
    Y. Ouyang, L. Dong, L. Xue, and C. Sun, “Adaptive control based on neural networks for an uncertain 2DOF helicopter system with input deadzone and output constraints,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 3, pp. 807–815, 2019. doi: 10.1109/JAS.2019.1911495
    [5]
    X. Li, L. Dong, and C. Sun, “Data-based optimal tracking of autonomous nonlinear switching systems,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 227–238, 2020.
    [6]
    Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017, pp. 3357–3364.
    [7]
    H. Li, Q. Zhang, and D. Zhao, “Deep reinforcement learning-based automatic exploration for navigation in unknown environment,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 6, pp. 2064–2076, 2019.
    [8]
    M. G. Bellemare, S. Candido, S. Castro, J. Gong, M. C. Machado, S. Moitra, S. S. Ponda, and Z. Wang, “Autonomous navigation of stratospheric balloons using reinforcement learning,” Nature, vol. 588, no. 7836, pp. 77–82, 2020. doi: 10.1038/s41586-020-2939-8
    [9]
    T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications,” IEEE Trans. Cybernetics, vol. 50, no. 9, pp. 3826–3839, 2020. doi: 10.1109/TCYB.2020.2977374
    [10]
    L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforcement learning: An overview,” in Innovations in Multi-Agent Systems and Applications-1, Berlin Heidelberg, Germany: Springer, 2010, pp. 183–221.
    [11]
    P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A survey of learning in multiagent environments: Dealing with non-stationarity,” arXiv preprint arXiv: 1707.09183, 2017.
    [12]
    G. Papoudakis, F. Christianos, A. Rahman, and S. V. Albrecht, “Dealing with non-stationarity in multiagent deep reinforcement learning,” arXiv preprint arXiv: 1906.04737, 2019.
    [13]
    J. Liu, Y. Zhang, Y. Yu, and C. Sun, “Fixed-time leaderfollower consensus of networked nonlinear systems via event/self-triggered control,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 11, pp. 5029–5037, 2020. doi: 10.1109/TNNLS.2019.2957069
    [14]
    W. Böhmer, T. Rashid, and S. Whiteson, “Exploration with unreliable intrinsic reward in multi-agent reinforcement learning,” arXiv preprint arXiv: 1906.02138, 2019.
    [15]
    M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative agents,” in Proc. Int. Conf. Machine Learning, 1993, pp. 330–337.
    [16]
    A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS One, vol. 12, no. 4, p. e0172395, 2017.
    [17]
    F. A. Oliehoek, M. T. Spaan, and N. Vlassis, “Optimal and approximate q-value functions for decentralized pomdps,” J. Artificial Intelligence Research, vol. 32, pp. 289–353, 2008. doi: 10.1613/jair.2447
    [18]
    L. Kraemer and B. Banerjee, “Multi-agent reinforcement learning as a rehearsal for decentralized planning,” Neurocomputing, vol. 190, pp. 82–94, 2016. doi: 10.1016/j.neucom.2016.01.031
    [19]
    M. E. Taylor and Stone, “Transfer learning for reinforcement learning domains: A survey,” J. Machine Learning Research, vol. 10, no. 7, pp. 1633–1685, 2009.
    [20]
    R. Laroche and M. Barlier, “Transfer reinforcement learning with shared dynamics,” in Proc. AAAI Conf. Artificial Intelligence, 2017, pp. 2147–2153.
    [21]
    L. Steccanella, S. Totaro, D. Allonsius, and A. Jonsson, “Hierarchical reinforcement learning for efficient exploration and transfer,” arXiv preprint arXiv: 2011.06335, 2020.
    [22]
    T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman, “Deep successor reinforcement learning,” arXiv preprint arXiv: 1606.02396, 2016.
    [23]
    A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver, “Successor features for transfer in reinforcement learning,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 4055–4065.
    [24]
    A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Zidek, and R. Munos, “Transfer in deep reinforcement learning using successor features and generalised policy improvement,” in Proc. Int. Conf. Machine Learning, PMLR, 2018, pp. 501–510.
    [25]
    A. Barreto, D. Borsa, S. Hou, G. Comanici, E. Aygun, P. Hamel, D. Toyama, S. Mourad, D. Silver, and D. Precup, “The option keyboard: Combining skills in reinforcement learning,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 13052–13062.
    [26]
    A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup, “Fast reinforcement learning with generalized policy updates,” Proc. National Academy of Sciences, vol. 117, no. 48, pp. 30079–30087, 2020. doi: 10.1073/pnas.1907370117
    [27]
    L. Lehnert and M. L. Littman, “Successor features combine elements of model-free and model-based reinforcement learning,” J. Machine Learning Research, vol. 21, no. 196, pp. 1–53, 2020.
    [28]
    R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
    [29]
    P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel, “Value-decomposition networks for cooperative multi-agent learning based on team reward,” in Proc. Int. Conf. Autonomous Agents and Multi-Agent Systems, 2018, pp. 2085–2087.
    [30]
    T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson, “Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4295–4304.
    [31]
    H. Mao, Z. Zhang, Z. Xiao, and Z. Gong, “Modelling the dynamic joint policy of teammates with attention multiagent DDPG,” in Proc. Int. Conf. Autonomous Agents and MultiAgent Systems, 2019, pp. 1108–1116.
    [32]
    F. L. da Silva, G. Warnell, A. H. R. Costa, and P. Stone, “Agents teaching agents: A survey on inter-agent transfer learning,” in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, 2020, pp. 2165–2167.
    [33]
    G. Boutsioukis, I. Partalas, and I. Vlahavas, “Transfer learning in multi-agent reinforcement learning domains,” in Proc. European Workshop on Reinforcement Learning, Springer, 2011, pp. 249–260.
    [34]
    F. L. Da Silva and A. H. R. Costa, “A survey on transfer learning for multiagent reinforcement learning systems,” J. Artificial Intelligence Research, vol. 64, pp. 645–703, 2019. doi: 10.1613/jair.1.11396
    [35]
    T. Yang, W. Wang, H. Tang, J. Hao, Z. Meng, H. Mao, D. Li, W. Liu, C. Zhang, Y. Hu, Y. Chen, and C. Fan, “Transfer among agents: An efficient multiagent transfer learning framework,” arXiv preprint arXiv: 2002.08030, 2020.
    [36]
    S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in Proc. Int. Conf. Machine Learning, 2017, pp. 2681–2690.
    [37]
    Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, vol. 5, no. 4, pp. 613–624, 1993. doi: 10.1162/neco.1993.5.4.613
    [38]
    I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman, “The successor representation in human reinforcement learning,” Nature Human Behaviour, vol. 1, no. 9, pp. 680–692, 2017. doi: 10.1038/s41562-017-0180-8
    [39]
    T. Gupta, A. Mahajan, B. Peng, W. Böhmer, and S. Whiteson, “Uneven: Universal value exploration for multi-agent reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2021, pp. 3930–3941.
    [40]
    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, USA: MIT Press, 2018.
    [41]
    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1856–1865.
    [42]
    M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Machine Learning Proceedings, San Francisco, USA: Elsevier, 1994, pp. 157–163.
    [43]
    J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proc. AAAI Conf. Artificial Intelligence, 2018, pp. 2974–2982.
    [44]
    V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in Neural Information Processing Systems, 2000, pp. 1008–1014.
    [45]
    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. doi: 10.1038/nature14539
    [46]
    L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Systems,Man,and Cybernetics,Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008. doi: 10.1109/TSMCC.2007.913919
    [47]
    Y. Wang, L. Dong, and C. Sun, “Cooperative control for multi-player pursuit-evasion games with reinforcement learning,” Neurocomputing, vol. 412, pp. 101–114, 2020. doi: 10.1016/j.neucom.2020.06.031
    [48]
    T. Madarasz and T. E. J. Behrens, “Better transfer learning with inferred successor maps,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 9026–9037.
    [49]
    A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. Int. Conf. Machine Learning, 2013, vol. 30.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(1)

    Article Metrics

    Article views (769) PDF downloads(132) Cited by()

    Highlights

    • Describe the tasks in multi-agent systems with successor features
    • Propose multi-agent successor features which is approximated by a global successor feature network
    • Propose a new algorithms called multi-agent deep deterministic policy gradient with successor features (MADDPG-SFs)
    • Improve the exploring efficiency of the MARL by transferring the knowledge of successor features that was learned before
    • Propose a new algorithm called multi-agent deep deterministic policy gradient with successor features and knowledge transfer (MADDPG-SFKT)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return