IEEE/CAA Journal of Automatica Sinica
Citation:  Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249272, Feb. 2021. doi: 10.1109/JAS.2021.1003814 
[1] 
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I. 4th ed. Belmont, USA: Athena Scientific, 2017.

[2] 
D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont, USA: Athena Scientific, 2019.

[3] 
D. P. Bertsekas, Rollout, Policy Iteration, and Distributed Reinforcement Learning. Belmont, USA: Athena Scientific, 2020.

[4] 
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by selfplay with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.

[5] 
D. P. Bertsekas, "Multiagent value iteration algorithms in dynamic programming and reinforcement learning, " arxiv: 2005.01627, 2020.

[6] 
D. P. Bertsekas, "Constrained multiagent rollout and multidimensional assignment with the auction algorithm, " arXiv: 2002.07407, 2020.

[7] 
D. P. Bertsekas, "Distributed dynamic programming, " IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 610616, Jun. 1982.

[8] 
D. P. Bertsekas, "Asynchronous distributed computation of fixed points, " Math. Programming, vol. 27, no. 1, pp. 107120, Sep. 1983.

[9] 
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, USA: PrenticeHall, 1989.

[10] 
D. P. Bertsekas and H. Z. Yu, "Asynchronous distributed policy iteration in dynamic programming, " in Proc. 48th Annu. Allerton Conf. Communication, Control, and Computing, Allerton, USA, 2010, pp. 13681374.

[11] 
D. P. Bertsekas and H. Z. Yu, "Qlearning and enhanced policy iteration in discounted dynamic programming, " Math. Oper. Res., vol. 37, pp. 6694, Feb. 2012.

[12] 
H. Z. Yu and D. P. Bertsekas, "Qlearning and policy iteration algorithms for stochastic shortest path problems, " Ann. Oper. Res., vol. 208, no. 1, pp. 95132, Sep. 2013.

[13] 
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ. 4th ed. Belmont, USA: Athena Scientific, 2012.

[14] 
D. P. Bertsekas, Abstract Dynamic Programming. Belmont, USA: Athena Scientific, 2018.

[15] 
S. Bhattacharya, S. Badyal, T. Wheeler, S. Gil, and D. P. Bertsekas, "Reinforcement learning for POMDP: Partitioned rollout and policy iteration with application to autonomous sequential repair problems, " IEEE Rob. Autom. Lett., vol. 5, no. 3, pp. 39673974, Jul. 2020.

[16] 
H. S. Witsenhausen, "A counterexample in stochastic optimum control, " SIAM J. Control, vol. 6, no. 1, pp. 131147, 1968. http://www.ams.org/mathscinetgetitem?mr=231649

[17] 
H. S. Witsenhausen, "Separation of estimation and control for discrete time systems, " Proc. IEEE, vol. 59, no. 11, pp. 15571566, Nov. 1971.

[18] 
J. Marschak, "Elements for a theory of teams, " Manage. Sci., vol. 1, no. 2, pp. 127137, Jan. 1975.

[19] 
R. Radner, "Team decision problems, " Ann. Math. Statist., vol. 33, no. 3, pp. 857881, Sep. 1962.

[20] 
H. S. Witsenhausen, "On information structures, feedback and causality, " SIAM J. Control, vol. 9, no. 2, pp. 149160, 1971. http://www.researchgate.net/publication/250955878_On_Information_Structures_Feedback_and_Causality

[21] 
J. Marschak and R. Radner, Economic Theory of Teams. New Haven, USA: Yale University Press, 1976.

[22] 
N. Sandell, P. Varaiya, M. Athans, and M. Safonov, "Survey of decentralized control methods for large scale systems, " IEEE Trans. Autom. Control, vol. 23, no. 2, pp. 108128, Apr. 1978.

[23] 
T. Yoshikawa, "Decomposition of dynamic team decision problems, " IEEE Trans. Autom. Control, vol. 23, no. 4, pp. 627632, Aug. 1978.

[24] 
Y. C. Ho, "Team decision theory and information structures, " Proc. IEEE, vol. 68, no. 6, pp. 644654, Jun. 1980.

[25] 
D. Bauso and R. Pesenti, "Generalized personbyperson optimization in team problems with binary decisions, " in Proc. American Control Conf., Seattle, USA, 2008, pp. 717722.

[26] 
D. Bauso and R. Pesenti, "Team theory and personbyperson optimization with binary decisions, " SIAM J. Control Optim., vol. 50, no. 5, pp. 30113028, Jan. 2012.

[27] 
A. Nayyar, A. Mahajan, and D. Teneketzis, "Decentralized stochastic control with partial history sharing: A common information approach, " IEEE Trans. Autom. Control, vol. 58, no. 7, pp. 16441658, Jul. 2013.

[28] 
A. Nayyar and D. Teneketzis, "Common knowledge and sequential team problems, " IEEE Trans Autom. Control, vol. 64, no. 12, pp. 51085115, Dec. 2019.

[29] 
Y. Y. Li, Y. J. Tang, R. Y. Zhang, and N. Li, "Distributed reinforcement learning for decentralized linear quadratic control: A derivativefree policy optimization approach, " arXiv: 1912.09135, 2019.

[30] 
G. Qu and N. Li, "Exploiting Fast Decaying and Locality in MultiAgent MDP with Tree Dependence Structure, " in Proc. of CDC, Nice, France, 2019.

[31] 
A. Gupta, "Existence of teamoptimal solutions in static teams with common information: A topology of information approach, " SIAM J. Control Optim., vol. 58, no. 2, pp. 9981021, Apr. 2020.

[32] 
F. Bullo, J. Cortes, and S. Martinez, Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms. St. Princeton, USA: Princeton University Press, 2009.

[33] 
M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton, USA: Princeton University Press, 2010.

[34] 
M. S. Mahmoud, Multiagent Systems: Introduction and Coordination Control. Boca Raton, USA: CRC Press, 2020.

[35] 
R. Zoppoli, M. Sanguineti, G. Gnecco, and T. Parisini, Neural Approximations for Optimal Control and Decision, Springer, 2020. http://www.researchgate.net/publication/338322921_Neural_Approximations_for_Optimal_Control_and_Decision

[36] 
F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, Springer International Publishing, 2016. doi: 10.1007/9783319289298

[37] 
P. HernandezLeal, M. Kaisers, T. Baarslag, and E. M. de Cote, "A survey of learning in multiagent environments: Dealing with nonstationarity, " arXiv: 1707.09183, 2017.

[38] 
K. Q. Zhang, Z. R. Yang, and T. Başar, "Multiagent reinforcement learning: A selective overview of theories and algorithms, " arXiv: 1911.10635, 2019.

[39] 
L. S. Shapley, "Stochastic games, " Proc. Natl. Acad. Sci., vol. 39, no. 10, pp. 10951100, 1953.

[40] 
M. L. Littman, "Markov games as a framework for multiagent reinforcement learning, " in Machine Learning Proceedings 1994, W. W. Cohen and H. Hirsh, Eds. Amsterdam, The Netherlands: Elsevier, 1994, pp. 157163.

[41] 
K. P. Sycara, "Multiagent systems, " AI Mag., vol. 19, no. 2, pp. 7992, Jun. 1998.

[42] 
P. Stone and M. Veloso, "Multiagent systems: A survey from a machine learning perspective, " Auton. Rob., vol. 8, no. 3, pp. 345383, Jun. 2000.

[43] 
L. Panait and S. Luke, "Cooperative multiagent learning: The state of the art, " Auton. Agen. MultiAgent Syst., vol. 11, no. 3, pp. 387434, Nov. 2005.

[44] 
L. Busoniu, R. Babuska, and B. De Schutter, "A comprehensive survey of multiagent reinforcement learning, " IEEE Trans. Syst., Man, Cybern., Part C, vol. 38, no. 2, pp. 156172, Mar. 2008.

[45] 
L. Busoniu, R. Babuška, and B. De Schutter, "Multiagent reinforcement learning: An overview, " in Innovations in MultiAgent Systems and Applications1, D. Srinivasan and L. C. Jain, Eds. Berlin, Germany: Springer, 2010, pp. 183221.

[46] 
L. Matignon, G. J. Laurent, and N. Le FortPiat, "Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems, " Knowl. Eng. Rev., vol. 27, no. 1, pp. 131, Feb. 2012.

[47] 
P. HernandezLeal, B. Kartal, and M. E. Taylor, "A survey and critique of multiagent deep reinforcement learning, " Auton. Agent. MultiAgent Syst., vol. 33, no. 6, pp. 750797, Oct. 2019.

[48] 
A. OroojlooyJadid and D. Hajinezhad, "A review of cooperative multiagent deep reinforcement learning, " arXiv: 1908.03963, 2019.

[49] 
T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, "Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications, " IEEE Trans Cybern., vol. 50, no. 9, pp. 38263839, Sep. 2020.

[50] 
G. Tesauro, "Extending Qlearning to general adaptive multiagent systems, " in Proc. 16th Int. Conf. Neural Information Processing Systems, 2004, pp. 871878.

[51] 
F. A. Oliehoek, J. F. P. Kooij, and N. Vlassis, "The crossentropy method for policy search in decentralized POMDPs, " Informatica, vol. 32, no. 4, pp. 341357, 2008. http://www.ams.org/mathscinetgetitem?mr=2481391

[52] 
P. Pennesi and I. C. Paschalidis, "A distributed actorcritic algorithm and applications to mobile sensor network coordination problems, " IEEE Trans. Autom. Control, vol. 55, no. 2, pp. 492497, Feb. 2010.

[53] 
I. C. Paschalidis and Y. W. Lin, "Mobile agent coordination via a distributed actorcritic algorithm, " in Proc. 19th Mediterranean Conf. Control Automation, Corfu, Greece, 2011, pp. 644649.

[54] 
S. Kar, J. M. F. Moura, and H. V. Poor, "QDLearning: A collaborative distributed strategy for multiagent reinforcement learning through consensus + innovations, " IEEE Trans. Signal Process., vol. 61, no. 7, pp. 18481862, Apr. 2013.

[55] 
J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson, "Learning to Communicate with Deep MultiAgent Reinforcement Learning, " in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 21372145.

[56] 
S. Omidshafiei, A. A. AghaMohammadi, C. Amato, S. Y. Liu, J. P. How, and J. Vian, "Graphbased cross entropy method for solving multirobot decentralized POMDPs, " in Proc. IEEE Int. Conf. Robotics and Automation, Stockholm, Sweden, 2016, pp. 53955402.

[57] 
J. K. Gupta, M. Egorov, and M. Kochenderfer, "Cooperative multiagent control using deep reinforcement learning, " in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, Best Papers, Brazil, 2017, pp. 6683.

[58] 
R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, "Multiagent actorcritic for mixed cooperativecompetitive environments, " in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 63796390.

[59] 
M. Zhou, Y. Chen, Y. Wen, Y. D. Yang, Y. F. Su, W. N. Zhang, D. Zhang, and J. Wang, "Factorized Qlearning for largescale multiagent systems, " arXiv: 1809.03738, 2018.

[60] 
K. Q. Zhang, Z. R. Yang, H. Liu, T. Zhang, and T. Başar, "Fully decentralized multiagent reinforcement learning with networked agents, " arXiv: 1802.08757, 2018.

[61] 
Y. Zhang and M. M. Zavlanos, 2019 "Distributed offpolicy actorcritic reinforcement learning with policy consensus, " in Proc. IEEE 58th Conf. Decision and Control, Nice, France, 2018, pp. 46744679.

[62] 
C. S. de Witt, J. N. Foerster, G. Farquhar, P. H. S. Torr, W. Boehmer, and S. Whiteson, "Multiagent common knowledge reinforcement learning", in Proc. 31st Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 99279939.

[63] 
D. P. Bertsekas, "Multiagent rollout algorithms and reinforcement learning, " arXiv: 2002.07407, 2019.

[64] 
S. Bhattacharya, S. Kailas, S. Badyal, S. Gil, and D. P. Bertsekas, "Multiagent rollout and policy iteration for POMDP with application to multirobot repair problems, " in Proc. Conf. Robot Learning, 2020; also arXiv preprint, arXiv: 2011.04222.

[65] 
D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming. Belmont, USA: Athena Scientific, 1996.

[66] 
G. Tesauro, and G. R. Galperin, "Online policy improvement using MonteCarlo search, " in Proc. 9th Int. Conf. Neural Information Processing Systems, Denver, USA, 1996, pp. 10681074.

[67] 
D. P. Bertsekas, Nonlinear Programming. 3rd ed. Belmont, USA: Athena Scientific, 2016.

[68] 
M. G. Lagoudakis and R. Parr, "Reinforcement learning as classification: Leveraging modern classifiers, " in Proc. 20th Int. Conf. Machine Learning, Washington, USA, 2003, pp. 424431.

[69] 
C. Dimitrakakis and M. G. Lagoudakis, "Rollout sampling approximate policy iteration, " Mach. Learn., vol. 72, no. 3, pp. 157171, Jul. 2008.

[70] 
A. Lazaric, M. Ghavamzadeh, and R. Munos, "Analysis of a classificationbased policy iteration algorithm, " in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010.

[71] 
P. Abbeel and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning, " in Proc. 21st Int. Conf. Machine Learning, Banff, Canada, 2004.

[72] 
B. D. Argall, S. Chernova, M. Veloso, and B. Browning, "A survey of robot learning from demonstration, " Rob. Auton. Syst., vol. 57, no. 5, pp. 469483, May 2009.

[73] 
G. Neu and C. Szepesvari, "Apprenticeship learning using inverse reinforcement learning and gradient methods, " arXiv: 1206.5264, 2012.

[74] 
H. Ben Amor, D. Vogt, M. Ewerton, E. Berger, B. Jung, and J. Peters, "Learning responsive robot behavior by imitation, " in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Tokyo, Japan, 2013, pp. 32573264.

[75] 
J. Lee, "A survey of robot learning from demonstrations for humanrobot collaboration, " arXiv: 1710.08789, 2017.

[76] 
M. K. Hanawal, H. Liu, H. H. Zhu, and I. C. Paschalidis, "Learning policies for Markov decision processes from data, " IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 22982309, Jun. 2019.

[77] 
D. Gagliardi and G. Russo, "On a probabilistic approach to synthesize control policies from example datasets, " arXiv: 2005.11191, 2020.

[78] 
T. T. Xu, H. H. Zhu, and I. C. Paschalidis, "Learning parametric policies and transition probability models of Markov decision processes from data, " Eur. J. Control, 2020. http://www.researchgate.net/publication/347300920_Nearly_Minimax_Optimal_Reinforcement_Learning_for_Linear_Mixture_Markov_Decision_Processes

[79] 
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Ed. Cambridge, USA: MIT Press, 2018.

[80] 
D. P. Bertsekas, "Featurebased aggregation and deep reinforcement learning: A survey and some new implementations, " IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 131, Jan. 2019.

[81] 
D. P. Bertsekas, "Approximate policy iteration: A survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310335, Jul. 2011; Expanded version appears as Lab. for Info. and Decision System Report LIDS2833, MIT, 2011.

[82] 
J. N. Tsitsiklis, "Asynchronous stochastic approximation and Qlearning, " Mach. Learn., vol. 16, no. 3, pp. 185202, Sep. 1994.

[83] 
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Humanlevel control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529533, 2015. http://europepmc.org/abstract/med/25719670
