Multiagent Reinforcement Learning:Rollout and Policy Iteration

Dimitri Bertsekas

doi:10.1109/JAS.2021.1003814

Volume 8 Issue 2

Feb. 2021

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2021 > 8(2): 249-272

Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814

Citation:

Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814

Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814

Citation:

Dimitri Bertsekas, "Multiagent Reinforcement Learning:Rollout and Policy Iteration," IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249-272, Feb. 2021. doi: 10.1109/JAS.2021.1003814

PDF( 2312 KB)

Multiagent Reinforcement Learning:Rollout and Policy Iteration

doi: 10.1109/JAS.2021.1003814

Dimitri Bertsekas

Arizona State University (ASU), Tempe, AZ 85281 USA, and also with Massachusetts Institute of Technology (MIT), Cambridge, MA 02139 USA

More Information

Author Bio:
Dimitri Bertsekas undergraduate studies were in engineering at the National Technical University of Athens, Greece. He obtained his MS in electrical engineering at the George Washington University, Wash. DC in 1969, and his Ph.D. in system science in 1971 at the Massachusetts Institute of Technology.
Dr. Bertsekas has held faculty positions with the Engineering-Economic Systems Dept., Stanford University (1971-1974) and the Electrical Engineering Dept. of the University of Illinois, Urbana (1974-1979). From 1979 to 2019 he was with the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (M.I.T.), where he served as McAfee Professor of Engineering. In 2019, he was appointed Fulton Professor of Computational Decision Making, and a full time faculty member at the department of Computer, Information, and Decision Systems Engineering at Arizona State University, Tempe, while maintaining a research position at MIT. His research spans several fields, including optimization, control, large-scale computation, and data communication networks, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and eighteen books and research monographs, several of which are used as textbooks in MIT classes. Most recently Dr Bertsekas has been focusing on reinforcement learning, and authored a textbook in 2019, and a research monograph on its distributed and multiagent implementation aspects in 2020.
Professor Bertsekas was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book "Neuro-Dynamic Programming", the 2000 Greek National Award for Operations Research, the 2001 ACC John R. Ragazzini Education Award, the 2009 INFORMS Expository Writing Award, the 2014 ACC Richard E. Bellman Control Heritage Award for "contributions to the foundations of deterministic and stochastic optimization-based methods in systems and control, " the 2014 Khachiyan Prize for Life-Time Accomplishments in Optimization, and the SIAM/MOS 2015 George B. Dantzig Prize. In 2018, he was awarded, jointly with his coauthor John Tsitsiklis, the INFORMS John von Neumann Theory Prize, for the contributions of the research monographs "Parallel and Distributed Computation" and "Neuro-Dynamic Programming". In 2001, he was elected to the United States National Academy of Engineering for "pioneering contributions to fundamental research, practice and education of optimization/control theory, and especially its application to data communication networks."
Dr. Bertsekas' recent books are "Introduction to Probability: 2nd Edition" (2008), "Convex Optimization Theory" (2009), "Dynamic Programming and Optimal Control, " Vol. I, (2017), and Vol. Ⅱ: (2012), "Abstract Dynamic Programming" (2018), "Convex Optimization Algorithms" (2015), "Reinforcement Learning and Optimal Control" (2019), and "Rollout, Policy Iteration, and Distributed Reinforcement Learning"(2020), all published by Athena Scientific
Recommended by Associate Editor Qinglai Wei.
Received Date: 2020-09-23
Revised Date: 2020-10-28
Accepted Date: 2020-10-30

Abstract

Abstract

We discuss the solution of complex multistage decision problems using methods that are based on the idea of policy iteration (PI), i.e., start from some base policy and generate an improved policy. Rollout is the simplest method of this type, where just one improved policy is generated. We can view PI as repeated application of rollout, where the rollout policy at each iteration serves as the base policy for the next iteration. In contrast with PI, rollout has a robustness property: it can be applied on-line and is suitable for on-line replanning. Moreover, rollout can use as base policy one of the policies produced by PI, thereby improving on that policy. This is the type of scheme underlying the prominently successful AlphaZero chess program.In this paper we focus on rollout and PI-like methods for problems where the control consists of multiple components each selected (conceptually) by a separate agent. This is the class of multiagent problems where the agents have a shared objective function, and a shared and perfect state information. Based on a problem reformulation that trades off control space complexity with state space complexity, we develop an approach, whereby at every stage, the agents sequentially (one-at-a-time) execute a local rollout algorithm that uses a base policy, together with some coordinating information from the other agents. The amount of total computation required at every stage grows linearly with the number of agents. By contrast, in the standard rollout algorithm, the amount of total computation grows exponentially with the number of agents. Despite the dramatic reduction in required computation, we show that our multiagent rollout algorithm has the fundamental cost improvement property of standard rollout: it guarantees an improved performance relative to the base policy. We also discuss autonomous multiagent rollout schemes that allow the agents to make decisions autonomously through the use of precomputed signaling information, which is sufficient to maintain the cost improvement property, without any on-line coordination of control selection between the agents.For discounted and other infinite horizon problems, we also consider exact and approximate PI algorithms involving a new type of one-agent-at-a-time policy improvement operation. For one of our PI algorithms, we prove convergence to an agent-by-agent optimal policy, thus establishing a connection with the theory of teams. For another PI algorithm, which is executed over a more complex state space, we prove convergence to an optimal policy. Approximate forms of these algorithms are also given, based on the use of policy and value neural networks. These PI algorithms, in both their exact and their approximate form are strictly off-line methods, but they can be used to provide a base policy for use in an on-line multiagent rollout scheme.
- Dynamic programming,
- multiagent problems,
- neuro-dynamic programming,
- policy iteration,
- reinforcement learning,
- rollout

FullText(HTML)

Recommended by Associate Editor Qinglai Wei.

References(83)

References

[1]	D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I. 4th ed. Belmont, USA: Athena Scientific, 2017.
[2]	D. P. Bertsekas, Reinforcement Learning and Optimal Control. Belmont, USA: Athena Scientific, 2019.
[3]	D. P. Bertsekas, Rollout, Policy Iteration, and Distributed Reinforcement Learning. Belmont, USA: Athena Scientific, 2020.
[4]	D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by self-play with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.
[5]	D. P. Bertsekas, "Multiagent value iteration algorithms in dynamic programming and reinforcement learning, " arxiv: 2005.01627, 2020.
[6]	D. P. Bertsekas, "Constrained multiagent rollout and multidimensional assignment with the auction algorithm, " arXiv: 2002.07407, 2020.
[7]	D. P. Bertsekas, "Distributed dynamic programming, " IEEE Trans. Autom. Control, vol. 27, no. 3, pp. 610-616, Jun. 1982.
[8]	D. P. Bertsekas, "Asynchronous distributed computation of fixed points, " Math. Programming, vol. 27, no. 1, pp. 107-120, Sep. 1983.
[9]	D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Englewood Cliffs, USA: Prentice-Hall, 1989.
[10]	D. P. Bertsekas and H. Z. Yu, "Asynchronous distributed policy iteration in dynamic programming, " in Proc. 48th Annu. Allerton Conf. Communication, Control, and Computing, Allerton, USA, 2010, pp. 1368-1374.
[11]	D. P. Bertsekas and H. Z. Yu, "Q-learning and enhanced policy iteration in discounted dynamic programming, " Math. Oper. Res., vol. 37, pp. 66-94, Feb. 2012.
[12]	H. Z. Yu and D. P. Bertsekas, "Q-learning and policy iteration algorithms for stochastic shortest path problems, " Ann. Oper. Res., vol. 208, no. 1, pp. 95-132, Sep. 2013.
[13]	D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ. 4th ed. Belmont, USA: Athena Scientific, 2012.
[14]	D. P. Bertsekas, Abstract Dynamic Programming. Belmont, USA: Athena Scientific, 2018.
[15]	S. Bhattacharya, S. Badyal, T. Wheeler, S. Gil, and D. P. Bertsekas, "Reinforcement learning for POMDP: Partitioned rollout and policy iteration with application to autonomous sequential repair problems, " IEEE Rob. Autom. Lett., vol. 5, no. 3, pp. 3967-3974, Jul. 2020.
[16]	H. S. Witsenhausen, "A counterexample in stochastic optimum control, " SIAM J. Control, vol. 6, no. 1, pp. 131-147, 1968. http://www.ams.org/mathscinet-getitem?mr=231649
[17]	H. S. Witsenhausen, "Separation of estimation and control for discrete time systems, " Proc. IEEE, vol. 59, no. 11, pp. 1557-1566, Nov. 1971.
[18]	J. Marschak, "Elements for a theory of teams, " Manage. Sci., vol. 1, no. 2, pp. 127-137, Jan. 1975.
[19]	R. Radner, "Team decision problems, " Ann. Math. Statist., vol. 33, no. 3, pp. 857-881, Sep. 1962.
[20]	H. S. Witsenhausen, "On information structures, feedback and causality, " SIAM J. Control, vol. 9, no. 2, pp. 149-160, 1971. http://www.researchgate.net/publication/250955878_On_Information_Structures_Feedback_and_Causality
[21]	J. Marschak and R. Radner, Economic Theory of Teams. New Haven, USA: Yale University Press, 1976.
[22]	N. Sandell, P. Varaiya, M. Athans, and M. Safonov, "Survey of decentralized control methods for large scale systems, " IEEE Trans. Autom. Control, vol. 23, no. 2, pp. 108-128, Apr. 1978.
[23]	T. Yoshikawa, "Decomposition of dynamic team decision problems, " IEEE Trans. Autom. Control, vol. 23, no. 4, pp. 627-632, Aug. 1978.
[24]	Y. C. Ho, "Team decision theory and information structures, " Proc. IEEE, vol. 68, no. 6, pp. 644-654, Jun. 1980.
[25]	D. Bauso and R. Pesenti, "Generalized person-by-person optimization in team problems with binary decisions, " in Proc. American Control Conf., Seattle, USA, 2008, pp. 717-722.
[26]	D. Bauso and R. Pesenti, "Team theory and person-by-person optimization with binary decisions, " SIAM J. Control Optim., vol. 50, no. 5, pp. 3011-3028, Jan. 2012.
[27]	A. Nayyar, A. Mahajan, and D. Teneketzis, "Decentralized stochastic control with partial history sharing: A common information approach, " IEEE Trans. Autom. Control, vol. 58, no. 7, pp. 1644-1658, Jul. 2013.
[28]	A. Nayyar and D. Teneketzis, "Common knowledge and sequential team problems, " IEEE Trans Autom. Control, vol. 64, no. 12, pp. 5108-5115, Dec. 2019.
[29]	Y. Y. Li, Y. J. Tang, R. Y. Zhang, and N. Li, "Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach, " arXiv: 1912.09135, 2019.
[30]	G. Qu and N. Li, "Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure, " in Proc. of CDC, Nice, France, 2019.
[31]	A. Gupta, "Existence of team-optimal solutions in static teams with common information: A topology of information approach, " SIAM J. Control Optim., vol. 58, no. 2, pp. 998-1021, Apr. 2020.
[32]	F. Bullo, J. Cortes, and S. Martinez, Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms. St. Princeton, USA: Princeton University Press, 2009.
[33]	M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton, USA: Princeton University Press, 2010.
[34]	M. S. Mahmoud, Multiagent Systems: Introduction and Coordination Control. Boca Raton, USA: CRC Press, 2020.
[35]	R. Zoppoli, M. Sanguineti, G. Gnecco, and T. Parisini, Neural Approximations for Optimal Control and Decision, Springer, 2020. http://www.researchgate.net/publication/338322921_Neural_Approximations_for_Optimal_Control_and_Decision
[36]	F. A. Oliehoek and C. Amato, A Concise Introduction to Decentralized POMDPs, Springer International Publishing, 2016. doi: 10.1007/978-3-319-28929-8
[37]	P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, "A survey of learning in multiagent environments: Dealing with non-stationarity, " arXiv: 1707.09183, 2017.
[38]	K. Q. Zhang, Z. R. Yang, and T. Başar, "Multi-agent reinforcement learning: A selective overview of theories and algorithms, " arXiv: 1911.10635, 2019.
[39]	L. S. Shapley, "Stochastic games, " Proc. Natl. Acad. Sci., vol. 39, no. 10, pp. 1095-1100, 1953.
[40]	M. L. Littman, "Markov games as a framework for multi-agent reinforcement learning, " in Machine Learning Proceedings 1994, W. W. Cohen and H. Hirsh, Eds. Amsterdam, The Netherlands: Elsevier, 1994, pp. 157-163.
[41]	K. P. Sycara, "Multiagent systems, " AI Mag., vol. 19, no. 2, pp. 79-92, Jun. 1998.
[42]	P. Stone and M. Veloso, "Multiagent systems: A survey from a machine learning perspective, " Auton. Rob., vol. 8, no. 3, pp. 345-383, Jun. 2000.
[43]	L. Panait and S. Luke, "Cooperative multi-agent learning: The state of the art, " Auton. Agen. Multi-Agent Syst., vol. 11, no. 3, pp. 387-434, Nov. 2005.
[44]	L. Busoniu, R. Babuska, and B. De Schutter, "A comprehensive survey of multiagent reinforcement learning, " IEEE Trans. Syst., Man, Cybern., Part C, vol. 38, no. 2, pp. 156-172, Mar. 2008.
[45]	L. Busoniu, R. Babuška, and B. De Schutter, "Multi-agent reinforcement learning: An overview, " in Innovations in Multi-Agent Systems and Applications-1, D. Srinivasan and L. C. Jain, Eds. Berlin, Germany: Springer, 2010, pp. 183-221.
[46]	L. Matignon, G. J. Laurent, and N. Le Fort-Piat, "Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems, " Knowl. Eng. Rev., vol. 27, no. 1, pp. 1-31, Feb. 2012.
[47]	P. Hernandez-Leal, B. Kartal, and M. E. Taylor, "A survey and critique of multiagent deep reinforcement learning, " Auton. Agent. Multi-Agent Syst., vol. 33, no. 6, pp. 750-797, Oct. 2019.
[48]	A. OroojlooyJadid and D. Hajinezhad, "A review of cooperative multi-agent deep reinforcement learning, " arXiv: 1908.03963, 2019.
[49]	T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, "Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications, " IEEE Trans Cybern., vol. 50, no. 9, pp. 3826-3839, Sep. 2020.
[50]	G. Tesauro, "Extending Q-learning to general adaptive multi-agent systems, " in Proc. 16th Int. Conf. Neural Information Processing Systems, 2004, pp. 871-878.
[51]	F. A. Oliehoek, J. F. P. Kooij, and N. Vlassis, "The cross-entropy method for policy search in decentralized POMDPs, " Informatica, vol. 32, no. 4, pp. 341-357, 2008. http://www.ams.org/mathscinet-getitem?mr=2481391
[52]	P. Pennesi and I. C. Paschalidis, "A distributed actor-critic algorithm and applications to mobile sensor network coordination problems, " IEEE Trans. Autom. Control, vol. 55, no. 2, pp. 492-497, Feb. 2010.
[53]	I. C. Paschalidis and Y. W. Lin, "Mobile agent coordination via a distributed actor-critic algorithm, " in Proc. 19th Mediterranean Conf. Control Automation, Corfu, Greece, 2011, pp. 644-649.
[54]	S. Kar, J. M. F. Moura, and H. V. Poor, "QD-Learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations, " IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1848-1862, Apr. 2013.
[55]	J. N. Foerster, Y. M. Assael, N. De Freitas, and S. Whiteson, "Learning to Communicate with Deep Multi-Agent Reinforcement Learning, " in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2137-2145.
[56]	S. Omidshafiei, A. A. Agha-Mohammadi, C. Amato, S. Y. Liu, J. P. How, and J. Vian, "Graph-based cross entropy method for solving multi-robot decentralized POMDPs, " in Proc. IEEE Int. Conf. Robotics and Automation, Stockholm, Sweden, 2016, pp. 5395-5402.
[57]	J. K. Gupta, M. Egorov, and M. Kochenderfer, "Cooperative multi-agent control using deep reinforcement learning, " in Proc. Int. Conf. Autonomous Agents and Multiagent Systems, Best Papers, Brazil, 2017, pp. 66-83.
[58]	R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, "Multi-agent actor-critic for mixed cooperative-competitive environments, " in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6379-6390.
[59]	M. Zhou, Y. Chen, Y. Wen, Y. D. Yang, Y. F. Su, W. N. Zhang, D. Zhang, and J. Wang, "Factorized Q-learning for large-scale multi-agent systems, " arXiv: 1809.03738, 2018.
[60]	K. Q. Zhang, Z. R. Yang, H. Liu, T. Zhang, and T. Başar, "Fully decentralized multi-agent reinforcement learning with networked agents, " arXiv: 1802.08757, 2018.
[61]	Y. Zhang and M. M. Zavlanos, 2019 "Distributed off-policy actor-critic reinforcement learning with policy consensus, " in Proc. IEEE 58th Conf. Decision and Control, Nice, France, 2018, pp. 4674-4679.
[62]	C. S. de Witt, J. N. Foerster, G. Farquhar, P. H. S. Torr, W. Boehmer, and S. Whiteson, "Multi-agent common knowledge reinforcement learning", in Proc. 31st Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 9927-9939.
[63]	D. P. Bertsekas, "Multiagent rollout algorithms and reinforcement learning, " arXiv: 2002.07407, 2019.
[64]	S. Bhattacharya, S. Kailas, S. Badyal, S. Gil, and D. P. Bertsekas, "Multiagent rollout and policy iteration for POMDP with application to multi-robot repair problems, " in Proc. Conf. Robot Learning, 2020; also arXiv preprint, arXiv: 2011.04222.
[65]	D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, USA: Athena Scientific, 1996.
[66]	G. Tesauro, and G. R. Galperin, "On-line policy improvement using Monte-Carlo search, " in Proc. 9th Int. Conf. Neural Information Processing Systems, Denver, USA, 1996, pp. 1068-1074.
[67]	D. P. Bertsekas, Nonlinear Programming. 3rd ed. Belmont, USA: Athena Scientific, 2016.
[68]	M. G. Lagoudakis and R. Parr, "Reinforcement learning as classification: Leveraging modern classifiers, " in Proc. 20th Int. Conf. Machine Learning, Washington, USA, 2003, pp. 424-431.
[69]	C. Dimitrakakis and M. G. Lagoudakis, "Rollout sampling approximate policy iteration, " Mach. Learn., vol. 72, no. 3, pp. 157-171, Jul. 2008.
[70]	A. Lazaric, M. Ghavamzadeh, and R. Munos, "Analysis of a classification-based policy iteration algorithm, " in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010.
[71]	P. Abbeel and A. Y. Ng, "Apprenticeship learning via inverse reinforcement learning, " in Proc. 21st Int. Conf. Machine Learning, Banff, Canada, 2004.
[72]	B. D. Argall, S. Chernova, M. Veloso, and B. Browning, "A survey of robot learning from demonstration, " Rob. Auton. Syst., vol. 57, no. 5, pp. 469-483, May 2009.
[73]	G. Neu and C. Szepesvari, "Apprenticeship learning using inverse reinforcement learning and gradient methods, " arXiv: 1206.5264, 2012.
[74]	H. Ben Amor, D. Vogt, M. Ewerton, E. Berger, B. Jung, and J. Peters, "Learning responsive robot behavior by imitation, " in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Tokyo, Japan, 2013, pp. 3257-3264.
[75]	J. Lee, "A survey of robot learning from demonstrations for human-robot collaboration, " arXiv: 1710.08789, 2017.
[76]	M. K. Hanawal, H. Liu, H. H. Zhu, and I. C. Paschalidis, "Learning policies for Markov decision processes from data, " IEEE Trans. Autom. Control, vol. 64, no. 6, pp. 2298-2309, Jun. 2019.
[77]	D. Gagliardi and G. Russo, "On a probabilistic approach to synthesize control policies from example datasets, " arXiv: 2005.11191, 2020.
[78]	T. T. Xu, H. H. Zhu, and I. C. Paschalidis, "Learning parametric policies and transition probability models of Markov decision processes from data, " Eur. J. Control, 2020. http://www.researchgate.net/publication/347300920_Nearly_Minimax_Optimal_Reinforcement_Learning_for_Linear_Mixture_Markov_Decision_Processes
[79]	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd Ed. Cambridge, USA: MIT Press, 2018.
[80]	D. P. Bertsekas, "Feature-based aggregation and deep reinforcement learning: A survey and some new implementations, " IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 1-31, Jan. 2019.
[81]	D. P. Bertsekas, "Approximate policy iteration: A survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310-335, Jul. 2011; Expanded version appears as Lab. for Info. and Decision System Report LIDS-2833, MIT, 2011.
[82]	J. N. Tsitsiklis, "Asynchronous stochastic approximation and Q-learning, " Mach. Learn., vol. 16, no. 3, pp. 185-202, Sep. 1994.
[83]	V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning, " Nature, vol. 518, no. 7540, pp. 529-533, 2015. http://europepmc.org/abstract/med/25719670

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(7)

Get Citation

PDF

XML

Article Metrics

Article views (6860) PDF downloads(631)

Multiagent Reinforcement Learning:Rollout and Policy Iteration

doi: 10.1109/JAS.2021.1003814

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content