Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

Dimitri P. Bertsekas

doi:10.1109/JAS.2018.7511249

Volume 6 Issue 1

Jan. 2019

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2019 > 6(1): 1-31

Dimitri P. Bertsekas, "Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations," IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 1-31, Jan. 2019. doi: 10.1109/JAS.2018.7511249

Citation:

Dimitri P. Bertsekas, "Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations," IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 1-31, Jan. 2019. doi: 10.1109/JAS.2018.7511249

Citation:

PDF( 7655 KB)

Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

doi: 10.1109/JAS.2018.7511249

Dimitri P. Bertsekas

Department of Electrical Engineering and Computer Science, and the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology (M. I. T.), Cambridge, MA 02139 USA

More Information

Author Bio:
Dimitri P. Bertsekas studied engineering at the National Technical University of Athens, Greece. He obtained his MS in electrical engineering at the George Washington University, Wash. DC in 1969, and his Ph.D. in system science in 1971 at the Massachusetts Institute of Technology (M.I.T.).
Dr. Bertsekas has held faculty positions with the Engineering-Economic Systems Dept., Stanford University (1971-1974) and the Electrical Engineering Dept. of the University of Illinois, Urbana (1974-1979). Since 1979 he has been teaching at the Electrical Engineering and Computer Science Department of M.I.T., where he is currently McAfee Professor of Engineering. He consults regularly with private industry and has held editorial positions in several journals. His research has spanned several fields, including optimization, control, large-scale and distributed computation, and data communication networks, and is closely tied to his teaching and book authoring activities. He has written numerous research papers, and sixteen books and research monographs, several of which are used as textbooks in M.I.T. classes, including "Introduction to Probability" (2008, co-authored with John Tsitsiklis).
Professor Bertsekas was awarded the INFORMS 1997 Prize for Research Excellence in the Interface Between Operations Research and Computer Science for his book "Neuro-Dynamic Programming" (co-authored with John Tsitsiklis), the 2001 ACC John R. Ragazzini Education Award, the 2009 INFORMS Expository Writing Award, the 2014 ACC Richard E. Bellman Control Heritage Award, the 2014 Khachiyan Prize for Life-Time Accomplishments in Optimization, and the SIAM/MOS 2015 George B. Dantzig Prize. In 2001 he was elected to the United States National Academy of Engineering.
Dr. Bertsekas' recent books are "Convex Optimization Algorithms" (2015), "Nonlinear Programming:3rd Edition" (2016), "Dynamic Programming and Optimal Control:4th Edition" (2017), and "Abstract Dynamic Programming:2nd Edition" (2018), all published by Athena Scientific.(e-mail: imitrib@mit.edu)
Received Date: 2018-07-02
Revised Date: 2018-08-06
Accepted Date: 2018-08-07

Abstract

Abstract

In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with feature construction using deep neural networks or other calculations. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by neural networkbased reinforcement learning, thereby potentially leading to more effective policy improvement.
- Reinforcement learning,
- dynamic programming,
- Markovian decision problems,
- aggregation,
- feature-based architectures,
- policy iteration,
- deep neural networks,
- rollout algorithms

FullText(HTML)

References(89)

References

[1]	D. P. Bertsekas, "Approximate policy iteration: a survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310-335, Aug. 2011. http://www.cnki.com.cn/Article/CJFDTotal-KZLL201103003.htm
[2]	C. E. Shannon, "Programming a digital computer for playing chess, " Phil. Mag., vol. 41, no. 7, pp. 356-375, 1950. doi: 10.1080/14786445008521796
[3]	A. L. Samuel, "Some studies in machine learning using the game of checkers, " IBM J. Res. Develop., vol. 3, pp. 210-229, 1959. doi: 10.1147/rd.33.0210
[4]	A. L. Samuel, "Some studies in machine learning using the game of checkers. Ⅱ-Recent progress, " IBM J. Res. Develop., vol. pp. 601-617, 1967. http://www.sciencedirect.com/science/article/pii/0066413869900044
[5]	P. J. Werbös, "Advanced forecasting methods for global crisis warning and models of intelligence, " Gener. Syst. Yearbook, vol. 22, no. 6, pp. 25-38, Jan. 1977.
[6]	A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neuronlike adaptive elements that can solve difficult learning control problems, " IEEE Trans. Syst. Man Cybern., vol. 13, no. 5, pp. 835-846, Sep-Oct. 1983. http://dl.acm.org/citation.cfm?id=104432
[7]	J. Christensen and R. E. Korf, "A unified theory of heuristic evaluation functions and its application to learning, " in Proc. of the 5th AAAI National Conf. on Artificial Intelligence, Philadelphia, Pennsylvania, 1986, pp. 148-152.
[8]	J. H. Holland, "Escaping brittleness: the possibilities of general purpose learning algorithms applied to parallel rule-based systems, " Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds. San Mateo, CA: Morgan Kaufmann, 1986, pp. 593-623.
[9]	R. S. Sutton, "Learning to predict by the methods of temporal differences, " Mach. Learn., vol. 3, no. 1, pp. 9-44, Aug. 1988.
[10]	G. Tesauro, "Practical issues in temporal difference learning, " Mach. Learn., vol. 8, no. 3-4, pp. 257-277, 1992. doi: 10.1007/BF00992697
[11]	G. Tesauro, "TD-gammon, a self-teaching backgammon program, achieves master-level play, " Neural Comput., vol. 6, no. 2, pp. 215-219, Mar. 1994. doi: 10.1007/978-1-4757-2379-3_11.pdf
[12]	G. Tesauro, "Temporal difference learning and TD-gammon, " Commun. ACM, vol. 38, no. 3, pp. 58-68, Mar. 1995.
[13]	G. Tesauro, "Programming backgammon using self-teaching neural nets, " Artif. Intell., vol. 134, no. 1-2, pp. 181-199, Jan. 2002.
[14]	G. Tesauro, "Neurogammon wins computer olympiad, " Neural Comput., vol. 1, no. 3, pp. 321-323, 1989. doi: 10.1162/neco.1989.1.3.321
[15]	G. Tesauro, "Connectionist learning of expert preferences by comparison training, " in Advances in Neural Information Processing Systems, Cambridge, MA, 1989, pp. 99-106. http://dl.acm.org/citation.cfm?id=89863
[16]	G. Tesauro and G. R. Galperin, "On-line policy improvement using monte-carlo search, " Presented at the 1996 Neural Information Processing Systems Conference, Denver; also in Advances in Neural Information Processing Systems 9, M. Mozer et al. eds. Denver, Colorado, 1997.
[17]	D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, 4th ed. Belmont, MA: Athena Scientific, 2017.
[18]	A. G. Barto, S. J. Bradtke, and S. P. Singh, " Real-time learning and control using asynchronous dynamic programming, " Artif. Intell., vol. 72, no. 1-2, pp. 81-138, Jun. 1995.
[19]	D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996.
[20]	R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge, MA: MIT Press, 1998(A draft 2nd edition is available on-line).
[21]	X. R. Cao, Stochastic Learning and Optimization: A Sensitivity-Based Approach. Springer, 2007.
[22]	L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, FL, USA: CRC Press, 2010.
[23]	C. Szepesvari, Algorithms for Reinforcement Learning, San Franscisco, CA: Morgan and Claypool Publishers, 2010.
[24]	W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd Ed. Hoboken, NJ: John Wiley and Sons, 2011.
[25]	H. S. Chang, J. Q. Hu, M. C. Fu, and S. I. Marcus, Simulation-Based Algorithms for Markov Decision Processes, 2nd Ed. Springer, 2013.
[26]	V. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. London: The Institution of Engineering and Technology, London, 2012.
[27]	A. Gosavi, Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, 2nd Ed. Springer, 2015.
[28]	J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamic Programming. IEEE Press, 2004.
[29]	F. L. Lewis, D. Liu, and G. G. Lendaris, "Special issue on adaptive dynamic programming and reinforcement learning in feedback control, " IEEE Trans. Syst. Man Cybern. Part B:Cybern., vol. 38, no. 4, 2008. http://openurl.ebscohost.com/linksvc/linking.aspx?stitle=IEEE%20Transactions%20on%20Systems%20Man%20and%20Cybernetics%20Part%20B&volume=37&issue=3&spage=751
[30]	F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press, Computational Intelligence Series, 2012.
[31]	B. Abramson, "Expected-outcome: a general model of static evaluation, " IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 2, pp. 182-193, Feb. 1990. http://openurl.ebscohost.com/linksvc/linking.aspx?stitle=IEEE%20Transactions%20on%20Pattern%20Analysis%20and%20Machine%20Intelligence&volume=12&issue=2&spage=182
[32]	D. P. Bertsekas, J. N. Tsitsiklis, and C. Wu, "Rollout algorithms for combinatorial optimization, " Heurist., vol. 3, no. 3, pp. 245-262, Dec. 1997.
[33]	D. P. Bertsekas and D. A. Castanon, "Rollout algorithms for stochastic scheduling problems, " Heurist., vol. 5, no.1, pp. 89-108, Apr. 1999.
[34]	D. P. Bertsekas, "Rollout algorithms for discrete optimization: a survey, " in Handbook of Combinatorial Optimization, P. Pardalos, D. Z. Du, and R. Graham, eds. New York, NY: Springer, 2013.
[35]	H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, "An adaptive sampling algorithm for solving Markov decision processes, " Operat. Res., vol. 53, no. 1, pp. 126-139, Feb. 2005. http://www.jstor.org/stable/25146851
[36]	R. Coulom, "Efficient selectivity and backup operators in Monte-Carlo tree search, " in Proc. of the 5th Int. Conf. on Computers and Games, Turin, Italy, 2006, pp. 72-83.
[37]	C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, "A survey of Monte Carlo tree search methods, " IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 1, pp. 1-43, Mar. 2012.
[38]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA:MIT Press, 2016.
[39]	J. Schmidhuber, "Deep learning in neural networks: an overview, " Neural Netw., vol. 61, pp. 85-117, Jan. 2015.
[40]	K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, "A brief survey of deep reinforcement learning, " arXiv preprint arXiv: 1708.05866, 2017.
[41]	W. B. Liu, Z. D. Wang, X. H. Liu, N. Y. Zeng, Y. R. Liu, and F. E. Alsaadi, "A survey of deep neural network architectures and their applications, " Neurocomputing, vol. 234, pp. 11-26, Apr. 2017.
[42]	Y. X. Li, "Deep reinforcement learning: an overview, " arXiv preprint ArXiv: 1701.07274v5, 2017.
[43]	D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by self-play with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.
[44]	A. G. Ivakhnenko, "The group method of data handling:a rival of the method of stochastic approximation, " Soviet Autom. Control, vol. 13, pp. 43-55, 1968.
[45]	A. G. Ivakhnenko, "Polynomial theory of complex systems, " IEEE Trans. Syst. Man Cybern., vol. 4, pp. 364-378, Oct. 1971. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4308320
[46]	J. C. Bean, J. R. Birge, and R. L. Smith, "Aggregation in dynamic programming, " Operat. Res., vol. 35, no.2, pp. 215-220, Apr. 1987.
[47]	F. Chatelin and W. L. Miranker, "Acceleration by aggregation of successive approximation methods, " Linear Algebra Appl., vol. 43, pp. 17-47, Mar. 1982.
[48]	C. C. Douglas and J. Douglas, "A unified convergence theory for abstract multigrid or multilevel algorithms, serial and parallel, " SIAM J. Numer. Anal., vol. 30, no. 1, pp. 136-158, 1993. doi: 10.1137/0730007
[49]	D. F. Rogers, R. D. Plante, R. T. Wong, and J. R. Evans, "Aggregation and disaggregation techniques and methodology in optimization, " Operat. Res., vol. 39, no. 4, pp. 553-582, Aug. 1991.
[50]	S. P. Singh, T. Jaakkola, and M. I. Jordan, "Reinforcement learning with soft state aggregation, " in Advances in Neural Information Processing Systems 7, MIT Press, 1995.
[51]	G. J. Gordon, "Stable function approximation in dynamic programming, " in Proc. of the 12th Int. Conf. on Machine Learning, California, 1995.
[52]	J. N. Tsitsiklis and B. Van Roy, "Feature-based methods for large scale dynamic programming, " Mach. Learn., vol. 22, no. 1-3, pp. 59-94, Mar. 1996.
[53]	K. Ciosek and D. Silver, "Value iteration with options and state aggregation, " Report, Center for Computational Statistics and Machine Learning, University College London, 2015.
[54]	I. V. Serban, C. Sankar, M. Pieper, J. Pineau, and J. Bengio, "The bottleneck simulator: a model-based deep reinforcement learning approach, " arXiv preprint arXiv: 1807.04723.v1, 2018.
[55]	D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ: Approximate Dynamic Programming, 4th ed. Belmont, MA: Athena Scientific, 2012.
[56]	O. E. David, N. S. Netanyahu, and L. Wolf, "Deepchess: end-to-end deep neural network for automatic learning in chess, " in Proc. of the 25th Int. Conf. Artificial Neural Networks, Barcelona, Spain, 2016, pp. 88-96. http://www.springerlink.com/openurl.asp?id=doi:10.1007/978-3-319-44781-0_11
[57]	D. P. Bertsekas, Abstract Dynamic Programming, 2nd Ed. Belmont, MA: Athena Scientific, 2018.
[58]	M. A. Krasnoselskii, G. M. Vainikko, R. P. Zabreyko, Y. B. Ruticki, and V. V. Stetśenko, Approximate Solution of Operator Equations. Groningen: Wolters-Noordhoff Pub., 1972.
[59]	C. A. J. Fletcher, Computational Galerkin Methods. franelatod by D. Louvish, Springer, 1984.
[60]	Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA: SIAM, 2003.
[61]	A. Kirsch, An Introduction to the Mathematical Theory of Inverse Problems, 2nd Ed. Springer, 2011.
[62]	D. P. Bertsekas, "Temporal difference methods for general projected equations, " IEEE Trans. Autom. Control, vol. 56, no. 9, pp. 2128-2139, Sep. 2011.
[63]	H. Z. Yu and D. P. Bertsekas, "Error bounds for approximations from projected linear equations, " Math. Operat. Res., vol. 35, no. 2, pp. 306-329, Apr. 2010.
[64]	D. P. Bertsekas, "Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, " Report LIDS-P-3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.
[65]	D. P. Bertsekas, "Proximal algorithms and temporal difference methods for solving fixed point problems, " Comput. Optimiz. Appl., vol. 70, no. 3, pp. 709-736, Jul. 2018.
[66]	H. Z. Yu and D. P. Bertsekas, "Weighted Bellman Equations and their Applications in Approximate Dynamic Programming, " Lab. for Information and Decision Systems Report LIDS-P-2876, MIT, 2012.
[67]	D. P. Bertsekas, "λ-policy iteration: a review and a new implementation, " Lab. for Information and Decision Systems Report LIDSP-2874, MIT; in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. R. Liu, eds. IEEE Press, Computational Intelligence Series, 2012.
[68]	A. Fern, S. Yoon, and R. Givan, "Approximate policy iteration with a policy language bias: solving relational Markov decision processes, " J. Artif. Intell. Res., vol. 25, pp. 75-118, Jan. 2006. http://dl.acm.org/citation.cfm?id=1622546
[69]	B. Scherrer, "Performance bounds for λ policy iteration and application to the game of Tetris, " J. Mach. Learn. Res., vol. 14, no. 1, pp. 1181-1227, Jan. 2013. http://www.oalib.com/paper/4086129
[70]	B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist, "Approximate modified policy iteration and its application to the game of Tetris, " J. Mach. Learn. Res., vol. 16, no. 1, pp. 1629-1676, Jan. 2015.
[71]	V. Gabillon, M. Ghavamzadeh, and B. Scherrer, "Approximate dynamic programming finally performs well in the game of Tetris, " in Advances in Neural Information Processing Systems, South Lake Tahoe, United States, 2013, pp. 1754-1762.
[72]	D. P. Bertsekas, Convex Optimization Algorithms. Belmont, MA: Athena Scientific, 2015.
[73]	D. P. Bertsekas, Nonlinear Programming, 3rd ed. Belmont, MA: Athena Scientific, 2016.
[74]	D. P. Bertsekas and J. N. Tsitsiklis, "Gradient convergence in gradient methods with errors, " SIAM J. Optimiz., vol. 10, no. 3, pp. 627-642, 2000. http://www.ams.org/mathscinet-getitem?mr=1741189
[75]	G. Cybenko, "Approximation by superpositions of a sigmoidal function, " Math. Control Sign. Syst., vol. 2, no. 4, pp. 303-314, Dec. 1989.
[76]	K. Funahashi, "On the approximate realization of continuous mappings by neural networks, " Neural Netw., vol. 2, no. 3, pp. 183-192, 1989.
[77]	K. Hornick, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators, " Neural Netw., vol. 2, no. 5, pp. 359-366, 1989.
[78]	M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, " Neural Netw., vol. 6, no. 6, pp. 861-867, 1993.
[79]	C. M. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995.
[80]	L. K. Jones, "Constructive approximations for neural networks by sigmoidal functions, " Proc. IEEE, vol. 78, no. 10, pp. 1586-1589, Oct. 1990.
[81]	G. E. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for deep belief nets, " Neural Comput., vol. 18, no. 7, pp. 1527-1554, Jul. 2006.
[82]	S. Haykin, Neural Networks and Learning Machines, 3rd Ed. Englewood-Cliffs, NJ: Prentice-Hall, 2008.
[83]	B. Van Roy, "Performance loss bounds for approximate value iteration with state aggregation, " Math. Operat. Res., vol. 31, no. 2, pp. 234-244, May 2006. http://connection.ebscohost.com/c/articles/21761580/performance-loss-bounds-approximate-value-iteration-state-aggregation
[84]	J. N. Tsitsiklis, "Asynchronous stochastic approximation and Qlearning, " Mach. Learn., vol. 16, no. 3, pp. 185-202, Sep. 1994.
[85]	G. Tesauro, "Comparison training of chess evaluation functions, " in Machines that Learn to Play Games, Commack, Nova Science Publishers, 2001, pp. 117-130.
[86]	D. P. Bertsekas and D. A. Castanon, "Adaptive aggregation methods for infinite horizon dynamic programming, " IEEE Trans. Autom. Control, vol. 34, no. 6, pp. 589-598, Jun. 1989.
[87]	T. P. Runarsson, M. Schoenauer, and M. Sebag, Pilot, "Rollout and Monte Carlo Tree Search Methods for Job Shop Scheduling, " in Learning and Intelligent Optimization (pp. 160-174), Springer, Berlin, Heidelberg, 2012.
[88]	D. P. Bertsekas, "A counterexample to temporal differences learning, " Neural Comput., vol. 7, no. 2, pp. 270-279, Mar. 1995. http://dl.acm.org/citation.cfm?id=206098
[89]	D. P. Bertsekas and J. N. Tsitsiklis, "An analysis of stochastic shortest path problems, " Math. Operat. Res., vol. 16, no. 3, pp. 580-595, Aug. 1991. http://dl.acm.org/citation.cfm?id=119204

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(20)

Get Citation

PDF

XML

Article Metrics

Article views (8166) PDF downloads(760)

Feature-Based Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations

doi: 10.1109/JAS.2018.7511249

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content