IEEE/CAA Journal of Automatica Sinica
Citation:  Dimitri P. Bertsekas, "FeatureBased Aggregation and Deep Reinforcement Learning: A Survey and Some New Implementations," IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 131, Jan. 2019. doi: 10.1109/JAS.2018.7511249 
[1] 
D. P. Bertsekas, "Approximate policy iteration: a survey and some new methods, " J. Control Theory Appl., vol. 9, no. 3, pp. 310335, Aug. 2011. http://www.cnki.com.cn/Article/CJFDTotalKZLL201103003.htm

[2] 
C. E. Shannon, "Programming a digital computer for playing chess, " Phil. Mag., vol. 41, no. 7, pp. 356375, 1950. doi: 10.1080/14786445008521796

[3] 
A. L. Samuel, "Some studies in machine learning using the game of checkers, " IBM J. Res. Develop., vol. 3, pp. 210229, 1959. doi: 10.1147/rd.33.0210

[4] 
A. L. Samuel, "Some studies in machine learning using the game of checkers. ⅡRecent progress, " IBM J. Res. Develop., vol. pp. 601617, 1967. http://www.sciencedirect.com/science/article/pii/0066413869900044

[5] 
P. J. Werbös, "Advanced forecasting methods for global crisis warning and models of intelligence, " Gener. Syst. Yearbook, vol. 22, no. 6, pp. 2538, Jan. 1977.

[6] 
A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neuronlike adaptive elements that can solve difficult learning control problems, " IEEE Trans. Syst. Man Cybern., vol. 13, no. 5, pp. 835846, SepOct. 1983. http://dl.acm.org/citation.cfm?id=104432

[7] 
J. Christensen and R. E. Korf, "A unified theory of heuristic evaluation functions and its application to learning, " in Proc. of the 5th AAAI National Conf. on Artificial Intelligence, Philadelphia, Pennsylvania, 1986, pp. 148152.

[8] 
J. H. Holland, "Escaping brittleness: the possibilities of general purpose learning algorithms applied to parallel rulebased systems, " Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, eds. San Mateo, CA: Morgan Kaufmann, 1986, pp. 593623.

[9] 
R. S. Sutton, "Learning to predict by the methods of temporal differences, " Mach. Learn., vol. 3, no. 1, pp. 944, Aug. 1988.

[10] 
G. Tesauro, "Practical issues in temporal difference learning, " Mach. Learn., vol. 8, no. 34, pp. 257277, 1992. doi: 10.1007/BF00992697

[11] 
G. Tesauro, "TDgammon, a selfteaching backgammon program, achieves masterlevel play, " Neural Comput., vol. 6, no. 2, pp. 215219, Mar. 1994. doi: 10.1007/9781475723793_11.pdf

[12] 
G. Tesauro, "Temporal difference learning and TDgammon, " Commun. ACM, vol. 38, no. 3, pp. 5868, Mar. 1995.

[13] 
G. Tesauro, "Programming backgammon using selfteaching neural nets, " Artif. Intell., vol. 134, no. 12, pp. 181199, Jan. 2002.

[14] 
G. Tesauro, "Neurogammon wins computer olympiad, " Neural Comput., vol. 1, no. 3, pp. 321323, 1989. doi: 10.1162/neco.1989.1.3.321

[15] 
G. Tesauro, "Connectionist learning of expert preferences by comparison training, " in Advances in Neural Information Processing Systems, Cambridge, MA, 1989, pp. 99106. http://dl.acm.org/citation.cfm?id=89863

[16] 
G. Tesauro and G. R. Galperin, "Online policy improvement using montecarlo search, " Presented at the 1996 Neural Information Processing Systems Conference, Denver; also in Advances in Neural Information Processing Systems 9, M. Mozer et al. eds. Denver, Colorado, 1997.

[17] 
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. I, 4th ed. Belmont, MA: Athena Scientific, 2017.

[18] 
A. G. Barto, S. J. Bradtke, and S. P. Singh, " Realtime learning and control using asynchronous dynamic programming, " Artif. Intell., vol. 72, no. 12, pp. 81138, Jun. 1995.

[19] 
D. P. Bertsekas and J. N. Tsitsiklis, NeuroDynamic Programming. Belmont, MA: Athena Scientific, 1996.

[20] 
R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge, MA: MIT Press, 1998(A draft 2nd edition is available online).

[21] 
X. R. Cao, Stochastic Learning and Optimization: A SensitivityBased Approach. Springer, 2007.

[22] 
L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, FL, USA: CRC Press, 2010.

[23] 
C. Szepesvari, Algorithms for Reinforcement Learning, San Franscisco, CA: Morgan and Claypool Publishers, 2010.

[24] 
W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd Ed. Hoboken, NJ: John Wiley and Sons, 2011.

[25] 
H. S. Chang, J. Q. Hu, M. C. Fu, and S. I. Marcus, SimulationBased Algorithms for Markov Decision Processes, 2nd Ed. Springer, 2013.

[26] 
V. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. London: The Institution of Engineering and Technology, London, 2012.

[27] 
A. Gosavi, SimulationBased Optimization: Parametric Optimization Techniques and Reinforcement Learning, 2nd Ed. Springer, 2015.

[28] 
J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of Learning and Approximate Dynamic Programming. IEEE Press, 2004.

[29] 
F. L. Lewis, D. Liu, and G. G. Lendaris, "Special issue on adaptive dynamic programming and reinforcement learning in feedback control, " IEEE Trans. Syst. Man Cybern. Part B:Cybern., vol. 38, no. 4, 2008. http://openurl.ebscohost.com/linksvc/linking.aspx?stitle=IEEE%20Transactions%20on%20Systems%20Man%20and%20Cybernetics%20Part%20B&volume=37&issue=3&spage=751

[30] 
F. L. Lewis and D. Liu, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. IEEE Press, Computational Intelligence Series, 2012.

[31] 
B. Abramson, "Expectedoutcome: a general model of static evaluation, " IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 2, pp. 182193, Feb. 1990. http://openurl.ebscohost.com/linksvc/linking.aspx?stitle=IEEE%20Transactions%20on%20Pattern%20Analysis%20and%20Machine%20Intelligence&volume=12&issue=2&spage=182

[32] 
D. P. Bertsekas, J. N. Tsitsiklis, and C. Wu, "Rollout algorithms for combinatorial optimization, " Heurist., vol. 3, no. 3, pp. 245262, Dec. 1997.

[33] 
D. P. Bertsekas and D. A. Castanon, "Rollout algorithms for stochastic scheduling problems, " Heurist., vol. 5, no.1, pp. 89108, Apr. 1999.

[34] 
D. P. Bertsekas, "Rollout algorithms for discrete optimization: a survey, " in Handbook of Combinatorial Optimization, P. Pardalos, D. Z. Du, and R. Graham, eds. New York, NY: Springer, 2013.

[35] 
H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, "An adaptive sampling algorithm for solving Markov decision processes, " Operat. Res., vol. 53, no. 1, pp. 126139, Feb. 2005. http://www.jstor.org/stable/25146851

[36] 
R. Coulom, "Efficient selectivity and backup operators in MonteCarlo tree search, " in Proc. of the 5th Int. Conf. on Computers and Games, Turin, Italy, 2006, pp. 7283.

[37] 
C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, "A survey of Monte Carlo tree search methods, " IEEE Trans. Comput. Intell. AI Games, vol. 4, no. 1, pp. 143, Mar. 2012.

[38] 
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA:MIT Press, 2016.

[39] 
J. Schmidhuber, "Deep learning in neural networks: an overview, " Neural Netw., vol. 61, pp. 85117, Jan. 2015.

[40] 
K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, "A brief survey of deep reinforcement learning, " arXiv preprint arXiv: 1708.05866, 2017.

[41] 
W. B. Liu, Z. D. Wang, X. H. Liu, N. Y. Zeng, Y. R. Liu, and F. E. Alsaadi, "A survey of deep neural network architectures and their applications, " Neurocomputing, vol. 234, pp. 1126, Apr. 2017.

[42] 
Y. X. Li, "Deep reinforcement learning: an overview, " arXiv preprint ArXiv: 1701.07274v5, 2017.

[43] 
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, "Mastering chess and Shogi by selfplay with a general reinforcement learning algorithm, " arXiv preprint arXiv: 1712.01815, 2017.

[44] 
A. G. Ivakhnenko, "The group method of data handling:a rival of the method of stochastic approximation, " Soviet Autom. Control, vol. 13, pp. 4355, 1968.

[45] 
A. G. Ivakhnenko, "Polynomial theory of complex systems, " IEEE Trans. Syst. Man Cybern., vol. 4, pp. 364378, Oct. 1971. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4308320

[46] 
J. C. Bean, J. R. Birge, and R. L. Smith, "Aggregation in dynamic programming, " Operat. Res., vol. 35, no.2, pp. 215220, Apr. 1987.

[47] 
F. Chatelin and W. L. Miranker, "Acceleration by aggregation of successive approximation methods, " Linear Algebra Appl., vol. 43, pp. 1747, Mar. 1982.

[48] 
C. C. Douglas and J. Douglas, "A unified convergence theory for abstract multigrid or multilevel algorithms, serial and parallel, " SIAM J. Numer. Anal., vol. 30, no. 1, pp. 136158, 1993. doi: 10.1137/0730007

[49] 
D. F. Rogers, R. D. Plante, R. T. Wong, and J. R. Evans, "Aggregation and disaggregation techniques and methodology in optimization, " Operat. Res., vol. 39, no. 4, pp. 553582, Aug. 1991.

[50] 
S. P. Singh, T. Jaakkola, and M. I. Jordan, "Reinforcement learning with soft state aggregation, " in Advances in Neural Information Processing Systems 7, MIT Press, 1995.

[51] 
G. J. Gordon, "Stable function approximation in dynamic programming, " in Proc. of the 12th Int. Conf. on Machine Learning, California, 1995.

[52] 
J. N. Tsitsiklis and B. Van Roy, "Featurebased methods for large scale dynamic programming, " Mach. Learn., vol. 22, no. 13, pp. 5994, Mar. 1996.

[53] 
K. Ciosek and D. Silver, "Value iteration with options and state aggregation, " Report, Center for Computational Statistics and Machine Learning, University College London, 2015.

[54] 
I. V. Serban, C. Sankar, M. Pieper, J. Pineau, and J. Bengio, "The bottleneck simulator: a modelbased deep reinforcement learning approach, " arXiv preprint arXiv: 1807.04723.v1, 2018.

[55] 
D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. Ⅱ: Approximate Dynamic Programming, 4th ed. Belmont, MA: Athena Scientific, 2012.

[56] 
O. E. David, N. S. Netanyahu, and L. Wolf, "Deepchess: endtoend deep neural network for automatic learning in chess, " in Proc. of the 25th Int. Conf. Artificial Neural Networks, Barcelona, Spain, 2016, pp. 8896. http://www.springerlink.com/openurl.asp?id=doi:10.1007/9783319447810_11

[57] 
D. P. Bertsekas, Abstract Dynamic Programming, 2nd Ed. Belmont, MA: Athena Scientific, 2018.

[58] 
M. A. Krasnoselskii, G. M. Vainikko, R. P. Zabreyko, Y. B. Ruticki, and V. V. Stetśenko, Approximate Solution of Operator Equations. Groningen: WoltersNoordhoff Pub., 1972.

[59] 
C. A. J. Fletcher, Computational Galerkin Methods. franelatod by D. Louvish, Springer, 1984.

[60] 
Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA: SIAM, 2003.

[61] 
A. Kirsch, An Introduction to the Mathematical Theory of Inverse Problems, 2nd Ed. Springer, 2011.

[62] 
D. P. Bertsekas, "Temporal difference methods for general projected equations, " IEEE Trans. Autom. Control, vol. 56, no. 9, pp. 21282139, Sep. 2011.

[63] 
H. Z. Yu and D. P. Bertsekas, "Error bounds for approximations from projected linear equations, " Math. Operat. Res., vol. 35, no. 2, pp. 306329, Apr. 2010.

[64] 
D. P. Bertsekas, "Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation, " Report LIDSP3205, MIT; arXiv preprint arXiv: 1610.05427, 2016.

[65] 
D. P. Bertsekas, "Proximal algorithms and temporal difference methods for solving fixed point problems, " Comput. Optimiz. Appl., vol. 70, no. 3, pp. 709736, Jul. 2018.

[66] 
H. Z. Yu and D. P. Bertsekas, "Weighted Bellman Equations and their Applications in Approximate Dynamic Programming, " Lab. for Information and Decision Systems Report LIDSP2876, MIT, 2012.

[67] 
D. P. Bertsekas, "λpolicy iteration: a review and a new implementation, " Lab. for Information and Decision Systems Report LIDSP2874, MIT; in Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. L. Lewis and D. R. Liu, eds. IEEE Press, Computational Intelligence Series, 2012.

[68] 
A. Fern, S. Yoon, and R. Givan, "Approximate policy iteration with a policy language bias: solving relational Markov decision processes, " J. Artif. Intell. Res., vol. 25, pp. 75118, Jan. 2006. http://dl.acm.org/citation.cfm?id=1622546

[69] 
B. Scherrer, "Performance bounds for λ policy iteration and application to the game of Tetris, " J. Mach. Learn. Res., vol. 14, no. 1, pp. 11811227, Jan. 2013. http://www.oalib.com/paper/4086129

[70] 
B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist, "Approximate modified policy iteration and its application to the game of Tetris, " J. Mach. Learn. Res., vol. 16, no. 1, pp. 16291676, Jan. 2015.

[71] 
V. Gabillon, M. Ghavamzadeh, and B. Scherrer, "Approximate dynamic programming finally performs well in the game of Tetris, " in Advances in Neural Information Processing Systems, South Lake Tahoe, United States, 2013, pp. 17541762.

[72] 
D. P. Bertsekas, Convex Optimization Algorithms. Belmont, MA: Athena Scientific, 2015.

[73] 
D. P. Bertsekas, Nonlinear Programming, 3rd ed. Belmont, MA: Athena Scientific, 2016.

[74] 
D. P. Bertsekas and J. N. Tsitsiklis, "Gradient convergence in gradient methods with errors, " SIAM J. Optimiz., vol. 10, no. 3, pp. 627642, 2000. http://www.ams.org/mathscinetgetitem?mr=1741189

[75] 
G. Cybenko, "Approximation by superpositions of a sigmoidal function, " Math. Control Sign. Syst., vol. 2, no. 4, pp. 303314, Dec. 1989.

[76] 
K. Funahashi, "On the approximate realization of continuous mappings by neural networks, " Neural Netw., vol. 2, no. 3, pp. 183192, 1989.

[77] 
K. Hornick, M. Stinchcombe, and H. White, "Multilayer feedforward networks are universal approximators, " Neural Netw., vol. 2, no. 5, pp. 359366, 1989.

[78] 
M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, "Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, " Neural Netw., vol. 6, no. 6, pp. 861867, 1993.

[79] 
C. M. Bishop, Neural Networks for Pattern Recognition. Oxford: Oxford University Press, 1995.

[80] 
L. K. Jones, "Constructive approximations for neural networks by sigmoidal functions, " Proc. IEEE, vol. 78, no. 10, pp. 15861589, Oct. 1990.

[81] 
G. E. Hinton, S. Osindero, and Y. W. Teh, "A fast learning algorithm for deep belief nets, " Neural Comput., vol. 18, no. 7, pp. 15271554, Jul. 2006.

[82] 
S. Haykin, Neural Networks and Learning Machines, 3rd Ed. EnglewoodCliffs, NJ: PrenticeHall, 2008.

[83] 
B. Van Roy, "Performance loss bounds for approximate value iteration with state aggregation, " Math. Operat. Res., vol. 31, no. 2, pp. 234244, May 2006. http://connection.ebscohost.com/c/articles/21761580/performancelossboundsapproximatevalueiterationstateaggregation

[84] 
J. N. Tsitsiklis, "Asynchronous stochastic approximation and Qlearning, " Mach. Learn., vol. 16, no. 3, pp. 185202, Sep. 1994.

[85] 
G. Tesauro, "Comparison training of chess evaluation functions, " in Machines that Learn to Play Games, Commack, Nova Science Publishers, 2001, pp. 117130.

[86] 
D. P. Bertsekas and D. A. Castanon, "Adaptive aggregation methods for infinite horizon dynamic programming, " IEEE Trans. Autom. Control, vol. 34, no. 6, pp. 589598, Jun. 1989.

[87] 
T. P. Runarsson, M. Schoenauer, and M. Sebag, Pilot, "Rollout and Monte Carlo Tree Search Methods for Job Shop Scheduling, " in Learning and Intelligent Optimization (pp. 160174), Springer, Berlin, Heidelberg, 2012.

[88] 
D. P. Bertsekas, "A counterexample to temporal differences learning, " Neural Comput., vol. 7, no. 2, pp. 270279, Mar. 1995. http://dl.acm.org/citation.cfm?id=206098

[89] 
D. P. Bertsekas and J. N. Tsitsiklis, "An analysis of stochastic shortest path problems, " Math. Operat. Res., vol. 16, no. 3, pp. 580595, Aug. 1991. http://dl.acm.org/citation.cfm?id=119204
