Off-Policy Reinforcement Learning with Gaussian Processes

Girish Chowdhary; Miao Liu; Robert Grande; Thomas Walsh; Jonathan How; Lawrence Carin; Girish Chowdhary; Miao Liu; Robert Grande; Thomas Walsh; Jonathan How; Lawrence Carin

[1]	Sutton R, Barto A. Reinforcement Learning, an Introduction. Cambridge, MA: MIT Press, 1998.
[2]	Engel Y, Mannor S, Meir R. Reinforcement learning with Gaussian processes. In: Proceedings of the 22nd International Conference on Machine learning. New York: ACM, 2005. 201-208
[3]	Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005, 6: 503-556
[4]	Maei H R, Szepesvári C, Bhatnagar S, Sutton R S. Toward off-policy learning control with function approximation. In: Proceedings of the 2010 International Conference on Machine Learning, Haifa, Israel, 2010.
[5]	Rasmussen C E, Williams C K I. Gaussian Processes for Machine Learning. Cambridge, MA: The MIT Press, 2006.
[6]	Engel Y, Szabo P, Volkinshtein D. Learning to control an octopus arm with gaussian process temporal difference methods. In: Advances in Neural Information Processing Systems 18. 2005. 347-354
[7]	Melo F S, Meyn S P, Ribeiro M I. An analysis of reinforcement learning with function approximation. In: Proceedings of the 2008 International Conference on Machine Learning. New York: ACM, 2008. 664-671
[8]	Deisenroth M P. Efficient Reinforcement Learning Using Gaussian Processes[Ph. D. dissertation], Karlsruhe Institute of Technology, Germany, 2010.
[9]	Jung T, Stone P. Gaussian processes for sample efficient reinforcement learning with rmax-like exploration. In: Proceedings of the 2012 European Conference on Machine Learning (ECML). 2012. 601-616
[10]	Rasmussen C, Kuss M. Gaussian processes in reinforcement learning. In: Proceedings of the Advances in Neural Information Processing Systems, 2004, 751-759
[11]	Kolter J Z, Ng A Y. Regularization and feature selection in leastsquares temporal difference learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York: ACM, 2009. 521-528
[12]	Liu B, Mahadevan S, Liu J. Regularized off-policy TD-learning. In: Proceddings of the Advances in Neural Information Processing Systems 25. Cambridge, MA: The MIT Press, 2012. 845-853
[13]	Sutton R S, Maei H R, Precup D, Bhatnagar S, Silver D, Szepesvári C, Wiewiora E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In: Proceedings of the 26th Annual International Conference on Machine Learning. New York, USA: ACM, 2009. 993-1000
[14]	Csató L, Opper M. Sparse on-line gaussian processes. Neural Computation, 2002, 14(3): 641-668
[15]	Singh S P, Yee R C. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 1994, 16(3): 227-233
[16]	Watkins C J. Q-learning. Machine Learning, 1992, 8(3): 279-292
[17]	Baird L C. Residual algorithms: reinforcement learning with function approximation. In: Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 1995. 30-37
[18]	Tsitsiklis J N, van Roy B. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997, 42(5): 674-690
[19]	Parr R, Painter-Wakefield C, Li L, Littman M L. Analyzing feature generation for value-function approximation. In: Proceedings of the 2007 International Conference on Machine Learning. New York: IEEE, 2007. 737-744
[20]	Ormoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning, 2002, 49(2-3): 161-178
[21]	Farahmand A M, Ghavamzadeh M, Szepesvári C, Mannor S. Regularized fitted q-iteration for planning in continuous-space Markovian decision problems. In: Proceedings of the 2009 American Control Conference. St. Louis, MO: IEEE, 2009. 725-730
[22]	Geist M, Pietquin O. Kalman temporal differences. Journal of Artificial Intelligence Research (JAIR), 2010, 39(1): 483-532
[23]	Strehl A L, Littman M L. A theoretical analysis of model-based interval estimation. In: Proceedings of the 22nd International Conference on Machine Learning. New York: IEEE, 2005. 856-863
[24]	Krause A, Guestrin C. Nonmyopic active learning of Gaussian processes: an exploration-exploitation approach. In: Proceedings of the 24th International Conference on Machine Learning. New York: IEEE, 2007. 449-456
[25]	Desautels T, Krause A, Burdick J W. Parallelizing explorationexploitation tradeoffs with Gaussian process bandit optimization. In: Proceedings of the 29th International Conference on Machine Learning. ICML, 2012. 1191-1198
[26]	Chung J J, Lawrance N R J, Sukkarieh S. Gaussian processes for informative exploration in reinforcement learning. In: Proceedings of the 2013 IEEE International Conference on Robotics and Automation. Karlsruhe: IEEE, 2013. 2633-2639
[27]	Barreto A D M S, Precup D, Pineau J. Reinforcement learning using kernel-based stochastic factorization. In: Proceedings of the Advances in Neural Information Processing Systems 24. Cambridge, MA: The MIT Press, 2011. 720-728
[28]	Chen X G, Gao Y, Wang R L. Online selective kernel-based temporal difference learning. IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(12): 1944-1956
[29]	Kveton B, Theocharous G. Kernel-based reinforcement learning on representative states. In: Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI, 2012. 977-983
[30]	Xu X, Hu D W, Lu X C. Kernel-based least squares policy iteration for reinforcement learning. IEEE Transactions on Neural Networks, 2007, 18(4): 973-992
[31]	Snelson E, Ghahramani Z. Sparse Gaussian processes using pseudoinputs. In: Proceedings of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2006. 1257-126
[32]	Lázaro-Gredilla M, Quińonero-Candela J, Rasmussen C E, Figueiras-Vidal A R. Sparse spectrum gaussian process regression. The Journal of Machine Learning Research, 2010, 11: 1865-1881
[33]	Varah J M. A lower bound for the smallest singular value of a matrix. Linear Algebra and Its Applications, 1975, 11(1): 3-5
[34]	Lizotte D J. Convergent fitted value iteration with linear function approximation. In: Proceedings of the Advances in Neural Information Processing Systems 24. Cambridge, MA: The MIT Press, 2011. 2537-2545
[35]	Kingravi H. Reduced-Set Models for Improving the Training and Execution Speed of Kernel Methods [Ph.D dissertation], Georgia Institute of Technology, Atlanta, GA, 2013
[36]	Boyan J, Moore A. Generalization in reinforcement learning: Safely approximating the value function. In: Proceedings of the Advances in Neural Information Processing Systems 7. Cambridge, MA: The MIT Press, 1995. 369-376
[37]	Engel Y, Mannor S, Meir R. The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing, 2004, 52(8): 2275-2285
[38]	Krause A, Singh A, Guestrin C. Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. The Journal of Machine Learning Research, 2008, 9: 235-284
[39]	Ehrhart E. Geometrie diophantienne-sur les polyedres rationnels homothetiques an dimensions. Comptes rendus hebdomadaires des seances de l'academia des sciences, 1962, 254(4): 616
[40]	Benveniste A, Priouret P, Métivier M. Adaptive Algorithms and Stochastic Approximations. New York: Springer-Verlag, 1990
[41]	Haddad W M, Chellaboina V. Nonlinear Dynamical Systems and Control: A Lyapunov-Based Approach. Princeton: Princeton University Press, 2008
[42]	Khalil H K. Nonlinear Systems. New York: Macmillan, 2002.
[43]	Lagoudakis M G, Parr R. Least-squares policy iteration. Journal of Machine Learning Research (JMLR), 2003, 4: 1107-1149