IEEE/CAA Journal of Automatica Sinica  2014, Vol.1 Issue (3): 274-281   PDF    
Experience Replay for Least-Squares Policy Iteration
Quan Liu, Xin Zhou, Fei Zhu , Qiming Fu, Yuchen Fu    
1. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China, and also with the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;
2. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China;
3.School of Computer Science and Technology, Soochow University, Jiangsu 215006, China, and also with the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China;
4. School of Computer Science and Technology, Soochow University, Jiangsu 215006, China
Abstract: Policy iteration, which evaluates and improves the control policy iteratively, is a reinforcement learning method. Policy evaluation with the least-squares method can draw more useful information from the empirical data and therefore improve the data validity. However, most existing online least-squares policy iteration methods only use each sample just once, resulting in the low utilization rate. With the goal of improving the utilization efficiency, we propose an experience replay for least-squares policy iteration (ERLSPI) and prove its convergence. ERLSPI method combines online least-squares policy iteration method with experience replay, stores the samples which are generated online, and reuses these samples with least-squares method to update the control policy. We apply the ERLSPI method for the inverted pendulum system, a typical benchmark testing. The experimental results show that the method can effectively take advantage of the previous experience and knowledge, improve the empirical utilization efficiency, and accelerate the convergence speed.
Key words: reinforcement learning     experience replay     leastsquares     policy iteration    
Ⅰ. INTRODUCTION

Reinforcement learning (RL) interacts with the environment and learns how to map situations to actions in order to obtain maximum cumulative reward. Agents constantly try to find the best action which gets the maximum reward. In reinforcement learning cases,the action will not only affect the immediate rewards,but also the next state and all subsequent reward. Reinforcement learning is characterized by trial and error search as well as delayed reward[1, 2]. Reinforcement learning has good learning performance in complex nonlinear systems with large spaces,and is widely used in process control,task scheduling, robot design,gaming and many other fields[3, 4, 5].

In reinforcement learning,we use the value function to estimate the long-term cumulative reward of a state or state-action pair. The V-function estimates the state and the Q-function estimates the state-action pair. The policy in reinforcement learning is a mapping from the state space to the action space,agents eventually reach the target through comparing the value functions to seek the optimal policy[6]. Policy iteration (PI) is an important reinforcement learning method,whose algorithms update the current policy by calculating the value functions and then improve the policy with the greedy policy. Least-squares method can be used advantageously in reinforcement learning algorithms. Bradtke et al. proposed the least-squares temporal difference (LSTD) algorithm based on the V-function[7, 8]. Although their algorithm requires more computation cost for each time step,which is different from traditional temporal difference (TD) algorithms, it can extract more useful information from the empirical data and therefore improve the data validity. However,as the absence of model in action selection in the model free cases,LSTD only intended to solve the prediction problems,and could not be used in the control problems[9, 10, 11]. Lagoudakis et al. extended it to the least-squares temporal difference for Q-functions (LSTD-Q)[12, 13],and proposed the least-squares policy iteration (LSPI) algorithm which made it available for control problems. LSPI used LSTD-Q in policy evaluation,and accurately estimated the current policy with all samples collected in advance,which was a typical offline algorithm. Reinforcement learning is capable of dealing with the interaction with the environment online. Busoniu et al. extended the offline least-squares policy iteration algorithm to the online cases[14],and presented an online least-squares policy iteration (online LSPI) algorithm. The algorithm uses samples generated online with exploratory policy,and improves the policy every few steps. But it uses empirical data only once,which have low empirical utilization rate. Therefore,in the early time of the algorithm implementation,as there is few sample data available,it is difficult to obtain good control policy,which leads to poor initial performance and slow convergence speed of the algorithm.

Experience replay (ER) methods can reuse prior empirical data and therefore reduce the number of samples required to get a good policy[15],which generates and stores up sample data online, and then reuses the sample data to update the current policy. In this work,combining the ER method with the online LSPI,we propose an ER for least-squares policy iteration (ERLSPI) algorithm which is based on linear least-squares function approximation theory. The algorithm reuses the sample data collected online and extracts more information from them,leading to the empirical utilization rate and the convergence speed improvement.

Ⅱ. RELATED THEORIES A. Markov Decision Process

A state signal which retains all pertinent information successfully is supposed to have the Markov property. In reinforcement learning, if a state has the Markov property,then the response of the environment at time $t$ + 1 depends only on the representation of the state and action at time $t$. A reinforcement learning task that satisfies the Markov property is known to be a Markov decision process (MDP). An MDP can be defined as the following four factors: the state space $X$,the action space $U$,the transition probability function $f$,the reward function $\rho$. In the deterministic case,the state transition function is $f$: $X\,\times\,U\rightarrow$[0,1],and the reward function is $\rho$: $X\times U\rightarrow{\bf R}$. At the current time step $k$,the controller takes action $u_k$ according to the control policy $h$: $X \rightarrow U$ in the given state $x_{k}$, $r_{k+1} =\rho(x_k,u_k)$. In the stochastic case,the state transition function is $\tilde{f}$: $X\times U\times X\rightarrow$ [0,$\infty$),and the reward function is $\tilde{\rho}$: $X\times U\times X\rightarrow {\bf R}$. At the current time step $k$,the controller takes action $u_{k}$ according to the control policy $h$: $X\rightarrow U$ in the given state $x_{k}$,transfers to the next state $x_{k+1} (x_{k+1} \in X_{k+1}$,$X_{k+1}$ $\subseteq X$) with the probability $P(x_{k+1} \in X_{k+1}|x_{k}$,$u_k) = \int\nolimits_{X_{k+1}} f(x_{k},u_{k},x'){\rm d}x'$,and the system obtains the immediate reward $r_{k+1}=\tilde{\rho}(x_{k},u_{k},x_{k+1})$. If the state space is countable,the state transition function can be written as ${\bar{f}}$: $X\times U\times X\rightarrow$[0,1],the probability of transfer to the next state is $P(x_{k+1} =x'\mid x_{k},u_{k}) = {\bar {f}}(x_{k},u_{k},x')$. Given $f$ and $\rho$,as well as the current state $x_{k}$ and the current action taken $u_{k}$,it is sufficient to determine the next state $x_{k+1}$ and the immediate reward $r_{k+1}$,this is the Markov property. It is the theoretical basis for reinforcement learning algorithms.

If an MDP has a terminal state,once it reaches the terminal state,it cannot transfer out,of which task is called an episodic task; on the contrary,a task without a terminal state is called a continuous task. In particular,an MDP without unreachable states, with nonzero probability being accessed for each state is ergodic.

B. Online Least-Squares Policy Iteration

The Q-function $Q^{h}$ given by the policy $h$ can be found by solving the Bellman equation $Q^{h}=T^{h}(Q^{h})$,where the policy evaluation mapping $T^{h}$ is $[T^{h}(Q)](x,u)= E_{x}' \{\rho(x,u,x')+\gamma Q(x',h(x')\}$. The PI algorithm starts with an arbitrary initial policy $h_0$,in each iteration $l \,(l\geq0)$, the algorithm evaluates the current policy,calculates the Q-function ${\rm Q}^{h_l}$,and then uses $h_{l+1}={\rm arg}\max\limits_{u}Q^{h_l}(x,u)$ to improve the policy. The PI algorithm eventually converges to the optimal policy $h^{*}$.

Traditional algorithms exactly store the distinct value functions for every state (if V-functions are used) or for every state-action pair (in the case of Q-functions) with tabular method. When some of the state variables have a very large or infinite number of possible values,determining exact storage is impossible,and the value functions have to be represented approximately.

In function approximation methods,we suppose the state space $X$ and action space $U$ contain a finite number of elements, represented by $X=\{x_{1},\cdots,x_{\bar{N}}\}$ and $U=\{u_{1}, \cdots,u_{\bar{N}}\}$,respectively. Define the basis function matrix ${\phi}\in {\bf R}^{n_{s}\times n}$ as follows:

$ {\phi_{l_{s},l}}={\phi_{l}}({x}_{l_{s}},{u}_{l_{s}}),{l}=0,1,\cdots,{n}-1, l_s=1,\cdots,{n} $ (1)

where $n$ is the dimension of the basis functions,$l_{s}$ is the serial number of samples,and $n_{s}$ is the number of samples. The basis functions are state-dependent. In the state-action basis function vector ${\phi_{l}}(x_{l_{s}},u_{l_{s}})$,all the basic functions that do not correspond to the current discrete action are taken to be equal to 0:

$ {\phi_{l}}({x}_{l_{s}},{u}_{l_{s}})=[0,\cdots,0,\cdots,0, {\phi_{l}}({x}_{l_{s}}),\cdots,{\phi_{\bar{N}}}({x}_{l_{s}}),\notag\\ \qquad 0,\cdots,0,\cdots,0]^\texttt{T},\notag $

where ${\phi_{i}}~({x}_{l_{s}})(i=0,1,\cdots,\bar{N})$ is the Gaussian radial basis functions (RBFs),which can be defined as

$ {\phi_{i}}({x}_{l_{s}})={\rm e}^{-\frac{\parallel {x}_{l_{s}}-{u}_{l} \parallel^2}{2\sigma^2}}\notag $

where $\mu_{i}(i=1,\cdots,\bar{N})$ is the center of the $i$-th RBF and $\sigma$ is the width.

In linear parametric conditions,according to the basic function ${\phi}$,for a given policy parameters ${\theta}$ (${\theta}\in{\bf R}^n$),approximation Q-functions can be represented by (2).

$ \label{equ 2} {{\hat{Q}}} = {\phi}^{\rm T}{\theta} $ (2)

LSPI algorithm iteratively updates the policy parameter $\theta$ to get the optimal policy. To find an approximate Q-functions for a policy $h$,a projected form of the Bellman equation can be written as

$ \hat{Q}^h=P^w(T^h(\hat{Q}^h)), $ (3)

where $P^w$ performs a weighted least-squares projection on the space of representable Q-functions[14]. The projection $P^w$ is defined by

$ P^w(Q)(x,u)={\phi}^\texttt{T}(x,u){\theta}^+,\notag $

where

$ {\theta}^+\in {\rm arg}\min\limits_{\theta} \sum\limits_{(x,u)\in X\times U}w(x,u)({\phi}^{\rm T}(x,u){\theta}-Q(x,u))^2.\notag $

LSPI algorithm uses LSTD-Q to calculate the policy parameters ${\theta}$ based on all samples collected in advance,as follows:

$ \frac{1}{n_s}{\Gamma}{\theta}=\gamma\frac{1}{n_s}{\Lambda}{\theta}+\frac{1}{n_s}{z} $ (4)

where matrix ${\Gamma},~{\Lambda} \in {\bf R}^{n\times n}$,and vector ${z}\in {\bf R}^{n}$. ${\Gamma},~{\Lambda}$,and ${z}$ are defined as follows:

$ {{\Gamma}}=\sum_{n_s}^{l_s=1}{{\phi}}(x_{l_s},u_{l_s}) {{\phi}}^\texttt{T}(x_{l_s},u_{l_s}),\notag\\ { \Lambda}= \sum_{n_s}^{l_s=1}{{\phi}} (x_{l_s},u_{l_s}){{\phi}}^{\rm T}(x_{l_s},h(x_{l_s})),\notag\\ {{ z}}=\sum_{n_s}^{l_s=1}{{\phi}}(x_{l_s},u_{l_s}){{r}}_{l_s}. $ (5)

LSPI is an offline algorithm,all samples collected in advance are needed in each updating step. In online LSPI,the policy improvements must be performed after several (not too many) steps (for continuous tasks) or episodes (for episodic tasks),and then applied to generate new samples. Then another policy improvement takes place,and the cycle repeats[14].

Ⅲ. ER LEAST-SQUARES POLICY ITERATION A. ER Least-Squares Policy Iteration

In the online LSPI algorithm,the sample data generated online is used only once and then be discarded,which leads to the low empirical utilization rate of the algorithm,especially in the cases where collecting samples is difficult. As a result,improving the empirical utilization rate is particularly important. In this work, we use the ER algorithm to achieve the goal. The sample data generated online is preserved for future reusing. The sample data is stored in the form of tetrad ($x,u,r,x'$),where $x$ is the current state,$u$ is the selected action under current policy,$r$ is the immediate reward obtained,and $x'$ is the next state transferred to. The sample data constitute the experience,together with the current policy $h$ constitute the quintuple ($x,u,r,x',h(x'$)) to update policy. The policy $h$ is associated with the policy parameters $\theta$; different $\theta$ is corresponding to different $h$; and $h(x')$ represents the action to be taken in state $x'$. Policy updating is performed after the prescribed $k$ episodes (episodic task) or $T$ time steps (continuous task) rather than after each time step. The entire sample data currently collected is used in each updating of the policy parameter $\theta$, resulting in the updating of the corresponding policy $h$,then the sample data is reused to constitute new quintuples to update $\theta$ under the new policy. This procedure is executed repeatedly for $d$ times to achieve the reuse of experience. The sample collection is restarted after each update,which can prevent the happening of excessive sample data and the resulting excessive computational cost.

LSTD-Q is utilized to estimate policy,the calculation of $\theta$ by (4) requires to a complete matrix inversion,which is computationally intensive with high computational cost. Many approaches can be used to reduce the computation cost. In this work, we take advantage of the Sherman-Morrison approach to calculate $\theta$ incrementally[7]. Set matrix $B\in {\bf R}^{n\times n}$ with initial matrix $B_0$ which is typically set as a diagonal matrix of the form $\beta_{\Gamma}I$,where $\beta_{\Gamma}$ is a small positive constant,and $I$ is an identity matrix. This ensures that $B_0$ is approximately 0,but is invertible. Suppose the serial number of the sample currently being processed is $l_s$ , then $B_{l_s}$ is updated by:

$ {B}_{l_s} = {B}_{l_{s-1}}- \notag\\ \frac{{B}_{l_{s-1}}{\phi}(x_{l_s},u_{l_s})({\phi}(x_{l_s},u_{l_s})- \gamma{\phi}(x'_{l_s},h(x'_{l_s}))^{\rm T}){B}_{l_{s-1}}}{1+ ({\phi}(x_{l_s},u_{l_s})-\gamma{\phi}(x'_{l_s},h{(x'_{l_s})}))^{\rm T}- B_{l_{s-1}}{\phi}(x_{l_s},u_{l_s})}. $ (6)

So,through updating matrix $B$ recursively,the procedure of matrix inversion is omitted,and the computation is greatly reduced. The ERLSPI algorithm with LSTD-Q is showed as Algorithm 1.

Algorithm 1. ERLSPI with LSTD-Q

1: ${B=\beta_\Gamma {I} (\beta_\Gamma}$ is a small positive constant)

2: ${{{z}}={0}}$

3: ${{n}_s=0\,({n}_s=0}$ is the number of current samples)

4: ${{i}=0\,({i}}$ is the counter)

5: At current state ${x_{l_{s}}}$,choose action ${u_{l_{s}}} ~({l_{s}}=1,2,\cdots$) according to a certain policy (for example $\varepsilon$- greedy)

6: Execute action $u_{l_{s}}$

7: Save sample data $\langle x_{l_{s}},{u_{l_{s}}},{r_{l_{s}}},{x'_{l_{s}}}\rangle$

8: $l_s=l_s$+1

9: $n_s=n_s+1x_{l_{s}}$ is the terminal state of episode $k$ (for episodic task)

10: $h(x'_{l_{s}}) = {\rm arg\,max}_u{\phi}^\texttt{T}(x_{l_s},u){\theta} l_s=1$

$ {B}= {B}- \\ \frac{{B}{\phi}(x_{l_s},u_{l_s})({\phi}(x_{l_s},u_{l_s})- \gamma{\phi}(x'_{l_s},h_{x'_{l_s}}))^{\rm T}){B}}{1+ ({\phi}(x_{l_s},u_{l_s})-\gamma {\phi}(x'_{l_s},h{(x'_{l_s}})))^{\rm T}-{B} {\phi}(x_{l_s},u_{l_s})} $

11: ${{{\pmb z}}}={{{\pmb z}}}+{\phi}(x_{l_{s}},u_{l_{s}}){{\pmb r}}_{l_{s}}$

12: ${l_s}={l_s}+1$

${\bf until:}~~{l_s}={n_s}$

13: ${\theta}={n_s}{B}\frac{1}{{n_s}}{{{\pmb z}}}$

14: $i=i+1$

${\bf until}:~~i=d$

15: $l_s=0$

16: $n_s=0$

17: $i=0$

Update current state $x_{l_{s}}=x'_{l_{s}-1}$

${\bf until:}$~~satisfy the terminal condition}

The least-squares policy evaluation for Q-functions (LSPE-Q) proposed by Jung et al. uses a formula similar to (4) to process sample data to estimate policy[16, 17]. The initial parameter vector ${\theta}_0$ can be set arbitrarily. The updating equation is as follows:

$ { \theta}_{l_s}={\theta}_{l_{s-1}}+\alpha({\theta}^+_{l_s}-{ \theta}_{l_{s-1}}), $ (7)

where ${\theta}^+_{l_s}=(\frac{1}{l_s}{{\Gamma}}_{l_s})^{-1}({\gamma} \frac{1}{l_s}{{\Lambda}}_{l_s}+\frac{1}{l_s}{z}_{l_s})$,$\alpha$ is the step size parameter,and $l_s$ is the serial number of the sample currently being processed. Also,Sherman-Morrison formula is used to eliminate the process of matrix inversion. Matrix ${B}\in {\bf R}^{n\times n}$ is updated with:

$ {B}_{l_s}={B}_{l_{s-1}}-\frac{{B}_{l_{s-1}}{\phi} (x_{l_s},u_{l_s}){\phi}^{\rm T}(x_{l_s},u_{l_s}){B}_{l_{s-1}}}{1+ {\phi}^{\rm T}(x_{l_s},u_{l_s}){B}_{l_{s-1}}{\phi}(x_{l_s},u_{l_s})}. $ (8)

Unlike ERLSPI with LSTD-Q,ERLSPI with LSPE-Q updates the policy parameter $\theta$ after each state transaction,the updating is dependent on the results of the last step,so the value of $\theta$ is relevant with the order for samples processing. Also,the updating is also affected by the step size parameter $\alpha$. The algorithm is less stable than the ERLSPI with LSTD-Q which uses all samples currently collected and eliminates the step size parameter $\alpha$. However,if a good initial value is given,by lowering the step size parameter $\alpha$ gradually,the algorithm can often achieve better performance. ERLSPI with LSPE-Q is given as Algorithm 2.

Algorithm 2. ERLSPI with LSPE-Q

1: ${B}=\beta_\Gamma {I} (\beta_\Gamma$ is a small positive constant)

2: ${{{{z}}}={0}}$

3: ${n}_s=0\,({n}_s=0$ is the number of current samples)

4: ${i}=0\,(i$ is the counter)

5: ${\bf repeat}$

6: At current state ${x_{l_{s}}}$,choose action ${u_{l_{s}}} ({l_{s}}=1,2,\cdots$) according to a certain policy (for example $\varepsilon$-${\rm greedy}$)

7: Execute action $u_{l_{s}}$

8: Save sample data $\langle{x_{l_{s}}},{u_{l_{s}}},{r_{l_{s}}},{x'_{l_{s}}}\rangle$

9: $l_s={l_s}+1$

10: $n_s=n_s+1$

11: ${\bf if}~~{x_{l_s}}$ is the terminal state of episode ${k}$ (for episodic task)

12: ${\bf repeat}$

13: $h(x'_{l_{s}}) = {\rm arg\,max}_u{\phi}^{\rm T}(x_{l_s},u){\theta}_{l_{s}}$

14: $l_s=1$

15: ${\bf repeat}$

16:

$ {{{B}}}_{l_{s}}= {{{B}}}_{l_{s-1}}-\frac{{{{B}}}_{l_{s-1}} {\phi}(x_{l_s},u_{l_s}){\phi}^{\rm T}(x_{l_s},u_{l_s}){{B}}_{l_{s-1}}}{1+ {\phi}^{\rm T}(x_{l_s},u_{l_s}){{B}}_{l_{s-1}}{\phi}(x_{l_s},u_{l_s})} $

17: ${\Lambda}= \Lambda_{l_{s-1}}+{{\phi}}(x_{l_s},u_{l_s}){{\phi}}^{\rm T}(x_{l_s},h(x_{l_s}))$

18: ${{{ z}}}={{{ z}}}+{\phi}(x_{l_{s}},u_{l_{s}}){{ r}}_{l_{s}}$

19: ${{\theta}}_{l_{s}}={\theta}_{l_{s-1}}+\alpha({\theta}^+_{l_{s}}-{ \theta}_{l_{s-1}})$, where ${\theta}^+_{l_{s}}=(\frac{1}{l_{s}}{{B}_{ls}})^{-1}({\gamma} \frac{1}{l_{s}}{ \Lambda}_{l_{s}}{\theta}_{l_{s}}+\frac{1}{l_{s}}{{{\pmb z}}}_{l_{s}})$

20: $l_s=l_s+1$

21: ${\bf until}~l_s=n_s$

22: $i=i+1$

23: ${\bf until}~~i=d$

24: $l_s=0$

25: $n_s=0$

26: $i=0$

27: ${\bf end if}$

28: Update current state $x_{l_{s}}=x'_{l_{s}-1}$

29: ${\bf until}$ satisfy the terminal condition.}

B. Algorithm Analysis

To ensure the convergence of the algorithm,the samples must be generated with exploratory policies. If only state-action pairs of the form $(x,h(x))$ are collected,then the information of state-action pairs $(x,u)$ is not to obtain when $u\neq h(x) $ which means the corresponding weight $w(x,u)$ is 0. Therefore,the Q-values of such state-action pairs would not be properly estimated,or the improvement of approximation policy could not be well performed according to Q-functions. To avoid this,in certain conditions,actions different from $h(x)$,such as random actions according to a certain probability,should be selected. For a given steady exploratory policy,the algorithm can be used to evaluate further exploratory policy.

Lemma 1. For any Markov chain,if: 1) $\theta_t$ is obtained by ERLSPI algorithm,2) each state-action pair $(x,u)$ is visited infinitely,3) each state-action pair $(x,u)$ is visited in proportion $w(x,u)$ with probability 1 in the long run,4) and ${\phi}^{\rm T}w({I}-\gamma\bar{{f}}{h}){\phi}$ is invertible,then the equation

$ {\theta}_{ERLSPI}=[{\phi}^{\rm T}{{{w}}}({I}-\gamma\bar{{{f}}}{{h}}{\phi})^{-1}][{\phi}^{\rm T}{{{w}}}\bar{{{r}}}]\notag $

holds with probability 1, noindent where ${\phi}\in {\bf R}^{\bar{{N}}\bar{{M}}\times\bar{{N}}\bar{{M}}}$ is the basis function matrix whose column $x$ is $\phi(x,u)$, ${{w}}\in {\bf R}^{\bar{{N}}\bar{{M}}\times\bar{{N}}\bar{{M}}}$ is a diagonal matrix $w(x,{u})$,which represents the weight of state-action pair $({x},{u})$ and $\sum\limits_{x\in X} \sum\limits_{u\in U}{w}(x,u)=1$, $\bar{{{f}}}\in{\bf R}^{\bar{{N}}\bar{{M}}\times\bar{{N}}\bar{{M}}}$ is a matrix of state transition probability and $\bar {f}_{[i,j],i'}={\bar{x}_i},\bar{u}_j,x_{i'}$, ${{h}}\in {{h}}^{\bar{{N}}\times\bar{{{N}}}\bar{{M}}}$ is a matrix about policy, ${{{h}}_{i',[i,j]}={\bar{x}_{i'},x_{i},u_j}}$,when $i'=i$ and $h(x_i)=u_j$,${{h}}_{i',[i,j]}=1$, otherwise ${h}_{i',[i,j]}=0$,and $\bar{{{r}}}\in {\bf R}^{\bar{N}\bar{M}}$ is a vector about expected immediate reward and $\bar{{{r}}}$$_{[i,j]=\sum\limits_{x'} \bar{f}(x_i,u_j,x')\rho(x_i,u_j,x')}$.

Proof. In the case of online LSPI,at time $t$,by (4) we get

$ \theta_t=\Big[\frac{1}{{t}}\sum^t_{k=1}{\phi}_k({\phi}_k-\gamma{\phi}_{k'})^{\rm T}\Big]^{-1}\Big[\frac{1}{t}\sum^t_{k=1}{\phi}_k{{r}}_k\Big].\notag $

In the case of ERLSPI where the sample data will be reused $d$ times,it can be further transformed into

$ \theta_t=\left[\frac{1}d{{t}}\sum^t_{k=1}\sum^d_{i=1}{ \phi}_k({\phi}_k-\gamma{\phi}_{k'})^{\rm T}\right]^{-1}\left[\frac{1}{t}\sum^t_{k=1}{\phi}_k{{r}}_k\right].\notag $

When ${t}\rightarrow \infty$,according to condition 2),each state-action pair $({x},{u})$ is visited infinitely,and the transition probabilities of state-action pairs approach their true transition probability $\bar{f}$ with probability 1. Meanwhile,by condition 3),each state-action pair $({x},{u})$ is visited in proportion $w(x,u)$ with probability 1. Therefore,by condition 4), the ${\pmb \theta}_{\rm ERLSPI}$ of ERLSPI can be estimated as

$ {\pmb \theta}_{\rm ERLSPI} = \lim\limits_{t\rightarrow\infty}{\theta}_t=\notag \\ \lim\limits_{t\rightarrow\infty}\left[\frac{1}d{{t}}\sum^t_{k=1}\sum^d_{i=1}{\phi}_k({\phi}_k-\gamma {\phi}_{k'})^{\rm T}\right]^{-1}\left[\frac{1}{t}\sum^t_{k=1}{\phi}_k{{r}}_k\right]=\notag \\ \left[\lim\limits_{t\rightarrow\infty}\frac{1}d{{t}}\sum^t_{k=1}\sum^d_{i=1}{\phi}_k ({\phi}_k-\gamma{\phi}_{k'})^{\rm T}\right]^{-1} \lim\limits_{t\rightarrow\infty}\left[\frac{1}{t}\sum^t_{k=1}{\phi}_k{{r}}_k\right] =\notag\\ \Big[\sum\limits_{x\in {X}}\sum\limits_{u\in {U}}w(x,u) \sum\limits_{x'\in {X}}\bar{f}(x,u,x')h(x',u',x'){\phi}(x,u)\times \notag\\ ({\phi}(x,u)-\gamma{\phi}(x',u'))\Big]^{-1} \big[\sum\limits_{x\in {X}}\sum\limits_{u\in {U}}w(x,u){\phi}(x,u)\times\notag \\ \sum\limits_{x'\in {X}}\bar{f}(x,u,x')r(x,u,x')\big]=\notag\\ [{\phi}^{\rm T}{{w}}({I}-\gamma\bar{{{f}}}{h}) {\phi}]^{-1}[{\phi}^{\rm T}{{w}}\bar{{{r}}}] $

Theorem 1. For ergodic Markov chains,which have no unreachable states and each state is being accessed with nonzero probability,if:

1) The feature vectors $\{{\phi}(x,u)|x\in {X},u\in {U}\}$ which represent state-action pairs are linearly independent,

2) The dimension of each feature vector ${\phi}(x,u)$ is $n=|{X}|\times|{U}|$ ,

3) The discount factor is $0\leq\gamma<1$,then,when the state transition frequency tends to be infinite,the asymptotic parameters ${\theta}_{ERLSPI}$ of ERLSPI algorithm converges to its optimal value ${\theta}^*$ with probability 1.

Proof. As it is for ergodic Markov chains,when the time step $t$ tends to be infinite,the frequency of each state-action pair approaches infinity with probability 1. Meanwhile,each state-action pair is visited in the proportion $w(x,u)$ with probability 1,and for any $(x,u)\in({X}\times{U})$,$w(x,u)>0$. Thus,the matrix ${{w}}$ is reversible. By condition 2),${\phi}$ is a ${n}\times{n}$ square matrix,and by condition 1),the rank of ${\phi}$ is $n$,therefore ${\phi}$ is invertible. By condition 3),$0\leq\gamma<1$,${I}-\gamma\bar{{{f}}}{{h}}$ is reversible. Therefore,by Lemma 1,equation ${\theta}_{ERLSPI}=[{\phi}^{\rm T}{{w}}({I}-\gamma\bar{{{f}}}{{h}}){\phi}]^{-1}[{\phi}^{\rm T}{{w}}\bar{{{r}}}]$ holds with probability 1.

The Q-function satisfies the consistency condition: $Q(x,u)=\bar{r}(x,u)+\gamma \sum\limits_{x \in {X}}\sum\limits_{u \in {U}}\bar{f}(x,u,x')h(x',u',x')Q(x',u')$

So the expected immediate reward $\bar{r}(x,u)$ can be attained as

$ \bar{r}(x,u)\!\!=\!\!Q(x,u)-\gamma\sum\limits_{x\in {X}}\sum\limits_{u\in{U}} \bar{f}(x,u,x')h(x',u',x')Q(x',u')= \notag\\ \quad {\phi}^\texttt{T}(x,u){\theta}^*-\gamma\sum\limits_{x \in {X}}\sum\limits_{u \in {U}}\bar{f}(x,u,x')h(x',u',x'){\phi}^{\rm T}(x',u'){\theta}^*=\notag\\ \quad ({\phi}(x,u)-\gamma\sum\limits_{x \in {X}}\sum\limits_{u \in {U}} \bar{f}(x,u,x')h(x',u',x'))^{\rm T}{\theta}^* $

Its matrix form is

$ \bar{{{r}}}=({I}-\gamma\bar{{{f}}}{{h}}){\phi}{\theta}^*.\notag $

Substituting in the formula of ${\theta}_{ERLSPI}$,we can get

$ {\theta}_{\rm ERLSPI} = [{\phi}^{\rm T}{{w}}(I-\gamma\bar{{f}} {{h}}){\phi}]^{-1}[\phi^{\rm T}{w}\bar{{r}}]= \notag\\ \qquad [{{\phi}}^{\rm T}{w}(I-\gamma\bar{{f}}{h}) {\phi}]^{-1}[{\phi}^{\rm T}{w}({I}-\gamma\bar{{f}}{{h}}){\phi} {\theta}^*]{\theta}^*. $

Thus,${\theta}_{\rm ERLSPI}$ converges to ${\theta}^*$ with probability 1.

As the sample data is generated with an exploratory policy,each state-action pair $(x,u)$ can be visited infinitely when $t\rightarrow\infty$,according to the law of large numbers,the probabilities of events happening approach their true probabilities, so the state transition probabilities between state-action pairs is an approximation to the true state transition probabilities $\bar{f}$. Besides,the probability of being visited for each state-action pair $(x,u)$ approaches $w(x,u)$ with probability 1, which meets the ergodic Markov chain. Therefore,using the linearly independent Gaussian radial basis function $\phi(x,u)(x\in {X},u\in {U})$ encoding whose dimension is $n=\mid {X}\mid \times \mid {U}\mid$,and take the discount factor $0\leq\gamma<1$,by Theorem 1,ERLSPI algorithm converges.

From the point of view of calculation,to calculate $\theta$ directly by (3) requires to complete the operation of matrix inversion whose order of complexity is ${\rm O}(n^3)$ where $n$ is the dimension of radial basis functions. For episodic tasks,suppose that batch updating interval has $k$ episodes,the sample data is reused $d$ times,and the number of samples collected in an updating interval is $m$. In ERLSPI with LSTD-Q,$\theta$ is calculated for $d$ times in each updating interval,the average time complexity of each episode is ${\rm O}(\frac{d}{k}n^3)$; in ERLSPI with LSPE-Q,$\theta$ is updated once in every state transition,its time complexity is ${\rm O}(m\frac{d}{k}n^3)$. To calculate incrementally by the Sherman-Morrison formula,the matrix inversion process is eliminated,the time complexity of each episode will be reduced to ${\rm O}(m\frac{d}{k}n^2)$. The time complexity is proportional to $d$ and is inversely proportional to $k$. The larger the value of $k$,the more experience is reused; the more quickly the optimal policy attained,the greater the computational cost is required. Thereby,we can reduce the number of the updating of policy by increasing the value of $k$ properly,which can reduce the computational costs without reducing the number of sample data. The time complexity of online LSPI,ERLSPI with LSTD-Q and ERLSPI with LSPE-Q are given in Table I.

Ⅳ. EXPERIMENT AND RESULTS ANALYSIS A. Experimental Description

In order to verify the performance of the proposed algorithm,in this work we simulate the experiment of the inverted pendulum system,where,in a horizontal rail,there is a cart with an inverted pendulum which can move around,and the inverted pendulum can rotate about the axis freely within the track,with the goal of keeping the inverted pendulum balanced by applying different forces,as showed in Fig. 1.

Download:
Fig. 1. Inverted pendulum system.

According to [13],the state space of the system is continuous,the state space of the system,represented by the angle $\chi$ between the inverted pendulum and the vertical direction and the current angular velocity $\dot{\chi}$,is continuous. The action is the force applied to the cart,there are three distinct actions: leftward force ($-$50N),rightward force (50N),without applying force (0 N). All the three actions are noisy,a uniform noise in the interval [$-$10,10] is added to the selected actions,the actual (noisy) force applied to the cart is denoted by F,the transitions are governed by nonlinear dynamics of the system which can be represented by

$ \ddot{\chi}=\frac{{g}{\rm sin}(\chi) - \eta al\frac{(\dot{\chi})^2}{2}-\eta {\rm cos}(\chi)F}{\frac{4}{3}l-\eta al{\rm cos}^2(\chi)}, $ (9)

where $g$ is the acceleration of gravity $(g\,=\,9.8\,{\rm m/s^2}),a$ is the mass of the inverted pendulum,$b$ is the mass of the cart $(b\,=\,8.0\,kg),l$ is the length of the inverted pendulum,and $\eta$ is a constant~($\eta=1/(a+b))$. The time step of system simulation is 0.1s. This is an episodic task,as long as the absolute value of the angle between the inverted pendulum and the vertical direction does not exceed $\pi$/2,the reward is 0; otherwise,once it exceeds $\pi$/2,it means the end of an episode and gets a reward of $-$1. The discount factor of the experiment is 0.95.

B. Experimental Settings

In this experiment,for each action a set of 10-dimensional (30-dimensional with 3-actions) basis functions is used to approximate the value function,including a constant and 9 Gaussian radial basis functions in the two-dimensional state space. For the state $x(\chi,\dot{\chi})$ and action $u$,the basis functions corresponding to the remaining actions are all 0. The basis functions corresponding to action $u$ are as follows:

$ (1,{\rm e}^{-\frac{\|x-\mu_1\|^2}{2\sigma^2}},\cdots,{\rm e}^{-\frac{\|x-\mu_9\|^2}{2\sigma^2}})^\texttt{T} $ (10)

where in $\mu_i~(i=1,\cdots,9)$ is the grid point $\{-\pi/4,0,+\pi/4\}\times\{-1,0,+1\}$ of the two-dimensional state space and $\sigma$ is the variance. The number of minimum balancing steps is set to 3000. That is,if the inverted pendulum does not fall after running 3000 time steps,this simulation can be considered to be successful,and the next episode starts. The maximum episode number of the experiment is set to 300. We analyze the performance of the algorithm by observing the balancing steps in 300 episodes. We use $\varepsilon$- greedy as behavior policy in the experiment. The exploration probability is initially set to a value $\varepsilon_0=0.005$,and it decays exponentially once every second with rate $\varepsilon_d=0.995$.

$ \varepsilon_{s} = \varepsilon_{0} \varepsilon_{d}^{\lfloor sT_{s}\rfloor},\notag $

where $s$ is the time steps,$T_{s}$ is the number of sampling time of the process,$T_{s}=0.1$ s ,and $\lfloor.\rfloor$ denotes the floor operator.

C. Experimental Results and Analysis

The experimental results of online LSPI method and ERLSPI with LSTD-Q under different parameters are shown in Fig. 2,where the horizontal axis represents the episodes of the simulation experiment,the vertical axis represents the balancing steps in each episode of the inverted pendulum,$d$ represents the number of experience reuse and $k$ represents the updating interval. For example,in ERLSPI,$k=5$ and $d=2$ denoting the policy is updated once every five episodes and the experience is reused two times in each updating. In Fig. 2 (a),showing results of ERLSPI $(d=2)$,ERLSPI $(d=5)$ and the online LSPI in the case of $k=1$, the experimental data are the average results after running 10 times. As we can see from the figure,the balancing steps of online LSPI method within the first 200 episodes are at a low level until about 230 episodes. The balancing steps of ERLSPI are significantly larger than online LSPI method,and the greater the $d$ is,the more improvement we can get. The greater $d$ means more experience can be reused,the faster learning speed to the optimal policy,and the higher computational costs. Increasing the value of $k$ properly can reduce the number of updating,which can reduce computational costs in the case when sample data is not reduced. In Fig. 2(b),we can see when $k=5$ where ERLSPI methods yield has good effects,the initial performance is slightly lower than those yielded by $k=1$. However,increasing the parameter $k$ continuously only leads to gradual deterioration of experimental results,as shown in Fig. 2(c). Therefore, appropriate parameters should be selected to ensure experimental results with proper computational costs.Table II shows the balancing steps of online LSPI and ERLSPI with LSTD-Q under different parameters in the first $p\,(p=50,100,150,200,250,300)$ episodes. Data in Table II shows that,the balancing steps of online LSPI within the first 200 episodes have been smaller than 50 steps,which eventually reach 2112 steps within the 300 episodes,while the balancing steps of ERLSPI,$d=5$ are up to several hundred in dozens of episodes,and finally reached the 3000 balancing steps in 300 episodes. Therefore,reusing previous experience can significantly improve the performance of the algorithm,and improve the convergence speed.

Table Ⅰ
THE ORDER OF TIME CPOMPLEXITY OF ONLINE LSPI, ERLSPI WIRH LSTD-Q AND ERLSPI WITH LSPE-Q

Download:
Fig. 2. The performance of online LSPI and ERLSPI with LSTD-Q under different parameters.

Table Ⅱ
THE BALANCING STEPS OF ONLINE LSPI AND ERLSPI WITH LSTD-Q UNDER DIFFERENT PARAMETERS

In the case of $k>5$,when $k$ increases,the computational costs become smaller and smaller,and so are the balancing steps,but they still reach 3000 balancing steps within 300 episodes.

(b),we can see when $k=5$ where ERLSPI methods yield has good The experimental results of online LSPI and ERLSPI with LSPE-Q under different parameters are shown in Fig. 3,where the step size parameter $\alpha=0.005$. In Fig. 3(a),showing results of ERLSPI ($d=2$),ERLSPI $(d=5)$ and online LSPI in the case of $k=1$,the experimental data are the average results after running 10 times of episodes. As we can see from Fig. 3,similar to LSPI with LSTD-Q, the balancing steps of online LSPI with LSPE-Q are at a low level until about 180 episodes. The balancing steps of ERLSPI with LSPE-Q are significantly larger than online LSPI,and the greater the $d$ is,the more improvement we can get. The greater $d$ $(d=2)$ means more experience can be reused,the faster learning speed to the optimal policy,and the higher computational costs. Increasing the value of $k$ properly can reduce the number of updating,which can reduce the computational costs in case without sample data reduced. In Fig. 3 effects,the initial performance is slightly lower than those yielded by others. However,increasing the parameter $k$ continuously only leads to gradual deterioration of experimental results,as shown in Fig. 3(c). Therefore,appropriate parameters should be selected to ensure appropriate experimental results with proper computational costs.

Download:
Fig. 3. The performance of online LSPI and ERLSPI with LSPE-Q under different parameters.

Table III shows the balancing steps of online LSPI and ERLSPI with LSPE-Q under different parameters in the first $p\,(p\,=\,50,\,100,\,150,\,200,\,250,\,300)$ episodes. Data in the table shows that,the balancing steps of online LSPI within the first 150 episodes have been smaller than 50 steps,eventually reach 2112 steps within the 300 episodes,while in $d$=5 ERLSPI,the balancing steps in dozens of episodes are up to several hundred,and finally reach the 3000 balancing steps in 300 episodes. Therefore, reusing previous experience can significantly improve the performance of the algorithm and improve the convergence speed. In the case of $k>5$,when $k$ increases,the computational costs become smaller and smaller,and so are the balancing steps,but they still reach the 3000 balancing steps in 300 episodes.

Table Ⅲ
THE BALANCING STEPS OF ONLINE LSPI AND ERLSPI WITH LSPE-Q UNDER DIFFERENT PARAMETERS

The performance comparison of ERLSPI and online LSPI,offline LSPI,Batch TD is shown in Fig. 4,the batch algorithms update the policy once every 5 episodes,the samples are reused 5 times, step size parameter $\alpha=0.005$. The balancing steps of Batch TD in 300 episodes have been maintained within dozens of steps, far lower than other least-squares methods,in which the least-squares methods add sample data to the matrix and make full use of the sample data. In online LSPI algorithm,the sample data is used only once,the initial performance is poor,but after about 230 episodes it grows rapidly,while the initial performance of offline LSPI is better,because it reuses all the samples until the current policy is stable. Offline LSPI reuses samples most frequently,however,it requires samples collected in advance and cannot be supplemented with new and better online samples which lead to the slow growth speed. ERLSPI algorithm collects samples online and reuses the sample data,which can ensure certain initial performance and quickly converge to the optimal policy to achieve 3000 balancing steps.

Download:
Fig. 4. The balancing steps of different algorithms. Fig. 5. The MSE of different algorithms.

Define the mean squared error (MSE) of policy parameter $\theta$ as

$ {\rm MSE}(\theta)=\sum_{i=1}^{N}(\theta(i)-\theta^*(i))^2, $ (11)

where $\theta^*$ is the optimal policy parameter,and $N$ is the dimension of $\theta$. MSE indicates the distance between the current policy and the optimal policy. The smaller the value of MSE is,the higher the accuracy of the policy is. The MSE values of ERLSPI algorithms and online LSPI,offline LSPI,Batch TD are shown in Fig. 5. The MSE of Batch TD algorithm is large,indicating the policy accuracy is unsatisfactory; online LSPI improves the policy accuracy obviously,but the growth speed is slow; while offline LSPI improves very fast in the beginning,but it levels off after about 120 episodes and is unable to find the optimal policy in 300 episodes; ERLSPI algorithms almost find the optimal policy after about 200 episodes and maintain high policy accuracy. Therefore,we can conclude that ERLSPI algorithms can improve the policy accuracy with the same episodes,find the optimal policy more quickly, enhance the empirical utilization rate and accelerate the convergence speed.

Download:
Fig. 5. The MSE of different algorithms.
Ⅴ. CONCLUSION

Traditional online least-squares policy iteration method uses the samples generated online only once which leads to the low empirical utilization rate and slow convergence speed. We combine experience replay method with online least-squares policy iteration method and an ER least-squares policy iteration method is proposed. The method stores samples generated online in the form of a tetrad $(x,u,r,x')$,and reuses these samples. In the process of experience reusing,although using the same samples in the form of tetrad,the policy parameter is updated with a quintuple $(x,u,r,x',h(x'))$,and each policy is different in updating. As a result,samples can be used more than once to update the policy parameter to improve the empirical utilization rate. ERLSPI with LSTD-Q updates the policy parameter after processing a batch of samples,which has more stable performance; while the ERLSPI with LSPE-Q updates the policy parameter after processing every sample and the result depends on the previous evaluation results. If a good initial value is assigned and the step size parameter is decreased gradually,it can also get good performance. Since the computation of the policy parameter needs to solve the matrix inversion which has high complexity and the large computational cost,the Sherman-Morrison formula is applied to the algorithm to eliminate the matrix inversion to reduce the time complexity and the computational cost. The experimental results of the inverted pendulum show that ERLSPI method can reuse previous experience,improve the empirical utilization rate,and accelerate the convergence speed. Additionally,more repetitions bring about more improvement,but requiring higher computational cost. Increasing the batch updating interval properly can also achieve a fine performance with smaller computational cost. Therefore,appropriate algorithm parameters should be selected according to the specific circumstances.

When the state space is divided more finely,the dimension of radial basis functions becomes relatively larger. However,the least-squares method generally need high computational cost. How to reduce the computational cost as well as improve the computational efficiency and how to select the proper samples efficiently are very important which are our future research work. In addition,transferring the work from the discrete action spaces to continuous action spaces is also our further study.

References
[1] Wiering M, van Otterlo M. Reinforcement learning: state-of-the-art. Adaptation, Learning, and Optimization. Berlin, Heidelberg: Springer, 2012.12-50
[2] Zhu Fei, Liu Quan, Fu Qi-Ming, Fu Yu-Chen. A least square actor-critic approach for continuous action space. Journal of Computer Research and Development, 2014, 51(3): 548-558 (in Chinese)
[3] Liu De-Rong, Li Hong-Liang, Wang Ding. Data-based self-learning optimal control: research progress and prospects. Acta Automatica Sinica, 2013, 39(11): 1858-1870 (in Chinese)
[4] Zhu Mei-Qiang, Cheng Yu-Hu, Li Ming, Wang Xue-Song, Feng Huan-Ting. A hybrid transfer algorithm for reinforcement learning based on spectral method. Acta Automatica Sinica, 2012, 38(11): 1765-1776 (in Chinese)
[5] Chen Xin, Wei Hai-Jun, Wu Min, Cao Wei-Hua. Tracking learning based on Gaussian regression for multi-agent systems in continuous space. Acta Automatica Sinica, 2013, 39(12): 2021-2031 (in Chinese)
[6] Xu X, Zuo L, Huang Z H. Reinforcement learning algorithms with function approximation: recent advances and applications. Information Sciences, 2014, 261: 1-31
[7] Bradtke S J, Barto A G. Linear least-squares algorithms for temporal difference learning. Recent Advances in Reinforcement Learning. New York: Springer, 1996.33-57
[8] Escandell-Montero P, Martínez-Martínez J D, Soria-Olivas E, Gómez-Sanchis J. Least-squares temporal difference learning based on an extreme learning machine. Neurocomputing, 2014, 141: 37-45
[9] Maei H R, Szepesvári C, Bhatnagar S, Sutton R S. Toward off-policy learning control with function approximation. In: Proceedings of the 27th International Conference on Machine Learning. Haifa: Omnipress, 2010.719-726
[10] Tamar A, Castro D D, Mannor S. Temporal difference methods for the variance of the reward to go. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Atlanta, Georgia, 2013.495-503
[11] Dann C, Neumann G, Peters J. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014, 15(1): 809-883
[12] Lagoudakis M G, Parr R, Littman M L. Least-squares methods in reinforcement learning for control. Methods and Applications of Artificial Intelligence. Berlin, Heidelberg: Springer, 2002, 2308: 249-260
[13] Lagoudakis M, Parr R. Least squares policy iteration. Journal of Machine Learning Research, 2003, 4, 1107-1149
[14] Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming using Function Approximators. New York: CRC Press, 2010.100-118
[15] Adam S, Busoniu L, Babuska R. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(2): 201-212
[16] Jung T, Polani D. Kernelizing LSPE (λ). In: Proceedings of the 2007 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Honolulu, HI: IEEE, 2007. 338-345
[17] Jung T, Polani D. Least squares SVM for least squares TD learning. In: Proceedings of the 17th European Conference on Artificial Intelligence. Amsterdam: IOS Press, 2006. 499-503