Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics

Citation

Aleksandra Faust, Peter Ruymgaart, Molly Salman, Rafael Fierro, Lydia Tapia.Continuous Action Reinforcement Learning for Control-Affine Systems with Unknown Dynamics[J]. IEEE/CAA Journal of Automatica Sinica, 2014,1(3): 323-336 复制到剪切板

Aleksandra Faust, Peter Ruymgaart, Molly Salman, Rafael Fierro, Lydia Tapia

1. Departmentof Computer Science, University of New Mexico, Albuquerque,NM87131, USA;
2. Computer Science Department, Austin College,Sherman, TX75090, USA;
3. Department of Electrical and Computer Engineering,University of New Mexico, Albuquerque, NM87131, USA

Received: September 17, 2013, Accepted: May 28, 2014.

This work was supported by New Mexico Space Grant, Computing Research Association CRA-W Distributed Research Experience for Undergraduates, NSF (ECCS #1027775), Army Research Laboratory (#W911NF-08-2-0004), National Institutes of Health (NIH) (P20GM110907) to the Center for Evolutionary and Theoretical Immunology. Recommended by Associate Editor Warren Dixon

Aleksandra Faust, Peter Ruymgaart, and Lydia Tapia are with the Department of Computer Science, University of New Mexico, Albuquerque, NM87131, USA (e-mail: {afaust, apr1248, tapia}@cs.unm.edu).
Molly Salman is with the Computer Science Department, Austin College, Sherman, TX75090, USA (e-mail: msalmanlo@austincollege.edu).
Rafael Fierro is with the Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM87131, USA (e-mail: rfierro@ece.unm.edu).

Abstract: Control of nonlinear systems is challenging in realtime. Decision making, performed many times per second, must ensure system safety. Designing input to perform a task often involves solving a nonlinear system of differential equations, which is a computationally intensive, if not intractable problem. This article proposes sampling-based task learning for controlaffine nonlinear systems through the combined learning of both state and action-value functions in a model-free approximate value iteration setting with continuous inputs. A quadratic negative definite state-value function implies the existence of a unique maximum of the action-value function at any state. This allows the replacement of the standard greedy policy with a computationally efficient policy approximation that guarantees progression to a goal state without knowledge of the system dynamics. The policy approximation is consistent, i.e., it does not depend on the action samples used to calculate it. This method is appropriate for mechanical systems with high-dimensional input spaces and unknown dynamics performing Constraint-Balancing Tasks. We verify it both in simulation and experimentally for an Unmanned Aerial Vehicles (UAVs) carrying a suspended load, and in simulation, for the rendezvous of heterogeneous robots.

Key words: Reinforcement learning policy approximation approximate value iteration fitted value iteration continuous action spaces control-affine nonlinear systems

Ⅰ. INTRODUCTION

HUMANS increasingly rely on robots to perform tasks. A particular class of tasks that interests us are Constraint-Balancing Tasks. These tasks have one goal state and opposing constraining preferences on the system. Balancing the speed and the quality of the task are often seen as two opposing preferential constraints. For example,the time-sensitive aerial cargo delivery task must deliver suspended load to origin as soon as possible with minimal load displacement along the trajectory (Fig. 1 (a)). The rendezvous task (Fig. 1 (b)) requires cargo-bearing Unmanned Aerial Vehicles (UAVs) and a ground robot to meet without a predetermined meeting point. In these tasks the robot must manipulate and interact with its environment,while completing the task in a timely manner. This article considers robots as mechanical systems with nonlinear control-affine dynamics. Without knowing their exact dynamics,we are interested in producing motions that perform a given Constraint-Balancing Task.

	Download: JPG larger image
Fig. 1. Evaluated tasks.

Control of multi-dimensional nonlinear systems,such as robots and autonomous vehicles must perform decision making many times per second,and must ensure system safety. Yet,designing inputs to perform a task typically requires system dynamics knowledge. Classical optimal control approaches use a combination of open-loop and closed loop controllers to generate and track trajectories^[1]. Another technique first linearizes the system and then applies LQR methods locally^[2]. All classical methods for solving nonlinear control problems require knowledge of the system dynamics^[2]. On the other hand,we present a solution to an optimal nonlinear control problem when the system dynamics is unknown.

Reinforcement learning (RL) solves control of unknown or intractable dynamics by learning from experience and observations. The outcome of RL is a control policy. Typically RL learns the value (cost) function and derives a greedy control policy with respect to the value. In continuous spaces, the value function is approximated^[3]. When actions are continuous,the greedy policy must be approximated as well. The downside of RL is that its sampling nature renders stability and convergence proofs challenging^[3].

We rely on RL to learn control policy for Constraint-Balancing Tasks without knowing the robot$'$s dynamics. Given the continuous state space,Fitted Value Iteration (FVI) approximates a value function with a linear map of basis functions^[4]. FVI learns the linear map parametrization iteratively in an expectation-maximization manner^{[3, 5]}. The basis function selection is challenging because the learning convergence is sensitive to the selection of the approximation functional space^[3]. Here,we select the basis functions to both fit the task and define value function as a Lyapunov candidate function.

We extend FVI,a discrete action RL algorithm,to continuous action space to develop Continuous Action Fitted Value Iteration (CAFVI). The novelty is a joint work with two value functions,state-value and action-value,to learn the control. CAFVI learns,globally to the state space,the state-value function,which is the negative of the Lyapunov. On the other hand,in the estimation step,it learns an action-value function locally around a state to estimate its maximum. This maximum is found using the newly developed policies that divide-and-conquer the problem by finding the optimal inputs on each axis separately and then combine them. Not only are the policies computationally efficient,scaling linearly with the input$'$s dimensionality,but they produce consistent near-optimal input; their outcome does not depend on the input samples used for calculation. Although problem decomposition via individual dimensions is a common technique for dimensionality reduction^[6],this article shows that single-component policies lead to a stable system,offers three examples of such policies to turn the equilibrium into an asymptotically stable point,and characterizes systems for which the technique is applicable. The reinforcement learning agent is evaluated on a quadrotor with suspended load and a heterogeneous robot rendezvous task.

From the practical perspective,the article gives methods to implement an FVI with linear map approximation for a Constraint-Balancing Task,on control-affine systems^[7] with unknown dynamics and in presence of a bounded drift. These tasks require the system to reach a goal state,while minimizing opposing constraints along the trajectory. The method is fast and easy to implement,rendering an inexpensive tool to attempt before more heavy-handed approaches are attempted.

Ⅱ. RELATED WORK

Efficient,near-optimal nonlinear system control is an important topic of research both in feedback controls and reinforcement learning. With known system dynamics,[8] develops adaptive control for interconnected systems. When the system dynamics is not known,optimal^{[9, 10, 11]} and near-optimal^{[12, 13, 14]} control for interconnected nonlinear systems are developed for learning the state-value function using neural networks. This article addresses the same problem,we use linearly parametrized state-value functions with linear regression rather than neural networks for parameter learning. Generalized Hamilton-Jacobi-Bellman (HJB) equation for control-affine systems can be approximately solved with iterative least-squares^[15]. Our method also learns the value function,which corresponds to the generalized HJB equation solution,through iterative minimization of the least squares error. However,we learn from samples and linear regression rather than neural networks. For linear unknown systems,[16] gives an optimal control using approximate dynamic programming,while we consider nonlinear control-affine systems. Convergence proofs exist for neural network-based approximate value iteration dynamic programming for linear^[17] and control-affine systems^[18],both with known dynamics. Here,we are concerned with approximate value iteration methods in the reinforcement learning setting -- without knowing the system dynamics.

In continuous action RL,the decision making step,which selects an input through a policy,becomes a multivariate optimization. The optimization poses a challenge in the RL setting because the objective function is not known. Robots need to perform input selection many times per second; 50 to 100 Hz is not unusual^[19]. The decision-making challenges brought action selection in continuous spaces to the forefront of current RL research with the main idea that the gradient descent methods find maximums for known,convex value functions^[20] and in actor-critic RL^[21]. Our method is critic-only and because the system dynamics is unknown,the value-function gradient is unavailable^[21]. Thus,we develop a gradient-free method that divides-and-conquers the problem by finding the optimal input in each direction,and then combines them. Other gradient-free approaches such as Gibbs sampling^[22],Monte Carlo methods^[23],and sample-averages^[24] have been tried. Online optimistic sampling planners have been researched^{[25, 26, 27, 28]}. Specifically,Hierarchical Optimistic Optimization applied to Trees (HOOT)^[27],uses hierarchical discretization to progressively narrow the search on the most promising areas of the input space,thus ensuring arbitrary small error. Our methods find a near-optimal action through sample-based interpolation of the objective function and find the maximum in the closed-form on each axis independently.

Discrete actions FVI has solved the minimal residual oscillations task for a quadrotor with a suspended load and has developed the stability conditions with a discrete action Markov Decision Process (MDP)^[29]. Empirical validation in [29] shows that the conditions hold. This article characterizes basis vector forms for control-affine systems,defines admissible policies resulting in an asymptotically stable equilibrium,and analytically shows the system stability. The empirical comparison with [29] in Section Ⅳ-B shows that it is both faster and performs the task with higher precision. This is because the decision making quality presented here is not limited to the finite action space and is independent of the available samples. We also show wider applicability of the methods developed here by applying them to a multi-agent rendezvous task. Our earlier work^[30] extends [29] to environments with static obstacles specifically for aerial cargo delivery applications,and is concerned with generating trajectories in discrete action spaces along kinematic paths.

Ⅲ. METHODS

This section consists of four parts. First,Section Ⅲ-A specifies the problem formulation for a task on a control-affine system suitable for approximate value iteration with linear basis vectors. Based on the task,the system and the constraints,we develop basis functions and write the state-value function in Lyapunov quadratic function form. Second,Section Ⅲ-B develops sample-efficient policies that take the system to the goal and can be used for both planning and learning. Third,Section Ⅲ-C places the policies into FVI setting to present a learning algorithm for the goal-oriented tasks. Together they give practical implementation tools for solving Constraint-Balancing Tasks through reinforcement learning on control-affine systems with unknown dynamics. We discuss these tools in Section Ⅲ-D.

A. Problem Formulation

Consider a discrete time,control-affine system with no disturbances,$\vec{D}:X \times U \rightarrow X$,

\begin{align} \vec{D}: \;\;\; \vec{x}_{k+1} = \vec{f}({\vec{x}_k}) + \vec{g}(\vec{x}_k)\vec{u}_k, \end{align}

(1)

where states are $\vec{x}_k \in X \subseteq {\bf R}^{d_x}$,input is defined on a closed interval around origin,$\vec{u}_k \in U \subseteq {\bf R}^{d_u}$,$d_u \leq d_x$,$\vec{0} \in U$,and $\vec{g}:X \rightarrow {\bf R}^{d_x}\times{\bf R}^{d_u}$,${\vec{g}}(\vec{x}_k)^{\rm T} = [{\vec{g}}_1(\vec{x}_k)\;\cdots\; {\vec{g}}_{d_u}(\vec{x}_k)]$ is regular for $\vec{x}_k \in X \setminus \{ \vec{0}\}$,nonlinear,and Lipschitz continuous. Drift $\vec{f}:X \rightarrow {\bf R}^{d_x}$,is nonlinear,and Lipschitz. Assume that the system is controllable^[2]. We are interested in autonomously finding control input $\vec{u}_k$ that takes the system to its origin in a timely-manner while reducing $ \|{A}\vec{x} \| $ along the trajectory,where ${A}^{\rm T} = [{\vec{a}}_1,\cdots,{\vec{ a}}_{d_g}] \in {\bf R}^{d_g} \times {\bf R}^{d_x}$,$d_g \leq d_x$ is nonsigular.

A discrete time,deterministic first-order MDP with continuous state and action spaces,

\begin{align} \label{eq:mdp} {M}: \;(X,U,\vec{D},\rho), \end{align}

(2)

defines the problem. $\rho:X \rightarrow {\bf R}{}$ is the observed state reward,and the system dynamics $\vec{D}$ is given in (1). We assume that we have access to its generative model or samples,but that we do not know $\vec{D}$. In the remainder of the article,when the time step $k$ is not important,it is dropped from the state notation without the loss of generality.

A solution to MDP is an optimal policy $\vec{h^*}:X \rightarrow U$, that maximizes discounted cumulative state reward. Thus,the objective function to maximize,state-value cost function $V:X \rightarrow {\bf R}$,is

\begin{align} \label{eq:objective} V(\vec{x}) = \sum_{k=0}^\infty \gamma^k \rho_k, \end{align}

(3)

where $\rho_k$ is immediate reward observed at time step $k$ starting at state $x$,and $0 \leq \gamma < 1$ a discount constant. RL solves MDP without analytical knowledge of the system dynamics $\vec{D}$,and reward,$\rho$. Instead,it interacts with the system and iteratively constructs the value function. Using the Bellman equation^[31],the state value function $V$ can be recursively represented as

$$ V(\vec{x} ) = \rho(\vec{x} ) + \gamma \max_{\vec{u} } {V(\vec{D}(\vec{x} ,\vec{u} ))}. $$

The state value function is an immediate state reward plus discounted value of the state the system transitions following greedy policy. The action-state function $Q:X \times U \rightarrow {\bf R}$ is

$$ Q(\vec{x},\vec{u}) = \rho(\vec{x}') + \gamma \max_{\vec{u}'}V(\vec{D}(\vec{x}',\vec{u}')),\text{ and } \vec{x}' = \vec{D}(\vec{x},\vec{u}), $$

Action-value function,$Q,$ is the sum of the reward obtained upon performing action $\vec{u}$ from a state $x$ and the value of the state that follows. Both value functions give an estimate of a value. A state-value function,$V$,is a measure of state$'$s value, while an action-value function,$Q$,assigns a value to a transition from a given state using an input. Note,that RL literature works with either a state-reward $\rho$,or a related state-action reward where the reward is a function of both the state and the action. We do not consider a cost of action itself,thus the state-action reward is simply the reward of the state that the agent transitions upon applying action $\vec{u}$ in the state $x$. Therefore,the relation between the $V$ and $Q$ is

\begin{align} \label{eq:q} Q(\vec{x},\vec{u}) = V \circ \vec{D} (\vec{x},\vec{u}). \end{align}

(4)

Both value functions devise a greedy policy $\vec{h}:X \rightarrow U$, at state $\vec{x}$,as the input that transitions the system to the highest valued reachable state.

\begin{align} {{\vec h}^Q}(\vec x) = \mathop {argmax}\limits_{\vec u \in U} Q(\vec x,\vec u). \end{align}

(5)

A greedy policy uses the learned value function to produce trajectories. We learn state-value function,$V$,because its approximation can be constructed to define a Lyapunov candidate function,and in tandem with the right policy it can help assess system stability. For discrete actions MDPs,(5) is a brute force search over the available samples. When the action space is continuous,(5) becomes an optimization problem over unknown function $\vec{D}$. We consider analytical properties of $Q(\vec{x},\vec{u})$ for a fixed state $\vec{x}$ and known $V$,but having only knowledge of the structure of the transition function $\vec{D}$. The key insight we exploit is that existence of a maximum of the action-value function $Q(\vec{x},\vec{u})$,as a function of input $\vec{u}$,depends only on the learned parametrization of the state-value function $V.$

Approximate value iteration algorithms with linear map approximators require basis vectors. Given the state constraint minimization,we choose quadratic basis functions

\begin{align} {{F}_i}(\vec{x}) = \|\vec{a}_i^{\rm T}\vec{x} \|^2,\;\;\; i=1,\cdots,d_g, \label{eq:features} \end{align}

(6)

so that state-value function approximation,$V,$ is a Lyapunov candidate function. Consequently,$V$ is

\begin{align} \label{eq:value} V(\vec{x}) = \sum_{i=1}^{d_g} {\theta}_i {{F}_i}(\vec{x}) = ({A}\vec{x} )^{\rm T}{\Theta} ({A}\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x}, \end{align}

(7)

for a diagonal matrix ${\Theta} = \vec{\theta}I =\text{diag}(\theta_1,\theta_2, \cdots,\theta_{d_g})$,and a symmetric matrix ${\Lambda}$. Let us assume that ${\Lambda}$ has full rank. Approximate value iteration learns the parametrization ${\Theta}$ using a linear regression. Let $\Gamma = - \Lambda$. Note,that if ${\Theta}$ is negative definite,$\Lambda$ is as well,while $\Gamma$ is positive definite,and vice versa. Let also assume that when $\Gamma > 0$,and the system drift is bounded with $\vec{x}$ with respect to $\Gamma$-norm,$\vec{f}(\vec{x})^{\rm T}\Gamma \vec{f}(\vec{x}) \leq {\vec{x}}^{\rm T}\Gamma \vec{x}$. This characterizes system drift,conductive to the task. We empirically demonstrate its sufficiency in the robotic systems we consider.

To summarize the system assumptions used in the remainder of the article:

1) The system is controllable and the equilibrium is reachable. In particular,we use

\begin{align} \label{eq:controllable} \exists i,1 \leq i \leq d_u,\text{ such that } \vec{f}(\vec{x})^{\rm T}\Gamma{\vec{g}_i}(\vec{x}) \ne 0, \end{align}

(8)

and that $\vec{g}(\vec{x})$ is regular outside of the origin,

\begin{align} \label{eq:regular} \vec{g}(\vec{x})^{\rm T}\Gamma \vec{g}(\vec{x}) > 0\text{,} x \in X \setminus \{\vec{0} \}. \end{align}

(9)

2) Input is defined on a closed interval around origin,

\begin{align} \label{eq:input} \vec{0} \in U. \end{align}

(10)

3) The drift is bounded,

\begin{align} \label{eq:drift} \vec{f}(\vec{x})^{\rm T}\Gamma \vec{f}(\vec{x}) \leq \vec{x}^{\rm T}\Gamma \vec{x}, \text{ when } \Gamma > 0. \end{align}

(11)

Table Ⅰ presents a summary of the key symbols.

TABLEⅠ
SUMMARY OF KEY SYMBOLS AND NOTATION

B. Policy Approximation

This section looks into an efficient and a consistent policy approximation for (5) that leads the system (1) to a goal state in the origin. Here,we learn the action-value function $Q$ on the axes,and assume a known estimate of the state-value function approximation $V$. For the policy to lead the system to the origin from an arbitrary state,the origin must be asymptotically stable. Negative of the state-value function $V$ can be a Lyapunov function, and the value function $V$ needs to be increasing in time. That only holds true when the policy approximation makes an improvement,i.e., the policy needs to transition the system to a state of a higher value ($V(\vec{x}_{n+1}) > V(\vec{x}_n)$). To ensure the temporal increase of $V$,the idea is to formulate conditions on the system dynamics and value function $V$,for which $Q$,considered as a function only of the input,is concave and has a maximum. In this work,we limit the conditions to a quadratic form $Q$. When we establish a maximum$'$s existence,we approximate it by finding a maximum on the axes and combining them together. Fig. 2 illustrates this idea. To reduce the dimensionality of the optimization problem,we propose a divide and conquer approach. Instead of solving one multivariate optimization, we solve $d_u$ univariate optimizations on the axes to find a highest valued point on each axis,$u_i$. The composition of the axes$'$ action selections is the selection vector $\vec{u}=[u_1,\cdots,u_{d_u}]^{\rm T}$. This section develops the policy approximation following these steps:

	Download: JPG larger image
Fig. 2. Example of two dimensional input and a quadratic value function. ${{\vec{u}}^*}$ is the optimal input, $\vec{u}$ is the one selected.

1) Show that $Q$ is a quadratic form and has a maximum (Proposition 1).

2) Define admissible policies that ensure the equilibrium$'$s asymptotic stability (Theorem 1).

3) Find a sampling-based method for calculating consistent, admissible policies in ${\rm O}(d_u)$ time with no knowledge of the dynamics (Theorem 2).

Since the greedy policy (5) depends on action-value $Q$, Proposition 1 gives the connection between value function (7) and corresponding action-value function $Q$.

Proposition 1. Action-value function $Q(\vec{x},\vec{u})$ (4),of MDP (2) with state-value function $V$ (7),is a quadratic function of input $\vec{u}$ for all states $\vec{x} \in X$. When ${\Theta}$ is negative definite,the action-value function $Q$ is concave and has a maximum.

Proof. Evaluating $Q(\vec{x},\vec{u})$ for an arbitrary state $\vec{x}$,we get

\begin{align*} Q(\vec{x},\vec{u}) &= V(\vec{D}(\vec{x},\vec{u})) = V(\vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{u} )=\\ &(\vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{u} ))^{\rm T}\Lambda(\vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{u} ). \end{align*}

Thus,$Q$ is a quadratic function of action $\vec{u}$ at any state $\vec{x}$. To show that $Q$ has a maximum,we inspect $Q$$'$s Hessian,

$$ HQ(\vec{x},\vec{u}) = \begin{bmatrix}\frac{\partial^2Q(\vec{x},\vec{u})}{\partial u_1\partial u_1} & \cdots &\frac{\partial^2Q(\vec{x},\vec{u})}{\partial u_1\partial u_{d_u}}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2Q(\vec{x},\vec{u})}{\partial u_{d_u}\partial u_1} & \cdots &\frac{\partial^2Q(\vec{x},\vec{u})}{\partial u_{d_u}\partial u_{d_u}} \end{bmatrix} = 2\vec{g}(\vec{x})^{\rm T}\Lambda{\vec{g}}(\vec{x}) $$

The Hessian is negative definite because $\vec{g}(\vec{x})$ is regular for all states $\vec{x}$ and $\Theta < 0$,which means that $\Lambda < 0$ as well. Therefore,the function is concave,with a maximum.

The state-value parametrization ${\Theta}$ is fixed for the entire state space. Thus,Proposition 1 guarantees that when the parametrization ${\Theta}$ is negative definite,the action-value function $Q$ has a single maximum. Next,we show that the right policy can ensure the progression to the goal,but we first define the acceptable policies.

Definition 1. Policy approximation $ \vec{\hat{u}} = \vec{\hat{ h}}^{Q}(\vec{x})$ is admissible,if it transitions the system to a state with a higher value when one exists,i.e.,when the following holds for policy$'$s at state $\vec{x}$,$\Delta Q(\vec{x},\vec{\hat{u}}) = Q(\vec{x},\vec{\hat{u}}) - V(\vec{x})$:

1) $\Delta Q(\vec{x},\vec{\hat{u}}) > 0,\text{ for } \vec{x} \in X \setminus \{ \vec{0} \} $;

2) $\Delta Q(\vec{x},\vec{\hat{u}}) = 0,\text{ for } \vec{x} = \vec{0}. $

Theorem 1 shows that an admissible policy is sufficient for the system to reach the goal.

Theorem 1. Let $\vec{\hat{u}} = \vec{\hat{{ h}}}^{Q}(\vec{x})$ be an admissible policy approximation. When $\Lambda < 0$,and the drift is bounded with (11),the system (1) with value function (7) progresses to an asymptotically stable equilibrium under policy $\vec{\hat{{ h}}}^{Q}$.

Proof. Consider $W(\vec{x}) = -V(\vec{x})=\vec{x}^{\rm T} \Gamma \vec{x}.$ $W$ is a Lyapunov candidate function because $\Gamma > 0$.

To show the asymptotic stability,a $W$ needs to be monotonically decreasing in time $W(\vec{x}_{n+1}) \leq W(\vec{x}_n)$ with equality holding only when the system is in the equilibrium,$\vec{x}_{n} = \vec{0}$. Directly from the definition of the admissible policy,for the state $\vec{x}_n \ne \vec{0}$,$ W(\vec{x}_{n+1}) - W(\vec{x}_n) = -Q(\vec{x}_n, \vec{\hat{{h}}}^{Q}(\vec{x}_n)) + V(\vec{x}_n) = V(\vec{x}_n) - Q(\vec{x}_n,\vec{\hat{u}}) < {0}. $ When $\vec{x}_n = \vec{0},\implies \vec{x}_{n+1} = \vec{f}(\vec{0}) = \vec{0},$ because of (11) $\implies W(\vec{x}_{n+1}) = 0. $

Theorem 1 gives the problem formulation conditions for the system to transition to the goal state. Now,we move to finding sample-based admissible policies by finding maximums of $Q$ in the direction parallel to an axis and passing through a point. Because $Q$ has quadratic form,its restriction to a line is a quadratic function of one variable. We use Lagrange interpolation to find the coefficients of $Q$ on a line,and find the maximum in the closed form. We first introduce the notation for $Q$$'$s restriction in an axial direction and its samples along the direction.

Definition 2. Axial restriction of $Q$ passing through point $\vec{p}$,is a univariate function $Q^{(\vec{p})}_{\vec{x},i}(w) = Q(\vec{x},{\vec{p}} + w\vec{e}_i)$.

If $\vec{q_i} = [Q^{(\vec{p})}_{\vec{x},1}(w_{i1}) \; Q^{(\vec{p})}_{\vec{x},2}(w_{i2}) \; Q^{(\vec{p})}_{\vec{x},3}(w_{i3})]^{\rm T}$,are three samples of $Q^{(\vec{p})}_{\vec{x},i}(w)$ obtained at points $[w_{i1} \; w_{i2} \; w_{i3}]$,then $Q(\vec{x},\vec{p} + w\vec{e}_i)$,is maximized at

\begin{align} \hat{w}_i &= \min ( \max (\hat{w^*}_i,w^l_i),w^u_i) ,\notag\\ \hat{w}^*_i &= \frac{\vec{q_i}^{\rm T} \cdot ([w_{i2}^2 \quad w_{i3} ^2 \quad w_{i1} ^2] - [w_{i3}^2 \quad w_{i1}^2 \quad w_{i2}^2])^{\rm T}}{2 \vec{q_i}^{\rm T} \cdot ([w_{i2} \quad w_{i3} \quad w_{i1}] - [w_{i3} \quad w_{i1} \quad w_{i2}]) ^{\rm T}}, \end{align}

(12)

on the interval,$w^l_i \leq w \leq w^u_i $. Equation (12) comes directly from Lagrange interpolation of a univariate second order polynomial to find the coefficients of the quadratic function,and then equating the derivative to zero to find its maximum. In the stochastic case,instead of Lagrange interpolation,linear regression yields the coefficients.

A motivation for this approach is that finding a maximum in a single direction is computationally efficient and consistent. A single-component policy is calculated in constant time. In addition,the input selection on an axis calculated with (12) is consistent,i.e. it does not depend on the sample points $w_{ij}$ available to calculate it. This is the direct consequence of a quadratic function being uniquely determined with arbitrary three points. It means that a policy based on (12) produces the same result regardless of the input samples used,which is important in practice where samples are often hard to obtain.

Lemma 1 shows single component policy characteristics including that a single-component policy is stable on an interval around zero. Later,we integrate the single-component policies together into admissible policies.

Lemma 1. A single input policy approximation (12),for an input component,$i$,$1 \leq i \leq d_u$ has the following characteristics:

1) There is an input around zero that does not decrease system$'$s state value upon transition,i.e.,$\exists \hat{w}_0 \in [w^i_l, w^i_u]$ such that $Q^{(\vec{p})}_{\vec{x},i}(\hat{w}_0) \geq Q(\vec{x},{\vec{p}})$;

2) $Q^{(\vec{0})}_{\vec{x},i}(\hat{w}_i) - V(\vec{x}) \geq 0$,when $\vec{x} \ne \vec{0}$;

3) $Q(\vec{0},\hat{w}_i \vec{e}_i) - V(\vec{0}) = 0$.

The proof for Lemma 1 is in Appendix A.

We give three consistent and admissible policies as examples. First, the Manhattan policy finds a point that maximizes $Q$$'$s restriction on the first axis,then iteratively finds maximums in the direction parallel to the subsequent axes,passing through points that maximize the previous axis. The second policy approximation,Convex Sum,is a convex combination of the maximums found independently on each axis. Unlike the Manhattan policy that works serially,the Convex Sum policy parallelizes well. Third, Axial Sum is the maximum of the Convex Sum policy approximation and nonconvex axial combinations. This policy is also parallelizable. All three policies scale linearly with the dimensions of the input $O(d_u)$. Next,we show that they are admissible.

Theorem 2. The system (2) with value function (7),bounded drift (11),and a negative definite ${\Theta}$,starting at an arbitrary state $\vec{x} \in X$,and on a set $U$ (10),progresses to an equilibrium in the origin under any of the following policies:

1) Manhattan policy

$\begin{array}{l} \vec h_m^Q:\left\{ {\begin{array}{*{20}{l}} {{{\hat w}_1} = \begin{array}{*{20}{c}} {argmax}\\ {w_l^1 \le w \le w_u^1} \end{array}Q_{\begin{array}{*{20}{c}} \to \\ x \end{array},1}^{(\begin{array}{*{20}{c}} \to \\ 0 \end{array})}(w),}\\ {{{\hat w}_n} = \begin{array}{*{20}{c}} {argmax}\\ {w_l^n \le w \le w_u^n} \end{array}Q_{\begin{array}{*{20}{c}} \to \\ x \end{array},n}^{({{\begin{array}{*{20}{c}} \to \\ {\hat u} \end{array}}_{n - 1}})}(w),n \in [2, \cdots ,{d_u}],}\\ {{{\begin{array}{*{20}{c}} \to \\ {\hat u} \end{array}}_n} = \sum\limits_{i = 1}^n {{{\hat w}_i}} {{\vec e}_i}.} \end{array}} \right.\\ \end{array}$

(13)

2) Convex Sum

\begin{align} \vec{h}_c^{Q}: \;\vec{\hat{u}} = \sum_{i=1}^{d_u} \lambda_i \vec{e}_i \mathop {argmax}\limits_{w_l^i \le w \le w_u^i} Q^{(\vec{0})}_{\vec{x},i}(w), \;\;\; \sum_{i=1}^{d_u} \lambda_i = 1. \end{align}

(14)

3) Axial Sum

\begin{align} \vec{h}_s^Q: \;\vec{\hat{u}} = \left\{\begin{array}{*{20}llllll} \vec{h}_c^Q(\vec{x}),& Q(\vec{x},\vec{h}_c^Q(\vec{x})) \geq Q(\vec{x},\vec{h}_n^Q(\vec{x})) \\ \vec{h}_n^Q(\vec{x}),& \mbox{otherwise} \end{array}\right. \end{align}

(15)

where

$$ \vec h_n^Q(\vec x) = \sum\limits_{i = 1}^{{d_u}} {{{\vec e}_i}} \mathop {argmax}\limits_{w_l^i \le w \le w_u^i} Q_{\vec x,i}^{(\vec 0)}(w). $$

The proof for Theorem 2 is in Appendix B.

A consideration in reinforcement learning,applied to robotics and other physical systems,is balancing exploitation and exportation^[32]. Exploitation ensures the safety of the system,when the policy is sufficiently good and yields no learning. Exploration forces the agent to perform suboptimal steps,and the most often used $\epsilon$-greedy policy performs a random action with probability $\epsilon$. Although the random action can lead to knowledge discovery and policy improvement,it also poses a risk to the system. The policies presented here fit well in online RL paradigm,because they allow safe exploration. Given that they are not optimal,they produce new knowledge,but because of their admissibility and consistency,their input of choice is safe to the physical system. For systems with independent inputs,Axial Sum policy is optimal (see Appendix C).

C. Continuous Action Fitted Value Iteration (CAFVI)

We introduced an admissible,consistent,and efficient decision making method for learning action-value function $Q$ locally,at fixed state $\vec{x}$,and fixed learning iteration (when ${\Theta}$ is fixed) without knowing the system dynamics. Now,the decision making policies are integrated into a FVI framework^[3,5] to produce a reinforcement learning agent for continuous state and action MDPs tailored for control-affine nonlinear systems. The algorithm learns the parameterization $\vec{\theta}$,and works much like approximate value iteration^[5] to learn state-value function approximation $\vec{\theta}$,but the action selection uses sampling-based policy approximation on the action-value function $Q$. Algorithm 1 shows an outline of the proposed Continuous Action Fitted Value Iteration (CAFVI). It first initializes $\vec{\theta}$ with a zero vector. Then,it iteratively estimates $Q$ function values and uses them to make a new estimate of $\vec{\theta}$. First,we randomly select a state $\vec{x}_s$ and observe its reward. Line 6 collects the samples. It uniformly samples the state space for $\vec{x}_{l_s}$. Because we need three data points for Lagrangian interpolation of a quadratic function,three input samples per input dimensions are selected. We also obtain,either through a simulator or an observation,the resulting state $\vec{x}'_{ij}$ when $\vec{u}_{ij}$ is applied to $\vec{x}_{l_s}$. Line 7 estimates the action-value function locally,for $\vec{x}_{l_s}$ and $\vec{u}_{ij}$ using the current ${\vec{\theta}_l}$ value. Next,the recommended action is calculated,$\vec{\hat{u}}$. Looking up the available samples or using a simulator,the system makes the transition from $\vec{x}_{l_s}$ using action $\vec{\hat{u}}$. The algorithm makes a new estimate of $V(\vec{x}_{l_s})$. After $n_s$ states are processed,Line 12 finds new $\vec{\theta}$ that minimizes the least squares error for the new state-value function estimates $v_{l_s}$. The process repeats until either $\vec{\theta}$ converges,or a maximum number of iterations is reached.

Algorithm 1. Continuous Action Fitted Value Iteration (CAFVI)

Input. $X,U$,discount factor $\gamma$,basis function vector $\vec{F}$

Output. ${\vec{\theta}}$

1) $\vec{\theta}_0,\vec{\theta}_1 \leftarrow$ zero vector

2) $l \leftarrow 1$

3) while ($l \le max\_iterations$) and $\|\vec{\theta}_l - \vec{\theta}_{l-1}\| \ge \epsilon$ do

4) for {$l_s=1,\cdots,n_s$} do

5) sample state $\vec{x}_{l_s}$ and observe its reward $\rho_{l_s}$

6) $\{\vec{x}_{l_s},\vec{u}_{ij},x'_{ij} | i=1,\cdots,d_u,j=1,2,3\}$ {obtain system dynamics samples}

7) for all $i,$ $j$ do $q_{ij} \leftarrow {\theta}^{\rm T}_l {F}(\vec{x}'_{ij}) $ end for{estimate action-value function}

8) $\vec{\hat{u}} \leftarrow $ calculated with (12)

9) obtain $\{\vec{x}_{l_s},\vec{\hat{u}},\vec{x'}_{l_s},\rho_{l_s} \}$

10) $v_{l_s} = \rho_{l_s} + \gamma {\vec{\theta}}^{\rm T}_l{\vec{F}}(\vec{x}'_{ls})$ {state-value function new estimate}

11) end for

12) ${{\vec \theta }_{l + 1}} \leftarrow \mathop {argmin}\limits_{\vec \theta } \sum\nolimits_{{l_s} = 1}^{{n_s}} {{{({v_{{l_s}}} - {{\vec \theta }^{\rm{T}}}\vec F({{\vec x}_{ls}}))}^2}} $

13) $l \leftarrow l+1$

14) end while

15) return ${\vec{\theta}_{l}}$

The novelties of Algorithm 1 are continuous input spaces,and the joint work with both state and action-value functions (Lines 6-8), while FVI works with discrete,finite input sets and with one of the two functions^[3],but not both. Although the outcome of the action-value function learning (Line 8) is independent of the input samples,the state-value function learning (Line 12) depends on the state-samples collected in Line 5,just like discrete action FVI^[5].

D. Discussion

Considering a Constraint-Balancing Task,we proposed quadratic feature vectors,and determined sufficient conditions for which admissible policies presented in Section Ⅲ-B transition the system to the goal state obeying the task requirements. Finally,we presented a learning algorithm that learns the parametrization. There are several points that need to be discussed,convergence of the CAFVI algorithm,usage of the quadratic basis functions,and determination of the conditions from Section Ⅲ-A. Full conditions under which FVI with discrete actions converges is still an active research topic^[3]. It is known that it converges when the system dynamics is a contraction^[3]. A detailed analysis of the error bounds for FVI algorithms with finite^[33] and continuous^[24] actions,finds that the FVI error bounds scale with the difference between the basis functional space and the inherent dynamics of the MDP. The system$'$s dynamics and reward functions determine the MDP$'$s dynamics. We choose quadratic basis functions,because of the nature of the problem we need to solve and for stability. But,basis functions must fit reasonably well into the true objective function (3) determined by the system dynamics and the reward,otherwise CAFVI diverges.

The goal of this article is to present an efficient toolset for solving Constraint-Balancing Tasks on a control-affine system with unknown dynamics. Using quadratic basis functions,Algorithm 1 learns the parametrization $\vec{\theta}$. Successful learning that converges to a $\vec{\theta}$ with all negative components,produces a controller based on Section Ⅲ-B policies that is safe for a physical system and completes the task.

In Section Ⅲ-A,we introduced sufficient conditions for successful learning. The conditions are sufficient but not necessary,so the learning could succeed under laxer conditions. Done in simulation prior to a physical system control,the learning can be applied when we are uncertain if the system satisfies the criterion. When the learning fails to converge to all negative parameters,the controller is not viable. Thus,a viable controller is possible under laxer conditions verifiable through learning; so the toolset can be safely and easily attempted first,before more computationally intensive methods are applied. It can be also used to quickly develop an initial value function,to be refined later with another method.

Ⅳ. RESULTS

This section evaluates the proposed methodology. We first verify the policy approximations$'$ quality and computational efficiency on a known function in Section Ⅳ-A,and then we showcase the method$'$s learning capabilities in two case studies: a quadrotor with suspended payload (Section Ⅳ-B),and a multi-agent rendezvous task (Section Ⅳ-C).

In all evaluations,the Convex Sum was calculated using equal convex coefficients $\lambda_i = d_u^{-1}$. Discrete and HOOT^[27] policies are used for comparison. The discrete policy uses an equidistant grid with 13 values per dimension. HOOT uses three hierarchical levels,each covering one tenth of the input size per dimension and maintaining the same number of inputs at each level. All computation was performed using Matlab on a single core Intel Core i7 system with 8GB of RAM,running the Linux operating system.

A. Policy Approximation Evaluation

In Section Ⅲ-B,we proposed three policy approximations and showed their admissibility. To empirically verify the findings,we examine their behavior on known quadratic functions of two variables,elliptical paraboloids with a maximum. Table Ⅱ depicts maximum and minimum values for $\Delta Q(\vec{x},\vec{h}^{Q}(\vec{x}))$ as $Q$ ranges over the class of concave elliptical paraboloids. Since the $\Delta Q$ is always positive for all three policies,the empirical results confirm our findings from Theorem 2 that the policies are admissible. We also see from $\min \Delta w$ that in some cases Manhattan and Axial Sum make optimal choices,which is expected as well. The maximum distance from the optimal input column ($\max \Delta w$) shows that the distance from the optimal input is bounded.

TABLE Ⅱ
SUMMARY OF POLICY APPROXIMATION PERFORMANCE. MINIMUM AND MAXIMUM OF THE VALUE GAIN ($\Delta Q$) AND THE DISTANCE FROM THE OPTIMAL INPUT ($\Delta w$)

To further evaluate the policies$'$ quality we measure the gain ratio between the policy$'$s gain and maximum gain on the action-value function ($\vec{u}^*$ is optimal input):

$$ g_{\vec{h}^{Q}}(\vec{x}) = \frac{Q(\vec{x},\vec{h}^{Q}(\vec{x})) - Q(\vec{x},\vec{0})}{Q(\vec{x},{\vec{u}^*}) - Q(\vec{x},\vec{0})}. $$

Non-admissible policies have negative or zero gain ratio for some states,while the gain ratio for admissible policies is strictly positive. The gain ratio of one signifies that policy $\vec{h}^{Q}$ is optimal,while a gain ratio of zero means that the selected input transitions the system to an equivalent state from the value function perspective. The elliptic paraboloids$'$ $Q(\vec{x},[w_1 w_2]^{\rm T}) = a w_1^2 + b w_1 w_2 + c w_2^2 + d w_1 + e w_2 + f$, isoclines are ellipses,and the approximation error depends on the rotational angle of the ellipse$'$s axes,and its eccentricity. Thus,a policy$'$s quality is assessed as a function of these two parameters: the rotational angle $\alpha$ and range of the parameter $c$,while parameters $a$,$d$,$e$,and $f$ are fixed. Parameter $b$ is calculated such that $b=(a-c)\tan 2\alpha$. The eccentricity is depicted in Fig. 3 (a),with zero eccentricity representing a circle,and an eccentricity of one representing the ellipse degenerating into a parabola. The empty areas in the heat maps are areas where the function is either a hyperbolic paraboloid or a plane,rather than an elliptic paraboloid and has no maximum. Fig. 3 displays the heat maps of the gain ratios for the Manhattan (Fig. 3 (b)),Axial Sum (Fig. 3 (c)),and Convex Sum (Fig. 3 (d)) policies. All policies have strictly positive gain ratio,which gives additional empirical evidence to support the finding in Theorem 2. Manhattan and Axial Sum perform similarly, with the best results for near-circular paraboloids,and degrading as the eccentricity increases. In contrast,the Convex Sum policy performs best for highly elongated elliptical paraboloids.

	Download: JPG larger image
Fig. 3. Eccentricity of the quadratic functions (a) related to policy approximation gain ratio (b)-(d) as a function of quadratic coefficient $c$, and rotation of the semi-axes.

Lastly,we consider the computational efficiency of the three policies,and compare the running time of a single decision making with discrete and HOOT^[27] policies. Fig. 4 depicts the computational time for each of the policies as a function of the input dimensionality.

	Download: JPG larger image
Fig. 4. Policy approximation computational time per input. dimensionality. Comparison of discrete, HOOT, Manhattan, Axial Sum, and Convex Sum policies. The $y$-axis is logarithmic.

Both discrete and HOOT policies$'$ computational time grows exponentially with the dimensionality,while the three policies that are based on the axial maximums: Manhattan,Axial Sum, and Convex Sum are linear in the input dimensionality,although Manhattan is slightly slower.

B. Cargo Delivery Task

This section applies the proposed methods to the aerial cargo delivery task^[29]. This task is defined for a UAV carrying a suspended load,and seeks acceleration on the UAV$'$s body,that transports the joint UAV-load system to a goal state with minimal residual oscillations. We show that the system and its MDP satisfy conditions for Theorem 1,and will assess the methods through examining the learning quality,the resulting trajectory characteristics,and implementation on the physical system. We compare it to the discrete AVI^[29] and HOOT^[27],and show that methods presented here solve the task with more precision.

To apply the motion planner to the cargo delivery task for a holonomic UAV carrying a suspended load,we use the following definition of the swing-free trajectory.

Definition 3. A trajectory of duration $t_0$ is said to be with minimal residual oscillations if for a given constant $\epsilon > 0$ there is a time $0 \le t_1 \le t_0$,such that for all $t \ge t_1$,the load displacement is bounded with $\epsilon$ ($\rho(t) < \epsilon$).

The MDP state space is the position of the center of the mass of the UAV $\vec{p}=[x$ $y$ $z]^{\rm T}$,its linear velocities $\vec{v}=[\dot{x}$ $\dot{y}$ $\dot{z}]^{\rm T}$,the angular position $\vec{\eta} = [\psi$ $\phi]^{\rm T}$ of the suspended load in the polar coordinates originating at the quadrotor$'$s center of mass,with the zenith belonging to the axis perpendicular to Earth, and its angular velocities $\dot{\vec{\eta}} = [\dot{\psi}$ $\dot{\phi}]^{\rm T}$. The actuator is the acceleration on the quadrotor$'$s body,$\vec{u}=[u_x$ $u_y$ $u_z]^{\rm T}$. For the system$'$s generative model,we use a simplified model of the quadrotor-load model described in [29],which satisfies (1). $\vec{G}$ is a gravitational force vector.

\begin{align} \label{eq:dyn1} \begin{split} \vec{v} &= {\vec{v}_0} + \bigtriangleup t \vec{u},\quad{\vec{p}} = {\vec{p}_0} + \bigtriangleup t {\vec{v}_0} + \frac{\bigtriangleup t^2}{2}\vec{u}\\ {\dot{\vec{\eta}}} &= {\dot{\vec{\eta}}_0} + \bigtriangleup t {\ddot{\vec{\eta}}},\quad \vec{\eta} = \vec{\eta}_0 + \bigtriangleup t {\dot{\vec{\eta}}_0} + \frac{\bigtriangleup t^2 }{2}{\ddot{\vec{\eta}}} \\ {\ddot{\vec{\eta}}} &= \begin{bmatrix}\sin \theta_0 \sin \phi_0 & - \cos\phi_0 & L^{-1}\cos \theta_0 \sin \phi_0 \\ -\cos \theta_0 \cos \phi_0 & 0 & L^{-1}\cos \phi_0 \sin \theta_0 \end{bmatrix} (\vec{u}-{\vec{G}}) \end{split} \end{align}

(16)

The system (16) satisfies (1). The reward function penalizes the distance from the goal state,the load displacement,and the negative $z$ coordinate. Lastly,the agent is rewarded when it reaches equilibrium.

The value function is approximated as a linear combination of quadratic forms of state subspaces $ V(\vec{x}) = {\vec{\theta}}^{\rm T} \vec{F}(\vec{x}), \vec{F}(\vec{x}) = [\| \vec{p}\|^2 \|\vec{v}\|^2 \| \vec{\eta} \|^2 \| \dot{\vec{\eta}} \|^2]^{\rm T}$ where $\vec{\theta} \in {\bf R}^{4}$,satisfies the form (7),and because the learning produces ${\theta}$ with all negative components,all conditions for Theorem 1 are satisfied including the drift (11).

The time-to-learn is presented in Fig. 5 (a). The axial maximum policies perform an order of magnitude faster than the discrete and HOOT policies. To assess learning with Algorithm 1 using Manhattan,Axial Sum,and Convex Sum policies,we compare to learning using the greedy discrete policy and HOOT. Fig. 5 (b) shows the learning curve,over number of iterations. After 300 iterations all policies converge to a stable value. All converge to the same value,but discrete learning that converges to a lower value.

	Download: JPG larger image
Fig. 5. Learning results for Manhattan, Axial Sum, and Convex Sum, compared to discrete greedy, and HOOT policies averaged over 100 trials. Learning curves for Manhattan and Axial Sum are similar to Convex Sum and are omitted from (b) for better visibility.

Finally,inspection of the learned parametrization vectors confirms that all the components are negative,meeting all needed criteria for Theorem 1. This means that the equilibrium is asymptotically stable,for admissible policies,and we can generate trajectories of an arbitrary length.

Next,we plan trajectories using the learned parametrizations over the 100 trials for the three proposed policies and compare them to the discrete and HOOT policies. We consider a cargo delivery task complete when $\| \vec{p} \| \leq 0.010 $m,$\| \vec{v} \| \leq 0.025 \text{ m/s}$,$\| \vec{\eta} \| \leq 1^{\circ}$,and $\| \dot{\vec{\eta}} \| \leq 5 ^{\circ}/$s. This is a stricter terminal set than the one previously used in [29]. The input limits are $-3 \leq u_i \leq 3$,for $i\in \{1,2,3\}$. The discrete and HOOT policies use the same setup described in Section Ⅳ-A. The planning occurs at 50 Hz. We compare the performance and trajectory characteristics of trajectories originating 3 meters from the goal state. Table Ⅲ presents results of the comparison. Manhattan,Axial Sum,and HOOT produce very similar trajectories,while Convex Sum generates slightly longer trajectories,but with the best load displacement characteristics. This is because the Convex Sum takes a different approach and selects smaller inputs,resulting in smoother trajectories. The Convex Sum method plans the 9 second trajectory in 0.14 s,over 5 times faster than the discrete planning,and over 3 times faster than HOOT. Finally,30 $\%$ of the discrete trajectories are never able to complete the task. This is because the terminal set is too small for the discretization. In other words,the discretized policy is not admissible. Examining the simulated trajectories in Fig. 6 reveals that Convex Sum indeed selects a smaller input,resulting in a smoother trajectory (Fig. 6 (a)) and less swing (Fig. 5 (b)). HOOT,Manhattan,and Axial Sum produce virtually identical trajectories,while the discrete trajectory has considerable jerk,absent from the other trajectories.

TABLE Ⅲ
SUMMARY OF TRAJECTORY CHARACTERISTICS OVER 100 TRIALS. MEANS ($\mu$) AND STANDARD DEVIATIONS ($\sigma$) OF TIME TO REACH THE GOAL, FINAL DISTANCE TO GOAL, FINAL SWING, MAXIMUM SWING, AND TIME TO COMPUTE THE TRAJECTORY. BEST RESULTS ARE HIGHLIGHTED.

	Download: JPG larger image
Fig. 6. Comparison of simulated rendezvous task trajectories created with Convex Sum to trajectories created with discrete greedy and HOOT policies. Black solid: Convex Sum ground; Gray solid: Convex Sum aerial; Black dashed: HOOT ground; Gray dashed: HOOT aerial.

Lastly,we experimentally compare the learned policies. The experiments were performed on AscTec Hummingbird quadrocopters,carrying a 62-centimeter suspended load weighing 45 grams.The quadrotor and load position were tracked via a Vicon motion capture system at 100 Hz. Experimentally,HOOT and Axial Sum resulted in similar trajectories,while Manhattan$'$s trajectory exhibited the most deviation from the planned trajectory (Fig. 7). The Convex Sum trajectory is the smoothest. Table Ⅳ quantifies the maximum load swing and the power required to produce the load$'$s motion from the experimental data. Convex Sum policy generates experimental trajectories with the best load swing performance,and with load motion that requires close to three times less energy to generate.

	Download: JPG larger image
Fig. 7. Comparison of experimental cargo delivery task trajectories created with Convex Sum versus trajectories created with discrete greedy and HOOT policies. Trajectories for Manhattan and Axial Sum are similar to Convex Sum and are omitted for better visibility.

TABLE Ⅳ
SUMMARY OF EXPERIMENTAL TRAJECTORY CHARACTERISTICS. MAXIMUM SWING AND ENERGY NEEDED TO PRODUCE LOAD OSCILLATIONS. BEST RESULTS ARE HIGHLIGHTED.

C. Rendezvous Task

The rendezvous cargo delivery task is a multi-agent variant of the time-sensitive cargo delivery task. It requires an UAV carrying a suspended load to rendezvous in swing-free fashion with a ground-bound robot to hand over the cargo. The cargo might be a patient airlifted to a hospital and then taken by a moving ground robot for delivery to an operating room for surgery. The rendezvous location and time are not known a priori,and the two heterogeneous agents must plan jointly to coordinate their speeds and positions. The two robots have no knowledge of the dynamics and each others$'$ constraints. The task requires minimization of the distance between the load$'$s and the ground robot$'$s location,the load swing minimization,and minimization for the agents$'$ velocities,while completing the task as fast as possible.

The quadrotor with the suspended load is modeled as in Section Ⅳ-B, while a rigid body constrained to two DOF in a plane models the ground-based robot. The joint state space is a 16-dimensional vector: the quadrotor$'$s 10-dimensional state space (Section Ⅳ-B), and the ground robot$'$s position-velocity space. The input is 5-dimensional acceleration to the quadrotor$'$s and ground robot$'$s center of masses. The ground robot$'$s maximum acceleration is lower than the quadrotor$'$s.

Applying Algorithm 1 with Convex Sum policy,the system learns the state-value function parametrization ${\Theta}$ that is negative definite. Fig. 8 shows both robots after two seconds. The comparison of simulated trajectories created with the Convex Sum and HOOT policies is depicted in Fig. 9. Convex Sum finds an 8.54-second trajectory that solves the task in 0.12 s. HOOT policy fails to find a suitable trajectory before reaching the maximum trajectory duration,destabilizes the system,and terminates after 101.44 s. The discrete policy yields similar results as HOOT. This is because the input needed to solve the task is smaller than the HOOT$'$s setup,and the system begins to oscillate. The rendezvous point produced with Convex Sum policy is between the robots$'$ initial positions,closer to the slower robot,as expected (Fig. 9 (a)). The quadrotor$'$s load swing is minimal (Fig. 9 (b)). The absolute accumulated reward collected while performing the task is smooth and steadily making progress,while the accumulated reward along HOOT trajectory remains significantly lower (Fig. 9 (c)). The rendezvous simulation shows that the proposed methods are able to solve tasks that previous methods are unable to because the Convex Sum policy is admissible.

	Download: JPG larger image
Fig. 8.Cargo-bearing UAV and a ground-based robot rendezvous after 2 s.

	Download: JPG larger image
Fig. 9. Comparison of simulated rendezvous task trajectories created with Convex Sum to trajectories created with discrete greedy and HOOT policies. Black solid: Convex Sum ground; Gray solid: Convex Sum aerial; Black dashed: HOOT ground; Gray dashed: HOOT aerial.

Ⅴ. CONCLUSIONS

Control of high-dimensional systems with continuous actions is a rapidly developing topic of research. In this paper,we proposed a method for learning control of nonlinear motion systems through combined learning of state-value and action-value functions. Negative definite quadratic state-value functions imply quadratic, concave action-value functions. That allowed us to approximate policy as a combination of its action-value function maximums on the axes,which we found through interpolation between observed samples. These policies are admissible,consistent,and efficient. Lastly,we showed that a quadratic,negative definite state-value function,in conjunction with admissible policies,are sufficient conditions for the system to progress to the goal while minimizing given constraints.

The verification on known functions confirmed the policies$'$ admissibility. A quadrotor carrying a suspended load assessed the method$'$s applicability to a physical system and a practical problem,and provided a comparison to two other methods demonstrating higher precision of the proposed method as well. The rendezvous task tested the method in higher dimensional input spaces for a multi-agent system,and showed that it finds a solution where other two methods do not. The results confirm that the proposed method outruns current state-of-the-art by an order of magnitude, while the experimental data revealed that the proposed method produces trajectories with better characteristics.

In all,we presented a solid first step for an optimal control framework for unknown control-affine systems for Constraint-Balancing Tasks. Despite the applied method$'$s restrictive condition,the results demonstrated high accuracy and fast learning times on the practical applications. In future work, the methodology can be extended to stochastic MDPs.

APPENDIX A PROOF FOR LEMMA 1

Proof. First,to show that $\exists \hat{w}_0 \in [w^i_l,w^i_u]$ such that $Q^{(\vec{p})}_{\vec{x},i}(\hat{w}_0) \geq Q(\vec{x},{\vec{p}})$,we pick $\hat{w}_0 = 0$,and directly from the definition,we get $Q^{(\vec{p})}_{\vec{x},i}(0) = Q(\vec{x},{\vec{p}})$. As a consequence

\begin{align} \label{eq:gezero} Q^{(\vec{p})}_{\vec{x},i}(0) \leq Q^{(\vec{p})}_{\vec{x},i}{(\hat{w}_i)}. \end{align}

(A1)

Second,to show that $Q^{(\vec{0})}_{\vec{x},i}(\hat{w}_i) - V(\vec{x}) \geq 0$,

\begin{align*} Q^{(\vec{0})}_{\vec{x},i}(\hat{w}_i) &\geq Q^{(\vec{0})}_{\vec{x},i}(0) & \text{ from (A1)}\\ & =\vec{f}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x}) \geq \vec{x}^{\rm T}\Lambda \vec{x}& \text{ due to (11)} \\ & =V(\vec{x})&. \end{align*}

Third,we show $Q(\vec{0},\hat{w}_i \vec{e}_i) - V(\vec{0}) = 0$. Since,the origin is equilibrium,the dynamics is $\vec{D}(\vec{0},\hat{w}_i {\vec{e}_i} ) = \vec{0}$. Let us evaluate the dynamics at $\hat{w}_i {\vec{e}_i} $,when $\vec{x} = \vec{0}$,

\begin{align*} \vec{D}(\vec{0},\hat{w}_i {\vec{e}_i} ) &= \vec{f}(\vec{0}) + \vec{g}(\vec{0}) \hat{w}_i {\vec{e}_i} &\\ & = \vec{f}(\vec{0}) & \text{ because of (9)} \\ & = \vec{0} & \text{ because of (11)} \\ \end{align*}

Thus,$Q(\vec{0},\hat{w}_i \vec{e}_i) - V(\vec{0}) = 0$.

APPENDIX B PROOF FOR THEOREM 2

Proof. In all three cases,it is sufficient to show that the policy approximations are admissible.

Manhattan policy: To show that the policy approximation (13) is admissible,for $\vec{x} \ne \vec{0}$ we use induction by $n$,$1 \leq n \leq d_u$,with induction hypothesis

\begin{align} &\Delta Q(\vec{x},\vec{\hat{u}}_{n}) \geq 0,\text{ where } \vec{\hat{u}}_{n} = \sum_{i=1}^{n}\hat{w}_i{\vec{e}_i},\text{ and} \notag\\ &\quad \Delta Q(\vec{x},\vec{\hat{u}}_{n}) = 0 \Leftrightarrow \notag\\ &\quad \vec{f}(\vec{x})^{\rm T} \Lambda {\vec{g}}_i(\vec{x}) = \vec{0},\forall i \leq n, \vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x}. \end{align}

(A2)

First note that at iteration $1 < n \leq d_u$,

\begin{align*} &\vec{D}(\vec{x} ,\vec{\hat{u}}_{n-1} + w {\vec{e}_n} ) = \vec{f}(\vec{x}) + \vec{g}(\vec{x}) \left(\vec{\hat{u}}_{n-1} + w{\vec{e}_n} \right) \\ &=\quad \vec{f}(\vec{x}) + \vec{g}(\vec{x}) \vec{\hat{u}}_{n-1} + \vec{g}(\vec{x}) w{\vec{e}_n} = \vec{f_n}(\vec{x} ) + {\vec{g}_{n}}(\vec{x})w, \end{align*}

denoting $\vec{f_n}(\vec{x} ) = \vec{f}(\vec{x}) + \vec{g}(\vec{x}) \vec{\hat{u}}_{n-1},$ and

\begin{align*} &Q(\vec{x} ,\vec{u}_{n}) = ({\vec{f}_n}(\vec{x}) + {\vec{g}_{n}}(\vec{x})w)^{\rm T}\Lambda ({\vec{f}_n}(\vec{x} ) + {\vec{g}_{n}}(\vec{x})w) \\ &\quad ={\vec{g}_{n}}(\vec{x})^{\rm T} \Lambda {\vec{g}_{n}}(\vec{x}) w^2 + 2 {\vec{f}_n}(\vec{x} )^{\rm T} \Lambda {\vec{g}_{n}}(\vec{x}) w \\ &\quad + \vec{f_n}(\vec{x} )^{\rm T} \Lambda {\vec{f}_n}(\vec{x} )=p_n w^2+ q_n w + r_n ,\; \; p_n,q_n ,r_n \in {\bf R}. \end{align*}

(A3)

Because $\Lambda < 0$,$Q (\vec{x} ,\vec{u}_{n})$ is a quadratic function of one variable with a maximum in

\begin{align} \hat{w}^*_n = -\frac{\vec{g_n}(\vec{x})^{\rm T}\Lambda \vec{f_n}(\vec{x})}{\vec{g_n}(\vec{x})^{\rm T}\Lambda{\vec{g_n}}(\vec{x})}. \end{align}

(A4)

Applying the induction for $n=1$,and using Lemma 1,

\begin{align*} \Delta Q(\vec{x},\vec{\hat{u}}_1) &= Q(\vec{x} ,\hat{w}_1{\vec{e}_1}) - V(\vec{x}) \\ &\geq Q(\vec{x} ,\vec{0}) - V(\vec{x}) =\vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) - \vec{x}^{\rm T} \Lambda \vec{x}\\ & > 0,\text{ when } \vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) > \vec{x}^{\rm T} \Lambda \vec{x}. \end{align*}

(A5)

Given that,$\vec{\hat{u}}_{1} \ne 0 \Leftrightarrow \Delta Q(\vec{x}, \vec{\hat{u}}_{n}) > \Delta Q(\vec{x},\vec{0}),$ and assuming $\vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x}$,we evaluate $\vec{\hat{u}}_{1} = \vec{0}$. From (A4),

\begin{align} \label{eq:n12} \hat{w}_1 = -\frac{\vec{g_1}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x})}{\vec{g_1}(\vec{x})^{\rm T}\Lambda{\vec{g_1}}(\vec{x})} = 0 \Leftrightarrow {g_1}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x}) = 0. \end{align}

(A6)

So,the induction hypothesis (A2) for $n=1$ holds. Assuming that (A2) holds for $1,\cdots,n-1$,and using Lemma 1,

\begin{align*} \Delta Q(\vec{x},\vec{\hat{u}}_{n}) &= Q(\vec{x} ,\vec{\hat{u}}_{n-1} + \hat{w}_{n} {\vec{e}_n}) - V(\vec{x}) \\ & \geq Q(\vec{x} ,\vec{\hat{u}}_{n-1} + \vec{0}) - V(\vec{x}) \\ & =\Delta Q(\vec{x},\vec{\hat{u}}_{n-1}) \quad\quad\quad \text{ from ind. hyp. (A2)}\\ & >0 \quad\quad\quad\quad\quad\quad \text{ when } \vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) > \vec{x}^{\rm T} \Lambda \vec{x}. \end{align*}

Similarly,assuming $\vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x}$,

\begin{align*} &\Delta Q(\vec{x},\vec{\hat{u}}_{n}) = 0 \Leftrightarrow \\ &\hat{w}_{n} =-\frac{\vec{g_n}(\vec{x})^{\rm T}\Lambda \vec{f_n}(x)}{\vec{g_n}(\vec{x})^{\rm T}\Lambda{\vec{g_n}}(\vec{x})} = 0, \text{ and } \Delta Q(\vec{x},\vec{\hat{u}}_{n-1}) = 0. \end{align*}

Since $\Delta Q(\vec{x},\vec{\hat{u}}_{n-1}) = 0 \Leftrightarrow \vec{\hat{u}}_{n-1} = \vec{0} $,means that ${f_n}(\vec{x}) = \vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{\hat{u}}_{n-1} = \vec{f}(\vec{x}),$

\begin{align*} &\Delta Q(\vec{x},\vec{\hat{u}}_{n}) = 0 \Leftrightarrow \\ &\vec{g_n}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x}) = \vec{0}, \text{ and } \Delta Q(\vec{x},\vec{\hat{u}}_{n-1})=0 \Leftrightarrow \\ &\vec{g_i}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x}) = 0,\text{ for } 1\leq i \leq n. \end{align*}

For $n=d_u$,the policy gain $\Delta Q(\vec{x},\vec{\hat{u}}_{d_u}) = 0 \Leftrightarrow \vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x},$ and $\vec{g_i}(\vec{x})^{\rm T}\Lambda \vec{f}(\vec{x}) = 0,\text{ for } 1\leq i \leq d_u.$ But,that is contradiction with the controllability assumption (8),thus $\Delta Q(\vec{x},\vec{\hat{u}}_{d_u}) > 0,$ when $\vec{x} \ne \vec{0}$.

When $\vec{x}=\vec{0}$,we get directly from Lemma 1,$\Delta Q(\vec{0}, \vec{\hat{u}}_{d_u}) =0$. This completes the proof that Manhattan policy (13) is admissible,and therefore the equilibrium is asymptotically stable.

Convex Sum (14): Following the same reasoning as for the first step of (A5) and (A6),we get that for all $1 \leq n \leq d_u$, $\Delta Q(\vec{x},\hat{w}_{n}\vec{e}_{n}) \geq 0,$ where ${{\hat w}_n}{{\vec e}_n} = {\rm{ }}argma{x_{_{w_l^n \le w \le w_u^n}}}Q_{\vec x,n}^{(\vec 0)}(w)$ and the equality holds only when

\begin{align} \label{eq:allzeros} \begin{split} \Delta &Q(\vec{x},\hat{w}_{n}\vec{e}_{n}) = 0 \\ &\Leftrightarrow \vec{f}(\vec{x})^{\rm T} \Lambda {g}_n(\vec{x}) = 0,\vec{f}(\vec{x})^{\rm T} \Lambda \vec{f}(\vec{x}) = \vec{x}^{\rm T} \Lambda \vec{x}. \\ \end{split} \end{align}

(A7)

To simplify the notation,let $Q_n = \Delta Q(\vec{x}, \hat{w}_{n}\vec{e}_{n})$,$n=1,\cdots,d_u$,and $Q_0 = 0$. Without loss of generality, assume that $Q_0 \leq Q_1 \leq \cdots \leq Q_{d_u}.$ The equality only holds when (A7) holds for all $n=1,\cdots,d_u$ which is contradiction with (8). Thus,there must be at least one $1 \leq n_0 \leq d_u$,such that $Q_{n_0 - 1} < Q_{n_0} $,and consequently $0 < Q_{d_u}.$

Lastly,we need to show that the combined input $\vec{\hat{u}}$ calculated with (14) is admissible,i.e.,$\Delta Q(\vec{x},\vec{\hat{u}}) > 0$. It suffices to show that $\vec{\hat{u}}$ is inside the ellipsoid $\check{Q}_0 = \{\vec{u} | Q(\vec{x},\vec{u}) \geq Q_0 \}$. Similarly,$Q_1, \cdots,Q_{d_u}$ define a set of concentric ellipsoids $\check{Q}_i = \{\vec{u} | Q(\vec{x},\vec{u}) \geq Q_i \},\;\; i=1,\cdots,d_u. $ Since,$ \check{Q}_0 \supseteq \check{Q}_1 \supseteq \cdots \supseteq \check{Q}_{d_u}, \; \text{and} \; \forall i,\vec{\hat{u}}_i \in \check{Q}_i \implies \vec{\hat{u}}_i \in \check{Q}_0. $ Because ellipsoid $\check{Q}_0$ is convex,the convex combination of points inside it (14),belongs to it as well. Since,at least one ellipsoid must be a true subset of $\check{Q}_0$,which completes the asymptotic stability proof.

{Axial Sum policy approximation (15)}: is admissible because (14) is admissible. Formally,$ \Delta Q (\vec{x},\vec{{h}}_s^{Q}(\vec{x})) \ge \Delta Q (\vec{x},\vec{{h}}_c^{Q}(\vec{x})) \ge 0. $

APPENDIX C OPTIMALITY CONDITIONS

Proposition 1. When ${\vec{g}}(\vec{x})$ is an independent input matrix,${A} = {I}$,and state-value function parameterization ${\Theta}$ is negative definite,then Axial Sum policy (15) is optimal with respect to the state-value function (7).

Proof. The optimal input ${\vec{u}^*}$ is a solution to $\frac{\partial Q(\vec{x},\vec{u})}{\partial \vec{u}_i}= 0$,and $\vec{\hat{u}}$ is a solution to $\frac{{\rm d}Q^{(\vec{0})}_{\vec{x},i}(w)}{{\rm d}w}$ at state $\vec{x}$ with respect to the state-value function (7). To show that the Axial Sum policy is optimal,${\vec{u}^*} = \vec{\hat{u}}$, it is enough to show that $\frac{{\rm d}Q(\vec{x},\vec{u})}{{\rm d}u_i}= \frac{{\rm d}Q^{(\vec{0})}_{\vec{x},i}(w)}{{\rm d}w}$. This is the case when $Q$ has the form of $Q(\vec{x},\vec{u}) = \sum_{i=1}^{d_x} (p_{x_i}u_i^2 + q_{x_i}u_i+r_{x_i}),$ for some $p_{x_i},q_{x_i},r_{x_i}\in {\bf R}{}$ that depend on the current state $\vec{x}$. In Proposition 1,we showed that $ %\begin{align*} Q(\vec{x},\vec{u}) =(\vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{u} ))^{\rm T}{\Theta}(\vec{f}(\vec{x}) + \vec{g}(\vec{x})\vec{u} ) = \sum_{i=1}^{d_x} \theta_i \left( \sum_{j=1}^{d_u}g_{ij}(\vec{x})u_j+ f_i(\vec{x})\right)^2. $ Since there is a single nonzero element $j_i$ in row $i$ of matrix $\vec{g}$,$ Q(\vec{x},\vec{u}) = \sum_{i=1}^{d_x} \left(\theta_i (g_{j_i}(\vec{x})u_{j_i}+ \vec{f}_{j_i}(\vec{x})\right)^2 = \sum_{i=1}^{d_x} (\theta_i g_{j_i}^2(\vec{x})u_{j_i}^2 + 2\theta_i f_{j_i}(\vec{x}) g_{j_i}(\vec{x})u_{j_i} + \vec{f}_{j_i}^2(\vec{x}) ).$ After rearranging,$Q(\vec{x},\vec{u}) = \sum_{i=1}^{d_x} (p_{x_i}u_i^2 + q_{x_i}u_i+r_{x_i})$.

ACKNOWLEDGEMENTS

The authors would like to thank Ivana Palunko for animation software,and Patricio Cruz for assisting with experiments.

[1]	Levine J. Analysis and control of nonlinear systems: a flatness-based approach. Mathematical Engineering. New York: Springer, 2009.
[2]	Khalil H K. Nonlinear Systems. New Jersey: Prentice Hall, 1996.
[3]	Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton, Florida: CRC Press, 2010.
[4]	Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996.
[5]	Ernst D, Glavic M, Geurts P, Wehenkel L. Approximate value iteration in the reinforcement learning context. application to electrical power system control. International Journal of Emerging Electric Power Systems, 2005, 3(1): 10661-106637
[6]	Taylor C J, Cowley A. Parsing indoor scenes using RGB-D imagery. In: Proceeding of Robotics: Sci. Sys. (RSS), Sydney, Australia, 2012.
[7]	LaValle S M. Planning Algorithms. Cambridge, U.K.: Cambridge University Press, 2006.
[8]	Yucelen T, Yang B-J, Calise A J. Derivative-free decentralized adaptive control of large-scale interconnected uncertain systems. In: Proceeding of the 50th Conference on Decision and Control and European Control Conference (CDC-ECC). Orlando, USA: IEEE, 2011.1104-1109
[9]	Mehraeen S, Jagannathan S. Decentralized optimal control of a class of interconnected nonlinear discrete-time systems by using online Hamilton-Jacobi-Bellman formulation. IEEE Transactions on Neural Networks, 2011, 22(11): 1757-1769
[10]	Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using timebased policy update. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(7): 1118-1129
[11]	Vamvoudakis K G, Vrabie D, Lewis F L. Online adaptive algorithm for optimal control with integral reinforcement learning. International Journal of Robust and Nonlinear Control, to be published
[12]	Mehraeen S, Jagannathan S. Decentralized nearly optimal control of a class of interconnected nonlinear discrete-time systems by using online Hamilton-Bellman-Jacobi formulation. In: Proceeding of the 2010 International Joint Conference on Neural Networks (IJCNN). Barcelona: IEEE, 2010. 1-8
[13]	Bhasin S, Sharma N, Patre P, Dixon W. Asymptotic tracking by a reinforcement learning-based adaptive critic controller. Journal of Control Theory and Applications, 2011, 9(3): 400-409
[14]	Modares H, Sistani M B N, Lewis F L. A policy iteration approach to online optimal control of continuous-time constrained-input systems. ISA Transactions, 2013, 52(5): 611-621
[15]	Chen Z, Jagannathan S. Generalized Hamilton-Jacobi-Bellman formulation-based neural network control of affine nonlinear discrete time systems. IEEE Transactions on Neural Networks, 2008, 19(1): 90-106
[16]	Jiang Y, Jiang Z P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica, 2012, 48(10): 2699-2704
[17]	Al-Tamimi A, Lewis F, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2008, 38(4): 943-949
[18]	Cheng T, Lewis F L, Abu-Khalaf M. A neural network solution for fixed-final time optimal control of nonlinear systems. Automatica, 2007, 43(3): 482-490
[19]	Kober J, Bagnell D, Peters J. Reinforcement learning in robotics: a survey. International Journal of Robotics Research, 2013, 32(11): 1236-1274
[20]	Hasselt H. Reinforcement learning in continuous state and action spaces. Adaptation, Learning, and Optimization. Berlin Heidelberg: Springer, 2012. 207-251
[21]	Grondman I, Busoniu L, Lopes G A D, Babuska R. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2012, 42(6): 1291-1307
[22]	Kimura H. Reinforcement learning in multi-dim ensional state-action space using random rectangular coarse coding and Gibbs sampling. In: Proceeding of the 2007 IEEE International Conference on Intelligent Robots and Systems (IROS). San Diego, CA: IEEE, 2007. 88-95
[23]	Lazaric A, Restelli M, Bonarini A. Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Advances in Neural Information Processing Systems, 2008, 20: 833-840
[24]	Antos A, Munos R, Szepesvári C. Fitted Q-iteration in continuous action-space MDPs. Advances in Neural Information Processing Systems 20. Cambridge, MA: MIT Press, 2007.9-16
[25]	Bubeck S, Munos R, Stoltz G, Szepesvari C. X-armed bandits. The Journal of Machine Learning Research, 2011, 12: 1655-1695
[26]	Buęoniu L, Daniels A, Munos R, Babuška R. Optimistic planning for continuous-action deterministic systems. In: Proceeding of the 2013 Symposium on Adaptive Dynamic Programming and Reinforcement Learning. Singapore: IEEE, 2013. 69-76
[27]	Mansley C, Weinstein A, Littman M L. Sample-based planning for continuous action Markov decision processes. In: Proceeding of the 21st International Conference on Automated Planning and Scheduling. Piscataway, NL, USA: ICML, 2011.335-338
[28]	Walsh T J, Goschin S, Littman M L. Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence. Atlanta, Georgia, USA: AAAI Press, 2010.612-617
[29]	Faust A, Palunko I, Cruz P, Fierro R, Tapia L. Learning swing freetrajectories for UAVs with a suspended load. In: Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). Karlsruhe, Germany: IEEE, 2013.4902-4909
[30]	Faust A, Palunko I, Cruz P, Fierro R, Tapia L, Automated aerial suspended cargo delivery through reinforcement learning. In: Artificial Intelligence, 2015, in press.
[31]	Bellman R E. Dynamic Programming. Mineola, NY: Dover Publications, Incorporated, 1957.
[32]	Sutton R S, Barto A G. A Reinforcement Learning: an Introduction. Cambridge, MA: MIT Press, 1998.
[33]	Munos R, Szepesvári C. Finite-time bounds for Fitted Value Iteration. Journal of Machine Learning Research, 2008, 9: 815-857


IEEE/CAA Journal of Automatica Sinica 2014, Vol.1 Issue (3): 323-336	PDF