A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 10 Issue 9
Sep.  2023

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
D. Wang, J. Y. Wang, M. M. Zhao, P. Xin, and J. F. Qiao, “Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1797–1809, Sept. 2023. doi: 10.1109/JAS.2023.123684
Citation: D. Wang, J. Y. Wang, M. M. Zhao, P. Xin, and J. F. Qiao, “Adaptive multi-step evaluation design with stability guarantee for discrete-time optimal learning control,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1797–1809, Sept. 2023. doi: 10.1109/JAS.2023.123684

Adaptive Multi-Step Evaluation Design With Stability Guarantee for Discrete-Time Optimal Learning Control

doi: 10.1109/JAS.2023.123684
Funds:  This work was supported in part by the National Key Research and Development Program of China (2021ZD0112302), the National Natural Science Foundation of China (62222301, 61890930-5, 62021003), and the Beijing Natural Science Foundation (JQ19013)
More Information
  • This paper is concerned with a novel integrated multi-step heuristic dynamic programming (MsHDP) algorithm for solving optimal control problems. It is shown that, initialized by the zero cost function, MsHDP can converge to the optimal solution of the Hamilton-Jacobi-Bellman (HJB) equation. Then, the stability of the system is analyzed using control policies generated by MsHDP. Also, a general stability criterion is designed to determine the admissibility of the current control policy. That is, the criterion is applicable not only to traditional value iteration and policy iteration but also to MsHDP. Further, based on the convergence and the stability criterion, the integrated MsHDP algorithm using immature control policies is developed to accelerate learning efficiency greatly. Besides, actor-critic is utilized to implement the integrated MsHDP scheme, where neural networks are used to evaluate and improve the iterative policy as the parameter architecture. Finally, two simulation examples are given to demonstrate that the learning effectiveness of the integrated MsHDP scheme surpasses those of other fixed or integrated methods.

     

  • loading
  • [1]
    D. Silver, A. Huang, C. J. Maddison, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. doi: 10.1038/nature16961
    [2]
    N. A. Barriga, M. Stanescu, F. Besoain, and M. Buro, “Improving RTS game AI by supervised policy learning, tactical search, and deep reinforcement learning,” IEEE Computational Intelligence Magazine, vol. 14, no. 3, pp. 8–18, Aug. 2019. doi: 10.1109/MCI.2019.2919363
    [3]
    Y. Jiang, Z. Cao, and J. Zhang, “Learning to solve 3-D bin packing problem via deep reinforcement learning and constraint programming,” IEEE Trans. Cybernetics, vol. 53, no. 5, pp. 2864–2875, May 2023.
    [4]
    D. Wang, H. He, X. Zhong, and D. Liu, “Event-driven nonlinear discounted optimal regulation involving a power system application,” IEEE Trans. Industrial Elecctronics, vol. 64, no. 10, pp. 8177–8186, Oct. 2017. doi: 10.1109/TIE.2017.2698377
    [5]
    E. Foruzan, L. K. Soh, and S. Asgarpoor, “Reinforcement learning approach for optimal distributed energy management in a microgrid,” IEEE Trans. Power Systems, vol. 33, no. 5, pp. 5749–5758, Sept. 2018. doi: 10.1109/TPWRS.2018.2823641
    [6]
    D. Chen, K. Chen, Z. Li, and T. Chu, et al., “PowerNet: Multi-agent deep reinforcement learning for scalable powergrid control,” IEEE Trans. Power Systems, vol. 37, no. 2, pp. 1007–1017, Mar. 2022. doi: 10.1109/TPWRS.2021.3100898
    [7]
    D. Wang, M. Ha, and M. Zhao, “The intelligent critic framework for advanced optimal control,” Artificial Intelligence Review, vol. 55, no. 1, pp. 1–22, Jan. 2022. doi: 10.1007/s10462-021-10118-9
    [8]
    M. Ha, D. Wang, and D. Liu, “Discounted iterative adaptive critic designs with novel stability analysis for tracking control,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1262–1272, Apr. 2022. doi: 10.1109/JAS.2022.105692
    [9]
    Y. Li, Y. Liu, and S. Tong, “Observer-based neuro-adaptive optimized control of strict-feedback nonlinear systems with state constraints,” IEEE Trans. Neural Networks and Learning Systems, vol. 33, no. 7, pp. 3131–3145, Jul. 2022. doi: 10.1109/TNNLS.2021.3051030
    [10]
    H. Wang, C. Yang, X. Liu, and L. Zhou, “Neural-network-based adaptive control of uncertain MIMO singularly perturbed systems with full-state constraints,” IEEE Trans. Neural Networks and Learning Systems, 2022.
    [11]
    C. Chen, H. Modares, K. Xie, F. L. Lewis, Y. Wan, and S. Xie, “Reinforcement learning-based adaptive optimal exponential tracking control of linear systems with unknown dynamics,” IEEE Trans. Automatic Control, vol. 64, no. 11, pp. 4423–4438, Nov. 2019. doi: 10.1109/TAC.2019.2905215
    [12]
    S. Song, M. Zhu, X. Dai, and D. Gong, “Model-free optimal tracking control of nonlinear input-affine discrete-time systems via an iterative deterministic Q-learning algorithm,” IEEE Trans. Neural Networks Learning Systems, 2022.
    [13]
    W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Hoboken, USA: Wiley, 2007.
    [14]
    D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEE Trans. Neural Networks, vol. 8, no. 5, pp. 997–1007, Sept. 1997. doi: 10.1109/72.623201
    [15]
    P. J. Werbos, “Approximate dynamic programming for real-time control and neural modeling,” in Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, D. A. White and D. A. Sofge, Eds. New York, USA: Van Nostrand Reinhold, 1992, ch. 13.
    [16]
    D. P. Bertsekas, Dynamic Programming and Optimal Control Chapter, vol. 1, Belmont, USA: Athena Scientific, 1995.
    [17]
    R. S. Sutton and A. G. Barto, Reinforcement Learning. Cambridge, USA: MIT Press, 1998.
    [18]
    J. Si and Y. T. Wang, “On-line learning control by association and reinforcement,” IEEE Trans. Neural Networks, vol. 12, no. 2, pp. 264–276, Mar. 2001. doi: 10.1109/72.914523
    [19]
    S. Al-Dabooni and D. C. Wunsch, “Online model-free n-step HDP with stability analysis,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 4, pp. 1255–1269, Apr. 2020. doi: 10.1109/TNNLS.2019.2919614
    [20]
    B. Luo, D. Liu, T. Huang, X. Yang, and H. Ma, “Multi-step heuristic dynamic programming for optimal control of nonlinear discrete-time systems,” Information Sciences, vol. 411, pp. 66–83, May 2017. doi: 10.1016/j.ins.2017.05.005
    [21]
    B. Luo, D. Liu, T. Huang, and J. Liu, “Output tracking control based on adaptive dynamic programming with multistep policy evaluation,” IEEE Trans. Systems,Man,and Cybernetics: Systems, vol. 49, no. 10, pp. 2155–2165, Oct. 2019. doi: 10.1109/TSMC.2017.2771516
    [22]
    V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. Int. Conf. Machine Learning, Jun. 2016, vol. 48, pp. 1928–1937.
    [23]
    Q. Wei, L. Wang, Y. Liu, and M. M. Polycarpou, “Optimal elevator group control via deep asynchronous actor-critic learning,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5245–5256, Dec. 2020. doi: 10.1109/TNNLS.2020.2965208
    [24]
    M. Ha, D. Wang, and D. Liu, “Offline and online adaptive critic control designs with stability guarantee through value iteration,” IEEE Trans. Cybernetics, vol. 52, no. 12, pp. 13262–13274, Dec. 2022.
    [25]
    D. Wang, M. Zhao, M. Ha, and L. Hu, “Adaptive-critic-based hybrid intelligent optimal tracking for a class of nonlinear discrete-time systems,” Engineering Applications of Artificial Intelligence, vol. 105, pp. 104443: 1–11, Dec. 2020.
    [26]
    A. Roy, V. Borkar, A. Karandikar, and P. Chaporkar, “Online reinforcement learning of optimal threshold policies for Markov decision processes,” IEEE Trans. Automatic Control, vol. 67, no. 7, pp. 3722–3729, Jul. 2022. doi: 10.1109/TAC.2021.3108121
    [27]
    A. Al-Tamimi, F. L. Lewis, and M. Abu-Khalaf, “Discrete-time nonlinear HJB solution using approximate dynamic programming: Convergence proof,” IEEE Trans. Systems,Man,and Cybernetics-Part B: Cybernetics, vol. 38, no. 4, pp. 943–949, Aug. 2008. doi: 10.1109/TSMCB.2008.926614
    [28]
    D. Wang, M. Ha, and J. Qiao, “Self-learning optimal regulation for discrete-time nonlinear systems under event-driven formulation,” IEEE Trans. Automatic Control, vol. 65, no. 3, pp. 1272–1279, Mar. 2020. doi: 10.1109/TAC.2019.2926167
    [29]
    M. Ha, D. Wang, and D. Liu, “A novel value iteration scheme with adjustable convergence rate,” IEEE Trans. Neural Networks and Learning Systems, 2022.
    [30]
    B. Luo, Y. Yang, H. N. Wu, and T. Huang, “Balancing value iteration and policy iteration for discrete-time control,” IEEE Trans. Syst.,Man,and Cybernetics: Syst., vol. 50, no. 11, pp. 3948–3958, Nov. 2020. doi: 10.1109/TSMC.2019.2898389
    [31]
    D. Liu and Q. Wei, “Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems,” IEEE Trans. Neural Networks and Learning Systems, vol. 25, no. 3, pp. 621–634, Mar. 2014. doi: 10.1109/TNNLS.2013.2281663
    [32]
    D. Liu, Q. Wei, and P. Yan, “Generalized policy iteration adaptive dynamic programming for discrete-time nonlinear systems,” IEEE Trans. Systems,Man,Cybernetics: Systems, vol. 45, no. 12, pp. 1577–1591, Dec. 2015. doi: 10.1109/TSMC.2015.2417510
    [33]
    D. Bertsekas, “Multiagent reinforcement learning: Rollout and policy iteration,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 2, pp. 249–272, Feb. 2021. doi: 10.1109/JAS.2021.1003814
    [34]
    Q. Wei and D. Liu, “Value iteration adaptive dynamic programming for optimal control of discrete-time nonlinear systems,” IEEE Trans. Cybernetics, vol. 46, no. 3, pp. 840–853, Mar. 2016. doi: 10.1109/TCYB.2015.2492242
    [35]
    F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming for feedback control,” IEEE Circuits and Systems Magazine, vol. 9, no. 3, pp. 32–50, 2009. doi: 10.1109/MCAS.2009.933854
    [36]
    H. Zhang, L. Cui, X. Zhang, and Y. Luo, “Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method,” IEEE Trans. Neural Networks, vol. 22, no. 12, pp. 2226–2236, Dec. 2011. doi: 10.1109/TNN.2011.2168538
    [37]
    D. Wang, M. Zhao, M. Ha, and J. Qiao, “Stability and admissibility analysis for zero-sum games under general value iteration formulation,” IEEE Trans. Neural Networks and Learning Systems, 2022.
    [38]
    M. Ha, D. Wang, and D. Liu, “Neural-network-based discounted optimal control via an integrated value iteration with accuracy guarantee,” Neural Networks, vol. 144, pp. 176–186, 2021. doi: 10.1016/j.neunet.2021.08.025
    [39]
    A. Heydari, “Stability analysis of optimal adaptive control using value iteration with approximation errors,” IEEE Trans. Automatic Control, vol. 63, no. 9, pp. 3119–3126, Sept. 2018. doi: 10.1109/TAC.2018.2790260
    [40]
    M. Lin, B. Zhao, and D. Liu, “Policy gradient adaptive critic designs for model-free optimal tracking control with experience replay,” IEEE Trans. Systems,Man,and Cybernetics: Systems, vol. 52, no. 6, pp. 3692–3703, Jun. 2022. doi: 10.1109/TSMC.2021.3071968
    [41]
    R. S. Sutton and A. G. Barto, “Eligibility traces,” in Reinforcement Learning: An Introduction, Cambridge, USA: pp.163–192, 1998.
    [42]
    T. Landelius, “Reinforcement learning and distributed local model synthesis,” Ph.D. dissertation, Linkoping Univ., Linkoping, Sweden, 1997.
    [43]
    D. P. Iracleous and A. T. Alexandridis, “A multi-task automatic generation control for power regulation,” Electric Power Systems Research, vol. 73, no. 3, pp. 275–285, Mar. 2005. doi: 10.1016/j.jpgr.2004.06.011

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(13)  / Tables(2)

    Article Metrics

    Article views (536) PDF downloads(70) Cited by()

    Highlights

    • Starting from the initial zero cost function, the properties of convergence and monotonicity are investigated under the multi-step heuristic dynamic programming (MsHDP) framework, in order to solve the discrete-time optimal learning control problem
    • The traditional stability criterion is extended. A novel stability condition is designed to determine the system stability using the current policy. The novel condition is not only applicable to traditional value iteration and policy iteration, but also to MsHDP, which obviously reflects the connection and difference between them
    • Based on the analyses of stability and convergence, we develop a novel integrated MsHDP algorithm, which can accelerate the whole phase of the learning process without using an initial admissible policy

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return