A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 3
Mar.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 7.847, Top 10% (SCI Q1)
    CiteScore: 13.0, Top 5% (Q1)
    Google Scholar h5-index: 64, TOP 7
Turn off MathJax
Article Contents
Majid Mazouchi, Subramanya Nageshrao and Hamidreza Modares, "Conflict-Aware Safe Reinforcement Learning: A Meta-Cognitive Learning Framework," IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 466-481, Mar. 2022. doi: 10.1109/JAS.2021.1004353
Citation: Majid Mazouchi, Subramanya Nageshrao and Hamidreza Modares, "Conflict-Aware Safe Reinforcement Learning: A Meta-Cognitive Learning Framework," IEEE/CAA J. Autom. Sinica, vol. 9, no. 3, pp. 466-481, Mar. 2022. doi: 10.1109/JAS.2021.1004353

Conflict-Aware Safe Reinforcement Learning: A Meta-Cognitive Learning Framework

doi: 10.1109/JAS.2021.1004353
Funds:  This work was supported by Ford Motor Company-Michigan State University Alliance
More Information
  • In this paper, a data-driven conflict-aware safe reinforcement learning (CAS-RL) algorithm is presented for control of autonomous systems. Existing safe RL results with pre-defined performance functions and safe sets can only provide safety and performance guarantees for a single environment or circumstance. By contrast, the presented CAS-RL algorithm provides safety and performance guarantees across a variety of circumstances that the system might encounter. This is achieved by utilizing a bilevel learning control architecture: A higher meta-cognitive layer leverages a data-driven receding-horizon attentional controller (RHAC) to adapt relative attention to different system’s safety and performance requirements, and, a lower-layer RL controller designs control actuation signals for the system. The presented RHAC makes its meta decisions based on the reaction curve of the lower-layer RL controller using a meta-model or knowledge. More specifically, it leverages a prediction meta-model (PMM) which spans the space of all future meta trajectories using a given finite number of past meta trajectories. RHAC will adapt the system’s aspiration towards performance metrics (e.g., performance weights) as well as safety boundaries to resolve conflicts that arise as mission scenarios develop. This will guarantee safety and feasibility (i.e., performance boundness) of the lower-layer RL-based control solution. It is shown that the interplay between the RHAC and the lower-layer RL controller is a bilevel optimization problem for which the leader (RHAC) operates at a lower rate than the follower (RL-based controller) and its solution guarantees feasibility and safety of the control solution. The effectiveness of the proposed framework is verified through a simulation example.

     

  • loading
  • [1]
    R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT press, 2018.
    [2]
    T. Bian and Z.-P. Jiang, “Reinforcement learning for linear continuoustime systems: An incremental learning approach,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 2, pp. 433–440, 2019. doi: 10.1109/JAS.2019.1911390
    [3]
    A. Faust, P. Ruymgaart, M. Salman, R. Fierro, and L. Tapia, “Continuous action reinforcement learning for control-affine systems with unknown dynamics,” IEEE/CAA J. Autom. Sinica, vol. 1, no. 3, pp. 323–336, 2014. doi: 10.1109/JAS.2014.7004690
    [4]
    G. Chowdhary, M. Liu, R. Grande, T. Walsh, J. How, and L. Carin, “Off-policy reinforcement learning with gaussian processes,” IEEE/CAA J. Autom. Sinica, vol. 1, no. 3, pp. 227–238, 2014. doi: 10.1109/JAS.2014.7004680
    [5]
    Z. Wang, Q. Wei, and D. Liu, “Event-triggered adaptive dynamic programming for discrete-time multi-player games,” Inf. Sci., vol. 506, pp. 457–470, 2020. doi: 10.1016/j.ins.2019.05.071
    [6]
    Z. Cao, C. Lin, M. Zhou, and R. Huang, “Scheduling semiconductor testing facility by using cuckoo search algorithm with reinforcement learning and surrogate modeling,” IEEE Trans. Automat. Contr. Sci. Eng., vol. 16, no. 2, pp. 825–837, 2018.
    [7]
    Z. Cao, Q. Xiao, and M. Zhou, “Distributed fusion-based policy search for fast robot locomotion learning,” IEEE Comput. Intell. Mag., vol. 14, no. 3, pp. 19–28, 2019. doi: 10.1109/MCI.2019.2919364
    [8]
    T. Liu, B. Tian, Y. Ai, and F.-Y. Wang, “Parallel reinforcement learningbased energy efficiency improvement for a cyber-physical system,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 617–626, 2020. doi: 10.1109/JAS.2020.1003072
    [9]
    B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimal and autonomous control using reinforcement learning: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2042–2062, 2017.
    [10]
    N. Horie, T. Matsui, K. Moriyama, A. Mutoh, and N. Inuzuka, “Multiobjective safe reinforcement learning,” Artificial Life and Robotics, pp. 1–9, 2019.
    [11]
    H. B. Ammar, E. Eaton, P. Ruvolo, and M. Taylor, “Online multitask learning for policy gradient methods,” in Proc. Int. Conf. Machine Learning. PMLR, 2014, pp. 1206–1214.
    [12]
    Y. W. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” arXiv preprint arXiv: 1707.04175, 2017.
    [13]
    Z. Xu and U. Topcu, “Transfer of temporal logic formulas in reinforcement learning,” in Proc. 28th Int. Joint Conf. Artificial Intelligence, vol. 28, 2019, pp. 4010–4018.
    [14]
    R. Laroche and M. Barlier, “Transfer reinforcement learning with shared dynamics,” in Proc. 31th AAAI Conf. Artificial Intelligence, vol. 31, no. 1, 2017, pp. 2147–2153.
    [15]
    N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern, “Transfer in variablereward hierarchical reinforcement learning,” Machine Learning, vol. 73, no. 3, pp. 289–312, 2008. doi: 10.1007/s10994-008-5061-y
    [16]
    M. Mazouchi, Y. Yang, and H. Modares, “Data-driven dynamic multiobjective optimal control: An aspiration-satisfying reinforcement learning approach,” IEEE Trans. Neural Netw. Learn. Syst., 2021.
    [17]
    W. Ying, Y. Zhang, J. Huang, and Q. Yang, “Transfer learning via learning to transfer,” in Proc. Int. Conf. Machine Learning. PMLR, 2018, pp. 5085–5094.
    [18]
    M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,” J. Mach. Learn. Res., vol. 10, no. 7, pp. 1633–1685, 2009.
    [19]
    A. Lazaric, “Transfer in reinforcement learning: A framework and a survey,” in Adaptation, Learning, and Optimization. Springer Berlin Heidelberg, 2012, pp. 143–173.
    [20]
    H. Wang, S. Fan, J. Song, Y. Gao, and X. Chen, “Reinforcement learning transfer based on subgoal discovery and subtask similarity,” IEEE/CAA J. Autom. Sinica, vol. 1, no. 3, pp. 257–266, 2014. doi: 10.1109/JAS.2014.7004683
    [21]
    T. Lew, A. Sharma, J. Harrison, and M. Pavone, “Safe learning and control using meta-learning,” in RSS Workshop Robust Autonomy, 2019.
    [22]
    Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, “Hierarchical deep reinforcement learning for continuous action control,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5174–5184, 2018. doi: 10.1109/TNNLS.2018.2805379
    [23]
    X. Chen, A. Ghadirzadeh, M. Björkman, and P. Jensfelt, “Meta-learning for multi-objective reinforcement learning,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots and Systems, 2019, pp. 977–983.
    [24]
    S. East, M. Gallieri, J. Masci, J. Koutnık, and M. Cannon, “Infinite-horizon differentiable model predictive control,” arXiv preprint arXiv: 2001.02244, 2020.
    [25]
    B. Amos, I. D. J. Rodriguez, J. Sacks, B. Boots, and J. Z. Kolter, “Differentiable mpc for end-to-end planning and control,” arXiv preprint arXiv: 1810.13400, 2018.
    [26]
    Z. Li, J. Kiseleva, and M. de Rijke, “Dialogue generation: From imitation learning to inverse reinforcement learning,” in Proc. AAAI Conf. Artificial Intelligence, vol. 33, no. 1, 2019, pp. 6722–6729.
    [27]
    R. Kamalapurkar, “Linear inverse reinforcement learning in continuous time and space,” in Proc. IEEE Annu. American Control Conf., 2018, pp. 1683–1688.
    [28]
    J. Rawlings and D. Mayne, Model Predictive Control Theory and Design. Nob Hill Publishing, Madison, WI, 1999.
    [29]
    E. F. Camacho and C. B. Alba, Model Predictive Control. Springer Science & Business Media, 2013.
    [30]
    J. Rafati and D. C. Noelle, “Learning representations in model-free hierarchical reinforcement learning,” in Proc. AAAI Conf. Artificial Intelligence, vol. 33, no. 1, 2019, pp. 10009–10010.
    [31]
    Z. Hou, J. Fei, Y. Deng, and J. Xu, “Data-efficient hierarchical reinforcement learning for robotic assembly control applications,” IEEE Trans. Ind. Electron., vol. 68, no. 11, pp. 11565–11575, 2021.
    [32]
    Y. Yang, K. G. Vamvoudakis, and H. Modares, “Safe reinforcement learning for dynamical games,” Int. J. Robust Nonlinear Control, vol. 30, no. 9, pp. 3706–3726, 2020. doi: 10.1002/rnc.4962
    [33]
    R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” in Proc. AAAI Conf. Artificial Intelligence, vol. 33, no. 1, 2019, pp. 3387–3395.
    [34]
    X. Li and C. Belta, “Temporal logic guided safe reinforcement learning using control barrier functions,” arXiv preprint arXiv: 1903.09885, 2019.
    [35]
    Y. Yang, Y. Yin, W. He, K. G. Vamvoudakis, H. Modares, and D. C. Wunsch, “Safety-aware reinforcement learning framework with an actorcritic-barrier structure,” in Proc. IEEE Annu. American Control Conf., 2019, pp. 2352–2358.
    [36]
    Y. Yang, K. G. Vamvoudakis, H. Modares, Y. Yin, and D. C. Wunsch, “Safe intermittent reinforcement learning with static and dynamic event generators,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 12, pp. 5441–5455, 2020. doi: 10.1109/TNNLS.2020.2967871
    [37]
    Z. Marvi and B. Kiumarsi, “Safe reinforcement learning: A control barrier function optimization approach,” Int. J. Robust Nonlinear Control, pp. 1923–1940, 1923.
    [38]
    T. L. Vu, S. Mukherjee, R. Huang, and Q. Hung, “Barrier function-based safe reinforcement learning for emergency control of power systems,” arXiv preprint arXiv: 2103.14186, 2021.
    [39]
    Z. Cai, H. Cao, W. Lu, L. Zhang, and H. Xiong, “Safe multi-agent reinforcement learning through decentralized multiple control barrier functions,” arXiv preprint arXiv: 2103.12553, 2021.
    [40]
    A. Kanellopoulos, F. Fotiadis, C. Sun, Z. Xu, K. G. Vamvoudakis, U. Topcu, and W. E. Dixon, “Temporal-logic-based intermittent, optimal, and safe continuous-time learning for trajectory tracking,” arXiv preprint arXiv: 2104.02547, 2021.
    [41]
    M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Proc. AAAI Conf. Artificial Intelligence, vol. 32, no. 1, 2018.
    [42]
    W. Zhang, O. Bastani, and V. Kumar, “MAMPS: Safe multi-agent reinforcement learning via model predictive shielding,” arXiv preprint arXiv: 1910.12639, 2019.
    [43]
    O. Bastani, “Safe reinforcement learning with nonlinear dynamics via model predictive shielding,” in Proc. IEEE American Control Conf., pp. 3488–3494, 2021.
    [44]
    N. Jansen, B. Könighofer, S. Junges, A. C. Serban, and R. Bloem, “Safe reinforcement learning via probabilistic shields,” arXiv preprint arXiv: 1807.06096, 2018.
    [45]
    S. Li and O. Bastani, “Robust model predictive shielding for safe reinforcement learning with stochastic dynamics,” in Proc. IEEE Int. Conf. Robotics and Automation, 2020, pp. 7166–7172.
    [46]
    D. Grbic and S. Risi, “Safe reinforcement learning through meta-learned instincts,” in Proc. Artificial Life Conf. MIT Press, 2020, pp. 283–291.
    [47]
    K. P. Wabersich and M. N. Zeilinger, “A predictive safety filter for learning-based control of constrained nonlinear dynamical systems,” arXiv preprint arXiv: 1812.05506, 2018.
    [48]
    A. Mustafa, M. Mazouchi, S. Nageshrao, and H. Modares, “Assured learning-enabled autonomy: A metacognitive reinforcement learning framework,” Int. J. Adapt. Control Signal Process. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/acs.3326
    [49]
    S. Lyashevskiy, “Constrained optimization and control of nonlinear systems: New results in optimal control,” in Proc. 35th IEEE Conf. Decision and Control, vol. 1, 1996, pp. 541–546.
    [50]
    H. Modares and F. L. Lewis, “Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning,” Automatica, vol. 50, no. 7, pp. 1780–1792, 2014. doi: 10.1016/j.automatica.2014.05.011
    [51]
    J. C. Willems, P. Rapisarda, I. Markovsky, and B. L. De Moor, “A note on persistency of excitation,” Syst. &Control Lett., vol. 54, no. 4, pp. 325–329, 2005.
    [52]
    J. Berberich and F. Allgöwer, “A trajectory-based framework for data-driven system analysis and control,” in Proc. IEEE European Control Conf., 2020, pp. 1365–1370.
    [53]
    I. Markovsky, J. C. Willems, P. Rapisarda, and B. L. De Moor, “Algorithms for deterministic balanced subspace identification,” Automatica, vol. 41, no. 5, pp. 755–766, 2005. doi: 10.1016/j.automatica.2004.10.007
    [54]
    R. Sepulchre, M. Jankovic, and P. V. Kokotovic, Constructive Nonlinear Control. Springer Science & Business Media, 2012.
    [55]
    C. Feller and C. Ebenbauer, “Relaxed logarithmic barrier function based model predictive control of linear systems,” IEEE Trans. Automat. Contr., vol. 62, no. 3, pp. 1223–1238, 2016.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(10)  / Tables(2)

    Article Metrics

    Article views (676) PDF downloads(74) Cited by()

    Highlights

    • The proposed algorithm provides safety and performance guarantees across a variety of circumstances that the system might encounter
    • The bilevel learning control architecture is utilized
    • A higher meta-cognitive layer leverages a data-driven receding-horizon attentional controller to adapt the relative attention to different system’s safety and performance requirements
    • A lower-layer RL controller designs control actuation signals for the system

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return