Understanding Agentic AI: Algorithms and Infrastructure

Wanlun Ma; Yongjian Guo; Qing-Long Han; Wei Zhou; Xiaogang Zhu; Junwu Xiong; Sheng Wen; Yang Xiang

doi:10.1109/JAS.2026.125993

Volume 13 Issue 4

Apr. 2026

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2026 > 13(4): 776-795

W. Ma, Y. Guo, Q.-L. Han, W. Zhou, X. Zhu, J. Xiong, S. Wen, and Y. Xiang, IEEE/CAA J. Autom. Sinica, vol. 13, no. 4, pp. 776–795, Apr. 2026. doi: 10.1109/JAS.2026.125993

Citation:

W. Ma, Y. Guo, Q.-L. Han, W. Zhou, X. Zhu, J. Xiong, S. Wen, and Y. Xiang, IEEE/CAA J. Autom. Sinica, vol. 13, no. 4, pp. 776–795, Apr. 2026. doi: 10.1109/JAS.2026.125993

W. Ma, Y. Guo, Q.-L. Han, W. Zhou, X. Zhu, J. Xiong, S. Wen, and Y. Xiang, IEEE/CAA J. Autom. Sinica, vol. 13, no. 4, pp. 776–795, Apr. 2026. doi: 10.1109/JAS.2026.125993

Citation:

W. Ma, Y. Guo, Q.-L. Han, W. Zhou, X. Zhu, J. Xiong, S. Wen, and Y. Xiang, IEEE/CAA J. Autom. Sinica, vol. 13, no. 4, pp. 776–795, Apr. 2026. doi: 10.1109/JAS.2026.125993

PDF( 2178 KB)

Understanding Agentic AI: Algorithms and Infrastructure

doi: 10.1109/JAS.2026.125993

More Information

Author Bio:
Wanlun Ma (Member, IEEE) received the Bachelor’s and Master’s degrees in information and communication engineering from the University of Electronic Science and Technology of China (UESTC) in 2017 and 2020, respectively; the Ph.D. degree in computer science from Swinburne University of Technology, Australia, in 2024. He is currently a Research Fellow with Swinburne University of Technology. His research interests focus on trustworthy and responsible AI, adversarial machine learning, and network security and privacy

Yongjian Guo (Student Member, IEEE) received the Bachelor’s Degree in computer science and technology from Tianjin University in 2024. He is currently working toward his Ph.D. degree with the Swinburne University of Technology, Australia. His research interests include AI security, Agentic RL, AI Infra and VLA post-training

Qing-Long Han (Fellow, IEEE) received the B.Sc. degree in mathematics from Shandong Normal University in 1983, and the M.Sc. and Ph.D. degrees in control engineering from East China University of Science and Technology, in 1992 and 1997, respectively. Professor Han is Pro Vice-Chancellor (Research Quality) and a Distinguished Professor at Swinburne University of Technology, Melbourne, Australia. He held various academic and management positions at Griffith University and Central Queensland University, Australia. His research interests include networked control systems, multi-agent systems, time-delay systems, smart grids, unmanned surface vehicles, and neural networks. Professor Han was awarded the 2024 IEEE Dr.-Ing. Eugene Mittelmann Achievement Award (the Highest Award in industrial electronics), the 2024 Chinese Association of Automation (CAA) Science and Technology Achievement Award (the Highest Achievement Award of CAA in Automation, Information and Intelligent Science), the 2021 Norbert Wiener Award (the Highest Award in systems science and engineering, and cybernetics) and the 2021 M. A. Sargent Medal (the Highest Award of the Electrical College Board of Engineers Australia). He was the recipient of the IEEE Systems, Man, and Cybernetics Society Andrew P. Sage Best Transactions Paper Award in 2019, 2020, and 2022, respectively, the IEEE/CAA Journal of Automatica Sinica Norbert Wiener Review Award in 2020, and the IEEE Transactions on Industrial Informatics Outstanding Paper Award in 2020. Professor Han is a Member of the Academia Europaea (The Academy of Europe). He is a Fellow of the International Federation of Automatic Control (FIFAC), a Fellow of the Asian Control Association (FACA), an Honorary Fellow of the Institution of Engineers Australia (HonFIEAust), and a Fellow of the Chinese Association of Automation (FCAA). He is a Highly Cited Researcher in both Engineering and Computer Science (Clarivate). He has served as an AdCom Member of IEEE Industrial Electronics Society (IES), a Member of IEEE IES Fellows Committee, a Member of IEEE IES Publications Committee, Chair of IEEE IES Technical Committee on Network-Based Control Systems and Applications, and the Co-Editor-in-Chief of IEEE Transactions on Industrial Informatics. He is currently the President-Elect, an Executive Board Member, and a Steering Committee Member of Asian Control Association (ACA), and the Vice-President of the Chinese Association of Automation (CAA). He is currently the Editor-in-Chief of IEEE/CAA Journal of Automatica Sinica and the Co-Editor of Australian Journal of Electrical and Electronic Engineering

Wei Zhou (Member, IEEE) received the BEng and MEng degrees from Central South University in 2005 and 2008, respectively, all in computer science, and the Ph.D. degree from School of Engineering and IT, University of New South Wales, Canberra, Australia, in 2016. Her research interests include computer networks, network security, and fingerprint biometric

Xiaogang Zhu (Member, IEEE) received his Ph.D. degree in computer science and engineering at Swinburne University of Technology, Australia, in 2021. Currently, he is a Lecturer at the Adelaide University, Australia, and focuses on searching vulnerabilities in programs. He is interested in detecting techniques such as fuzzing, machine learning and symbolic execution. He has published papers on top journals, such as IEEE Transactions on Dependable and Secure Computing (TDSC), and conferences such as IEEE Symposium on Security and Privacy (S&P), ACM Conference on Computer and Communications Security (CCS), USENIX Security Symposium (USENIX Security), and International Conference on Software Engineering (ICSE). He also served as a reviewer for many top journals such as IEEE Transactions on Dependable and Secure Computing, IEEE Internet of Things Journal, ACM Computing Surveys

Junwu Xiong received a Master’s degree from Beihang University. He currently works in JD Technology as Senior Director responsible for cloud platform, technology research and engineering management. His research interests include reinforcement learning, LLM and VLA RL post-training, agentic RL, world models, video and Omni generative foundation model reasoning, effective and efficient universal intelligent infrastructure. He has published multiple papers at the top international conferences including International Conference on Learning Representations (ICLR), International Conference on Machine Learning (ICML), Conference on Neural Information Processing Systems (NeurIPS), and AAAI Conference on Artificial Intelligence (AAAI)

Sheng Wen (Senior Member, IEEE) received the Ph.D. degree from Deakin University, Australia, in 2014. He is currently working full-time as an Associate Professor with Swinburne University of Technology, Australia. He is also the Director of Blockchain Innovation Lab and the Deputy Director of Swinburne Cybersecurity Lab at Swinburne University, Australia. In addition, he has published over 100 high-quality papers, including top conference papers such as papers in IEEE Symposium on Security and Privacy (S&P), ACM Conference on Computer and Communications Security (CCS), USENIX Security Symposium (USENIX Security), Network and Distributed System Security Symposium (NDSS), International Conference on Software Engineering (ICSE), Fast Software Encryption (FSE), and The Web Conference (WWW), as well as many papers in IEEE/ACM transactions series journals. Some of these papers are with high impact factors or become highly cited papers. His research interests include social network analysis and system security

Yang Xiang (Fellow, IEEE) received his Ph.D. in computer science from Deakin University, Australia. He is currently a full professor and the Director of the Digital Capability Research Platform, Swinburne University of Technology, Australia. His research interests include the broad area of Cybersecurity and AI, which covers the security and algorithms of software, systems, networks, and applications to solve the security, privacy, reliability, responsiveness, and robustness challenges. He is also leading the Blockchain initiatives at Swinburne. In the past 20 years, he has published more than 300 research papers in top-tier international journals and conferences. He is the Editor-in-Chief of the SpringerBriefs on Information Security and Cryptography. He is serving as the Associate Editor of the ACM Computing Surveys. He is a current member of College of Experts (CoE) of the Australian Research Council (ARC). He is a Fellow of the IEEE
Corresponding author: Wanlun Ma, e-mail: wma@swin.edu.au
¹ https://github.com/huggingface/vlab
² https://github.com/isaac-sim/IsaacSim
³ https://github.com/Significant-Gravitas/AutoGPT
Received Date: 2025-12-26
Revised Date: 2026-01-26
Accepted Date: 2026-03-05

Abstract

Abstract

The rapid evolution of large language models (LLMs) towards autonomous Agentic artificial intelligence (AI) necessitates a systemic overhaul across algorithms, infrastructure, and architectures. This paper presents a unified view of the “Agentic AI Infrastructure,” connecting research threads often studied in isolation. First, post-training algorithms are reviewed, contrasting traditional reinforcement learning (RL) with emerging reasoning-centric methods and test-time scaling strategies. Next, the transition of RL training frameworks is analyzed from monolithic, colocated designs to disaggregated, asynchronous architectures tailored for the extreme variance of agentic rollouts. Furthermore, progress in agent construction is synthesized, covering reflection, planning, tool use, and multi-agent collaboration. By integrating these layers, the paper elucidates how agentic AI systems impose unique demands on underlying training systems. Finally, open challenges are outlined by covering capability scaling, efficiency, safety, privacy, and governance for reliable real-world agentic AI deployment.
- Agentic artificial intelligence (AI),
- AI Infrastructure,
- post-training,
- reinforcement learning (RL)

FullText(HTML)

¹ https://github.com/huggingface/vlab
² https://github.com/isaac-sim/IsaacSim
³ https://github.com/Significant-Gravitas/AutoGPT

References(133)

References

[1]	Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, Y. Liang, and T. CraftJarvis, “Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 1480.
[2]	X. Li, S. Wang, S. Zeng, Y. Wu, and Y. Yang, “A survey on LLM-based multi-agent systems: Workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, Art. no. 9, Oct. 2024. doi: 10.1007/s44336-024-00009-2
[3]	J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in Proc. 36th Annu. ACM Symp. on User Interface Software and Technology, San Francisco, USA, 2023, Art. no. 2.
[4]	X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang, et al., “Ghost in the Minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory,” arXiv preprint arXiv: 2305.17144, 2023.
[5]	G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” Trans. Mach. Learn. Res., Mar. 2024.
[6]	C. Qu, S. Dai, X. Wei, H. Cai, S. Wang, D. Yin, J. Xu, and J.-R. Wen, “Tool learning with large language models: A survey,” Front. Comput. Sci., vol. 19, no. 8, Art. no. 198343, Jan. 2025. doi: 10.1007/s11704-024-40678-2
[7]	T. Schick, J. Dwivedi-Yu, R. Dessí, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 2997.
[8]	C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, et al., “ChatDev: Communicative agents for software development,” in Proc. 62nd Annu. Meet. of the Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 15174−15186.
[9]	Z. Li, W. Wu, Y. Guo, J. Sun, and Q.-L. Han, “Embodied multi-agent systems: A review,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 6, pp. 1095–1116, Jun. 2025. doi: 10.1109/JAS.2025.125552
[10]	W. Zhou, X. Zhu, Q.-L. Han, L. Li, X. Chen, S. Wen, and Y. Xiang, “The security of using large language models: A survey with emphasis on ChatGPT,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 1, pp. 1–26, Nov. 2025.
[11]	J. Lin, X. Zeng, J. Zhu, S. Wang, J. Shun, J. Wu, and D. Zhou, “Plan and budget: Effective and efficient test-time scaling on reasoning large language models,” arXiv preprint arXiv: 2505.16122, 2025.
[12]	Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, et al., “A survey on test-time scaling in large language models: What, how, where, and how well?” arXiv preprint arXiv: 2503.24235, 2025.
[13]	Y. Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, et al., “StreamRL: Scalable, heterogeneous, and elastic RL for LLMs with disaggregated stream generation,” arXiv preprint arXiv: 2504.15930, 2025.
[14]	G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y. Peng, H. Lin, and C. Wu, “HybridFlow: A flexible and efficient RLHF framework,” in Proc. Twentieth European Conf. Computer Systems, Rotterdam, Netherlands, 2025, pp. 1279−1297.
[15]	G. Sheng, Y. Tong, B. Wan, W. Zhang, C. Jia, X. Wu, et al., “Laminar: A scalable asynchronous RL post-training framework,” arXiv preprint arXiv: 2510.12633, 2025.
[16]	W. Wang, S. Xiong, G. Chen, W. Gao, S. Guo, Y. He, et al., “Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library,” arXiv preprint arXiv: 2506.06122, 2025.
[17]	R. Kamoi, Y. Zhang, N. Zhang, J. Han, and R. Zhang, “When can LLMs actually correct their own mistakes? A critical survey of self-correction of LLMs,” Trans. Assoc. Comput. Linguist., vol. 12, pp. 1417–1440, Nov. 2024. doi: 10.1162/tacl_a_00713
[18]	J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv: 1707.06347, 2017.
[19]	M. Luo, J. Yao, R. Liaw, E. Liang, and I. Stoica, “IMPACT: Importance weighted asynchronous architectures with clipped target networks,” in Proc. 8th Int. Conf. Learning Representations, Addis Ababa, Ethiopia, 2020.
[20]	Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, et al., “DeepSeekMath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv: 2402.03300, 2024.
[21]	Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, et al., “DAPO: An open-source LLM reinforcement learning system at scale,” arXiv preprint arXiv: 2503.14476, 2025.
[22]	J. Liu, J. Obando-Ceron, H. Lu, Y. He, W. Wang, W. Su, B. Zheng, P. S. Castro, A. Courville, and L. Pan, “Asymmetric proximal policy optimization: Mini-critics boost LLM reasoning,” arXiv preprint arXiv: 2510.01656, 2025.
[23]	H. Wang, S. Hao, H. Dong, S. Zhang, Y. Bao, Z. Yang, and Y. Wu, “Offline reinforcement learning for LLM multi-step reasoning,” in Proc. Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 2025, pp. 8881−8893.
[24]	R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 2338.
[25]	M. G. Azar, Z. D. Guo, B. Piot, R. Munos, M. Rowland, M. Valko, and D. Calandriello, “A general theoretical paradigm to understand learning from human preferences,” in Proc. 27th Int. Conf. Artificial Intelligence and Statistics, Valencia, Spain, 2024, pp. 4447−4455.
[26]	K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela, “Model alignment as prospect theoretic optimization,” in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, pp. 12634−12651.
[27]	H. Yuan, Z. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “RRHF: Rank responses to align language models with human feedback,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 482.
[28]	Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and P. J. Liu, “SLiC-HF: Sequence likelihood calibration with human feedback,” arXiv preprint arXiv: 2305.10425, 2023.
[29]	J. Hong, N. Lee, and J. Thorne, “ORPO: Monolithic preference optimization without reference model,” in Proc. Conf. Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 11170−11189.
[30]	Y. Meng, M. Xia, and D. Chen, “SimPO: Simple preference optimization with a reference-free reward,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, Art. no. 3946.
[31]	Y. Xie, K. Kawaguchi, Y. Zhao, J. X. Zhao, M.-Y. Kan, J. He, and M. Xie, “Self-evaluation guided beam search for reasoning,” in Proc. 37th Int. Conf. Neural Information Processing System, New Orleans, USA, 2023, Art. no. 1802.
[32]	J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, et al., “Chain-of-thought prompting elicits reasoning in large language models,” in Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, Art. no. 1800.
[33]	X. Wang, J. Wei, D. Schuurmans, Q. V. Le, H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[34]	A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, et al., “SELF-REFINE: Iterative refinement with self-feedback,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 2019.
[35]	S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” in Proc. 37th Int. Conf. Neural Information Processing System, New Orleans, USA, 2023, Art. no. 517.
[36]	R. Coulom, “Efficient selectivity and backup operators in Monte-Carlo tree search,” in Proc. 5th Int. Conf. Computers and Games, Turin, Italy, 2006, pp. 72−83.
[37]	D. Zhang, S. Zhoubian, Z. Hu, Y. Yue, Y. Dong, and J. Tang, “ReST-MCTS: LLM self-training via process reward guided tree search,” in Proc. 38th Int. Conf. Neural Information Processing System*, Vancouver, Canada, 2024, Art. no. 2066.
[38]	M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, et al., “Graph of thoughts: Solving elaborate problems with large language models,” in Proc. 38th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2024, Art. no. 1972.
[39]	K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv: 2110.14168, 2021.
[40]	A. Ni, S. Iyer, D. Radev, V. Stoyanov, W.-T. Yih, S. I. Wang, and X. V. Lin, “LEVER: Learning to verify language-to-code generation with execution,” in Proc. 40th Int. Conf. Machine Learning, Honolulu, USA, 2023, Art. no. 1086.
[41]	H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[42]	L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal, “Generative verifiers: Reward modeling as next-token prediction,” in Proc. Thirteenth Int. Conf. Learning Representations, Singapore, Singapore, 2025.
[43]	Y. Weng, M. Zhu, F. Xia, B. Li, S. He, S. Liu, B. Sun, K. Liu, and J. Zhao, “Large language models are better reasoners with self-verification,” in Findings of the Association for Computational Linguistics, Singapore, Singapore, 2023, pp. 2550−2575.
[44]	N. Singhi, H. Bansal, A. Hosseini, A. Grover, K.-W. Chang, M. Rohrbach, and A. Rohrbach, “When to solve, when to verify: Compute-optimal problem solving and generative verification for LLM reasoning,” in Proc. Second Conf. Language Modeling, Montreal, Canada, 2025.
[45]	T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V. Q. Tran, Y. Tay, and D. Metzler, “Confident adaptive language modeling,” in Proc. 36th Int. Conf. Neural Information Processing System, New Orleans, USA, 2022, Art. no. 1269.
[46]	M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large language model cascades with mixture of thought representations for cost-efficient reasoning,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[47]	X. Ning, Z. Lin, Z. Zhou, Z. Wang, H. Yang, and Y. Wang, “Skeleton-of-thought: Prompting LLMs for efficient parallel generation,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[48]	Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhandari, X. Wu, A. A. Awan, et al., “DeepSpeed-chat: Easy, fast and affordable RLHF training of ChatGPT-like models at all scales,” arXiv preprint arXiv: 2308.01320, 2023.
[49]	L. Von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouedec. TRL: Transformer reinforcement learning. 2020. [Online]. Available: https://github.com/huggingface/trl
[50]	Y. Zhao, J. Huang, J. Hu, X. Wang, Y. Mao, D. Zhang, et al., “Swift: A scalable lightweight infrastructure for fine-tuning,” in Proc. 39th AAAI Conf. Artificial Intelligence, Philadelphia, Pennsylvania, 2025, Art. no. 3485.
[51]	Z. Zhu, C. Xie, X. Lv, and slime Contributors. (2025). slime: An LLM post-training framework for RL scaling [Online]. Available: https://github.com/THUDM/slime
[52]	Z. Wang, K. Wang, Q. Wang, P. Zhang, L. Li, Z. Yang, et al., “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning,” arXiv preprint arXiv: 2504.20073, 2025.
[53]	Y. Zhong, Z. Zhang, B. Wu, S. Liu, Y. Chen, C. Wan, et al., “Optimizing RLHF training for large language models with stage fusion,” in Proc. 22nd USENIX Symp. on Networked Systems Design and Implementation, Philadelphia, USA, 2025, Art. no. 26.
[54]	Z. Wang, T. Zhou, L. Liu, A. Li, J. Hu, D. Yang, et al., “DistFlow: A fully distributed RL framework for scalable and efficient LLM post-training,” arXiv preprint arXiv: 2507.13833, 2025.
[55]	J. Hu, X. Wu, W. Shen, J. K. Liu, Z. Zhu, W. Wang, et al., “OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework,” arXiv preprint arXiv: 2405.11143, 2024.
[56]	NeMo RL: A scalable and efficient post-training library. (2025) [Online]. Available: https://github.com/NVIDIA-NeMo/RL
[57]	S. Cao, S. Hegde, D. Li, T. Griggs, S. Liu, E. Tang, et al., SkyRL-v0: Train real-world long-horizon agents via reinforcement learning. 2025. [Online]. Available: https://github.com/NovaSky-AI/SkyRL
[58]	W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, et al., “AReal: A large-scale asynchronous reinforcement learning system for language reasoning,” arXiv preprint arXiv: 2505.24298, 2025.
[59]	J. Wang, S. Wang, Y. Cui, X. Chen, C. Wang, X. Zhang, et al., “SeamlessFlow: A trainer agent isolation RL framework achieving bubble-free pipelines via tag scheduling,” arXiv preprint arXiv: 2508.11553, 2025.
[60]	Z. Han, A. You, H. Wang, K. Luo, G. Yang, W. Shi, et al., “AsyncFlow: An asynchronous streaming RL framework for efficient LLM post-training,” arXiv preprint arXiv: 2507.01663, 2025.
[61]	C. Yu, S. Lu, C. Zhuang, D. Wang, Q. Wu, Z. Li, et al., “AWorld: Orchestrating the training recipe for agentic AI,” arXiv preprint arXiv: 2508.20404, 2025.
[62]	W. Brown. (2025). Verifiers: Environments for LLM reinforcement learning [Online]. Available: https://github.com/PrimeIntellect-ai/verifiers
[63]	W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” in Proc. 29th Symp. on Operating Systems Principles, Koblenz, Germany, 2023, pp. 611−626.
[64]	C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world GitHub issues?” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[65]	Z. Deng, W. Ma, Q.-L. Han, W. Zhou, X. Zhu, S. Wen, and Y. Xiang, “Exploring DeepSeek: A survey on advances, applications, challenges and future directions,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 872–893, May 2025. doi: 10.1109/JAS.2025.125498
[66]	Z. Xi, J. Huang, C. Liao, B. Huang, H. Guo, J. Liu, et al., “AgentGym-RL: Training LLM agents for long-horizon decision making through multi-turn reinforcement learning,” arXiv preprint arXiv: 2509.08755, 2025.
[67]	K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, et al., “Kimi k1.5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv: 2501.12599, 2025.
[68]	Meituan LongCat Team, B. Wang, Bayan, B. Xiao, B. Zhang, B. Rong, et al., “LongCat-flash-Omni technical report,” arXiv preprint arXiv: 2511.00279, 2025.
[69]	S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[70]	N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Proc. 37th Int. Conf. Neural Information Processing System, New Orleans, USA, 2023, Art. no. 377.
[71]	A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-RAG: Learning to retrieve, generate, and critique through self-reflection,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[72]	Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, B. Cao, and L. Sun, “ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases,” arXiv preprint arXiv: 2306.05301, 2023.
[73]	Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, et al., “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[74]	S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, Art. no. 4020.
[75]	P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 1882.
[76]	S. Gao, J. Dwivedi-Yu, P. Yu, X. E. Tan, R. Pasunuru, O. Golovneva, K. Sinha, A. Celikyilmaz, A. Bosselut, and T. Wang, “Efficient tool use with chain-of-abstraction reasoning,” in Proc. 31st Int. Conf. Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 2727−2743.
[77]	S. Chen, Y. Wang, Y.-F. Wu, Q. Chen, Z. Xu, W. Luo, K. Zhang, and L. Zhang, “Advancing tool-augmented large language models: Integrating insights from errors in inference trees,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, Art. no. 3382.
[78]	Y. Ruan, H. Dong, A. Wang, S. Pitis, Y. Zhou, J. Ba, Y. Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of lm agents with an lm-emulated sandbox,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[79]	S. Yuan, K. Song, J. Chen, X. Tan, Y. Shen, K. Ren, D. Li, and D. Yang, “EASYTOOL: Enhancing LLM-based agents with concise tool instruction,” in Proc. Conf. Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, New Mexico, 2025, pp. 951−972.
[80]	Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “HuggingGPT: Solving AI tasks with ChatGPT and its friends in Hugging Face,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 1657.
[81]	Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 1387.
[82]	A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang, “Language agent tree search unifies reasoning, acting, and planning in language models,” in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, Art. no. 2572.
[83]	B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: Empowering large language models with optimal planning proficiency,” arXiv preprint arXiv: 2304.11477, 2023.
[84]	L. E. Erdogan, N. Lee, S. Kim, S. Moon, H. Furuta, G. Anumanchipalli, K. Keutzer, and A. Gholami, “Plan-and-Act: Improving planning of agents for long-horizon tasks,” in Proc. Forty-Second Int. Conf. Machine Learning, Vancouver, Canada, 2025.
[85]	W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Proc. 39th Int. Conf. Machine Learning, Baltimore, USA, 2022, pp. 9118−9147.
[86]	W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, et al., “Inner monologue: Embodied reasoning through planning with language models,” in Proc. 6th Conf. Robot Learning, Auckland, New Zealand, 2023, pp. 1769−1782.
[87]	C. H. Song, B. M. Sadler, J. Wu, W.-L. Chao, C. Washington, and Y. Su, “LLM-planner: Few-shot grounded planning for embodied agents with large language models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 2998−3009.
[88]	S. Qiao, R. Fang, N. Zhang, Y. Zhu, X. Chen, S. Deng, Y. Jiang, P. Xie, F. Huang, and H. Chen, “Agent planning with world knowledge model,” in Proc. 38th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2024, Art. no. 3646.
[89]	A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of LLM agents,” in Proc. 62nd Annu. Meet. of the Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 13851−13870.
[90]	W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “MemoryBank: Enhancing large language models with long-term memory,” in Proc. 38th AAAI Conf. Artificial Intelligence, Vancouver, Canada, 2024, pp. 19724−19731.
[91]	W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, Art. no. 3259, 2023.
[92]	C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as operating systems,” arXiv preprint arXiv: 2310.08560, 2023.
[93]	K.-H. Lee, X. Chen, H. Furuta, J. Canny, and I. Fischer, “A human-inspired reading agent with gist memory of very long contexts,” in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, Art. no. 1054.
[94]	W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, “A-MEM: Agentic memory for LLM agents,” in Proc. 39th Conf. Neural Information Processing Systems, San Diego, USA, 2025.
[95]	P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, Art. no. 793.
[96]	G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave, “Atlas: Few-shot learning with retrieval augmented language models,” J. Mach. Learn. Res., vol. 24, no. 251, pp. 1–43, Jul. 2023.
[97]	B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, and S. Tang, “Graph retrieval-augmented generation: A survey,” ACM Trans. Inform. Syst., vol. 44, no. 2, Art. no. 35, Feb. 2026.
[98]	D.-Y. Zhou, X. Xue, Q. Ma, C. Guo, L.-Z. Cui, Y.-L. Tian, J. Yang, and F.-Y. Wang, “Federated experiments: Generative causal inference powered by LLM-based agents simulation and RAG-based domain docking,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 7, pp. 1301–1304, Jul. 2025. doi: 10.1109/JAS.2024.124671
[99]	H. Zhu, J. Lu, Y. Lou, and X. Yang, “Distributed saturated impulsive quasi-consensus for leader-follower multi-agent systems: An open topology framework,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 9, pp. 1941–1943, Sep. 2025. doi: 10.1109/JAS.2024.124743
[100]	Y. Li, Y. Zhang, X. Li, and C. Sun, “Regional multi-agent cooperative reinforcement learning for city-level traffic grid signal control,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 9, pp. 1987–1998, Sep. 2024. doi: 10.1109/JAS.2024.124365
[101]	G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “CAMEL: Communicative agents for ‘mind’ exploration of large language model society,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 2264, 2023.
[102]	W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, et al., “AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[103]	Z. Wang, S. Mao, W. Wu, T. Ge, F. Wei, and H. Ji, “Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration,” in Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 257−279.
[104]	S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, et al., “MetaGPT: Meta programming for a multi-agent collaborative framework,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.
[105]	L. H. Klein, N. Potamitis, R. Aydin, R. West, C. Gulcehre, and A. Arora, “Fleet of agents: Coordinated problem solving with large language models,” in Proc. Forty-Second Int. Conf. Machine Learning, Vancouver, Canada, 2025.
[106]	C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “ChatEval: Towards better LLM-based evaluators through multi-agent debate,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2023.
[107]	T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” in Proc. Conf. Empirical Methods in Natural Language Processing, Miami, USA, 2024, pp. 17889−17904.
[108]	C. Inc. (2025). CrewAI [Online]. Available: https://github.com/crewAIInc/crewAI
[109]	D. Chen, H. Zhang, H. Wang, Y. Huo, Y. Li, and J. Wang, “GameGPT: Multi-agent collaborative framework for game development,” arXiv preprint arXiv: 2310.08067, 2023.
[110]	Microsoft. (2023). Autogen [Online]. Available: https://github.com/microsoft/autogen. Accessed on: Dec. 1, 2025.
[111]	OpenAI. (2024). Swarm [Online]. Available: https://github.com/openai/swarm. Accessed on: Dec. 1, 2025.
[112]	L. Foundation. (2025). Agentstack [Online]. Available: https://github.com/i-am-bee/agentstack. Accessed on: Dec. 1, 2025.
[113]	Amazon. (2025). Strands agents SDK for Python [Online]. Available: https://github.com/strands-agents/sdk-python. Accessed on: Dec. 1, 2025.
[114]	L. AI. (2023). Langchain: The platform for reliable agents [Online]. Available: https://github.com/langchain-ai/langchain. Accessed on: Dec. 1, 2025.
[115]	Microsoft. (2025). Semantic kernel [Online]. Available: https://github.com/microsoft/semantic-kernel. Accessed on: Dec. 1, 2025.
[116]	LlamaIndex. (2025). LlamaIndex [Online]. Available: https://www.llamaindex.ai/. Accessed on: Dec. 1, 2025.
[117]	L. AI. (2024). LangGraph: Build resilient language agents as graphs [Online]. Available: https://github.com/langchain-ai/langgraph. Accessed on: Dec. 1, 2025.
[118]	LangFlow AI. (2024). LangFlow [Online]. Available: https://github.com/langflow-ai/langflow. Accessed on: Dec. 1, 2025.
[119]	J. Xu, Q. Sun, Q.-L. Han, and Y. Tang, “When embodied AI meets industry 5.0: Human-centered smart manufacturing,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 3, pp. 485–501, Mar. 2025. doi: 10.1109/JAS.2025.125327
[120]	T. Shen, J. Sun, S. Kong, Y. Wang, J. Li, X. Li, and F.-Y. Wang, “The journey/DAO/TAO of embodied intelligence: From large models to foundation intelligence and parallel intelligence,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 6, pp. 1313–1316, Jun. 2024. doi: 10.1109/JAS.2024.124407
[121]	H. Tan, Y. Feng, X. Mao, S. Huang, G. Liu, Z. Hao, H. Su, and J. Zhu, “AnyPos: Automated task-agnostic actions for bimanual manipulation,” arXiv preprint arXiv: 2507.12768, 2025.
[122]	Y. Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y.-W. Chao, and D. Fox, “AnyTeleop: A general vision-based dexterous robot arm-hand teleoperation system,” in Proc. Robotics: Science and Systems XIX, Daegu, Republic of Korea, 2023.
[123]	Y. Feng, H. Tan, X. Mao, C. Xiang, G. Liu, S. Huang, H. Su, and J. Zhu, “Vidar: Embodied video diffusion model for generalist manipulation,” arXiv preprint arXiv: 2507.12898, 2025.
[124]	M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, et al., “OpenVLA: An open-source vision-language-action model,” in Proc. 8th Conf. Robot Learning, Munich, Germany, 2024, pp. 2679−2713.
[125]	K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, et al., “ $\pi$_0: A vision-language-action flow model for general robot control,” arXiv preprint arXiv. 2410.24164, 2024.
[126]	J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, et al., “GR00T N1: An open foundation model for generalist humanoid robots,” arXiv preprint arXiv: 2503.14734, 2025.
[127]	R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, and S. Palma, et al., (2024). LeRobot: State-of-the-art machine learning for real-world robotics in Pytorch [Online]. Available: https://github.com/huggingface/lerobot
[128]	C. Yu, Y. Wang, Z. Guo, H. Lin, S. Xu, H. Zang, et al., “RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation,” arXiv preprint arXiv: 2509.15965, 2025.
[129]	B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone, “LIBERO: Benchmarking knowledge transfer for lifelong robot learning,” in Proc. 37th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2023, Art. no. 1939.
[130]	A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa, “Visual imitation enables contextual humanoid control,” arXiv preprint arXiv: 2505.03729, 2025.
[131]	C. Wang, Z. Wang, W. Sheng, Q. Liu, and H. Dong, “Fuzzy domain adaptation via variational inference for evolving concept drift,” IEEE Trans. Fuzzy Syst., vol. 33, no. 7, pp. 2361–2375, Jul. 2025. doi: 10.1109/TFUZZ.2025.3567089
[132]	C. Wang, Z. Wang, Q. Liu, H. Dong, W. Liu, and X. Liu, “A comprehensive survey on domain adaptation for intelligent fault diagnosis,” Knowl.-Based Syst., vol. 37, Art. no. 114109, 2025.
[133]	S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, et al., “WebArena: A realistic web environment for building autonomous agents,” in Proc. Twelfth Int. Conf. Learning Representations, Vienna, Austria, 2024.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(5) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views (31) PDF downloads(2)

Understanding Agentic AI: Algorithms and Infrastructure

doi: 10.1109/JAS.2026.125993

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content