DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models

Luolin Xiong; Haofen Wang; Xi Chen; Lu Sheng; Yun Xiong; Jingping Liu; Yanghua Xiao; Huajun Chen; Qing-Long Han; Yang Tang

doi:10.1109/JAS.2025.125495

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2025 > In Press, Accepted Manuscript

L. Xiong, H. Wang, X. Chen, L. Sheng, Y. Xiong, J. Liu, Y. Xiao, H. Chen, Q.-L. Han, and Y. Tang, “DeepSeek: Paradigm shifts and technical evolution in large AI models,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 841–858, May 2025. doi: 10.1109/JAS.2025.125495

Citation:

L. Xiong, H. Wang, X. Chen, L. Sheng, Y. Xiong, J. Liu, Y. Xiao, H. Chen, Q.-L. Han, and Y. Tang, “DeepSeek: Paradigm shifts and technical evolution in large AI models,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 841–858, May 2025. doi: 10.1109/JAS.2025.125495

Citation:

PDF( 1157 KB)

DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models

doi: 10.1109/JAS.2025.125495

Funds: This work was supported by the National Natural Science Foundation of China (62233005, 62293502, U2441245, 62176185, U23B2057, 62306112), the State Key Laboratory of Industrial Control Technology, China (ICT2024A22), and the Shanghai Sailing Program (23YF1409400)

More Information

Author Bio:
Luolin Xiong (Graduate Student Member, IEEE) received the B.S. degree at the School of Control Science and Engineering from East China University of Science and Technology in 2020. She is currently pursuing the Ph.D. degree from East China University of Science and Technology. Her research interests include smart grid, microgrids, reinforcement learning and their applications

Haofeng Wang (Member, IEEE) received the PhD degree in computer application from Shanghai Jiaotong University. He is a Professor at rhe College of Design and Innovation, Tongji University. His research interests include knowledge graph, large language model, and retrieval-augmented generation

Xi Chen is currently a Ph.D. candidate at Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University. He is under the supervision of Professor Yun Xiong. His research interests include graph representation learning and data mining

Lu Sheng (Member, IEEE) received the B.E. degree from Zhejiang University, in 2011, and the Ph.D. degree from The Chinese University of Hong Kong, Hong Kong, China, in 2016. From 2017 to 2019, he was a Post-Doctoral Researcher with the Multimedia Laboratory (MMLab), The Chinese University of Hong Kong, Hong Kong, China. He is currently an Associate Professor with the College of Software, Beihang University. His research interests include 3D computer vision and embodied AI

Yun Xiong is a Professor at Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University. She received the Ph.D. degree from Fudan University in 2008. Her research interests include big data mining and data science. Her research achievements include the publication of more than 50 papers in internationally top journals and conferences in the field of data mining, including TKDE, KDD, WWW, AAAI, etc

Jingping Liu is an Associate Professor at the School of Information Science and Engineering, East China University of Science and Technology. He received the Ph.D. degree from the School of Computer Science, Fudan University in 2021. His research interests include large language model, knowledge bases, and data mining

Yanghua Xiao is a Professor at Fudan University, as well as the Director of the Shanghai Key Laboratory of Data Science. He has long been engaged in research on big data and artificial intelligence. He has published over 300 papers, and authored four academic monographs and textbooks. He serves as an Associate Editor or editorial board Member for several international journals, including Applied Intelligence

Huajun Chen (Member, IEEE) is a Full Professor of the College of Computer Science and Technologies at Zhejiang University, and a Deputy Director of the Key Laboratory of Big Data Intelligence at Zhejiang Province. He received bachelor degree and the Ph.D. degree from Zhejiang University in 2000 and 2004, respectively. His research interests include knowledge graph and natural language processing, big data, and artificial intelligence

Qing-Long Han (Fellow, IEEE) received the B.Sc. degree in mathematics from Shandong Normal University in 1983, and the M.Sc. and Ph.D. degrees in control engineering from East China University of Science and Technology in 1992 and 1997, respectively.Professor Han is Pro Vice-Chancellor (Research Quality) and a Distinguished Professor at Swinburne University of Technology, Melbourne, Australia. He held various academic and management positions at Griffith University and Central Queensland University, Australia. His research interests include networked control systems, multi-agent systems, time-delay systems, smart grids, unmanned surface vehicles, and neural networks.Professor Han was awarded the 2024 IEEE Dr.-Ing. Eugene Mittelmann Achievement Award (the Highest Achievement Award in industrial electronics), the 2021 Norbert Wiener Award (the Highest Achievement Award in systems science and engineering, and cybernetics) and the 2021 M. A. Sargent Medal (the Highest Achievement Award of the Electrical College Board of Engineers Australia). He was the recipient of the IEEE Systems, Man, and Cybernetics Society Andrew P. Sage Best Transactions Paper Award in 2019, 2020, and 2022, respectively, the IEEE/CAA Journal of Automatica Sinica Norbert Wiener Review Award in 2020, and the IEEE Transactions on Industrial Informatics Outstanding Paper Award in 2020.Professor Han is a Member of the Academia Europaea (The Academy of Europe). He is a Fellow of the International Federation of Automatic Control (FIFAC), an Honorary Fellow of the Institution of Engineers Australia (HonFIEAust), and a Fellow of the Chinese Association of Automation (FCAA). He is a Highly Cited Researcher in both Engineering and Computer Science (Clarivate). He has served as an AdCom Member of IEEE Industrial Electronics Society (IES), a Member of IEEE IES Fellows Committee, a Member of IEEE IES Publications Committee, Chair of IEEE IES Technical Committee on Network-Based Control Systems and Applications, and the Co-Editor-in-Chief of IEEE Transactions on Industrial Informatics. He is currently the President-Elect, an Executive Board Member, and a Steering Committee Member of Asian Control Association (ACA), and Vice President of the Chinese Association of Automation (CAA). He is currently the Editor-in-Chief of IEEE/CAA Journal of Automatica Sinica

Yang Tang (Fellow, IEEE) received the B.S. and Ph.D. degrees in electrical engineering from Donghua University in 2006 and 2010, respectively. From 2008 to 2010, he was a Research Associate with The Hong Kong Polytechnic University, Hong Kong, China. From 2011 to 2015, he was a Post-Doctoral Researcher with the Humboldt University of Berlin, Germany, and with the Potsdam Institute for Climate Impact Research, Germany. He is now a Professor with the East China University of Science and Technology. His current research interests include distributed estimation/control/optimization, computer vision, reinforcement learning, cyber-physical systems, hybrid dynamical systems, and their applications.Prof. Tang is an IEEE Fellow. He was a recipient of the Alexander von Humboldt Fellowship. He is an Senior Area Editor of IEEE Transactions on Circuits and Systems-i: Regular Papers, Associate Editor of IEEE Transactions on Neural Networks and Learning Systems, IEEE Transactions on Cybernetics, IEEE Transactions on Industrial Informatics, IEEE/ASME Transactions on Mechatronics, IEEE Transactions on Circuits and Systems-i: Regular Papers, IEEE Transactions on Cognitive and Developmental Systems, IEEE Transactions on Emerging Topics in Computational Intelligence, IEEE Systems Journal, Engineering Applications of Artificial Intelligence (IFAC Journal), Science China Information Sciences and Acta Automatica Sinica, etc. He has published more than 200 papers in international journals and conferences, including more than 140 papers in IEEE Transactions and 20 papers in IFAC journals. He has been awarded as best/outstanding Associate Editor in IEEE journals for five times. He is a (Leading) Guest Editor for several special issues focusing on autonomous systems, robotics, and industrial intelligence in IEEE Transactions
Corresponding author: Qing-Long Han, e-mail: qhan@swin.edu.au; Yang Tang, e-mail: tangtany@gmail.com
Luolin Xiong and Haofen Wang and Xi Chen and Lu Sheng and Yun Xiong and Jingping Liu and Yanghua Xiao and Huajun Chen contributed equally to this work.
Received Date: 2025-03-04
Revised Date: 2025-04-14
Accepted Date: 2025-04-21

Available Online: 2025-04-29

Abstract

Abstract

DeepSeek, a Chinese artificial intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream large language model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including multi-head latent attention (MLA), mixture-of-experts (MoE), multi-token prediction (MTP), and group relative policy optimization (GRPO). The paper then explores DeepSeek’s engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek’s innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.
- DeepSeek,
- large AI models,
- reasoning capability,
- reinforcement learning,
- test-time scaling

FullText(HTML)

Luolin Xiong and Haofen Wang and Xi Chen and Lu Sheng and Yun Xiong and Jingping Liu and Yanghua Xiao and Huajun Chen contributed equally to this work.

References(142)

References

[1]	D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv: 2501.12948, 2025.
[2]	E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature, vol. 638, no. 8049, pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6
[3]	C. Metz, “What to know about DeepSeek and how it is upending A.I. - The New York Times,” https://www.nytimes.com/2025/01/27/technology/what-is-deepseek-china-ai.html, (Accessed on 01/27/2025).
[4]	A. Picchi, “What is DeepSeek, and why is it causing Nvidia and other stocks to slump? - CBS News,” https://www.cbsnews.com/news/what-is-deepseek-ai-china-stock-nvidia-nvda-asml/, (Accessed on 01/28/2025).
[5]	T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y. Tang, “A brief overview of ChatGPT: The history, status quo and potential future development,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023. doi: 10.1109/JAS.2023.123618
[6]	J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, 2018.
[7]	L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al, “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022.
[8]	A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. on Machine Learning (ICML). PMLR, 2021, pp. 8748–8763.
[9]	J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Pre-trained language models for text generation: A survey,” ACM Computing Surveys, vol. 56, no. 9, pp. 1–39, 2024.
[10]	B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
[11]	N. Karanikolas, E. Manga, N. Samaridi, E. Tousidou, and M. Vassilakopoulos, “Large language models versus natural language understanding and generation,” in Proc. the 27th Pan-Hellenic Conf. on Progress in Computing and Informatics, pp. 278–290, 2023.
[12]	R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 778–784, 2014. doi: 10.1109/TASLP.2014.2303296
[13]	A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al, “PaLM: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[14]	M. Nadeem, S. S. Sohail, L. Javed, F. Anwer, A. K. J. Saudagar, and K. Muhammad, “Vision-enabled large language and deep learning models for image-based emotion recognition,” Cognitive Computation, vol. 16, no. 5, pp. 2566–2579, 2024. doi: 10.1007/s12559-024-10281-5
[15]	K. Bayoudh, R. Knani, F. Hamdaoui, and A. Mtibaa, “A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets,” The Visual Computer, vol. 38, no. 8, pp. 2939–2970, 2022. doi: 10.1007/s00371-021-02166-7
[16]	R. Archana and P. S. E. Jeevaraj, “Deep learning models for digital image processing: A review,” Artificial Intelligence Review, vol. 57, no. 1, p. 11, 2024. doi: 10.1007/s10462-023-10631-z
[17]	“AI Action Summit (10 and 11 february 2025),” https://onu.delegfrance.org/ai-action-summit-10-and-11-february-2025.
[18]	Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Can ChatGPT replace traditional KBQA models? An in-depth analysis of the question answering performance of the GPT LLM family,” in Proc. Int. Semantic Web Conf., pp. 348–367, 2023.
[19]	M. M. Lucas, J. Yang, J. K. Pomeroy, and C. C. Yang, “Reasoning with large language models for medical question answering,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1964–1975, 2024. doi: 10.1093/jamia/ocae131
[20]	L. Liao, G. H. Yang, and C. Shah, “Proactive conversational agents in the post-ChatGPT world,” in Proc. the 46th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 3452–3455, 2023.
[21]	D. Choi, S. Lee, S.-I. Kim, K. Lee, H. J. Yoo, S. Lee, and H. Hong, “Unlock life with a chat GPT: Integrating conversational AI with large language models into everyday lives of autistic individuals,” in Proc. the 2024 CHI Conf. on Human Factors in Computing Systems, pp. 1–17, 2024.
[22]	A. Casheekar, A. Lahiri, K. Rath, K. S. Prabhakar, and K. Srinivasan, “A contemporary review on chatbots, AI-powered virtual conversational agents, ChatGPT: Applications, open challenges and future research directions,” Computer Science Review, vol. 52, p. 100632, 2024. doi: 10.1016/j.cosrev.2024.100632
[23]	Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via ChatGPT,” ACM Trans. Software Engineering and Methodology, vol. 33, no. 7, pp. 1–38, 2024.
[24]	J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 21 558–21 572.
[25]	G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv: 2312.11805, 2023.
[26]	“Introducing Claude 2.1,” https://www.anthropic.com/news/claude-2-1.
[27]	A. Meta, “Introducing llama 3.1: Our most capable models to date,” Meta AI Blog, vol. 12, 2024.
[28]	A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv: 2310.06825, 2023.
[29]	F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” in Proc. the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
[30]	P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023. doi: 10.1109/TPAMI.2023.3275156
[31]	J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 technical report,” arXiv preprint arXiv: 2303.08774, 2023.
[32]	“DALL·E 2,” https://openai.com/dall-e-2/.
[33]	X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv: 2501.17811, 2025.
[34]	“Introducing OpenAI o1,” https://openai.com/o1/.
[35]	P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” in in Proc. the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439.
[36]	“Grok 3 beta — the age of reasoning agents,” https://x.ai/blog/grok-3.
[37]	A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-V3 technical report,” arXiv preprint arXiv: 2412.19437, 2024.
[38]	X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek LLM: Scaling open-source language models with longtermism,” arXiv preprint arXiv: 2401.02954, 2024.
[39]	D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu et al., “DeepseekMoE: Towards ultimate expert specialization in mixture-of-experts language models,” arXiv preprint arXiv: 2401.06066, 2024.
[40]	Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv: 2402.03300, 2024.
[41]	A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo et al., “Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model,” arXiv preprint arXiv: 2405.04434, 2024.
[42]	P. Blunsom, “Hidden markov models,” Lecture notes, August, vol. 15, no. 18-19, p. 48, 2004.
[43]	T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in Proc. the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 858–867.
[44]	T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Advances in Neural Information Processing Systems (NIPS), vol. 26, 2013.
[45]	J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[46]	A. Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020. doi: 10.1016/j.physd.2019.132306
[47]	S. Meshram and M. Anand Kumar, “Long short-term memory network for learning sentences similarity using deep contextual embeddings,” Int. Journal of Information Technology, vol. 13, no. 4, pp. 1633–1641, 2021. doi: 10.1007/s41870-021-00686-y
[48]	Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv: 1907.11692, 2019.
[49]	V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv: 1910.01108, 2019.
[50]	J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020. doi: 10.1093/bioinformatics/btz682
[51]	Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” arXiv preprint arXiv: 1909.11942, 2019.
[52]	A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv: 2112.00861, 2021.
[53]	Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv: 2107.02137, 2021.
[54]	V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv: 2110.08207, 2021.
[55]	A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[56]	M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv: 2107.03374, 2021.
[57]	Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li et al., “CodeGeeX: A pre-trained model for code generation with multilingual benchmarking on HumanEval-X,” in Proc. the 29th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
[58]	H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
[59]	S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” in Proc. Int. Conf. on Machine Learning. PMLR, 2023, pp. 2397–2430.
[60]	G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv: 2403.05530, 2024.
[61]	C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[62]	T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Proc. Advances in Neural Information Processing Systems (NIPS), vol. 33, 2020, pp. 1877–1901.
[63]	J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv: 2112.11446, 2021.
[64]	T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176Bparameter open-access multilingual language model,” arXiv preprint arXiv: 2211.05100, 2022.
[65]	J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv: 2203.15556, 2022.
[66]	N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proc. Int. Conf. on Machine Learning. PMLR, 2022, pp. 5547–5569.
[67]	R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv: 2201.08239, 2022.
[68]	S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using DeepSpeed and megatron to train megatron-turing NLG 530B, a large-scale generative language model,” arXiv preprint arXiv: 2201.11990, 2022.
[69]	S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “OPT: Open pre-trained transformer language models,” arXiv preprint arXiv: 2205.01068, 2022.
[70]	S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “BloombergGPT: A large language model for finance,” arXiv preprint arXiv: 2303.17564, 2023.
[71]	A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv: 2407.21783, 2024.
[72]	J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv: 2309.16609, 2023.
[73]	T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao et al., “ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools,” arXiv preprint arXiv: 2406.12793, 2024.
[74]	K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao et al., “Kimi k1. 5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv: 2501.12599, 2025.
[75]	G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv: 2403.08295, 2024.
[76]	“Grok-2 beta release,” https://x.ai/blog/grok-2.
[77]	R. Vavekanand and K. Sam, “Llama 3.1: An in-depth analysis of the next-generation large language model,” 2024.
[78]	Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, S. Y. Philip, and M. Long, “PredRNN: A recurrent neural network for spatiotemporal predictive learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2208–2225, 2022.
[79]	S. Liu, I. Ni’mah, V. Menkovski, D. C. Mocanu, and M. Pechenizkiy, “Efficient and effective training of sparse recurrent neural networks,” Neural Computing and Applications, vol. 33, pp. 9625–9636, 2021. doi: 10.1007/s00521-021-05727-y
[80]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[81]	J. Wang, “A tutorial on LLM reasoning: Relevant methods behind ChatGPT o1,” arXiv preprint arXiv: 2502.10867, 2025.
[82]	E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022.
[83]	X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv: 2309.17179, 2023.
[84]	L. Luo, Y. Liu, R. Liu, S. Phatale, H. Lara, Y. Li, L. Shu, Y. Zhu, L. Meng, J. Sun et al., “Improve mathematical reasoning in language models by automated process supervision,” arXiv preprint arXiv: 2406.06592, vol. 2, 2024.
[85]	Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making large language models better reasoners with step-aware verifier,” arXiv preprint arXiv: 2206.02336, 2022.
[86]	Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang, “An empirical analysis of compute-optimal inference for problem-solving with language models,” arXiv e-prints, pp. arXiv–2408, 2024.
[87]	C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv: 2408.03314, 2024.
[88]	R. Bellman, “Dynamic programming and stochastic control processes,” Information and control, vol. 1, no. 3, pp. 228–239, 1958. doi: 10.1016/S0019-9958(58)80003-0
[89]	J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al, “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
[90]	F. Christianos, G. Papoudakis, M. Zimmer, T. Coste, Z. Wu, J. Chen, K. Khandelwal, J. Doran, X. Feng, J. Liu et al., “Pangu-agent: A fine-tunable generalist agent with structured reasoning,” arXiv preprint arXiv: 2312.14878, 2023.
[91]	Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/
[92]	Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang et al., “Deepseek-VL2: Mixture-of-experts visionlanguage models for advanced multimodal understanding,” arXiv preprint arXiv: 2412.10302, 2024.
[93]	H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
[94]	X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 11 975–11 986.
[95]	C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan et al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” arXiv preprint arXiv: 2410.13848, 2024.
[96]	J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024. doi: 10.1016/j.neucom.2023.127063
[97]	J. Cai and Y. Chen, “MHA-Net: Multipath hybrid attention network for building footprint extraction from high-resolution remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 5807–5817, 2021. doi: 10.1109/JSTARS.2021.3084805
[98]	D. Xiao, Q. Meng, S. Li, and X. Yuan, “Improving transformers with dynamically composable multi-head attention,” arXiv preprint arXiv: 2405.08553, 2024.
[99]	L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao, “Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption,” arXiv preprint arXiv: 2407.18003, 2024.
[100]	N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv: 1911.02150, 2019.
[101]	W. Brandon, M. Mishra, A. Nrusimha, R. Panda, and J. Ragan-Kelley, “Reducing transformer key-value cache size with cross-layer attention,” Advances in Neural Information Processing Systems, vol. 37, pp. 86 927–86 957, 2025.
[102]	J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” arXiv preprint arXiv: 2305.13245, 2023.
[103]	V. Joshi, P. Laddha, S. Sinha, O. J. Omer, and S. Subramoney, “QCQA: Quality and capacity-aware grouped query attention,” arXiv preprint arXiv: 2406.10247, 2024.
[104]	Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “MoEfication: Transformer feed-forward layers are mixtures of experts,” arXiv preprint arXiv: 2110.01786, 2021.
[105]	L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai, “Auxiliary-lossfree load balancing strategy for mixture-of-experts,” arXiv preprint arXiv: 2408.15664, 2024.
[106]	S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, and X. Chu, “ScheMoE: An extensible mixture-of-experts distributed training system with task scheduling,” in Proc. the Nineteenth European Conf. on Computer Systems, pp. 236–249, 2024.
[107]	M. Qorib, G. Moon, and H. T. Ng, “Are decoder-only language models better than encoder-only language models in understanding word meaning?” Findings of the Association for Computational Linguistics ACL 2024, pp. 16 339–16 347, 2024.
[108]	W. Xu, W. Hu, F. Wu, and S. Sengamedu, “DeTiME: Diffusionenhanced topic modeling using encoder-decoder based LLM,” arXiv preprint arXiv: 2310.15296, 2023.
[109]	H. You, Y. Fu, Z. Wang, A. Yazdanbakhsh, and Y. C. Lin, “When linear attention meets autoregressive decoding: Towards more effective and efficient linearized large language models,” arXiv preprint arXiv: 2406.07368, 2024.
[110]	L. Guo, Y. Fang, F. Chen, P. Liu, and S. Xu, “Large language models with adaptive token fusion: A novel approach to reducing hallucinations and improving inference efficiency,” Authorea, pp. 1–10, 2024.
[111]	K. Yue, B.-C. Chen, J. Geiping, H. Li, T. Goldstein, and S.-N. Lim, “Object recognition as next token prediction,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 16 645–16 656, 2024.
[112]	F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” arXiv preprint arXiv: 2404.19737, 2024.
[113]	G. Bachmann and V. Nagarajan, “The pitfalls of next-token prediction,” arXiv preprint arXiv: 2403.06963, 2024.
[114]	S. Tuli, C.-H. Lin, Y.-C. Hsu, N. K. Jha, Y. Shen, and H. Jin, “DynaMo: Accelerating language model inference with dynamic multitoken sampling,” arXiv preprint arXiv: 2405.00888, 2024.
[115]	J. Zhang, Z. Zhang, S. Han, and S. Lu, “Proximal policy optimization ¨ via enhanced exploration efficiency,” Information Sciences, vol. 609, pp. 750–765, 2022. doi: 10.1016/j.ins.2022.07.111
[116]	J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv: 1707.06347, 2017.
[117]	Y. Wang, H. He, and X. Tan, “Truly proximal policy optimization,” in Proc. The 35th Uncertainty in Artificial Intelligence Conf.. PMLR, 2020, pp. 113–122.
[118]	Y. Gu, Y. Cheng, C. P. Chen, and X. Wang, “Proximal policy optimization with policy feedback,” IEEE Trans. Systems, Man, and Cybernetics: Systems, vol. 52, no. 7, pp. 4600–4610, 2021.
[119]	S. S. Ramesh, Y. Hu, I. Chaimalas, V. Mehta, P. G. Sessa, H. B. Ammar, and I. Bogunovic, “Group robust preference optimization in reward-free RLHF,” arXiv preprint arXiv: 2405.20304, 2024.
[120]	G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen et al., “Process reinforcement through implicit rewards,” arXiv preprint arXiv: 2502.01456, 2025.
[121]	S. Sane, “Hybrid group relative policy optimization: A multisample approach to enhancing policy optimization,” arXiv preprint arXiv: 2502.01652, 2025.
[122]	J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv: 2001.08361, 2020.
[123]	T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, 2022.
[124]	B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8-bit numerical formats for deep neural networks,” arXiv preprint arXiv: 2206.02915, 2022.
[125]	H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, J. Hu et al., “FP8-LM: Training FP8 large language models,” arXiv preprint arXiv: 2310.18313, 2023.
[126]	E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv: 2210.17323, 2022.
[127]	G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proc. Int. Conf. on Machine Learning. PMLR, 2023, pp. 38 087–38 099.
[128]	M. Fishman, B. Chmiel, R. Banner, and D. Soudry, “Scaling FP8 training to trillion-token LLMs,” arXiv preprint arXiv: 2409.12517, 2024.
[129]	B. He, L. Noci, D. Paliotta, I. Schlag, and T. Hofmann, “Understanding and minimising outlier features in transformer training,” in Proc. The Thirty-eighth Annual Conf. on Neural Information Processing Systems, 2024.
[130]	M. Sun, X. Chen, J. Z. Kolter, and Z. Liu, “Massive activations in large language models,” arXiv preprint arXiv: 2402.17762, 2024.
[131]	C. Zhao, L. Zhao, J. Li, and Z. Xu, “DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling,” https://github.com/deepseek-ai/DeepGEMM, 2025.
[132]	J. Li, “FlashMLA: Efficient MLA decoding kernel,” https://github.com/deepseek-ai/FlashMLA, 2025.
[133]	D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv: 2006.16668, 2020.
[134]	N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “The sparsely-gated mixture-of-experts layer,” Outrageously large neural networks, 2017.
[135]	W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022.
[136]	P. Qi, X. Wan, G. Huang, and M. Lin, “Zero bubble pipeline parallelism,” arXiv preprint arXiv: 2401.10241, 2023.
[137]	C. Zhao, S. Zhou, L. Zhang, C. Deng, Z. Xu, Y. Liu, K. Yu, J. Li, and L. Zhao, “DeepEP: an efficient expert-parallel communication library,” https://github.com/deepseek-ai/DeepEP, 2025.
[138]	S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv: 2412.06769, 2024.
[139]	F. Jiang, Z. Xu, Y. Li, L. Niu, Z. Xiang, B. Li, B. Y. Lin, and R. Poovendran, “SafeChain: Safety of language models with long chain-ofthought reasoning capabilities,” arXiv preprint arXiv: 2502.12025, 2025.
[140]	Z. Ying, D. Zhang, Z. Jing, Y. Xiao, Q. Zou, A. Liu, S. Liang, X. Zhang, X. Liu, and D. Tao, “Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models,” arXiv preprint arXiv: 2502.11054, 2025.
[141]	N. Jones, “OpenAI’s’ deep research’tool: is it useful for scientists?” Nature.
[142]	“Introducing operator,” https://openai.com/index/introducing-operator/.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(5) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views (16) PDF downloads(4)

DeepSeek: Paradigm Shifts and Technical Evolution in Large AI Models

doi: 10.1109/JAS.2025.125495

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content