Citation: | L. Xiong, H. Wang, X. Chen, L. Sheng, Y. Xiong, J. Liu, Y. Xiao, H. Chen, Q.-L. Han, and Y. Tang, “DeepSeek: Paradigm shifts and technical evolution in large AI models,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 5, pp. 841–858, May 2025. doi: 10.1109/JAS.2025.125495 |
[1] |
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning,” arXiv preprint arXiv: 2501.12948, 2025.
|
[2] |
E. Gibney, “China’s cheap, open AI model DeepSeek thrills scientists,” Nature, vol. 638, no. 8049, pp. 13–14, 2025. doi: 10.1038/d41586-025-00229-6
|
[3] |
C. Metz, “What to know about DeepSeek and how it is upending A.I. - The New York Times,” https://www.nytimes.com/2025/01/27/technology/what-is-deepseek-china-ai.html, (Accessed on 01/27/2025).
|
[4] |
A. Picchi, “What is DeepSeek, and why is it causing Nvidia and other stocks to slump? - CBS News,” https://www.cbsnews.com/news/what-is-deepseek-ai-china-stock-nvidia-nvda-asml/, (Accessed on 01/28/2025).
|
[5] |
T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y. Tang, “A brief overview of ChatGPT: The history, status quo and potential future development,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023. doi: 10.1109/JAS.2023.123618
|
[6] |
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv: 1810.04805, 2018.
|
[7] |
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al, “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022.
|
[8] |
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. on Machine Learning (ICML). PMLR, 2021, pp. 8748–8763.
|
[9] |
J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Pre-trained language models for text generation: A survey,” ACM Computing Surveys, vol. 56, no. 9, pp. 1–39, 2024.
|
[10] |
B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth, “Recent advances in natural language processing via large pre-trained language models: A survey,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023.
|
[11] |
N. Karanikolas, E. Manga, N. Samaridi, E. Tousidou, and M. Vassilakopoulos, “Large language models versus natural language understanding and generation,” in Proc. the 27th Pan-Hellenic Conf. on Progress in Computing and Informatics, pp. 278–290, 2023.
|
[12] |
R. Sarikaya, G. E. Hinton, and A. Deoras, “Application of deep belief networks for natural language understanding,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 778–784, 2014. doi: 10.1109/TASLP.2014.2303296
|
[13] |
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al, “PaLM: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
|
[14] |
M. Nadeem, S. S. Sohail, L. Javed, F. Anwer, A. K. J. Saudagar, and K. Muhammad, “Vision-enabled large language and deep learning models for image-based emotion recognition,” Cognitive Computation, vol. 16, no. 5, pp. 2566–2579, 2024. doi: 10.1007/s12559-024-10281-5
|
[15] |
K. Bayoudh, R. Knani, F. Hamdaoui, and A. Mtibaa, “A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets,” The Visual Computer, vol. 38, no. 8, pp. 2939–2970, 2022. doi: 10.1007/s00371-021-02166-7
|
[16] |
R. Archana and P. S. E. Jeevaraj, “Deep learning models for digital image processing: A review,” Artificial Intelligence Review, vol. 57, no. 1, p. 11, 2024. doi: 10.1007/s10462-023-10631-z
|
[17] |
“AI Action Summit (10 and 11 february 2025),” https://onu.delegfrance.org/ai-action-summit-10-and-11-february-2025.
|
[18] |
Y. Tan, D. Min, Y. Li, W. Li, N. Hu, Y. Chen, and G. Qi, “Can ChatGPT replace traditional KBQA models? An in-depth analysis of the question answering performance of the GPT LLM family,” in Proc. Int. Semantic Web Conf., pp. 348–367, 2023.
|
[19] |
M. M. Lucas, J. Yang, J. K. Pomeroy, and C. C. Yang, “Reasoning with large language models for medical question answering,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1964–1975, 2024. doi: 10.1093/jamia/ocae131
|
[20] |
L. Liao, G. H. Yang, and C. Shah, “Proactive conversational agents in the post-ChatGPT world,” in Proc. the 46th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 3452–3455, 2023.
|
[21] |
D. Choi, S. Lee, S.-I. Kim, K. Lee, H. J. Yoo, S. Lee, and H. Hong, “Unlock life with a chat GPT: Integrating conversational AI with large language models into everyday lives of autistic individuals,” in Proc. the 2024 CHI Conf. on Human Factors in Computing Systems, pp. 1–17, 2024.
|
[22] |
A. Casheekar, A. Lahiri, K. Rath, K. S. Prabhakar, and K. Srinivasan, “A contemporary review on chatbots, AI-powered virtual conversational agents, ChatGPT: Applications, open challenges and future research directions,” Computer Science Review, vol. 52, p. 100632, 2024. doi: 10.1016/j.cosrev.2024.100632
|
[23] |
Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via ChatGPT,” ACM Trans. Software Engineering and Methodology, vol. 33, no. 7, pp. 1–38, 2024.
|
[24] |
J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 21 558–21 572.
|
[25] |
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv: 2312.11805, 2023.
|
[26] |
“Introducing Claude 2.1,” https://www.anthropic.com/news/claude-2-1.
|
[27] |
A. Meta, “Introducing llama 3.1: Our most capable models to date,” Meta AI Blog, vol. 12, 2024.
|
[28] |
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv: 2310.06825, 2023.
|
[29] |
F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A comprehensive survey on transfer learning,” in Proc. the IEEE, vol. 109, no. 1, pp. 43–76, 2020.
|
[30] |
P. Xu, X. Zhu, and D. A. Clifton, “Multimodal learning with transformers: A survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 113–12 132, 2023. doi: 10.1109/TPAMI.2023.3275156
|
[31] |
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 technical report,” arXiv preprint arXiv: 2303.08774, 2023.
|
[32] |
“DALL·E 2,” https://openai.com/dall-e-2/.
|
[33] |
X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv: 2501.17811, 2025.
|
[34] |
“Introducing OpenAI o1,” https://openai.com/o1/.
|
[35] |
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” in in Proc. the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439.
|
[36] |
“Grok 3 beta — the age of reasoning agents,” https://x.ai/blog/grok-3.
|
[37] |
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-V3 technical report,” arXiv preprint arXiv: 2412.19437, 2024.
|
[38] |
X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al., “Deepseek LLM: Scaling open-source language models with longtermism,” arXiv preprint arXiv: 2401.02954, 2024.
|
[39] |
D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu et al., “DeepseekMoE: Towards ultimate expert specialization in mixture-of-experts language models,” arXiv preprint arXiv: 2401.06066, 2024.
|
[40] |
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv: 2402.03300, 2024.
|
[41] |
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo et al., “Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model,” arXiv preprint arXiv: 2405.04434, 2024.
|
[42] |
P. Blunsom, “Hidden markov models,” Lecture notes, August, vol. 15, no. 18-19, p. 48, 2004.
|
[43] |
T. Brants, A. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in Proc. the 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007, pp. 858–867.
|
[44] |
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Advances in Neural Information Processing Systems (NIPS), vol. 26, 2013.
|
[45] |
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
|
[46] |
A. Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, 2020. doi: 10.1016/j.physd.2019.132306
|
[47] |
S. Meshram and M. Anand Kumar, “Long short-term memory network for learning sentences similarity using deep contextual embeddings,” Int. Journal of Information Technology, vol. 13, no. 4, pp. 1633–1641, 2021. doi: 10.1007/s41870-021-00686-y
|
[48] |
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv: 1907.11692, 2019.
|
[49] |
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv: 1910.01108, 2019.
|
[50] |
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020. doi: 10.1093/bioinformatics/btz682
|
[51] |
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” arXiv preprint arXiv: 1909.11942, 2019.
|
[52] |
A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv: 2112.00861, 2021.
|
[53] |
Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv: 2107.02137, 2021.
|
[54] |
V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv preprint arXiv: 2110.08207, 2021.
|
[55] |
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
|
[56] |
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv: 2107.03374, 2021.
|
[57] |
Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li et al., “CodeGeeX: A pre-trained model for code generation with multilingual benchmarking on HumanEval-X,” in Proc. the 29th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
|
[58] |
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
|
[59] |
S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., “Pythia: A suite for analyzing large language models across training and scaling,” in Proc. Int. Conf. on Machine Learning. PMLR, 2023, pp. 2397–2430.
|
[60] |
G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv: 2403.05530, 2024.
|
[61] |
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
|
[62] |
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Proc. Advances in Neural Information Processing Systems (NIPS), vol. 33, 2020, pp. 1877–1901.
|
[63] |
J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv: 2112.11446, 2021.
|
[64] |
T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176Bparameter open-access multilingual language model,” arXiv preprint arXiv: 2211.05100, 2022.
|
[65] |
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv: 2203.15556, 2022.
|
[66] |
N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “GLaM: Efficient scaling of language models with mixture-of-experts,” in Proc. Int. Conf. on Machine Learning. PMLR, 2022, pp. 5547–5569.
|
[67] |
R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv: 2201.08239, 2022.
|
[68] |
S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using DeepSpeed and megatron to train megatron-turing NLG 530B, a large-scale generative language model,” arXiv preprint arXiv: 2201.11990, 2022.
|
[69] |
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “OPT: Open pre-trained transformer language models,” arXiv preprint arXiv: 2205.01068, 2022.
|
[70] |
S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “BloombergGPT: A large language model for finance,” arXiv preprint arXiv: 2303.17564, 2023.
|
[71] |
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv: 2407.21783, 2024.
|
[72] |
J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv: 2309.16609, 2023.
|
[73] |
T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Zhang, D. Rojas, G. Feng, H. Zhao et al., “ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools,” arXiv preprint arXiv: 2406.12793, 2024.
|
[74] |
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao et al., “Kimi k1. 5: Scaling reinforcement learning with LLMs,” arXiv preprint arXiv: 2501.12599, 2025.
|
[75] |
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv: 2403.08295, 2024.
|
[76] |
“Grok-2 beta release,” https://x.ai/blog/grok-2.
|
[77] |
R. Vavekanand and K. Sam, “Llama 3.1: An in-depth analysis of the next-generation large language model,” 2024.
|
[78] |
Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, S. Y. Philip, and M. Long, “PredRNN: A recurrent neural network for spatiotemporal predictive learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2208–2225, 2022.
|
[79] |
S. Liu, I. Ni’mah, V. Menkovski, D. C. Mocanu, and M. Pechenizkiy, “Efficient and effective training of sparse recurrent neural networks,” Neural Computing and Applications, vol. 33, pp. 9625–9636, 2021. doi: 10.1007/s00521-021-05727-y
|
[80] |
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
|
[81] |
J. Wang, “A tutorial on LLM reasoning: Relevant methods behind ChatGPT o1,” arXiv preprint arXiv: 2502.10867, 2025.
|
[82] |
E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022.
|
[83] |
X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv: 2309.17179, 2023.
|
[84] |
L. Luo, Y. Liu, R. Liu, S. Phatale, H. Lara, Y. Li, L. Shu, Y. Zhu, L. Meng, J. Sun et al., “Improve mathematical reasoning in language models by automated process supervision,” arXiv preprint arXiv: 2406.06592, vol. 2, 2024.
|
[85] |
Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making large language models better reasoners with step-aware verifier,” arXiv preprint arXiv: 2206.02336, 2022.
|
[86] |
Y. Wu, Z. Sun, S. Li, S. Welleck, and Y. Yang, “An empirical analysis of compute-optimal inference for problem-solving with language models,” arXiv e-prints, pp. arXiv–2408, 2024.
|
[87] |
C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv: 2408.03314, 2024.
|
[88] |
R. Bellman, “Dynamic programming and stochastic control processes,” Information and control, vol. 1, no. 3, pp. 228–239, 1958. doi: 10.1016/S0019-9958(58)80003-0
|
[89] |
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al, “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
|
[90] |
F. Christianos, G. Papoudakis, M. Zimmer, T. Coste, Z. Wu, J. Chen, K. Khandelwal, J. Doran, X. Feng, J. Liu et al., “Pangu-agent: A fine-tunable generalist agent with structured reasoning,” arXiv preprint arXiv: 2312.14878, 2023.
|
[91] |
Q. Team, “Qwen2.5: A party of foundation models,” September 2024. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/
|
[92] |
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang et al., “Deepseek-VL2: Mixture-of-experts visionlanguage models for advanced multimodal understanding,” arXiv preprint arXiv: 2412.10302, 2024.
|
[93] |
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
|
[94] |
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in Proc. the IEEE/CVF Int. Conf. on Computer Vision, 2023, pp. 11 975–11 986.
|
[95] |
C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan et al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,” arXiv preprint arXiv: 2410.13848, 2024.
|
[96] |
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing, vol. 568, p. 127063, 2024. doi: 10.1016/j.neucom.2023.127063
|
[97] |
J. Cai and Y. Chen, “MHA-Net: Multipath hybrid attention network for building footprint extraction from high-resolution remote sensing imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 5807–5817, 2021. doi: 10.1109/JSTARS.2021.3084805
|
[98] |
D. Xiao, Q. Meng, S. Li, and X. Yuan, “Improving transformers with dynamically composable multi-head attention,” arXiv preprint arXiv: 2405.08553, 2024.
|
[99] |
L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao, “Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption,” arXiv preprint arXiv: 2407.18003, 2024.
|
[100] |
N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv: 1911.02150, 2019.
|
[101] |
W. Brandon, M. Mishra, A. Nrusimha, R. Panda, and J. Ragan-Kelley, “Reducing transformer key-value cache size with cross-layer attention,” Advances in Neural Information Processing Systems, vol. 37, pp. 86 927–86 957, 2025.
|
[102] |
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai, “GQA: Training generalized multi-query transformer models from multi-head checkpoints,” arXiv preprint arXiv: 2305.13245, 2023.
|
[103] |
V. Joshi, P. Laddha, S. Sinha, O. J. Omer, and S. Subramoney, “QCQA: Quality and capacity-aware grouped query attention,” arXiv preprint arXiv: 2406.10247, 2024.
|
[104] |
Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “MoEfication: Transformer feed-forward layers are mixtures of experts,” arXiv preprint arXiv: 2110.01786, 2021.
|
[105] |
L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai, “Auxiliary-lossfree load balancing strategy for mixture-of-experts,” arXiv preprint arXiv: 2408.15664, 2024.
|
[106] |
S. Shi, X. Pan, Q. Wang, C. Liu, X. Ren, Z. Hu, Y. Yang, B. Li, and X. Chu, “ScheMoE: An extensible mixture-of-experts distributed training system with task scheduling,” in Proc. the Nineteenth European Conf. on Computer Systems, pp. 236–249, 2024.
|
[107] |
M. Qorib, G. Moon, and H. T. Ng, “Are decoder-only language models better than encoder-only language models in understanding word meaning?” Findings of the Association for Computational Linguistics ACL 2024, pp. 16 339–16 347, 2024.
|
[108] |
W. Xu, W. Hu, F. Wu, and S. Sengamedu, “DeTiME: Diffusionenhanced topic modeling using encoder-decoder based LLM,” arXiv preprint arXiv: 2310.15296, 2023.
|
[109] |
H. You, Y. Fu, Z. Wang, A. Yazdanbakhsh, and Y. C. Lin, “When linear attention meets autoregressive decoding: Towards more effective and efficient linearized large language models,” arXiv preprint arXiv: 2406.07368, 2024.
|
[110] |
L. Guo, Y. Fang, F. Chen, P. Liu, and S. Xu, “Large language models with adaptive token fusion: A novel approach to reducing hallucinations and improving inference efficiency,” Authorea, pp. 1–10, 2024.
|
[111] |
K. Yue, B.-C. Chen, J. Geiping, H. Li, T. Goldstein, and S.-N. Lim, “Object recognition as next token prediction,” in Proc. the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 16 645–16 656, 2024.
|
[112] |
F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” arXiv preprint arXiv: 2404.19737, 2024.
|
[113] |
G. Bachmann and V. Nagarajan, “The pitfalls of next-token prediction,” arXiv preprint arXiv: 2403.06963, 2024.
|
[114] |
S. Tuli, C.-H. Lin, Y.-C. Hsu, N. K. Jha, Y. Shen, and H. Jin, “DynaMo: Accelerating language model inference with dynamic multitoken sampling,” arXiv preprint arXiv: 2405.00888, 2024.
|
[115] |
J. Zhang, Z. Zhang, S. Han, and S. Lu, “Proximal policy optimization ¨ via enhanced exploration efficiency,” Information Sciences, vol. 609, pp. 750–765, 2022. doi: 10.1016/j.ins.2022.07.111
|
[116] |
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv: 1707.06347, 2017.
|
[117] |
Y. Wang, H. He, and X. Tan, “Truly proximal policy optimization,” in Proc. The 35th Uncertainty in Artificial Intelligence Conf.. PMLR, 2020, pp. 113–122.
|
[118] |
Y. Gu, Y. Cheng, C. P. Chen, and X. Wang, “Proximal policy optimization with policy feedback,” IEEE Trans. Systems, Man, and Cybernetics: Systems, vol. 52, no. 7, pp. 4600–4610, 2021.
|
[119] |
S. S. Ramesh, Y. Hu, I. Chaimalas, V. Mehta, P. G. Sessa, H. B. Ammar, and I. Bogunovic, “Group robust preference optimization in reward-free RLHF,” arXiv preprint arXiv: 2405.20304, 2024.
|
[120] |
G. Cui, L. Yuan, Z. Wang, H. Wang, W. Li, B. He, Y. Fan, T. Yu, Q. Xu, W. Chen et al., “Process reinforcement through implicit rewards,” arXiv preprint arXiv: 2502.01456, 2025.
|
[121] |
S. Sane, “Hybrid group relative policy optimization: A multisample approach to enhancing policy optimization,” arXiv preprint arXiv: 2502.01652, 2025.
|
[122] |
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv: 2001.08361, 2020.
|
[123] |
T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “GPT3.int8(): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, 2022.
|
[124] |
B. Noune, P. Jones, D. Justus, D. Masters, and C. Luschi, “8-bit numerical formats for deep neural networks,” arXiv preprint arXiv: 2206.02915, 2022.
|
[125] |
H. Peng, K. Wu, Y. Wei, G. Zhao, Y. Yang, Z. Liu, Y. Xiong, Z. Yang, B. Ni, J. Hu et al., “FP8-LM: Training FP8 large language models,” arXiv preprint arXiv: 2310.18313, 2023.
|
[126] |
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv: 2210.17323, 2022.
|
[127] |
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “SmoothQuant: Accurate and efficient post-training quantization for large language models,” in Proc. Int. Conf. on Machine Learning. PMLR, 2023, pp. 38 087–38 099.
|
[128] |
M. Fishman, B. Chmiel, R. Banner, and D. Soudry, “Scaling FP8 training to trillion-token LLMs,” arXiv preprint arXiv: 2409.12517, 2024.
|
[129] |
B. He, L. Noci, D. Paliotta, I. Schlag, and T. Hofmann, “Understanding and minimising outlier features in transformer training,” in Proc. The Thirty-eighth Annual Conf. on Neural Information Processing Systems, 2024.
|
[130] |
M. Sun, X. Chen, J. Z. Kolter, and Z. Liu, “Massive activations in large language models,” arXiv preprint arXiv: 2402.17762, 2024.
|
[131] |
C. Zhao, L. Zhao, J. Li, and Z. Xu, “DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling,” https://github.com/deepseek-ai/DeepGEMM, 2025.
|
[132] |
J. Li, “FlashMLA: Efficient MLA decoding kernel,” https://github.com/deepseek-ai/FlashMLA, 2025.
|
[133] |
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “GShard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv: 2006.16668, 2020.
|
[134] |
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “The sparsely-gated mixture-of-experts layer,” Outrageously large neural networks, 2017.
|
[135] |
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022.
|
[136] |
P. Qi, X. Wan, G. Huang, and M. Lin, “Zero bubble pipeline parallelism,” arXiv preprint arXiv: 2401.10241, 2023.
|
[137] |
C. Zhao, S. Zhou, L. Zhang, C. Deng, Z. Xu, Y. Liu, K. Yu, J. Li, and L. Zhao, “DeepEP: an efficient expert-parallel communication library,” https://github.com/deepseek-ai/DeepEP, 2025.
|
[138] |
S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y. Tian, “Training large language models to reason in a continuous latent space,” arXiv preprint arXiv: 2412.06769, 2024.
|
[139] |
F. Jiang, Z. Xu, Y. Li, L. Niu, Z. Xiang, B. Li, B. Y. Lin, and R. Poovendran, “SafeChain: Safety of language models with long chain-ofthought reasoning capabilities,” arXiv preprint arXiv: 2502.12025, 2025.
|
[140] |
Z. Ying, D. Zhang, Z. Jing, Y. Xiao, Q. Zou, A. Liu, S. Liang, X. Zhang, X. Liu, and D. Tao, “Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models,” arXiv preprint arXiv: 2502.11054, 2025.
|
[141] |
N. Jones, “OpenAI’s’ deep research’tool: is it useful for scientists?” Nature.
|
[142] |
“Introducing operator,” https://openai.com/index/introducing-operator/.
|