A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 13 Issue 1
Jan.  2026

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 19.2, Top 1 (SCI Q1)
    CiteScore: 28.2, Top 1% (Q1)
    Google Scholar h5-index: 95, TOP 5
Turn off MathJax
Article Contents
L. Chen, L. Wang, L. Jin, and J. Wang, “LMAdam: Enhancing adam via linear multistep discretization,” IEEE/CAA J. Autom. Sinica, vol. 13, no. 1, pp. 161–169, Jan. 2026. doi: 10.1109/JAS.2025.125834
Citation: L. Chen, L. Wang, L. Jin, and J. Wang, “LMAdam: Enhancing adam via linear multistep discretization,” IEEE/CAA J. Autom. Sinica, vol. 13, no. 1, pp. 161–169, Jan. 2026. doi: 10.1109/JAS.2025.125834

LMAdam: Enhancing Adam via Linear Multistep Discretization

doi: 10.1109/JAS.2025.125834
Funds:  This work was supported in part by the National Natural Science Foundation of China (62506148 and 62476115), the Fundamental Research Funds for the Central Universities (lzujbky-2025-pd05 and lzujbky-2025-ytB01), the Research Grants Council of the Hong Kong Special Administrative Region of China (AoE/E-407/24-N and C1013-24G), the Postdoctoral Fellowship Program (Grade C) of China Postdoctoral Science Foundation (GZC20251039), and the Supercomputing Center of Lanzhou University
More Information
  • In this paper, we propose a learning algorithm termed linear multistep adaptive moment (LMAdam) to enhance the adaptive moment (Adam) algorithm for machine learning. Considering Adam as a single-step discretization of its continuous counterpart, we develop the LMAdam algorithm based on a linear multistep discretization scheme. We design a feedforward neural network for learning the coefficients of the multistep terms with ensured consistency and select the coefficients to ensure zero stability of the multistep terms. We experimentally demonstrate the superiority of the LMAdam via extensive experimentation on benchmark datasets for training various deep neural networks in three applications.

     

  • loading
  • [1]
    C. Pan, J. Peng, and Z. Zhang, “Depth-guided vision transformer with normalizing flows for monocular 3D object detection,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 3, pp. 673–689, Mar. 2024. doi: 10.1109/JAS.2023.123660
    [2]
    Y. Zha, Y. Yang, R. Li, and Z. Hu, “Text alignment is an efficient unified model for massive NLP tasks,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, vol. 36, 2024.
    [3]
    Y. Liu, B. Tian, Y. Lv, L. Li, and F.-Y. Wang, “Point cloud classification using content-based transformer via clustering in feature space,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 1, pp. 2066–2074, Jan. 2024.
    [4]
    Z. Zhang, Z. Lei, M. Omura, H. Hasegawa, and S. Gao, “Dendritic learning-incorporated vision transformer for image recognition,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 2, pp. 539–541, Feb. 2024. doi: 10.1109/JAS.2023.123978
    [5]
    W. Xu, C. Zhao, J. Cheng, Y. Wang, Y. Tang, T. Zhang, Z. Yuan, Y. Lv, and F.-Y. Wang, “Transformer-based macroscopic regulation for high-speed railway timetable rescheduling,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1822–1833, Sept. 2023. doi: 10.1109/JAS.2023.123501
    [6]
    Y. Demidovich, G. Malinovsky, I. Sokolov, and P. Richtárik, “A guide through the zoo of biased SGD,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada, vol. 36, 2024.
    [7]
    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent., San Diego, CA, USA, May 2015.
    [8]
    J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” J. Mach. Learn. Res., vol. 12, no. 7, pp. 1–39, Feb. 2011.
    [9]
    T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, May 2012.
    [10]
    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional transformers for language understanding,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Minneapolis, MN, USA, 2019, pp. 4171–4186.
    [11]
    W. Zeng, “Pangu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation,” 2021. [Online]. Available: arXiv: 2104.12369.
    [12]
    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “CodeGen: An open large language model for code with multi-turn program synthesis,” in Proc. Int. Conf. Learn. Represent., Kigali, Rwanda, 2023.
    [13]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” In Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp 5998–6008.
    [14]
    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent., New Orleans, LA, USA, 2019.
    [15]
    M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive methods for nonconvex optimization,” in Proc. Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, 2018, pp. 9815–9825.
    [16]
    S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and beyond,” in Proc. Int. Conf. Learn. Represent., Vancouver, BC, Canada, 2018, pp. 1–23.
    [17]
    X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan, “Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models,” 2022. [Online]. Available: arXiv: 2208.06677.
    [18]
    D. F. Griffiths and D. J. Higham, Numerical methods for ordinary differential equations: Initial value problems. London, U.K.: Springer, 2010.
    [19]
    A. Défossez, L. Bottou, F. Bach, and N. Usunier, “A simple convergence proof of Adam and Adagrad,” Trans. Mach. Learn. Res., vol. 2022, no. 1, pp. 1–30, Oct. 2020.
    [20]
    Z. Xie, X. Wang, H. Zhang, I. Sato, and M. Sugiyama, “Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum,” in Proc. Int. Conf. Mach. Learn., Baltimore, MD, USA, 2022, pp. 24430–24459.
    [21]
    W. E, “A proposal on machine learning via dynamical systems,” Commun. Math. Stat., vol. 5, no. 1, pp. 1–11, Mar. 2017.
    [22]
    Y. Lu, A. Zhong, Q. Li, and B. Dong, “Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations,” in Proc. Int. Conf. Mach. Learn., Stockholm, Sweden, July 2018, pp. 3276–3285.
    [23]
    R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in Proc. Adv. Neural Inf. Process. Syst., Montreal, Canada, vol. 31, 2018.
    [24]
    L. Chen, L. Jin, and M. Shang, “Zero stability well predicts performance of convolutional neural networks,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 6268–6277.
    [25]
    K. Atkinson, W. Han, and D. E. Stewart, Numerical solution of ordinary differential equations. Hoboken, NJ, USA: Wiley, 2009.
    [26]
    Y. Zhang, L. Jin, D. Guo, Y. Yin, and Y. Chou, “Taylor-type 1-step-ahead numerical differentiation rule for first-order derivative approximation and ZNN discretization,” J. Comput. Appl. Math., vol. 273, pp. 29–40, Jan. 2015. doi: 10.1016/j.cam.2014.05.027
    [27]
    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp. 770–778.
    [28]
    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Represent., San Diego, CA, USA, 2015.
    [29]
    G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Honolulu, Hawaii, USA, 2017, pp. 4700–4708.
    [30]
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Commun. ACM., vol. 63, no. 11, pp. 139–144, Nov. 2020. doi: 10.1145/3422622
    [31]
    M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Int. Conf. Mach. Learn., Sydney, Australia, 2017, pp. 214–223.
    [32]
    I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, vol. 30, pp. 5769–5779.
    [33]
    T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in Proc. Int. Conf. Learn. Represent., Vancouver, BC, Canada, 2018, pp. 1–26.
    [34]
    J. Zhuang, T. Tang, Y. Ding, S. Tatikonda, N. Dvornek, X. Papademetris, and J. S. Duncan, “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 18795–18806.
    [35]
    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Dec. 1997. doi: 10.1162/neco.1997.9.8.1735
    [36]
    M. Liu and M. Shang, “On RNN-based k-WTA models with time-dependent inputs,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 11, pp. 2034–2036, Nov. 2022. doi: 10.1109/JAS.2022.105932
    [37]
    Y. Lin, Y. Meng, X. Sun, Q. Han, K. Kuang, J. Li, and F. Wu, “BertGCN: Transductive text classification by combining GCN and BERT,” in Proc. ACL., 2021, pp. 1456–1462.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(8)

    Article Metrics

    Article views (32) PDF downloads(4) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return