IEEE/CAA Journal of Automatica Sinica
Citation:  X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and K. H. Johansson, “A primaldual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554 
[1] 
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.

[2] 
H. B. McMahan, E. Moore, D. Ramage, et al., “Communicationefficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2017, pp. 1273–1282.

[3] 
T. Tatarenko and B. Touri, “Nonconvex distributed optimization,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017. doi: 10.1109/TAC.2017.2648041

[4] 
M. Hong, D. Hajinezhad, and M.M. Zhao, “ProxPDA: The proximal primaldual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. Int. Conf. Machine Learning, 2017, pp. 1529–1538.

[5] 
A. Daneshmand, G. Scutari, and V. Kungurtsev, “Secondorder guarantees of gradient algorithms over networks,” in Proc. Annual Allerton Conf. Communication, Control, and Computing, 2018, pp. 359–365.

[6] 
H. Sun and M. Hong, “Distributed nonconvex firstorder optimization and information processing: Lower complexity bounds and rate optimal algorithms,” arXiv preprint arXiv: 1804.02729, 2018.

[7] 
D. Hajinezhad and M. Hong, “Perturbed proximal primaldual algorithm for nonconvex nonsmooth optimization,” Mathematical Programming, vol. 176, no. 1, pp. 207–245, 2019.

[8] 
H. Sun and M. Hong, “Distributed nonconvex firstorder optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Trans. Signal Processing, vol. 67, no. 22, pp. 5912–5928, 2019. doi: 10.1109/TSP.2019.2943230

[9] 
H.T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized FrankWolfe algorithm for convex and nonconvex problems,” IEEE Trans. Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559

[10] 
J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” Journal of Machine Learning Research, vol. 10, no. 3, pp. 777–801, 2009.

[11] 
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lockfree approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.

[12] 
C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwildstyle algorithms,” in Advances in Neural Information Processing Systems, 2015, pp. 2674–2682.

[13] 
X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.

[14] 
X. Lian, H. Zhang, C.J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.

[15] 
Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.J. Li, and F.F. Li, “Distributed asynchronous optimization with unbounded delays: How slow can you go?” in Proc. Int. Conf. Machine Learning, 2018, pp. 5970–5979.

[16] 
J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar, “SignSGD: Compressed optimisation for nonconvex problems,” in Proc. Int. Conf. Machine Learning, 2018, pp. 560–569.

[17] 
P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2525–2536.

[18] 
A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communicationefficient federated learning method with periodic averaging and quantization,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2020, pp. 2021–2031.

[19] 
D. Basu, D. Data, C. Karakus, and S. Diggavi, “QsparselocalSGD: Distributed SGD with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, 2019, pp. 14668–14679.

[20] 
J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate SGD,” in Proc. Conf. Machine Learning and Systems, 2019, pp. 212–229.

[21] 
H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 5693–5700.

[22] 
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading redundancy for communication: Speeding up distributed SGD for nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 2545–2554.

[23] 
H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7184–7193.

[24] 
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, pp. 11080–11092.

[25] 
H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7174–7183.

[26] 
Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.

[27] 
X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.

[28] 
J. George, T. Yang, H. Bai, and P. Gurram, “Distributed stochastic gradient method for nonconvex problems with applications in supervised learning,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 5538–5543.

[29] 
X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2018, pp. 3043–3052.

[30] 
M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Machine Learning, 2019, pp. 344–353.

[31] 
H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” in Advances in Neural Information Processing Systems, 2018, pp. 7652–7662.

[32] 
A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communicationefficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.

[33] 
H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized stochastic learning over directed graphs,” in Proc. Int. Conf. Machine Learning, 2020, pp. 9324–9333.

[34] 
N. Singh, D. Data, J. George, and S. Diggavi, “SQuARMSGD: Communicationefficient momentum SGD for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021. doi: 10.1109/JSAIT.2021.3103920

[35] 
J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.

[36] 
H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D^{2}: Decentralized training over decentralized data,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4848–4856.

[37] 
S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradienttracking based nonconvex stochastic algorithm for decentralized optimization,” in Proc. IEEE Data Science Workshop, 2019, pp. 315–321.

[38] 
J. Zhang and K. You, “Decentralized stochastic gradient tracking for empirical risk minimization,” arXiv preprint arXiv: 1909.02712, 2019.

[39] 
S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learning Representations, 2019.

[40] 
A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Machine Learning, 2019, pp. 3478–3487.

[41] 
S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021. DOI: 10.1109/TAC.2021.3126253

[42] 
M. Rabbat, “Multiagent mirror descent for decentralized stochastic optimization,” in Proc. Int. Workshop on Computational Advances in MultiSensor Adaptive Processing, 2015, pp. 517–520.

[43] 
G. Lan, S. Lee, and Y. Zhou, “Communicationefficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.

[44] 
D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018. doi: 10.1016/j.automatica.2017.12.053

[45] 
D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 4238–4245.

[46] 
A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu, “Robust distributed accelerated stochastic gradient methods for multiagent networks,” arXiv preprint arXiv: 1910.08701, 2019.

[47] 
S. Pu and A. Garcia, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. doi: 10.1137/17M1111085

[48] 
S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 963–968.

[49] 
R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over stronglyconnected networks,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 8353–8358.

[50] 
M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010.

[51] 
A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Machine Learning, 2012, pp. 1571–1578.

[52] 
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximalgradient methods under the PolyakŁojasiewicz condition,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.

[53] 
H. Zhang and L. Cheng, “Restricted strong convexity and its applications to convergence analysis of gradienttype methods in convex optimization,” Optimization Letters, vol. 9, no. 5, pp. 961–979, 2015. doi: 10.1007/s115900140795x

[54] 
Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 5569–5579.

[55] 
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1467–1476.

[56] 
W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact firstorder algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. doi: 10.1137/14096668X

[57] 
A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over timevarying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. doi: 10.1137/16M1084316

[58] 
H. Li and Z. Lin, “Revisiting extra for smooth distributed optimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 1795–1821, 2020. doi: 10.1137/18M122902X

[59] 
Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” [Online] Available: http://yann.lecun.com/exdb/mnist, 2010.

[60] 
L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 421–436.

[61] 
Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer Int. Publishing, 2018.

[62] 
Y. Tang, J. Zhang, and N. Li, “Distributed zeroorder algorithms for nonconvex multiagent optimization,” IEEE Trans. Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020.

[63] 
X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for secondorder multiagent systems with dynamic eventtriggered communication,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 3397–3402.
