IEEE/CAA Journal of Automatica Sinica
Citation:  X. L. Yi, S. J. Zhang, T. Yang, T. Y. Chai, and K. H. Johansson, “A primaldual SGD algorithm for distributed nonconvex optimization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 5, pp. 812–833, May 2022. doi: 10.1109/JAS.2022.105554 
The distributed nonconvex optimization problem of minimizing a global cost function formed by a sum of n local cost functions by using local information exchange is considered. This problem is an important component of many machine learning techniques with data parallelism, such as deep learning and federated learning. We propose a distributed primaldual stochastic gradient descent (SGD) algorithm, suitable for arbitrarily connected communication networks and any smooth (possibly nonconvex) cost functions. We show that the proposed algorithm achieves the linear speedup convergence rate
for general nonconvex cost functions and the linear speedup convergence rate
when the global cost function satisfies the PolyakŁojasiewicz (PŁ) condition, where T is the total number of iterations. We also show that the output of the proposed algorithm with constant parameters linearly converges to a neighborhood of a global optimum. We demonstrate through numerical experiments the efficiency of our algorithm in comparison with the baseline centralized SGD and recently proposed distributed SGD algorithms.
[1] 
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.

[2] 
H. B. McMahan, E. Moore, D. Ramage, et al., “Communicationefficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2017, pp. 1273–1282.

[3] 
T. Tatarenko and B. Touri, “Nonconvex distributed optimization,” IEEE Trans. Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017. doi: 10.1109/TAC.2017.2648041

[4] 
M. Hong, D. Hajinezhad, and M.M. Zhao, “ProxPDA: The proximal primaldual algorithm for fast distributed nonconvex optimization and learning over networks,” in Proc. Int. Conf. Machine Learning, 2017, pp. 1529–1538.

[5] 
A. Daneshmand, G. Scutari, and V. Kungurtsev, “Secondorder guarantees of gradient algorithms over networks,” in Proc. Annual Allerton Conf. Communication, Control, and Computing, 2018, pp. 359–365.

[6] 
H. Sun and M. Hong, “Distributed nonconvex firstorder optimization and information processing: Lower complexity bounds and rate optimal algorithms,” arXiv preprint arXiv: 1804.02729, 2018.

[7] 
D. Hajinezhad and M. Hong, “Perturbed proximal primaldual algorithm for nonconvex nonsmooth optimization,” Mathematical Programming, vol. 176, no. 1, pp. 207–245, 2019.

[8] 
H. Sun and M. Hong, “Distributed nonconvex firstorder optimization and information processing: Lower complexity bounds and rate optimal algorithms,” IEEE Trans. Signal Processing, vol. 67, no. 22, pp. 5912–5928, 2019. doi: 10.1109/TSP.2019.2943230

[9] 
H.T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized FrankWolfe algorithm for convex and nonconvex problems,” IEEE Trans. Automatic Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559

[10] 
J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncated gradient,” Journal of Machine Learning Research, vol. 10, no. 3, pp. 777–801, 2009.

[11] 
B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild!: A lockfree approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011, pp. 693–701.

[12] 
C. M. De Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwildstyle algorithms,” in Advances in Neural Information Processing Systems, 2015, pp. 2674–2682.

[13] 
X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous parallel stochastic gradient for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 2737–2745.

[14] 
X. Lian, H. Zhang, C.J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.

[15] 
Z. Zhou, P. Mertikopoulos, N. Bambos, P. Glynn, Y. Ye, L.J. Li, and F.F. Li, “Distributed asynchronous optimization with unbounded delays: How slow can you go?” in Proc. Int. Conf. Machine Learning, 2018, pp. 5970–5979.

[16] 
J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar, “SignSGD: Compressed optimisation for nonconvex problems,” in Proc. Int. Conf. Machine Learning, 2018, pp. 560–569.

[17] 
P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” in Advances in Neural Information Processing Systems, 2018, pp. 2525–2536.

[18] 
A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communicationefficient federated learning method with periodic averaging and quantization,” in Proc. Int. Conf. Artificial Intelligence and Statistics, 2020, pp. 2021–2031.

[19] 
D. Basu, D. Data, C. Karakus, and S. Diggavi, “QsparselocalSGD: Distributed SGD with quantization, sparsification and local computations,” in Advances in Neural Information Processing Systems, 2019, pp. 14668–14679.

[20] 
J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best errorruntime tradeoff in localupdate SGD,” in Proc. Conf. Machine Learning and Systems, 2019, pp. 212–229.

[21] 
H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning,” in Proc. AAAI Conf. Artificial Intelligence, 2019, pp. 5693–5700.

[22] 
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Trading redundancy for communication: Speeding up distributed SGD for nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 2545–2554.

[23] 
H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7184–7193.

[24] 
F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, 2019, pp. 11080–11092.

[25] 
H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic nonconvex optimization,” in Proc. Int. Conf. Machine Learning, 2019, pp. 7174–7183.

[26] 
Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, 2017, pp. 5904–5914.

[27] 
X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2017, pp. 5330–5340.

[28] 
J. George, T. Yang, H. Bai, and P. Gurram, “Distributed stochastic gradient method for nonconvex problems with applications in supervised learning,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 5538–5543.

[29] 
X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent,” in Proc. Int. Conf. Machine Learning, 2018, pp. 3043–3052.

[30] 
M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning,” in Proc. Int. Conf. Machine Learning, 2019, pp. 344–353.

[31] 
H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” in Advances in Neural Information Processing Systems, 2018, pp. 7652–7662.

[32] 
A. Reisizadeh, H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Robust and communicationefficient collaborative learning,” in Advances in Neural Information Processing Systems, 2019, pp. 8386–8397.

[33] 
H. Taheri, A. Mokhtari, H. Hassani, and R. Pedarsani, “Quantized decentralized stochastic learning over directed graphs,” in Proc. Int. Conf. Machine Learning, 2020, pp. 9324–9333.

[34] 
N. Singh, D. Data, J. George, and S. Diggavi, “SQuARMSGD: Communicationefficient momentum SGD for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021. doi: 10.1109/JSAIT.2021.3103920

[35] 
J. Wang and G. Joshi, “Cooperative SGD: A unified framework for the design and analysis of communicationefficient SGD algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021.

[36] 
H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D^{2}: Decentralized training over decentralized data,” in Proc. Int. Conf. Machine Learning, 2018, pp. 4848–4856.

[37] 
S. Lu, X. Zhang, H. Sun, and M. Hong, “GNSD: A gradienttracking based nonconvex stochastic algorithm for decentralized optimization,” in Proc. IEEE Data Science Workshop, 2019, pp. 315–321.

[38] 
J. Zhang and K. You, “Decentralized stochastic gradient tracking for empirical risk minimization,” arXiv preprint arXiv: 1909.02712, 2019.

[39] 
S. U. Stich, “Local SGD converges fast and communicates little,” in Proc. Int. Conf. Learning Representations, 2019.

[40] 
A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proc. Int. Conf. Machine Learning, 2019, pp. 3478–3487.

[41] 
S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021. DOI: 10.1109/TAC.2021.3126253

[42] 
M. Rabbat, “Multiagent mirror descent for decentralized stochastic optimization,” in Proc. Int. Workshop on Computational Advances in MultiSensor Adaptive Processing, 2015, pp. 517–520.

[43] 
G. Lan, S. Lee, and Y. Zhou, “Communicationefficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.

[44] 
D. Yuan, Y. Hong, D. W. Ho, and G. Jiang, “Optimal distributed stochastic mirror descent for strongly convex optimization,” Automatica, vol. 90, pp. 196–203, 2018. doi: 10.1016/j.automatica.2017.12.053

[45] 
D. Jakovetic, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence rates for distributed stochastic optimization over random networks,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 4238–4245.

[46] 
A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli, and L. Zhu, “Robust distributed accelerated stochastic gradient methods for multiagent networks,” arXiv preprint arXiv: 1910.08701, 2019.

[47] 
S. Pu and A. Garcia, “Swarming for faster convergence in stochastic optimization,” SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2997–3020, 2018. doi: 10.1137/17M1111085

[48] 
S. Pu and A. Nedić, “A distributed stochastic gradient tracking method,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 963–968.

[49] 
R. Xin, A. K. Sahu, U. A. Khan, and S. Kar, “Distributed stochastic optimization with gradient tracking over stronglyconnected networks,” in Proc. IEEE Conf. Decision and Control, 2019, pp. 8353–8358.

[50] 
M. Mesbahi and M. Egerstedt, Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010.

[51] 
A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proc. Int. Conf. Machine Learning, 2012, pp. 1571–1578.

[52] 
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximalgradient methods under the PolyakŁojasiewicz condition,” in Proc. Joint European Conf. Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.

[53] 
H. Zhang and L. Cheng, “Restricted strong convexity and its applications to convergence analysis of gradienttype methods in convex optimization,” Optimization Letters, vol. 9, no. 5, pp. 961–979, 2015. doi: 10.1007/s115900140795x

[54] 
Z. Li and J. Li, “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, pp. 5569–5579.

[55] 
M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proc. Int. Conf. Machine Learning, 2018, pp. 1467–1476.

[56] 
W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An exact firstorder algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015. doi: 10.1137/14096668X

[57] 
A. Nedić, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over timevarying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. doi: 10.1137/16M1084316

[58] 
H. Li and Z. Lin, “Revisiting extra for smooth distributed optimization,” SIAM Journal on Optimization, vol. 30, no. 3, pp. 1795–1821, 2020. doi: 10.1137/18M122902X

[59] 
Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,” [Online] Available: http://yann.lecun.com/exdb/mnist, 2010.

[60] 
L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade. Springer, 2012, pp. 421–436.

[61] 
Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer Int. Publishing, 2018.

[62] 
Y. Tang, J. Zhang, and N. Li, “Distributed zeroorder algorithms for nonconvex multiagent optimization,” IEEE Trans. Control of Network Systems, vol. 8, no. 1, pp. 269–281, 2020.

[63] 
X. Yi, L. Yao, T. Yang, J. George, and K. H. Johansson, “Distributed optimization for secondorder multiagent systems with dynamic eventtriggered communication,” in Proc. IEEE Conf. Decision and Control, 2018, pp. 3397–3402.
