IEEE/CAA Journal of Automatica Sinica
Citation: | J. Hou, X. Zeng, G. Wang, J. Sun, and J. Chen, “Distributed momentum-based Frank-Wolfe algorithm for stochastic optimization,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 3, pp. 685–699, Mar. 2023. doi: 10.1109/JAS.2022.105923 |
This paper considers distributed stochastic optimization, in which a number of agents cooperate to optimize a global objective function through local computations and information exchanges with neighbors over a network. Stochastic optimization problems are usually tackled by variants of projected stochastic gradient descent. However, projecting a point onto a feasible set is often expensive. The Frank-Wolfe (FW) method has well-documented merits in handling convex constraints, but existing stochastic FW algorithms are basically developed for centralized settings. In this context, the present work puts forth a distributed stochastic Frank-Wolfe solver, by judiciously combining Nesterov’s momentum and gradient tracking techniques for stochastic convex and nonconvex optimization over networks. It is shown that the convergence rate of the proposed algorithm is
for convex optimization, and
for nonconvex optimization. The efficacy of the algorithm is demonstrated by numerical simulations against a number of competing alternatives.
[1] |
J. Chen, J. Sun, and G. Wang, “From unmanned systems to autonomous intelligent systems,” Eng., vol. 12, no. 5, pp. 16–19, 2022.
|
[2] |
Z. Jiang, Q. Jia, and X. Guan, “On large action space in EV charging scheduling optimization,” Sci. China Inf. Sci., vol. 65, p. 122201, 2022.
|
[3] |
R. Yang, L. Liu, and G. Feng, “An overview of recent advances in distributed coordination of multi-agent systems,” Unmanned Syst., vol. 10, no. 3, pp. 307–325, 2022. doi: 10.1142/S2301385021500199
|
[4] |
L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186.
|
[5] |
X. Wang, J. Sun, G. Wang, F. Allgöwer, and J. Chen, “Data-driven control of distributed event-triggered network systems,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 2, pp. 351–364, Feb. 2023.
|
[6] |
G. Wang, S. Lu, G. B. Giannakis, G. Tesauro, and J. Sun, “Decentralized TD tracking with linear function approximation and its finite-time analysis,” in Proc. Adv. Neural Inf. Process. Syst., 2020.
|
[7] |
D. Wang, Z. Wang, and Z. Wu, “Distributed convex optimization for nonlinear multi-agent systems disturbed by a second-order stationary process over a digraph,” Sci. China Inf. Sci., vol. 65, p. 132201, 2022. doi: 10.1007/s11432-020-3111-4
|
[8] |
A. Mokhtari, H. Hassani, and A. Karbasi, “Stochastic conditional gradient methods: From convex minimization to submodular maximization,” J. Mach. Learn. Res., vol. 21, no. 105, pp. 1–49, 2020.
|
[9] |
Z. Akhtar and K. Rajawat, “Momentum based projection free stochastic optimization under affine constraints,” in Proc. American Control Conf., 2021, pp. 2619–2624.
|
[10] |
S. J. Reddi, S. Sra, B. Póczos, and A. Smola, “Stochastic Frank-Wolfe methods for nonconvex optimization,” in Proc. Annu. Allerton Conf. Commu., Control, Comput., 2016, pp. 1244–1251.
|
[11] |
Y. Li, X. Wang, J. Sun, G. Wang, and J. Chen, “Data-driven consensus control of fully distributed event-triggered multi-agent systems,” Sci. China Inf. Sci., 2022. DOI: 10.1007/s11432-022-3629-1.
|
[12] |
S. Pu, A. Olshevsky, and I. C. Paschalidis, “Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 114–122, 2020. doi: 10.1109/MSP.2020.2975212
|
[13] |
Z. Wang, J. Zhang, T.-H. Chang, J. Li, and Z.-Q. Luo, “Distributed stochastic consensus optimization with momentum for nonconvex nonsmooth problems,” IEEE Trans. Signal Process, vol. 69, pp. 4486–4501, 2021. doi: 10.1109/TSP.2021.3097211
|
[14] |
G. Wang, G. B. Giannakis, and J. Chen, “Learning ReLU networks on linearly separable data: Algorithm, optimality, and generalization,” IEEE Trans. Signal Process., vol. 67, no. 9, pp. 2357–2370, 2019. doi: 10.1109/TSP.2019.2904921
|
[15] |
Y. Dai and Y. Weng, “Synchronous parallel block coordinate descent method for nonsmooth convex function minimization,” J. Syst. Sci. Complex., vol. 33, no. 2, pp. 345–365, 2020. doi: 10.1007/s11424-020-8313-y
|
[16] |
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009. doi: 10.1137/070704277
|
[17] |
Y. Zhang, Y. Lou, Y. Hong, and L. Xie, “Distributed projection-based algorithms for source localization in wireless sensor networks,” IEEE Trans. Wireless Commun., vol. 14, no. 6, pp. 3131–3142, 2015. doi: 10.1109/TWC.2015.2402672
|
[18] |
X.-F. Wang, Y. Hong, X.-M. Sun, and K.-Z. Liu, “Distributed optimization for resource allocation problems under large delays,” IEEE Trans. Ind. Electron., vol. 66, no. 12, pp. 9448–9457, 2019. doi: 10.1109/TIE.2019.2891406
|
[19] |
S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic programming,” SIAM J. Optim., vol. 23, no. 4, pp. 2341–2368, 2013. doi: 10.1137/120880811
|
[20] |
A. Agrawal, S. Barratt, and S. Boyd, “Learning convex optimization models,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 8, pp. 1355–1364, 2021. doi: 10.1109/JAS.2021.1004075
|
[21] |
S. Fujishige and S. Isotani, “A submodular function minimization algorithm based on the minimum-norm base,” Pacific J. Opt., vol. 7, no. 1, pp. 3–17, 2011.
|
[22] |
M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Nav. Res. Logist., vol. 3, no. 1–2, pp. 95–110, 1956. doi: 10.1002/nav.3800030109
|
[23] |
G. Lan and Y. Zhou, “Conditional gradient sliding for convex optimization,” SIAM J. Optim., vol. 26, no. 2, pp. 1379–1409, 2016.
|
[24] |
D. S. Kalhan, A. Singh Bedi, A. Koppel, K. Rajawat, H. Hassani, A. K. Gupta, and A. Banerjee, “Dynamic online learning via Frank-Wolfe algorithm,” IEEE Trans. Signal Process., vol. 69, pp. 932–947, 2021. doi: 10.1109/TSP.2021.3051871
|
[25] |
T. Kerdreux, A. d’Aspremont, and S. Pokutta, “Projection-free optimization on uniformly convex sets,” in Proc. Artif. Intell. Statist., 2021, vol. 130, pp. 19–27.
|
[26] |
B. Li, M. Coutiño, G. B. Giannakis, and G. Leus, “A momentum-guided Frank-Wolfe algorithm,” IEEE Trans. Signal Process., vol. 69, pp. 3597–3611, 2021. doi: 10.1109/TSP.2021.3087910
|
[27] |
Y. Zhang, B. Li, and G. B. Giannakis, “Accelerating Frank-Wolfe with weighted average gradients,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 5529–5533.
|
[28] |
E. Hazan and S. Kale, “Projection-free online learning,” in Proc. 29th Int. Conf. Mach. Learn., 2012, pp. 521–528.
|
[29] |
E. Hazan and H. Luo, “Variance-reduced and projection-free stochastic optimization,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1263–1271.
|
[30] |
K. Bekiroglu, S. Srinivasan, E. Png, R. Su, and C. Lagoa, “Recursive approximation of complex behaviours with IoT-data imperfections,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 3, pp. 656–667, 2020. doi: 10.1109/JAS.2020.1003126
|
[31] |
M. Weber, “Projection-free nonconvex stochastic optimization on Riemannian manifolds,” IMA J. Numer. Anal., vol. 42, no. 4, pp. 3241–3271, 2021. doi: 10.1093/imanum/drab066
|
[32] |
A. Bellet, Y. Liang, A. B. Garakani, M. Balcan, and F. Sha, “A distributed Frank-Wolfe algorithm for communication-efficient sparse learning,” in Proc. SIAM Int. Conf. Data Mining, 2015, pp. 478–486.
|
[33] |
H.-T. Wai, A. Scaglione, J. Lafond, and E. Moulines, “A projection-free decentralized algorithm for non-convex optimization,” in Proc. IEEE Global Conf. Signal and Inf. Process, 2016, pp. 475–479.
|
[34] |
H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized Frank-Wolfe algorithm for convex and nonconvex problems,” IEEE Trans. Autom. Control, vol. 62, no. 11, pp. 5522–5537, 2017. doi: 10.1109/TAC.2017.2685559
|
[35] |
G. Chen, P. Yi, and Y. Hong, “Distributed optimization with projection-free dynamics,” arXiv preprint arXiv: 2105.02450, 2021.
|
[36] |
L. Zhang, G. Wang, D. Romero, and G. B. Giannakis, “Randomized block Frank-Wolfe for convergent large-scale learning,” IEEE Trans. Signal Processing, vol. 65, no. 24, pp. 6448–6461, 2017. doi: 10.1109/TSP.2017.2755597
|
[37] |
J. Xie, C. Zhang, Z. Shen, C. Mi, and H. Qian, “Decentralized gradient tracking for continuous DR-submodular maximization,” in Proc. Int. Conf. Artif. Intell. Statist., 2019.
|
[38] |
H. Gao, H. Xu, and S. Vucetic, “Sample efficient decentralized stochastic Frank-Wolfe methods for continuous DR-Submodular maximization,” in Proc. Int. Joint Conf. Artif. Intell., 2021, pp. 3501–3507.
|
[39] |
A. Cutkosky and F. Orabona, “Momentum-based variance reduction in non-convex SGD,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 15210–15219.
|
[40] |
Y. E. Nesterov, “A method for solving the convex programming problem with convergence rate
|
[41] |
J. N. Tsitsiklis, “Problems in decentralized decision making and computation,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., MIT, Boston, USA, 1984.
|
[42] |
Y. Zhang, Y. Lou, and Y. Hong, “An approximate gradient algorithm for constrained distributed convex optimization,” IEEE/CAA J. Autom. Sinica, vol. 1, no. 1, pp. 61–67, 2014. doi: 10.1109/JAS.2014.7004621
|
[43] |
X. Ren, D. Li, Y. Xi, and H. Shao, “Distributed subgradient algorithm for multi-agent optimization with dynamic stepsize,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 8, pp. 1451–1464, 2021. doi: 10.1109/JAS.2021.1003904
|
[44] |
J. Gao, X.-W. Liu, Y.-H. Dai, Y. Huang, and P. Yang, “Achieving geometric convergence for distributed optimization with barzilai-borwein step sizes,” Sci. China Inf. Sci., vol. 65, p. 149204, 2022.
|
[45] |
P. D. Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Trans. Signal Inf. Process. Netw., vol. 2, no. 2, pp. 120–136, 2016.
|
[46] |
Z. Li, B. Liu, and Z. Ding, “Consensus-based cooperative algorithms for training over distributed data sets using stochastic gradients,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 10, pp. 5579–5589, 2022.
|
[47] |
X. Ma, P. Yi, and J. Chen, “Distributed gradient tracking methods with finite data rates,” J. Syst. Sci. Complex., vol. 34, no. 5, pp. 1927–1952, 2021. doi: 10.1007/s11424-021-1231-9
|