A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 10
Oct.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 7.847, Top 10% (SCI Q1)
    CiteScore: 13.0, Top 5% (Q1)
    Google Scholar h5-index: 64, TOP 7
Turn off MathJax
Article Contents
Z. N. Li, Y. J. Li, B. Y. Tan, S. X. Ding, and  S. L. Xie,  “Structured sparse coding with the group log-regularizer for key frame extraction,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 10, pp. 1818–1830, Oct. 2022. doi: 10.1109/JAS.2022.105602
Citation: Z. N. Li, Y. J. Li, B. Y. Tan, S. X. Ding, and  S. L. Xie,  “Structured sparse coding with the group log-regularizer for key frame extraction,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 10, pp. 1818–1830, Oct. 2022. doi: 10.1109/JAS.2022.105602

Structured Sparse Coding With the Group  Log-regularizer for Key Frame Extraction

doi: 10.1109/JAS.2022.105602
Funds:  This work was supported in part by the National Natural Science Foundation of China (61903090, 61727810, 62073086, 62076077, 61803096, U191140003), the Guangzhou Science and Technology Program Project (202002030289), and Japan Society for the Promotion of Science (JSPS) KAKENHI (18K18083)
More Information
  • Key frame extraction based on sparse coding can reduce the redundancy of continuous frames and concisely express the entire video. However, how to develop a key frame extraction algorithm that can automatically extract a few frames with a low reconstruction error remains a challenge. In this paper, we propose a novel model of structured sparse-coding-based key frame extraction, wherein a nonconvex group log-regularizer is used with strong sparsity and a low reconstruction error. To automatically extract key frames, a decomposition scheme is designed to separate the sparse coefficient matrix by rows. The rows enforced by the nonconvex group log-regularizer become zero or nonzero, leading to the learning of the structured sparse coefficient matrix. To solve the nonconvex problems due to the log-regularizer, the difference of convex algorithm (DCA) is employed to decompose the log-regularizer into the difference of two convex functions related to the l1 norm, which can be directly obtained through the proximal operator. Therefore, an efficient structured sparse coding algorithm with the group log-regularizer for key frame extraction is developed, which can automatically extract a few frames directly from the video to represent the entire video with a low reconstruction error. Experimental results demonstrate that the proposed algorithm can extract more accurate key frames from most SumMe videos compared to the state-of-the-art methods. Furthermore, the proposed algorithm can obtain a higher compression with a nearly 18% increase compared to sparse modeling representation selection (SMRS) and an 8% increase compared to SC-det on the VSUMM dataset.

     

  • loading
  • [1]
    W. K. Fu, K. L. Lin, and C. S. Shih, “Key-frame selection for multi-robot simultaneous localization and tracking in robot soccer field,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Madrid, Spain, 2018, pp. 109−116.
    [2]
    W. Zheng, F. Zhou, and Z. F. Wang, “Robust and accurate monocular visual navigation combining IMU for a quadrotor,” IEEE/CAA J. Autom. Sinica, vol. 2, no. 1, pp. 33–44, Jan. 2015.
    [3]
    W. J. Zhang, J. C. Wang, and F. P. Lan, “Dynamic hand gesture recognition based on short-term sampling neural networks,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 1, pp. 110–120, Jan. 2021. doi: 10.1109/JAS.2020.1003465
    [4]
    G. Y. Xia, B. J. Chen, H. J. Sun, and Q. S. Liu, “Nonconvex low-rank kernel sparse subspace learning for keyframe extraction and motion segmentation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 4, pp. 1612–1626, Apr. 2021. doi: 10.1109/TNNLS.2020.2985817
    [5]
    Y. L. Li, J. P. Shi, and D. H. Lin, “Low-latency video semantic segmentation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 5997−6005.
    [6]
    M. Y. Ma, S. H. Mei, S. Wan, J. H. Hou, Z. Y. Wang, and D. D. Feng, “Video summarization via block sparse dictionary selection,” Neurocomputing, vol. 378, pp. 197–209, Feb. 2020. doi: 10.1016/j.neucom.2019.07.108
    [7]
    C. Huang and H. M. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 2, pp. 577–589, Feb. 2020. doi: 10.1109/TCSVT.2019.2890899
    [8]
    R. Jenabzadeh and A. Behrad, “Video summarization using sparse representation of local descriptors,” Intell. Decis. Technol., vol. 13, no. 3, pp. 315–327, Sept. 2019. doi: 10.3233/IDT-180112
    [9]
    Y. J. Li, A. Kanemura, H. Asoh, T. Miyanishi, and M. Kawanabe, “Extracting key frames from first-person videos in the common space of multiple sensors,” in Proc. IEEE Int. Conf. Image Processing, Beijing, China, 2017, pp. 3993−3997.
    [10]
    E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 1600−1607.
    [11]
    Y. J. Li, A. Kanemura, H. Asoh, T. Miyanishi, and M. Kawanabe, “Key frame extraction from first-person video with multi-sensor integration,” in Proc. IEEE Int. Conf. Multimedia and Expo, Hong Kong, China, 2017, pp. 1303−1308.
    [12]
    D. Q. Zou, X. W. Chen, G. Y. Cao, and X. G. Wang, “Unsupervised video matting via sparse and low-rank representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 6, pp. 1501–1514, Jun. 2020. doi: 10.1109/TPAMI.2019.2895331
    [13]
    Z. N. Li, S. X. Ding, Y. J. Li, Z. Y. Yang, S. L. Xie, and W. H. Chen, “Manifold optimization-based analysis dictionary learning with ℓ1/2-norm regularizer,” Neural Netw., vol. 98, pp. 212–222, Feb. 2018. doi: 10.1016/j.neunet.2017.11.015
    [14]
    C. L. Bao, Y. H. Quan, and H. Ji, “A convergent incoherent dictionary learning algorithm for sparse coding,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 302−316.
    [15]
    Z. N. Li, Z. Y. Yang, H. L. Zhao, and S. L. Xie, “Direct-optimization-based DC dictionary learning with the MCP regularizer,” IEEE Trans. Neural Netw. Learn. Syst., 2021. DOI: 10.1109/TNNLS.2021.3114400.
    [16]
    B. Y. Tan, Y. J. Li, S. X. Ding, I. Paik, and A. Kanemura, “DC programming for solving a sparse modeling problem of video key frame extraction,” Digital Signal Process., vol. 83, pp. 214–222, Dec. 2018. doi: 10.1016/j.dsp.2018.08.005
    [17]
    Y. J. Li, B. Y. Tan, S. Akaho, H. Asoh, and S. X. Ding, “Gaze prediction for first-person videos based on inverse non-negative sparse coding with determinant sparse measure,” J. Visual Commun. Image Rep., vol. 81, p. 103367, Nov. 2021.
    [18]
    S. P. Zhang, H. X. Yao, X. Sun, and X. S. Lu, “Sparse coding based visual tracking: Review and experimental comparison,” Pattern Recogn., vol. 46, no. 7, pp. 1772–1788, Jul. 2013. doi: 10.1016/j.patcog.2012.10.006
    [19]
    Y. J. Li, A. Kanemura, H. Asoh, T. Miyanishi, and M. Kawanabe, “A sparse coding framework for gaze prediction in egocentric video,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 1313−1317.
    [20]
    H. P. Liu, Y. H. Liu, Y. L. Yu, and F. C. Sun, “Diversified key-frame selection using structured L2, 1 optimization,” IEEE Trans. Industr. Inform., vol. 10, no. 3, pp. 1736–1745, Aug. 2014. doi: 10.1109/TII.2014.2330798
    [21]
    Y. W. Li, S. H. Gu, C. Mayer, L. Van Gool, and R. Timofte, “Group sparsity: The hinge between filter pruning and decomposition for network compression,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2020, pp. 8015−8024.
    [22]
    Z. N. Li, H. L. Zhao, Y. C. Guo, Z. Y. Yang, and S. L. Xie, “Accelerated log-regularized convolutional transform learning and its convergence guarantee,” IEEE Trans. Cybern., Apr. 2021. DOI: 10.1109/TCYB.2021.306752.
    [23]
    B. Wen, X. J. Chen, and T. K. Pong, “A proximal difference-of-convex algorithm with extrapolation,” Comput. Optim. Appl., vol. 69, no. 2, pp. 297–324, 2018. doi: 10.1007/s10589-017-9954-1
    [24]
    D. N. Phan, H. M. Le, and H. A. Le Thi, “Accelerated difference of convex functions algorithm and its application to sparse binary logistic regression,” in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 1369−1375.
    [25]
    S. Jain and J. E. Gonzalez, “Fast semantic segmentation on video using block motion-based feature interpolation,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 3−6.
    [26]
    C. Sujatha and U. Mudenagudi, “A study on keyframe extraction methods for video summary,” in Proc. Int. Conf. Computational Intelligence and Communication Networks, Gwalior, India, 2011, pp. 73−77.
    [27]
    S. Hong, J. Ryu, W. Im, and H. S. Yang, “D3: Recognizing dynamic scenes with deep dual descriptor based on key frames and key segments,” Neurocomputing, vol. 273, pp. 611–621, Jan. 2018. doi: 10.1016/j.neucom.2017.08.046
    [28]
    N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based framework for extracting key frames from videos,” Signal Process.: Image Commun., vol. 28, no. 1, pp. 34–44, Jan. 2013. doi: 10.1016/j.image.2012.10.002
    [29]
    M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 505−520.
    [30]
    Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in Proc. Int. Conf. Computer Vision, Barcelona, Spain, 2011, pp. 1995−2002.
    [31]
    M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya, “Video summarization using deep semantic features,” in Proc. 13th Asian Conf. Computer Vision, Taipei, China, 2016, pp. 361−377.
    [32]
    D. Wu, X. Luo, M. S. Shang, Y. He, G. Y. Wang, and M. C. Zhou, “A deep latent factor model for high-dimensional and sparse matrices in recommender systems,” IEEE Trans. Syst.,Man,Cybern.: Syst., vol. 51, no. 7, pp. 4285–4296, Jul. 2021. doi: 10.1109/TSMC.2019.2931393
    [33]
    H. Y. Lu, L. Jin, X. Luo, B. L. Liao, D. S. Guo, and L. Xiao, “RNN for solving perturbed time-varying underdetermined linear system with double bound limits on residual errors and state variables,” IEEE Trans. Industr. Inform., vol. 15, no. 11, pp. 5931–5942, Nov. 2019. doi: 10.1109/TII.2019.2909142
    [34]
    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv: 1409.1556, 2014.
    [35]
    L. Xin, Y. Yuan, M. C. Zhou, Z. G. Liu, and M. S. Shang, “Non-negative latent factor model based on β-divergence for recommender systems,” IEEE Trans. Syst.,Man,Cybern.: Syst., vol. 51, no. 8, pp. 4612–4623, Aug. 2021. doi: 10.1109/TSMC.2019.2931468
    [36]
    X. Luo, M. S. Shang, and S. Li, “Efficient extraction of non-negative latent factors from high-dimensional and sparse matrices in industrial applications,” in Proc. IEEE 16th Int. Conf. Data Mining, Barcelona, Spain, 2016, pp. 311−319.
    [37]
    X. Luo, Z. D. Wang, and M. S. Shang, “An instance-frequency-weighted regularization scheme for non-negative latent factor analysis on high-dimensional and sparse data,” IEEE Trans. Syst.,Man,Cybern.: Syst., vol. 51, no. 6, pp. 3522–3532, Jun. 2021. doi: 10.1109/TSMC.2019.2930525
    [38]
    H. S. Du, L. G. Ma, G. D. Li, and S. Wang, “Low-rank graph preserving discriminative dictionary learning for image recognition,” Knowl. Based Syst., vol. 187, p. 104823, Jan. 2020.
    [39]
    H. Cao, Y. J. Yu, P. P. Zhang, and Y. X. Wang, “Flue gas monitoring system with empirically-trained dictionary,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 606–616, Mar. 2020. doi: 10.1109/JAS.2019.1911642
    [40]
    R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, pp. 947–951, Jun. 2000. doi: 10.1038/35016072
    [41]
    V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proc. 27th Int. Conf. Machine Learning, Haifa, Israel, 2010, pp. 807−814.
    [42]
    P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: A self-gated activation function,” arXiv: 1710.05941, 2017.
    [43]
    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jan. 2014.
    [44]
    M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. doi: 10.1109/TSP.2006.881199
    [45]
    S. K. Sahoo and A. Makur, “Image denoising via sparse representations over sequential generalization of K-means (SGK),” in Proc. 9th Int. Conf. Information, Communications and Signal Processing, Tainan, China, 2013, pp. 1−5.
    [46]
    Z. N. Li, S. X. Ding, and Y. J. Li, “A fast algorithm for learning overcomplete dictionary for sparse representation based on proximal operators,” Neural Comput., vol. 27, no. 9, pp. 1951–1982, Sept. 2015. doi: 10.1162/NECO_a_00763
    [47]
    X. Luo, Z. G. Liu, S. Li, M. S. Shang, and Z. D. Wang, “A fast non-negative latent factor model based on generalized momentum method,” IEEE Trans. Syst.,Man,Cybern.: Syst., vol. 51, no. 1, pp. 610–620, Jan. 2021. doi: 10.1109/TSMC.2018.2875452
    [48]
    S. E. F. de Avila, A. P. B. Lopes, A. Jr. da Luz, and A. de Albuquerque Araújo, “VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method,” Pattern Recognit. Lett., vol. 32, no. 1, pp. 56–68, Jan. 2011. doi: 10.1016/j.patrec.2010.08.004
    [49]
    D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th IEEE Int. Conf. Computer Vision, Vancouver, Canada, 2001, pp. 416−423.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(10)  / Tables(3)

    Article Metrics

    Article views (214) PDF downloads(42) Cited by()

    Highlights

    • We propose a novel model of structured sparse-codingbased key frame extraction, wherein a nonconvex group log-regularizer is used with strong sparsity and a low reconstruction error
    • To automatically extract key frames, a decomposition scheme is designed to separate the sparse coefficient matrix by rows
    • The DCA is employed to decompose the log-regularizer into the difference of two convex functions related to the l1 norm. Therefore, an efficient structured sparse coding algorithm with the group log-regularizer for key frame extraction is developed, which can automatically extract a few frames directly from the video to represent the entire video with a low reconstruction error

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return