A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 2
Feb.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
P. L. Huang, J. W. Han, N. Liu, J. Ren, and D. W. Zhang, “Scribble-supervised video object segmentation,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 2, pp. 339–353, Feb. 2022. doi: 10.1109/JAS.2021.1004210
Citation: P. L. Huang, J. W. Han, N. Liu, J. Ren, and D. W. Zhang, “Scribble-supervised video object segmentation,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 2, pp. 339–353, Feb. 2022. doi: 10.1109/JAS.2021.1004210

Scribble-Supervised Video Object Segmentation

doi: 10.1109/JAS.2021.1004210
Funds:  This work was supported in part by the National Key R&D Program of China (2017YFB0502904) and the National Science Foundation of China (61876140)
More Information
  • Recently, video object segmentation has received great attention in the computer vision community. Most of the existing methods heavily rely on the pixel-wise human annotations, which are expensive and time-consuming to obtain. To tackle this problem, we make an early attempt to achieve video object segmentation with scribble-level supervision, which can alleviate large amounts of human labor for collecting the manual annotation. However, using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete. To address this issue, this paper introduces two novel elements to learn the video object segmentation model. The first one is the scribble attention module, which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background. The other one is the scribble-supervised loss, which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage. To evaluate the proposed method, we implement experiments on two video object segmentation benchmark datasets, YouTube-video object segmentation (VOS), and densely annotated video segmentation (DAVIS)-2017. We first generate the scribble annotations from the original per-pixel annotations. Then, we train our model and compare its test performance with the baseline models and other existing works. Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.

     

  • loading
  • [1]
    Z. Teng, J. Xing, Q. Wang, C. Lang, S. Feng, and Y. Jin, “Robust object tracking based on temporal and spatial deep networks,” in Proc. IEEE Int. Conf. Computer Vision, 2017, pp. 1144–1153.
    [2]
    Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE Trans. Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1709–1717, 2019.
    [3]
    D. Shao, Y. Xiong, Y. Zhao, Q. Huang, Y. Qiao, and D. Lin, “Find and focus: Retrieve and localize video events with natural language queries,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 200–216.
    [4]
    Q. Wang, J. Gao, and X. Li, “Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes,” IEEE Trans. Image Processing, vol. 28, no. 9, pp. 4376–4386, 2019. doi: 10.1109/TIP.2019.2910667
    [5]
    Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic segmentation of urban scenes,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 7892–7901.
    [6]
    F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. SorkineHornung, “Learning video object segmentation from static images,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017, pp. 2663–2672.
    [7]
    S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017, pp. 221–230.
    [8]
    C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i Nieto, “Rvos: End-to-end recurrent network for video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 5277–5286.
    [9]
    J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg, “A generative appearance model for end-to-end video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 8953–8962.
    [10]
    P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “Feelvos: Fast end-to-end embedding learning for video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 9481–9490.
    [11]
    Y. J. Koh and C.-S. Kim, “Primary object segmentation in videos based on region augmentation and reduction,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 7417–7425.
    [12]
    T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” in Proc. European Conf. Computer Vision, Springer, 2010, pp. 282–295.
    [13]
    H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid dilated deeper convlstm for video salient object detection,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 715–731.
    [14]
    X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 3623–3632.
    [15]
    X. Lu, W. Wang, J. Shen, Y.-W. Tai, D. Crandall, and S. C. Hoi, “Learning video object segmentation from unlabeled videos,” arXiv preprint arXiv: 2003.05020, 2020.
    [16]
    M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normalized cut loss for weakly-supervised cnn segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 1818–1827.
    [17]
    M. Tang, F. Perazzi, A. Djelouah, I. Ben Ayed, C. Schroers, and Y. Boykov, “On regularized losses for weakly-supervised CNN segmentation,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 507–522.
    [18]
    P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Proc. Advances in Neural Information Processing Systems, 2011, pp. 109–117.
    [19]
    K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “Video object segmentation without temporal information,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 41, no. 6, pp. 1515–1530, 2018.
    [20]
    P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” arXiv preprint arXiv: 1706.09364, 2017.
    [21]
    Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool, “Blazingly fast video object segmentation with pixel-wise metric learning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 1189–1198.
    [22]
    Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in Proc. Int. Conf. Computer Vision, IEEE, 2011, pp. 1995–2002.
    [23]
    T. Ma and L. J. Latecki, “Maximum weight cliques with mutex constraints for video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 670–677.
    [24]
    X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M.-H. Yang, “Deep regression tracking with shrinkage loss,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 353–369.
    [25]
    D. Zhang, O. Javed, and M. Shah, “Video object segmentation through spatially accurate and temporally dense extraction of primary object regions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 628–635.
    [26]
    L. Chen, J. Shen, W. Wang, and B. Ni, “Video object segmentation via dense trajectories,” IEEE Trans. Multimedia, vol. 17, no. 12, pp. 2225–2234, 2015. doi: 10.1109/TMM.2015.2481711
    [27]
    K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by tracing discontinuities in a trajectory embedding,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, IEEE, 2012, pp. 1846–1853.
    [28]
    P. Ochs and T. Brox, “Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions,” in Proc. Int. Conf. Computer Vision, IEEE, 2011, pp. 1583–1590.
    [29]
    A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in Proc. IEEE Int. Conf. Computer Vision, 2013, pp. 1777–1784.
    [30]
    W.-D. Jang, C. Lee, and C.-S. Kim, “Primary object segmentation in videos via alternate convex optimization of foreground and background distributions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 696–704.
    [31]
    A. Faktor and M. Irani, “Video segmentation by non-local consensus voting.” in Proc. British Machine Vision Conference (BMVC), 2014, vol. 2, no. 7, Article No. 8.
    [32]
    Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 786–802.
    [33]
    W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 3395–3402.
    [34]
    S. D. Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2017, pp. 3664–3673.
    [35]
    J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint learning for video object segmentation and optical flow,” in Proc. IEEE Int. Conf. Computer Vision, 2017, pp. 686–695.
    [36]
    S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo, “Instance embedding transfer to unsupervised video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 6526–6535.
    [37]
    W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling, “Learning unsupervised video object segmentation through visual attention,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 3064–3074.
    [38]
    D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribblesupervised convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 3159–3167.
    [39]
    B. Wang, G. Qi, S. Tang, T. Zhang, Y. Wei, L. Li, and Y. Zhang, “Boundary perception guidance: A scribble-supervised semantic segmentation approach,” in Proc. 28th Int. Joint Conf. Artificial Intelligence. AAAI Press, 2019, pp. 3663–3669.
    [40]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Advances Neural Information Processing Systems, 2017, pp. 5998–6008.
    [41]
    X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
    [42]
    P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma, “Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations,” in Proc. IEEE Int. Conf. Computer Vision, 2019, pp. 3106–3115.
    [43]
    J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
    [44]
    H. Zhang, H. Zhang, C. Wang, and J. Xie, “Co-occurrent features in semantic segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 548–557.
    [45]
    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
    [46]
    H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” arXiv preprint arXiv: 1805.08318, 2018.
    [47]
    T. Joy, A. Desmaison, T. Ajanthan, R. Bunel, M. Salzmann, P. Kohli, P. H. Torr, and M. P. Kumar, “Efficient relaxations for dense crfs with sparse higher-order potentials,” SIAM Journal on Imaging Sciences, vol. 12, no. 1, pp. 287–318, 2019. doi: 10.1137/18M1178104
    [48]
    B. Zhang, J. Xiao, Y. Wei, M. Sun, and K. Huang, “Reliability does matter: An end-to-end weakly supervised semantic segmentation approach,” arXiv preprint arXiv: 1911.08039, 2019.
    [49]
    Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001. doi: 10.1109/34.969114
    [50]
    C. Liu, W. T. Freeman, E. H. Adelson, and Y. Weiss, “Human-assisted motion annotation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, IEEE, 2008, pp. 1–8.
    [51]
    V. Karasev, A. Ravichandran, and S. Soatto, “Active frame, location, and detector selection for automated and manual video annotation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 2123–2130.
    [52]
    B. Luo, H. Li, F. Meng, Q. Wu, and C. Huang, “Video object segmentation via global consistency aware query strategy,” IEEE Trans. Multimedia, vol. 19, no. 7, pp. 1482–1493, 2017. doi: 10.1109/TMM.2017.2671447
    [53]
    D. Moltisanti, S. Fidler, and D. Damen, “Action recognition from single timestamp supervision in untrimmed videos,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 9915–9924.
    [54]
    N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-VOS: A large-scale video object segmentation benchmark,” arXiv preprint arXiv: 1809.03327, 2018.
    [55]
    J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 DAVIS challenge on video object segmentation,” arXiv preprint arXiv: 1704.00675, 2017.
    [56]
    Z. Guo and R. W. Hall, “Parallel thinning with two-subiteration algorithms,” Communications of the ACM, vol. 32, no. 3, pp. 359–373, 1989. doi: 10.1145/62065.62074
    [57]
    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 724–732.
    [58]
    H. Lin, X. Qi, and J. Jia, “Agss-VOS: Attention guided single-shot video object segmentation,” in Proc. IEEE Int. Conf. Computer Vision, 2019, pp. 3949–3957.
    [59]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
    [60]
    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv: 1412.6980, 2014.
    [61]
    H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia, “Psanet: Point-wise spatial attention network for scene parsing,” in Proc. European Conf. Computer Vision (ECCV), 2018, pp. 267–283.
    [62]
    Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vision, 2019, pp. 603–612.
    [63]
    L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 6499–6507.
    [64]
    S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim, “Fast video object segmentation by reference-guided mask propagation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 7376–7385.
    [65]
    J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-generation, refinement and merging for video object segmentation,” in Proc. Asian Conf. Computer Vision, Springer, 2018, pp. 565–580.
    [66]
    J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang, “Fast and accurate online video object segmentation via tracking parts,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 7415–7424.
    [67]
    Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2019, pp. 1328–1338.
    [68]
    X. Luo, M. Zhou, S. Li, L. Hu, and M. Shang, “Non-negativity constrained missing data estimation for high-dimensional and sparse matrices from industrial applications,” IEEE Trans. Cybernetics, vol. 50, no. 5, pp. 1844–1855, 2019.
    [69]
    X. Luo, Y. Yuan, M. Zhou, Z. Liu, and M. Shang, “Non-negative latent factor model based onβ-divergence for recommender systems,” IEEE Trans. Systems,Man,and Cybernetics:Systems, vol. 51, no. 8, pp. 4612–4623, 2019.
    [70]
    X. Luo, M. Zhou, S. Li, and M. Shang, “An inherently nonnegative latent factor model for high-dimensional and sparse matrices from industrial applications,” IEEE Trans. Industrial Informatics, vol. 14, no. 5, pp. 2011–2022, 2017.
    [71]
    X. Luo, M. Zhou, Y. Xia, Q. Zhu, A. C. Ammari, and A. Alabdulwahab, “Generating highly accurate predictions for missing qos data via aggregating nonnegative latent factor models,” IEEE Trans. Neural Networks and Learning Systems, vol. 27, no. 3, pp. 524–537, 2015.
    [72]
    Z. Yizhuo, W. Zhirong, P. Houwen, and L. Stephen, “A transductive approach for video object segmentation,” arXiv preprint arXiv: 2004.07193, 2020.
    [73]
    P. M. Kebria, A. Khosravi, S. M. Salaken, and S. Nahavandi, “Deep imitation learning for autonomous vehicles based on convolutional neural networks,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 1, pp. 82–95, 2019.
    [74]
    Y. Xia, H. Yu, and F.-Y. Wang, “Accurate and robust eye center localization via fully convolutional networks,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 5, pp. 1127–1138, 2019. doi: 10.1109/JAS.2019.1911684
    [75]
    T. D. Pham, K. Wardell, A. Eklund, and G. Salerud, “Classification of short time series in early parkinson’s disease with deep learning of fuzzy recurrence plots,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1306–1317, 2019. doi: 10.1109/JAS.2019.1911774
    [76]
    S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, “Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction,” IEEE Trans. Neural Networks and Learning Systems, vol. 30, no. 2, pp. 601–614, 2018.
    [77]
    D. Zhang, J. Han, L. Zhao, and D. Meng, “Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework,” Int. J. Computer Vision, vol. 127, no. 4, pp. 363–380, 2019. doi: 10.1007/s11263-018-1112-4
    [78]
    D. Zhang, J. Han, L. Zhao, and T. Zhao, “From discriminant to complete: Reinforcement searching-agent learning for weakly supervised object detection,” IEEE Trans. Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5549–5560, 2020.
    [79]
    J. Han, Y. Yang, D. Zhang, D. Huang, D. Xu, and F. De La Torre, “Weakly-supervised learning of category-specific 3D object shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 43, no. 4, pp. 1423–1437, 2019.
    [80]
    D. Zhang, J. Han, G. Guo, and L. Zhao, “Learning object detectors with semi-annotated weak labels,” IEEE Trans. Circuits and Systems for Video Technology, vol. 29, no. 12, pp. 3622–3635, 2018.
    [81]
    D. Zhang, G. Guo, D. Huang, and J. Han, “Poseflow: A deep motion representation for understanding human behaviors in videos,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2018, pp. 6762–6770.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(8)  / Tables(7)

    Article Metrics

    Article views (2474) PDF downloads(143) Cited by()

    Highlights

    • An early attempt to achieve video object segmentation with scribble-level supervision, which can alleviate large amounts of human labors for collecting the manual annotation
    • The proposed scribble attention module can capture more accurate context information and learns an effective attention map to enhance the contrast between foreground and background
    • The proposed scribble-supervised loss can optimize the unlabeled pixels and dynamically corrects inaccurate segmented areas during the training stage

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return