A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 11 Issue 12
Dec.  2024

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
M. Wang, J. Zhang, J. Ma, and  X. Guo,  “Cas-FNE: Cascaded face normal estimation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2423–2434, Dec. 2024. doi: 10.1109/JAS.2024.124899
Citation: M. Wang, J. Zhang, J. Ma, and  X. Guo,  “Cas-FNE: Cascaded face normal estimation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 12, pp. 2423–2434, Dec. 2024. doi: 10.1109/JAS.2024.124899

Cas-FNE: Cascaded Face Normal Estimation

doi: 10.1109/JAS.2024.124899
Funds:  This work was supported by the National Natural Science Foundation of China (62072327)
More Information
  • Capturing high-fidelity normals from single face images plays a core role in numerous computer vision and graphics applications. Though significant progress has been made in recent years, how to effectively and efficiently explore normal priors remains challenging. Most existing approaches depend on the development of intricate network architectures and complex calculations for in-the-wild face images. To overcome the above issue, we propose a simple yet effective cascaded neural network, called Cas-FNE, which progressively boosts the quality of predicted normals with marginal model parameters and computational cost. Meanwhile, it can mitigate the imbalance issue between training data and real-world face images due to the progressive refinement mechanism, and thus boost the generalization ability of the model. Specifically, in the training phase, our model relies solely on a small amount of labeled data. The earlier prediction serves as guidance for following refinement. In addition, our shared-parameter cascaded block employs a recurrent mechanism, allowing it to be applied multiple times for optimization without increasing network parameters. Quantitative and qualitative evaluations on benchmark datasets are conducted to show that our Cas-FNE can faithfully maintain facial details and reveal its superiority over state-of-the-art methods. The code is available at https://github.com/AutoHDR/CasFNE.git.

     

  • loading
  • [1]
    G. Retsinas, P. P. Filntisis, R. Danecek, V. F. Abrevaya, A. Roussos, T. Bolkart, and P. Maragos, “3D facial expressions through analysis-by-neural-synthesis,” arXiv preprint arXiv: 2404.04104, 2024.
    [2]
    B. Lei, J. Ren, M. Feng, M. Cui, and X. Xie, “A hierarchical representation network for accurate and detailed face reconstruction from in-the-wild images,” arXiv preprint arXiv: 2302.14434, 2023.
    [3]
    R. Daněěek, M. J. Black, and T. Bolkart, “Emoca: Emotion driven monocular face capture and animation,” arXiv preprint arXiv: 2204.11312, 2022.
    [4]
    Y. Feng, H. Feng, M. J. Black, and T. Bolkart, “Learning an animatable detailed 3d face model from in-the-wild images,” ACM Trans. Graphics, vol. 40, no. 4, p. 88, Aug. 2021.
    [5]
    Z. Li, Z. Lu, H. Yan, B. Shi, G. Pan, Q. Zheng, and X. Jiang, “Spin-up: Spin light for natural light uncalibrated photometric stereo,” arXiv preprint arXiv: 2404.01612, 2024.
    [6]
    B. Yu, J. Ren, J. Han, F. Wang, J. Liang, and B. Shi, “EventPS: Real-time photometric stereo using an event camera,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2024, pp. 9602–9611.
    [7]
    S. Ikehata, “Scalable, detailed and mask-free universal photometric stereo,” arXiv preprint arXiv: 2303.00308, 2023.
    [8]
    M. Wang, X. Guo, W. Dai, and J. Zhang, “Face inverse rendering via hierarchical decoupling,” IEEE Trans. Image Processing, vol. 31, pp. 5748–5761, Aug. 2022. doi: 10.1109/TIP.2022.3201466
    [9]
    M. Wang, C. Wang, X. Guo, and J. Zhang, “Towards high-fidelity face normal estimation,” in Proc. 30th ACM Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. 5172–5180.
    [10]
    V. F. Abrevaya, A. Boukhayma, P. H. S. Torr, and E. Boyer, “Cross-modal deep face normals with deactivable skip connections,” arXiv preprint arXiv: 2003.09691, 2020.
    [11]
    X. Chen, Y. Zheng, Y. Zheng, Q. Zhou, H. Zhao, G. Zhou, and Y.-Q. Zhang, “DPF: Learning dense prediction fields with weak supervision,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 15347–15357.
    [12]
    M. Rossi, M. El Gheche, A. Kuhn, and P. Frossard, “Joint graph-based depth refinement and normal estimation,” arXiv preprint arXiv: 1912.01306, 2020.
    [13]
    Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser, “Physically-based rendering for indoor scene understanding using convolutional neural networks,” arXiv preprint arXiv: 1612.07429, 2017.
    [14]
    Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras, “Neural face editing with intrinsic image disentangling,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5444–5453.
    [15]
    K. Zhang, Y. Su, X. Guo, L. Qi, and Z. Zhao, “MU-GAN: Facial attribute editing based on multi-attention mechanism,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1614–1626, Sep. 2021. doi: 10.1109/JAS.2020.1003390
    [16]
    M. Wang, W. Dai, X. Guo, and J. Zhang, “Face inverse rendering from single images in the wild,” in Proc. IEEE Int. Conf. Multimedia and Expo, Taipei, China, 2022, pp. 1–6.
    [17]
    A. Lattas, S. Moschoglou, S. Ploumpis, B. Gecer, J. Deng, and S. Zafeiriou, “FitMe: Deep photorealistic 3D morphable model avatars,” arXiv preprint arXiv: 2305.09641, 2023.
    [18]
    N. Yang, B. Xia, Z. Han, and T. Wang, “A domain-guided model for facial cartoonlization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 10, pp. 1886–1888, Oct. 2022. doi: 10.1109/JAS.2022.105887
    [19]
    Y. Zheng, Y. Wang, G. Wetzstein, M. J. Black, and O. Hilliges, “PointAvatar: Deformable point-based head avatars from videos,” arXiv preprint arXiv: 2212.08377, 2023.
    [20]
    G. Trigeorgis, P. Snape, I. Kokkinos, and S. Zafeiriou, “Face normals “in-the-wild” using fully convolutional networks,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 340–349.
    [21]
    S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs, “SfSNet: Learning shape, reflectance and illuminance of faces ‘in the wild’,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 6296–6305.
    [22]
    J. E. Laird, P. S. Rosenbloom, and A. Newell, “Towards chunking as a general learning mechanism,” in Proc. 4th AAAI Conf. Artificial Intelligence, Austin, USA, 1984, pp. 188–192.
    [23]
    Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 3730–3738.
    [24]
    X. Long, Y. Zheng, Y. Zheng, B. Tian, C. Lin, L. Liu, H. Zhao, G. Zhou, and W. Wang, “Adaptive surface normal constraint for geometric estimation from monocular images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 9, pp. 6263–6279, Sep. 2024. doi: 10.1109/TPAMI.2024.3381710
    [25]
    R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah, “Shape-from-shading: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 8, pp. 690–706, Aug. 1999. doi: 10.1109/34.784284
    [26]
    Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, “Face relighting from a single image under arbitrary unknown lighting conditions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1968–1984, Nov. 2009. doi: 10.1109/TPAMI.2008.244
    [27]
    S. Biswas, G. Aggarwal, and R. Chellappa, “Robust estimation of albedo for illumination-invariant matching and shape recovery,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 5, pp. 884–899, May 2009. doi: 10.1109/TPAMI.2008.135
    [28]
    M. K. Johnson and E. H. Adelson, “Shape estimation in natural illumination,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 2553–2560.
    [29]
    J. T. Barron and J. Malik, “Shape, albedo, and illumination from a single image of an unknown object,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 334–341.
    [30]
    Y. Xiong, A. Chakrabarti, R. Basri, S. J. Gortler, D. W. Jacobs, and T. Zickler, “From shading to local shape,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 67–79, Jan. 2015. doi: 10.1109/TPAMI.2014.2343211
    [31]
    Y. Mei, Y. Zeng, H. Zhang, Z. Shu, X. Zhang, S. Bi, J. Zhang, H. J. Jung, and V. M. Patel, “Holo-relighting: Controllable volumetric portrait relighting from a single image,” arXiv preprint arXiv: 2403.09632, 2024.
    [32]
    Y. Cheng, Z. Chen, X. Ren, W. Zhu, Z. Xu, D. Xu, C. Yang, and Y. Yan, “3D-aware face editing via warping-guided latent direction learning,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2024, pp. 916–926.
    [33]
    Z. Cai, K. Jiang, S.-Y. Chen, Y.-K. Lai, H. Fu, B. Shi, and L. Gao, “Real-time 3d-aware portrait video relighting,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2024, pp. 6221–6231.
    [34]
    A. Ranjan, K. M. Yi, J.-H. R. Chang, and O. Tuzel, “FaceLit: Neural 3D relightable faces,” arXiv preprint arXiv: 2303.15437, 2023.
    [35]
    L. Zhang, J. Liu, B. Zhang, D. Zhang, and C. Zhu, “Deep cascade model-based face recognition: When deep-layered learning meets small data,” IEEE Trans. Image Process., vol. 29, pp. 1016–1029, Sep. 2019.
    [36]
    Q. Wang, T. Wu, H. Zheng, and G. Guo, “Hierarchical pyramid diverse attention networks for face recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2020, pp. 8323–8332.
    [37]
    F. Xue, Z. Tan, Y. Zhu, Z. Ma, and G. Guo, “Coarse-to-fine cascaded networks with smooth predicting for video facial expression recognition,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 2411–2417.
    [38]
    B. Yu and D. Tao, “Anchor cascade for efficient face detection,” IEEE Trans. Image Process., vol. 28, no. 5, pp. 2490–2501, May 2019. doi: 10.1109/TIP.2018.2886790
    [39]
    J. Lou, X. Cai, J. Dong, and H. Yu, “Real-time 3D facial tracking via cascaded compositional learning,” IEEE Trans. Image Process., vol. 30, pp. 3844–3857, Mar. 2021. doi: 10.1109/TIP.2021.3065819
    [40]
    S. Ma, Y. Wang, Y. Wei, J. Fan, T. H. Li, H. Liu, and F. Lv, “CAT: Localization and identification cascade detection transformer for open-world object detection,” arXiv preprint arXiv: 2301.01970, 2023.
    [41]
    R. Wu, G. Zhang, S. Lu, and T. Chen, “Cascade EF-GAN: Progressive facial expression editing with local focuses,” arXiv preprint arXiv: 2003.05905, 2020.
    [42]
    L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal talking face generation with dynamic pixel-wise loss,” arXiv preprint arXiv: 1905.03820, 2019.
    [43]
    X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang, “Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 6459–6468.
    [44]
    D. Lin, G. Chen, D. Cohen-Or, P.-A. Heng, and H. Huang, “Cascaded feature network for semantic segmentation of RGB-D images,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1320–1328.
    [45]
    P. Hu, G. Wang, X. Kong, J. Kuen, and Y.-P. Tan, “Motion-guided cascaded refinement network for video object segmentation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 1400–1409.
    [46]
    K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “Hybrid task cascade for instance segmentation,” arXiv preprint arXiv: 1901.07518, 2019.
    [47]
    H. Ding, S. Qiao, A. Yuille, and W. Shen, “Deeply shape-guided cascade for instance segmentation,” arXiv preprint arXiv: 1911.11263, 2021.
    [48]
    E. Chouzenoux, J. Idier, and S. Moussaoui, “A majorize–minimize strategy for subspace optimization applied to image restoration,” IEEE Trans. Image Process., vol. 20, no. 6, pp. 1517–1528, Jun. 2011. doi: 10.1109/TIP.2010.2103083
    [49]
    J. Xie, J. Yang, J. J. Qian, Y. Tai, and H. M. Zhang, “Robust nuclear norm-based matrix regression with applications to robust face recognition,” IEEE Trans. Image Process., vol. 26, no. 5, pp. 2286–2295, May 2017. doi: 10.1109/TIP.2017.2662213
    [50]
    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv: 1512.03385, 2016.
    [51]
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252, Apr. 2015. doi: 10.1007/s11263-015-0816-y
    [52]
    L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” arXiv preprint arXiv: 1508.06576, 2015.
    [53]
    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint arXiv: 1611.07004, 2017.
    [54]
    S. Zafeiriou, M. Hansen, G. Atkinson, V. Argyriou, M. Petrou, M. Smith, and L. Smith, “The photoface database,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 132–139.
    [55]
    C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: The first facial landmark localization challenge,” in Proc. IEEE Int. Conf. Computer Vision Workshops, Sydney, Australia, 2013, pp. 397–403.
    [56]
    T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” arXiv preprint arXiv: 1812.04948, 2019.
    [57]
    A. D. Bagdanov, A. Del Bimbo, and I. Masi, “The Florence 2D/3D hybrid face dataset,” in Proc. Joint ACM Workshop on Human Gesture and Behavior Understanding, Scottsdale, USA, 2011, pp. 79–80.
    [58]
    W.-C. Ma, T. Hawkins, P. Peers, C.-F. Chabert, M. Weiss, and P. Debevec, “Rapid acquisition of specular and diffuse normal maps from polarized spherical gradient illumination,” in Proc. 18th Eurographics Conf. Rendering Techniques, Grenoble, France, 2007, pp. 183–194.
    [59]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deeplearning library,” in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver Canada, 2019, pp. 721.
    [60]
    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv: 1412.6980, 2014.
    [61]
    L. Guo, H. Zhu, Y. Lu, M. Wu, and X. Cao, “RAFaRe: Learning robust and accurate non-parametric 3D face reconstruction from pseudo 2D&3D pairs,” in Proc. 37th AAAI Conf. Artificial Intelligence, Washington, USA, 2023, pp. 719–727.
    [62]
    Z. Zhang, Y. Ge, R. Chen, Y. Tai, Y. Yan, J. Yang, C. Wang, J. Li, and F. Huang, “Learning to aggregate and personalize 3D face from in-the-wild photo collection,” arXiv preprint arXiv: 2106.07852, 2021.
    [63]
    Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, “Joint 3D face reconstruction and dense alignment with position map regression network,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 557–574.
    [64]
    A. Bansal, B. Russell, and A. Gupta, “Marr revisited: 2D-3D alignment via surface normal prediction,” arXiv preprint arXiv: 1604.01347, 2016.
    [65]
    I. Kokkinos, “UberNet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory,” arXiv preprint arXiv: 1609.02132, 2016.
    [66]
    X. Zhu, X. Liu, Z. Lei, and S. Z. Li, “Face alignment in full pose range: A 3D total solution,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 1, pp. 78–92, Jan. 2019. doi: 10.1109/TPAMI.2017.2778152
    [67]
    N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and lightweight super-resolution with cascading residual network,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 256–272.
    [68]
    L. Zhang, Y. He, Q. Zhang, Z. Liu, X. Zhang, and C. Xiao, “Document image shadow removal guided by color-awarebackground,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 1818–1827.
    [69]
    J. Li, Z. Zhang, X. Liu, C. Feng, X. Wang, L. Lei, and W. Zuo, “Spatially adaptive self-supervised learning for real-world image denoising,” arXiv preprint arXiv: 2303.14934, 2023.
    [70]
    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” arXiv preprint arXiv: 2111.09881, 2022.
    [71]
    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of styleGAN,” arXiv preprint arXiv: 1912.04958, 2020.
    [72]
    X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1510–1519.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(17)  / Tables(4)

    Article Metrics

    Article views (67) PDF downloads(20) Cited by()

    Highlights

    • Our Cas-FNE can produce face normals through the refinement of multiple cascaded stages
    • Our normal enhance geometric, which ensures that the recovered 3D face has high fidelity details

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return