A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 2
Feb.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
H. J. Hu, H. S. Wang, Z. Liu, and W. D. Chen, “Domain-invariant similarity activation map contrastive learning for retrieval-based long-term visual localization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 2, pp. 313–328, Feb. 2022. doi: 10.1109/JAS.2021.1003907
Citation: H. J. Hu, H. S. Wang, Z. Liu, and W. D. Chen, “Domain-invariant similarity activation map contrastive learning for retrieval-based long-term visual localization,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 2, pp. 313–328, Feb. 2022. doi: 10.1109/JAS.2021.1003907

Domain-Invariant Similarity Activation Map Contrastive Learning for Retrieval-Based Long-Term Visual Localization

doi: 10.1109/JAS.2021.1003907
More Information
  • Visual localization is a crucial component in the application of mobile robot and autonomous driving. Image retrieval is an efficient and effective technique in image-based localization methods. Due to the drastic variability of environmental conditions, e.g., illumination changes, retrieval-based visual localization is severely affected and becomes a challenging problem. In this work, a general architecture is first formulated probabilistically to extract domain-invariant features through multi-domain image translation. Then, a novel gradient-weighted similarity activation mapping loss (Grad-SAM) is incorporated for finer localization with high accuracy. We also propose a new adaptive triplet loss to boost the contrastive learning of the embedding in a self-supervised manner. The final coarse-to-fine image retrieval pipeline is implemented as the sequential combination of models with and without Grad-SAM loss. Extensive experiments have been conducted to validate the effectiveness of the proposed approach on the CMU-Seasons dataset. The strong generalization ability of our approach is verified with the RobotCar dataset using models pre-trained on urban parts of the CMU-Seasons dataset. Our performance is on par with or even outperforms the state-of-the-art image-based localization baselines in medium or high precision, especially under challenging environments with illumination variance, vegetation, and night-time images. Moreover, real-site experiments have been conducted to validate the efficiency and effectiveness of the coarse-to-fine strategy for localization.

     

  • loading
  • [1]
    T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 9, pp. 1744–1756, Sept. 2017.
    [2]
    B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 4076–4083, Oct. 2018.
    [3]
    L. P. Wang and H. Wei, “Avoiding non-Manhattan obstacles based on projection of spatial corners in indoor environment,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 4, pp. 1190–1200, Jul. 2020.
    [4]
    R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1437–1451, Jun. 2018.
    [5]
    S. Lowry, N. Sünderhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, and M. J. Milford, “Visual place recognition: A survey,” IEEE Trans. Robot., vol. 32, no. 1, pp. 1–19, Feb. 2016.
    [6]
    A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla, “4/7 place recognition by view synthesis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, pp. 257–271, Feb. 2018.
    [7]
    T. Sattler, Q. J. Zhou, M. Pollefeys, and L. Leal-Taixé, “Understanding the limitations of CNN-based absolute camera pose regression,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 3297–3307.
    [8]
    P. E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 12708–12717.
    [9]
    T. Sattler, W. Maddern, C. Toft, et al., “Benchmarking 6DOF outdoor visual localization in changing conditions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 8601–8610.
    [10]
    Y. F. Ma, Z. Y. Wang, H. Yang, and L. Yang, “Artificial intelligence applications in the development of autonomous vehicles: A survey,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 2, pp. 315–329, Mar. 2020.
    [11]
    D. Doan, Y. Latif, T. J. Chin, Y. Liu, T. T. Do, and I. Reid, “Scalable place recognition under appearance change for autonomous driving,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 9318–9327.
    [12]
    P. Yin, L. Y. Xu, X. Q. Li, C. Yin, Y. L. Li, R. A. Srivatsan, L. Li, J. M. Ji, and Y. Q. He, “A multi-domain feature learning method for visual place recognition,” in Proc. IEEE Int. Conf. Robotics and Automation, Montreal, QC, Canada, 2019, pp. 319–324.
    [13]
    Z. T. Chen, A. Jacobson, N. Sünderhauf, B. Upcroft, L. Q. Liu, C. H. Shen, I. Reid, and M. Milford, “Deep learning features at scale for visual place recognition,” in Proc. IEEE Int. Conf. Robotics and Automation, Singapore, 2017, pp. 3223–3230.
    [14]
    J. Wang, Y. Song, T. Leung, C. Rosenberg, J. B. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1386–1393.
    [15]
    P. Wohlhart and V. Lepetit, “Learning descriptors for object recognition and 3D pose estimation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3109–3118.
    [16]
    J. W. Lu, J. L. Hu, and Y. P. Tan, “Discriminative deep metric learning for face and kinship verification,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4269–4282, Sept. 2017.
    [17]
    S. Lowry and M. J. Milford, “Supervised and unsupervised linear learning techniques for visual place recognition in changing environments,” IEEE Trans. Robot., vol. 32, no. 3, pp. 600–613, Jun. 2016.
    [18]
    A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “End-to-end learning of deep visual representations for image retrieval,” Int. J. Comput. Vis., vol. 124, no. 2, pp. 237–254, Sept. 2017.
    [19]
    F. Radenović, G. Tolias, and O. Chum, “Fine-tuning CNN image retrieval with no human annotation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1655–1668, Jul. 2019.
    [20]
    B. L. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 2921–2929.
    [21]
    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 618–626.
    [22]
    A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian, “Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks,” in Proc. IEEE Winter Conf. Applications of Computer Vision, Lake Tahoe, NV, USA, 2018, pp. 839–847.
    [23]
    H. J. Kim, E. Dunn, and J. M. Frahm, “Learned contextual feature reweighting for image geo-localization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3251–3260.
    [24]
    Z. T. Chen, L. Q. Liu, I. Sa, Z. Y. Ge, and M. Chli, “Learning context flexible attention model for long-term visual place recognition,” IEEE Robot. Autom. Lett., vol. 3, no. 4, pp. 4015–4022, Oct. 2018.
    [25]
    H. Y. Gu, H. S. Wang, F. Xu, Z. Liu, and W. D. Chen, “Active fault detection of soft manipulator in visual servoing,” IEEE Trans. Ind. Electron., 2020. DOI: 10.1109/TIE.2020.3028813
    [26]
    L. J. Han, H. S. Wang, Z. Liu, W. D. Chen, and X. F. Zhang, “Vision-based cutting control of deformable objects with surface tracking,” IEEE/ASME Trans. Mechatron., 2020. DOI: 10.1109/TMECH.2020.3029114
    [27]
    D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, pp. 1188–1197, Oct. 2012.
    [28]
    W. Shi, P. X. Liu, and M. H. Zheng, “A mixed-depth visual rendering method for bleeding simulation,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 4, pp. 917–925, Jul. 2019.
    [29]
    Z. Y. Liu and H. Qiao, “GNCCP-graduated NonConvexity and concavity procedure,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 6, pp. 1258–1267, Jun. 2014.
    [30]
    H. Qiao, Y. L. Li, T. Tang, and P. Wang, “Introducing memory and association mechanism into a biologically inspired visual model,” IEEE Trans. Cybernet., vol. 44, no. 9, pp. 1485–1496, Sept. 2014.
    [31]
    R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, Oct. 2017.
    [32]
    H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Proc. IEEE Computer Society Conf. Computer Vision & Pattern Recognition, San Francisco, CA, USA, 2010, pp. 3304–3311.
    [33]
    M. J. Milford and G. F. Wyeth, “SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights,” in Proc. IEEE Int. Conf. Robotics and Automation, Saint Paul, MN, USA, 2012, pp. 1643–1649.
    [34]
    S. M. Siam and H. Zhang, “Fast-SeqSLAM: A fast appearance based place recognition algorithm,” in Proc. IEEE Int. Conf. Robotics and Automation, Singapore, 2017, pp. 5702–5708.
    [35]
    Y. Xing, C. Lv, L. Chen, H. J. Wang, H. Wang, D. P. Cao, E. Velenis, and F. Y. Wang, “Advances in vision-based lane detection: Algorithms, integration, assessment, and perspectives on ACP-based parallel vision,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 645–661, May 2018.
    [36]
    T. Jenicek and O. Chum, “No fear of the dark: Image retrieval under varying illumination conditions,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Seoul, Korea (South), 2019, pp. 9696–9703.
    [37]
    P. Isola, J. Y. Zhu, T. H. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 5967–5976.
    [38]
    A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool, “ComboGAN: Unrestrained scalability for image domain translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 2018, pp. 896–903.
    [39]
    M. Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Red Hook, NY, United States, 2017, pp. 700–708.
    [40]
    X. Huang, M. Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 179–196.
    [41]
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 27th Int. Conf. Neural Information Processing Systems, Cambridge, MA, United States, 2014, pp. 2672–2680.
    [42]
    A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv: 1511.06434, 2015.
    [43]
    H. Porav, W. Maddern, and P. Newman, “Adversarial training for adverse conditions: Robust metric localisation using appearance transfer,” in Proc. IEEE Int. Conf. Robotics and Automation, Brisbane, QLD, Australia, 2018, pp. 1011–1018.
    [44]
    J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2242–2251.
    [45]
    A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van Gool, “Night-to-day image translation for retrieval-based localization,” in Proc. IEEE Int. Conf. Robotics and Automation, Montreal, QC, Canada, 2019, pp. 5958–5964.
    [46]
    T. Naseer, G. L. Oliveira, T. Brox, and W. Burgard, “Semantics-aware visual localization under challenging perceptual conditions,” in Proc. IEEE Int. Conf. Robotics and Automation, Singapore, 2017, pp. 2614–2620.
    [47]
    E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in Proc. IEEE Int. Conf. Robotics and Automation, Brisbane, QLD, Australia, 2018, pp. 6484–6490.
    [48]
    N. Piasco, D. Sidibé, V. Gouet-Brunet, and C. Demonceaux, “Learning scene geometry for visual localization in challenging conditions,” in Proc. IEEE Int. Conf. Robotics and Automation, Montreal, QC, Canada, 2019, pp. 9094–9100.
    [49]
    N. Piasco, D. Sidibé, V. Gouet-Brunet, and C. Demonceaux, “Improving image description with auxiliary modality for visual localization in challenging conditions,” Int. J. Comput. Vis., 2021, DOI: 10.1007/s11263-020-01363-6
    [50]
    X. H. Wang and H. B. Duan, “Hierarchical visual attention model for saliency detection inspired by avian visual pathways,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 2, pp. 540–552, Mar. 2019.
    [51]
    Z. Xin, Y. H. Cai, T. Lu, X. X. Xing, S. J. Cai, J. P. Zhang, Y. P. Yang, and Y. Q. Wang, “Localizing discriminative visual landmarks for place recognition,” in Proc. IEEE Int. Conf. Robotics and Automation, Montreal, QC, Canada, 2019, pp. 5979–5985.
    [52]
    X. Luo, D. X. Wang, M. C. Zhou, and H. Q. Yuan, “Latent factor-based recommenders relying on extended stochastic gradient descent algorithms,” IEEE Trans. Syst. Man Cybernet.:Syst., vol. 51, no. 2, pp. 916–926, 2019. doi: 10.1109/TSMC.2018.2884191
    [53]
    D. Wu, Q. He, X. Luo, M. S. Shang, Y. He, and G. Y. Wang, “A posterior-neighborhood-regularized latent factor model for highly accurate web service QoS prediction,” IEEE Trans. Serv. Comput., 2019. DOI: 10.1109/TSC.2019.2961895
    [54]
    D. Wu, X. Luo, M. S. Shang, Y. He, G. Y. Wang, and M. C. Zhou, “A deep latent factor model for high-dimensional and sparse matrices in recommender systems,” IEEE Trans. Syst. Man Cybernet.: Syst., 2019. DOI: 10.1109/TSMC.2019.2931393
    [55]
    R. Gong, W. Li, Y. H. Chen, and L. Van Gool, “DLOW: Domain flow for adaptation and generalization,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 2472–2481.
    [56]
    A. Achille and S. Soatto, “Emergence of invariance and disentanglement in deep representations,” in Proc. Information Theory and Applications Workshop, San Diego, CA, USA, 2018, pp. 1–9.
    [57]
    H. Kim and A. Mnih, “Disentangling by factorising,” in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 2649–2658.
    [58]
    A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv: 1511.05644, 2015.
    [59]
    M. Mathieu, J. B. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun, “Disentangling factors of variation in deep representations using adversarial training,” in Proc. 30th Int. Conf. Neural Information Processing Systems, Red Hook, NY, United States, 2016, pp. 5047–5055.
    [60]
    X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets,” in Proc. 30th Int. Conf. Neural Information Processing Systems, Red Hook, NY, United States, 2016, pp. 2180–2188.
    [61]
    C. Donahue, Z. C. Lipton, A. Balsubramani, and J. McAuley, “Semantically decomposing the latent spaces of generative adversarial networks,” arXiv preprint arXiv: 1705.07904, 2018.
    [62]
    M. Lopez-Antequera, R. Gomez-Ojeda, N. Petkov, and J. Gonzalez-Jimenez, “Appearance-invariant place recognition by discriminatively training a convolutional neural network,” Pattern Recogn. Lett., vol. 92, pp. 89–95, Jun. 2017.
    [63]
    H. J. Hu, H. S. Wang, Z. Liu, C. G. Yang, W. D. Chen, and L. Xie, “Retrieval-based localization based on domain-invariant feature learning under changing environments,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Macau, China, 2019, pp. 3684–3689.
    [64]
    L. Tang, Y. Wang, Q. H. Luo, X. Q. Ding, and R. Xiong, “Adversarial feature disentanglement for place recognition across changing appearance,” in Proc. IEEE Int. Conf. Robotics and Automation, Paris, France, 2020, pp. 1301–1307.
    [65]
    S. Hausler, A. Jacobson, and M. Milford, “Filter early, match late: Improving network-based visual place recognition,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Macau, China, 2019, pp. 3268–3275.
    [66]
    S. Garg, N. Suenderhauf, and M. Milford, “Don’t look back: Robustifying place categorization for viewpoint- and condition-invariant place recognition,” in Proc. IEEE Int. Conf. Robotics and Automation, Brisbane, QLD, Australia, 2018, pp. 3645–3652.
    [67]
    H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4004–4012.
    [68]
    E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell, “Distance metric learning, with application to clustering with side-information,” in Proc. 15th Int. Conf. Neural Information Processing Systems, Cambridge, MA, United States, 2002, pp. 521–528.
    [69]
    K. Q. Weinberger and L. K. Saul, “Fast solvers and efficient implementations for distance metric learning,” in Proc. 25th Int. Conf. Machine Learning, Helsinki, Finland, 2008, pp. 1160–1167.
    [70]
    R. R. Varior, M. Haloi, and G. Wang, “Gated siamese convolutional neural network architecture for human re-identification,” in Proc. European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 791–808.
    [71]
    V. Balntas, S. D. Li, and V. Prisacariu, “RelocNet: Continuous metric learning relocalisation using neural nets,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 782–799.
    [72]
    E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in Proc. Int. Workshop on Similarity-Based Pattern Recognition, Copenhagen, Denmark, 2015, pp. 84–92.
    [73]
    B. G. V. Kumar, G. Carneiro, and I. Reid, “Learning local image descriptors with deep siamese and triplet convolutional networks by minimizing global loss functions,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 5385–5394.
    [74]
    M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” Int. J. Robot. Res., vol. 27, no. 6, pp. 647–665, Jun. 2008.
    [75]
    H. Badino, D. Huber, and T. Kanade, “Visual topometric localization,” in Proc. IEEE Intelligent Vehicles Symp., 2011.
    [76]
    W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The Oxford RobotCar dataset,” Int. J. Robot. Res., vol. 36, no. 1, pp. 3–15, Jan. 2017.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(11)  / Tables(7)

    Article Metrics

    Article views (2328) PDF downloads(79) Cited by()

    Highlights

    • A domain-invariant feature learning framework is proposed based with feature consistency loss
    • A new gradient-weighted similarity activation map is proposed for high-accuracy retrieval
    • A novel self--supervised contrastive learning is proposed with adaptive triplet
    • Our results keep on par with state-of-the-art baselines with efficient two-stage pipeline

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return