Loading [MathJax]/jax/element/mml/optable/BasicLatin.js
A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
L. F. Tang, Y. X. Deng, Y. Ma, J. Huang, and J. Y. Ma, "SuperFusion: A versatile image registration and fusion network with semantic awareness, " IEEE/CAA J. Autom. Sinica, vol. 9, no. 12, pp.2021-2137, Dec. 2022. doi: 10.1109/JAS.2022.106082
Citation: L. F. Tang, Y. X. Deng, Y. Ma, J. Huang, and J. Y. Ma, "SuperFusion: A versatile image registration and fusion network with semantic awareness, " IEEE/CAA J. Autom. Sinica, vol. 9, no. 12, pp.2021-2137, Dec. 2022. doi: 10.1109/JAS.2022.106082

SuperFusion: A Versatile Image Registration and Fusion Network with Semantic Awareness

doi: 10.1109/JAS.2022.106082
Funds:

the National Natural Science Foundation of China 62276192

the National Natural Science Foundation of China 62075169

the National Natural Science Foundation of China 62061160370

the Key Research and Development Program of Hubei Province 2020BAB113

More Information
  • Image fusion aims to integrate complementary information in source images to synthesize a fused image comprehensively characterizing the imaging scene. However, existing image fusion algorithms are only applicable to strictly aligned source images and cause severe artifacts in the fusion results when input images have slight shifts or deformations. In addition, the fusion results typically only have good visual effect, but neglect the semantic requirements of high-level vision tasks. This study incorporates image registration, image fusion, and semantic requirements of high-level vision tasks into a single framework and proposes a novel image registration and fusion method, named SuperFusion. Specifically, we design a registration network to estimate bidirectional deformation fields to rectify geometric distortions of input images under the supervision of both photometric and end-point constraints. The registration and fusion are combined in a symmetric scheme, in which while mutual promotion can be achieved by optimizing the naive fusion loss, it is further enhanced by the mono-modal consistent constraint on symmetric fusion outputs. In addition, the image fusion network is equipped with the global spatial attention mechanism to achieve adaptive feature integration. Moreover, the semantic constraint based on the pre-trained segmentation model and Lovasz-Softmax loss is deployed to guide the fusion network to focus more on the semantic requirements of high-level vision tasks. Extensive experiments on image registration, image fusion, and semantic segmentation tasks demonstrate the superiority of our SuperFusion compared to the state-of-the-art alternatives. The source code and pre-trained model are publicly available at https://github.com/Linfeng-Tang/SuperFusion.

     

  • A single-type device usually fails to offer a comprehensive description of the entire imaging scenario due to the limitations of the shooting environment or hardware [1]. To assist humans and machines in comprehensively perceiving the imaging scene, image fusion, an essential image enhancement technique, extracts complementary and useful information from multiple source images and then synthesizes a fused image [2], [3], [4], [5]. Among all image fusion tasks, infrared and visible image fusion has been extensively investigated and has promising prospects due to their unique characteristics. More specifically, the visible camera yields digital images with abundant texture detail via capturing the reflected light from the surface of objects. However, the visible camera fails to effectively capture significant objectives (e.g., pedestrians and vehicles) under complex circumstances such as night, occlusion, camouflage, and smoke. Fortunately, the infrared sensor generates images by collecting thermal radiation information from objects, which can defend against interference from extreme environments. Therefore, infrared images are capable of presenting significant targets with high contrast but usually fail to characterize the texture information effectively. The complementary nature of visible and infrared images encourages researchers to integrate useful information from multi-modal images into a single fused image to facilitate human perception and subsequent visual applications, such as object detection, object tracking, semantic segmentation, security surveillance, and military reconnaissance [2], [6], [7].

    In the past decades, numerous infrared and visible image fusion algorithms that focus on boosting visual quality have been proposed. Typically, these fusion approaches can be divided into traditional fusion frameworks and deep learning-based frameworks. More specifically, traditional fusion frameworks involve broadly five categories, i.e., multi-scale transformation (MST)-based [8], [9], [10], sparsity representation (SR)-based [11], subspace-based [12], saliency-based [13], and optimization-based [14], [15] fusion methods. Similarly, the deep learning-based frameworks can be classified into four categories, i.e., Autoencoder (AE)-based [16], [17], [18], Convolutional Neural Network (CNN)-based [19], [20], [21], [22], Generative Adversarial Network (GAN)-based [23], [24], [25], [26], and Transformer-based [27], [28], [29] methods. In particular, with the rapid boom of deep learning techniques, deep learning-based algorithms dominate the development of image fusion.

    Although the existing infrared and visible image fusion methods have achieved satisfactory results, they seldom take into account the following two inevitable issues in the real-world scenario:

    1) The existing fusion algorithms, both traditional and deep learning-based ones, are sensitive to misalignment in source images. Infrared and visible images generally suffer from varying degrees of misalignment in practice due to differences in imaging principles. Once there are offsets or deformations in source images, the fusion results inevitably suffer from artifacts, as presented in Fig. 1. The latest work, i.e., UMF-CMGR [30], attempts to eliminate the effect of minor deformations from source images on the fusion results, which implements multi-model image registration by transferring visible images into the infrared domain. However, its registration performance is limited, as shown in Fig. 1, because the supervision of registration is built on the quality of the transfer. Therefore, we directly construct photometric constraints between registered images and their ground truth. Moreover, for better robustness to the noise, we further employ end-point loss to constrain the estimated deformation field in a bidirectional flow estimation scheme. In this way, image fusion can be promoted by satisfactory registration results. In turn, it is possible for the fusion to benefit the registration during joint training. Because the fusion loss, which always requires the general similarity between the fusion results and the input images, especially in salient structure, implicitly forces the inputs to be similar, i.e., aligned. To further enhance the positive feedback of fusion, we propose a symmetric joint registration and fusion scheme, in which, two fused images are expected to be the same, if the registration is well finished. Consequently, we simply employ mono-modal similarity measurement to constrain the symmetric fusion outputs to fulfill the mutual promotion between registration and fusion.

    Figure  1.  An illustration of infrared and visible image registration, fusion, and segmentation. The first row is the source images and fused images, and the second row is the segmentation results. The existing methods can satisfy only one of the registration or semantic requirement, which fail to facilitate high-level vision tasks in the presence of misalignment in the source images. In contrast, the proposed method can eliminate the impact of misalignment and effectively boost the performance of high-level vision tasks.

    2) The existing fusion approaches barely consider how to facilitate high-level vision tasks. Most fusion approaches focus only on the visual quality of fused images while ignoring the contribution of fusion results to high-level vision tasks, e.g., semantic segmentation, and object detection. Some researchers have noticed this drawback and proposed practical solutions, such as SeAFusion [22] and TarDAL [25]. However, as a preliminary exploration, the aforementioned schemes only design simple loss functions to guide the training of fusion networks, which reserves space for further improvements. From Fig. 1, it can be observed that the segmentation model fails to yield accurate segmentation results from the fused images generated by SeAFusion and TarDAL. In particular, while focusing on increasing the contrast of fused objects to facilitate automatic machine detection, TarDAL neglects the whole visual quality of fused images. To address the limitations of the prior works, we design a more elaborated semantic constraint to enforce the fusion network to preserve semantic information as much as possible during the fusion process to facilitate subsequent high-level vision tasks.

    Furthermore, we introduce a global spatial attention module (GSAM) to improve the perceptual effect of fusion results. GSAM can adaptively estimate appropriate weights for infrared and visible features by perceiving global structural information from four directions. It is worth emphasizing that we take registration, fusion, and semantic requirements into a single framework to meet the need of practical fusion. As shown in Fig. 1, the proposed method is capable of obtaining satisfactory fusion results and segmentation results even if there is a severe misalignment in source images.

    To be concrete, this work has the following contributions:

    We model image registration, image fusion, and high-level semantic requirements uniformly into a single framework. To the best of our knowledge, this is the first practical image fusion method that adequately considers the prerequisites of image fusion (i.e., image registration) and the subsequent applications of image fusion.

    We design a symmetric bi-directional image registration module to effectively perform multi-modal image alignment. Specifically, the property of symmetry allows our approach to achieve the mutual promotion of image fusion and image registration.

    A semantic constraint based on semantic segmentation is introduced to prompt the fusion network to respond to the demands of high-level vision tasks. Moreover, a global spatial attention module is embedded into the fusion network to achieve adaptive feature integration.

    Extensive experiments demonstrate the superiority of our method compared to state-of-the-art alternatives. In particular, our approach can accomplish unaligned image fusion while facilitating performance improvement for high-level vision tasks.

    The rest of the paper is organized as follows. Section Ⅱ summarizes some relevant research to the proposed method. Section Ⅲ provides a detailed discussion of our SuperFusion. In Section Ⅳ, we present some qualitative and quantitative results on the image registration, image fusion, and semantic segmentation tasks, as well as perform the ablation study to verify the effectiveness of specific designs. Some concluding remarks are presented in Section Ⅴ.

    Image fusion, cross-modal image registration, and recurrent neural network are three of the most relevant techniques to our work. We review some representative researches to introduce their developments in this section.

    In recent years, image fusion, especially typical image fusion aimed at improving image perception, continues to attract increasing attention. Typical image fusion methods involve the following four categories.

    1) AE-based Image Fusion Methods: Deep learning dominates the image fusion field owing to its robust feature extraction and information integration capabilities. The autoencoder (AE)-based image fusion scheme is a vital branch of applying deep learning to image fusion, which employs the encoder and decoder to implement two key processes in image fusion, i.e., feature extraction and image reconstruction. Then, feature fusion is usually accomplished using hand-crafted fusion strategies, such as element-wise average, element-wise addition, and element-wise weighting. DenseFuse [16] is the pioneer of AE-based fusion approaches, which introduces dense blocks in the encoder to achieve effective feature extraction and information utilization. Subsequently, Li et al. further introduced nest connection and attention mechanism to improve the fusion performance [31], [32]. In addition, Xu et al. [33], [34] and Zhao et al. [35] promoted the interpretability of AE-based fusion methods via devising multiple encoders to extract specific features. It is worth noting that the hand-crafted fusion strategies hinder the further improvement of AE-based image fusion approaches. For this purpose, Xu et al. designed a classification saliency-based fusion rule to selectively integrate the features extracted by the encoder from infrared and visible images [17].

    2) CNN-based Image Fusion Methods: As another branch of deep learning-based methods, the convolutional neural network (CNN)-based approach relies on elaborate network structures and loss functions to implement feature extraction, fusion, and image reconstruction in an end-to-end manner. Zhao et al. devised a multi-realm image fusion method using CNN-based realm-specific and realm-general feature representation learning [36]. Incorporating the characteristics of infrared and visible image fusion, Tang et al. designed an illumination-aware loss to guide the fusion model to implement round-the-clock image fusion according to the lighting conditions [37]. The focus of the above methods is on the design of the learning paradigms or loss functions. In terms of the network structure, RXDNFuse develops an aggregated residual dense network to achieve more effective feature extraction and information preservation [38]. Moreover, Liu et al. proposed a flexible fusion method that can search effective architecture dynamically according to various modality principles and fusion mechanisms to avoid target blurring and texture detail loss [39]. It is instructive to note that different image fusion tasks usually have certain commonalities and complementarities, which enables the possibility of solving multiple image fusion tasks in a unified framework. Zhang et al. defined the image fusion problem as proportional maintenance of gradient and intensity. They first manually assigned appropriate weights for sub-losses according to the characteristics of different fusion missions [40], and further developed an adaptive decision block to dynamically assign weights for sub-loss terms to cope with the complex scenario [41]. In addition, Xu et al. developed a unified model for multi-fusion missions, which can achieve cross-fertilization between different fusion tasks [42], [43]. Similarly, researchers further developed a great number of general image fusion models (e.g., IFCNN [44], SGFusion [45]) to fully take advantage of their convenient potential.

    3) Transformer-based Image Fusion Methods: It is worth mentioning that the receptive field of the convolutional neural network is limited, which cannot effectively exploit the global context in the source images to assist information aggregation. Therefore, researchers attempted to leverage Transformer to model long-range dependencies to sufficiently merge complementary information from input images. Fu et al. first attempted to leverage a pure Transformer architecture for the image fusion tasks [46]. They developed a Patch Pyramid Transformer (PPT) to capture local correlation information of neighboring pixels and non-local information from the entire image. Considering the advantages of CNNs in modeling local dependencies, most Transformer-based fusion methods (e.g., CGTF [29], SwinFusion [27], and YDTR [28]) first utilize CNN to extract shallow features and then employ Transformer to mine global interactions to facilitate adequate information integration. In addition, some works such as IFT [47] and DNDT [48] leverage the intrinsic of Transformer, i.e., the self-attention and cross-attention mechanism, to design novel fusion rules, thus taking advantage of the global interaction during the feature fusion phase.

    4) GAN-based Image Fusion Methods: Considering the lack of ground truth as a reference for infrared and visible image fusion, researchers also introduced generative adversarial mechanisms to impose stronger constraints for fusion networks from the perspective of probability distributions. FusionGAN first models the image fusion task as the adversarial game between a generator and a discriminator, where the discriminator forces the generator to retain more texture details from visible images [23], [49]. However, a single discriminator is prone to modal imbalance. Thus, subsequent works, such as DDcGAN [50], AttentionFGAN [51], SDDGAN [52], and GAN-FM [53] design dual discriminators to maintain modal balance.

    Although the above typical approaches can generate satisfactory fusion results, they ignore some real-world challenges, such as unaligned image fusion and the requirements of high-level vision tasks. Fortunately, some researches provide preliminary explorations for the aforementioned challenges. Wang et al. are the first to realize that slight misalignment in source images may introduce severe ghosts in fusion results [30]. Thus, they proposed a cross-modality perceptual style transfer module to transfer visible images into the infrared domain. Then, a multi-level refinement registration is deployed to align infrared images to pseudo-infrared images. Finally, aligned infrared images and visible images are fed to the dual-path interaction fusion module to complete the complementary information integration. Their method (i.e., UMF-CMGR) can mitigate the effects of minor deformations in input images. However, the fusion results synthesized by UMF-CMGR still suffer from artifacts when the unaligned degree of source images is obvious. In particular, UMF-CMGR only considers unaligned image fusion but ignores the semantic requirements of subsequent applications.

    Of course, there are some works (e.g., SeAFusion [22] and TarDAL [25]) that take into account the demands of high-level vision tasks and design semantic constraints to guide the fusion network to inject more semantic information into the fused images. More specifically, they interface a high-level model following the fusion network to measure semantic information contained in fusion results, and losses of the high-level model are utilized to guide the training of the fusion network through gradient back-propagation. SeAFusion and TarDAL perform semantic measurements using the semantic segmentation model and the target detection model, respectively. In addition, Liu et al. injected the semantic-driven idea into the GAN-based framework and employed a segmentation network as the discriminator to synthesize more meaningful fusion results from the perspective of the segmentation task [54]. They also adopted image fusion as an auxiliary task to assist the semantic segmentation task, where image fusion is deployed as an additional regularization for feature learning [55]. The above semantic-driven approaches can improve the performance of high-level tasks on fusion results yet fail to implement unaligned image fusion. Therefore, these methods still have limitations in practical applications. For this purpose, we adequately consider the prerequisites and application requirements of image fusion in this work and incorporate image registration, image fusion, and semantic requirements of high-level vision tasks into a single framework.

    Compared to mono-modal image registration, cross-modal image registration algorithms, including both traditional and deep learning ones, develop slowly due to the severe appearance variance in different modalities [56]. Old fashions began with presupposing a transform model, e.g., affine transformation [57], and free-form deformation [58], and then obtaining the model parameters by minimizing metrics that measure the similarity between the target image and the moved source image, e.g., normalized correlation coefficient (NCC) [59] and mutual information (MI) [60]. However, those traditional dense matching methods are sensitive to noise. To tackle this problem, some cross-modal sparse features, including MFP [61], DASC [62], and RIFT [63] focus on extracting interest points and encoding the local information into features, which significantly improve the robustness of matching. Nevertheless, those sparse features could not handle the free-form deformation. Therefore, dense matching methods, i.e., flow estimation, still attract many researchers.

    Recently, VoxelMorph [64] introduces the deep neural network to estimate the cross-modal flow with no supervision, which only employs NCC as the loss function. Nemar [65] proposes an unsupervised joint translation and registration framework, in which the supervision is constructed on the transfer domain. CrossRAFT [66] generalizes the single modal flow estimator into cross-modal tasks with data augmentation and knowledge distillation. However, none of those unsupervised methods really face the open problem in cross-modal image registration–how to build the supervision on the severe appearance variance in different modalities. By contrast, those supervised methods can address the problem in an end-to-end manner. Wang et al. [67] developed the traditional matching method with neural networks in a supervised manner. The registration module of UMF-CMGR [30] is only supervised by similarity constraints between registered infrared images and transferred images, which is sensitive to noise. To tackle those problems, we introduce both supervised flow and photometric constraints for robust registration training.

    Recurrent Neural Networks (RNN) that specialize in modeling long-range dependencies and are easy to train, have made a splash in the field of natural language processing [68]. When it is employed in the field of computer vision, two sequential RNN layers can guarantee that information can be efficiently propagated across the whole feature maps, thus fully integrating the global contexts. Bell et al. [69] firstly designed a two-round four-directional RNN architecture to exploit global context information to improve the performance of small object detection. Given the feature hi,j at pixel (i,j), one round of data translations in four-directional RNN can be formulated as:

    hi,j=max(αdirhi,j1+hi,j,0), (1)

    where αdir indicates the weight parameter in the recurrent translation layer for each direction. Specifically, all weights are first initialized as an identity matrix, and then adaptively updated during the training process. Moreover, the procedure of two-round four-directional RRN can be summarized in Fig. 2. In the first stage, RNN is utilized to collect neighboring interactions for each pixel of the input features. Then, RNN in the second round further captures non-local contexts for generating feature maps with global perception. It is worth mentioning that the direction-aware and global information aggregation properties of the two-round four-directional RNN give it a significant advantage for shadow detection and raindrop detection. Thus, Hu et al. [70] and Wang et al. [71] introduced two-round four-directional RNN to the shadow removal and single-image deraining tasks, respectively. Recently, Xu et al. [34] introduced the four-directional spatial variant RNN in the image fusion task, which is deployed to model the internal structure of source images in an explicit manner. In this work, four-directional RNN is incorporated into the global spatial attention module to adaptively allocate fusion weights for infrared and visible features.

    Figure  2.  Illustration of how two-round four-directional RNN architecture integrates global contextual information in two stages. In the first round, four-directional recurrent convolution are employed to collect horizontal and vertical neighborhood information for each position in the input feature maps. In the second round, the contextual information from the entire feature maps is gathered by repeating the previous operations.

    Given a strictly registered pair of visible image IviRH×W×3 and infrared image IirRH×W×1, image fusion aims to integrate complementary information of both into a single fused image IfRH×W×3. However, it is impossible to directly capture aligned images for fusion, because of the different extrinsic and intrinsic parameters of cameras. Additionally, the cross-modality binocular photograph may suffer from shakes, latency, and special noise. Especially, the infrared camera may be severely interfered by internal temperature and external hot airflow. Thereby, it is urgent to consider the misalignment between input infrared and visible images in real-world fusion. Taking the visible image Ivi and moved infrared image Iir as inputs, our method first yields the registered infrared image Iregir with a registration network NR, as illustrated in Fig. 3. More specifically, a novel dense mathcer (DM) is devised to estimate the infrared-to-visible deformation field ϕirviRH×W×2, which is formulated as:

    ϕirvi=DM(Ivi,Iir). (2)
    Figure  3.  The overall framework of the proposed SuperFusion for cross-modal image registration and fusion. Iir and Ivi indicate the moved infrared and visible image, respectively; ϕirvi and ϕviir denote the deformation field from the moved infrared to the fixed visible image and the deformation field from the moved visible image to the fixed infrared image, respectively; R indicates the re-sample operation; Iregir and Iregir represent the registered infrared and visible image, respectively; y1seg and y2seg mean the segmentation results.

    In particular, each element ϕirvi[i,j]=(Δx,Δy)R2 represents the deformation offset for the pixel of the unaligned infrared image. The registered infrared image can be achieved by re-sampling the unaligned infrared image with the estimated deformation field, which can be expressed as:

    Iregir=R(Iir,ϕirvi), (3)

    where R indicates the re-sampler. The resampling process can be briefly described as:

    Iregir[i,j]=Iir[i+ϕirvi(i,j,1),j+ϕirvi(i,j,2)]. (4)

    After that, the registered infrared image Iregir and visible image Ivi are fed into the fusion network NF to synthesize the fused image I1f as:

    I1f=NF(Ivi,Iregir). (5)

    It is instructive to note that we design a global spatial attention module (GSAM) in the fusion network to implement adaptive feature fusion. GSAM leverages two-round four-directional RNN that fully exploits the global context in features and assigns appropriate fusion weights for features, thus integrating more meaningful information into the fusion results.

    Finally, the fusion result is fed into the segmentation network NS, whose output is the predicted class probability y1seg, to implement the semantic information measurement, which is formulated as:

    y1seg=NS(I1f). (6)

    It is worth noting that infrared and visible image registration is a challenging mission due to the vast modal variance. Fortunately, two symmetric fused images could eliminate the modal variance and thus provide appropriate pixel-level supervision for multi-modal image registration. In particular, the fused image is a public domain of infrared and visible images, which not only eliminates modal variance but also contains more information. Therefore, we further develop a symmetric image registration and fusion framework to obtain symmetric fused images, as presented in Fig. 3. More specifically, the infrared image Iir and moved visible image Ivi are used as the inputs for the other branch. The specific registration and fusion processes are shown in (7) and (8):

    ϕviir=DM(Iir,Ivi), Iregvi=R(Ivi,ϕviir), (7)
    I2f=NF(Iregvi,Iir). (8)

    Moreover, the fusion result I2f is also fed into the segmentation network to perform the semantic measurement, which is presented as:

    y2seg=NS(I2f). (9)

    It is notable that the color visible image is first converted into the YCbCr space from the RGB space. Then, the Y (luminance) channel of the visible image and the grayscale infrared image are fed into the fusion model. The output of the fusion network is the Y channel of the fused image, which is mapped back to the RGB space along with Cb and Cr (chrominance) channels of the visible images to obtain the color fused image.

    Our SuperFusion incorporates image registration, image fusion, and semantic requirements into a unified framework. In order to implement each component more effectively, we elaborate losses to guide the training of related networks.

    1) Loss Function of Image Registration: Photometric error and end-point error, which are the most common losses for deformation field estimation, both can benefit the performance. While the end-point loss forces the estimator to regress the flow in the smooth areas, the photometric loss improves the accuracy of both quantitative and qualitative results in the textured areas. In the case of cross-modality image registration, there are always severe appearance variances requesting flow supervision, for example, scattered lights in visible images, lightened people in infrared images, and so on. However, discriminative information provided by these peculiar areas confuses the flow estimator and makes the loss larger, so the flow estimator is enforced to focus on those areas and neglect the common areas that are rich in texture. Therefore, for keeping the precision of textured areas, it is also necessary to supervise the flow learning with photometric constraints.

    Specifically, given a pair of aligned infrared and visible images Iir,Ivi, we conduct a synthetic transform ϕgt on Iir to generate a misaligned infrared image Iir. To register Iir to Ivi, we estimate the flow ϕirvi with the dense matcher and conduct it onto Iir to generate the registered infrared image Iregir as shown in (2) and (3). Then, the photometric loss is constructed between Iregir and its ground truth Iir as:

    LPH(Iregir,Iir)=1HWH,Wi,jIregir(i,j)Iir(i,j)1, (10)

    where 1 denotes l1-norm.

    To construct the end-point loss, we have to invert ϕgt. However, obtaining the inverse flow requires a complex resampling strategy [72]. Moreover, the resampling strategy cannot handle all situations and might introduce noise according to Nyquist Sampling Theorem. Therefore, instead of computing the inverse flow of ground truth, we believe a dense matcher should be able to estimate the bidirectional flows if the input features are adaptive to modalities and force the dense matcher to estimate the inverse flow ϕviir. Then the end-point loss can be constructed between ϕviir and its ground truth ϕgt as:

    LEP(ϕviir,ϕgt)=1HWH,Wi,jϕviir(i,j)ϕgt(i,j)2, (11)

    where 2 denotes l2-norm. Moreover, to better constrain ϕviir, we warp Iir with ϕviir to generate Iir, and then construct a photometric loss LPH(Iir,Iir) as (10).

    After registration, the registered image pair Iregir and Ivi would be fed into the fusion network. However, if Iregir is not registered well, the fusion loss (detailed in the subsequent subsection) that is constructed between Iregir and If would be hard to optimize and the fusion module would prefer to minimize the loss between Ivi and If. Consequently, the loss would converge into an unpredictable situation and mutual promotion would not be achieved. That is one reason why we propose a symmetrical architecture as shown in Fig. 3, which yields symmetrical outputs Iregvi, ϕirvi, Ivi and losses LPH(Iregvi,Ivi), LEP(ϕirvi,ϕgt), LPH(Ivi,Ivi) as mentioned above. In this symmetrical scheme, the optimization of registration and fusion would be balanced.

    To sum up, the final photometric loss and end-point loss are formulated in (12) and (13), respectively:

    LPH=LPH(Iregir,Iir)+LPH(Iir,Iir)+LPH(Iregvi,Ivi)+LPH(Ivi,Ivi), (12)
    LEP=LEP(ϕviir,ϕgt)+LEP(ϕirvi,ϕgt). (13)

    Furthermore, our symmetric branches output two fusion results I1f,I2f, which are considered to be the same if the registration is well finished. Therefore, we develop a consistency constraint loss LCC as:

    LCC=1HWH,Wi,jI1f(i,j)I2f(i,j)1. (14)

    Such a constraint is believed to make the fusion and registration promoted mutually. That is another reason why the symmetric scheme is designed.

    Finally, the full objective function of the image registration network is a weighted sum of the photometric loss, end-point loss, and consistency constraint loss, which is defined as:

    LReg=LPH+α1LEP+α2LCC, (15)

    where α1 and α2 are hyper-parameters to trade off each component of registration loss.

    2) Loss Function of Image Fusion: The fusion model is expected to preserve the structures and abundant texture details of source images. Thus, we design the SSIM loss LSSIM and texture loss LText to guide the fusion network NF to achieve the above purposes. The SSIM loss is presented as:

    LSSIM=4(SSIM(I1f,Ivi)+SSIM(I1f,Iregir))(SSIM(I2f,Iregvi)+SSIM(I2f,Iir)), (16)

    where SSIM(,) indicates the structural similarity measurement, which could measure image distortion from three perspectives, i.e., light, contrast, and structure [73].

    The texture details of an image can be characterized by its gradient. Therefore, we calculate the error between the gradient of the fused image and the maximum gradient aggregation of the source image to construct the texture loss as:

    LText=1HW|I1f|max (17)

    where \nabla denotes the Sobel gradient operator, which could measure the gradient of an image; \left| \cdot \right| refers to the absolute operation, and \max(\cdot) indicates the element-wise maximum aggregation operator.

    It is worth noting that the fused image is also expected to integrate intensity information in the source images, especially the significant targets in the infrared image. Therefore, we devise an intensity maximization loss \mathcal{L}_{Int} to guide the fusion network to adaptively integrate intensity information of source images:

    \begin{equation} \begin{aligned} \mathcal{L}_{Int} &= \frac{1}{HW}\left\| I_f^1 - \max(I_{vi}, I_{ir}^{reg})\right\|_1 \\ & + \frac{1}{HW}\left\| I_f^2 - \max(I_{vi}^{reg}, I_{ir})\right\|_1. \end{aligned} \end{equation} (18)

    The final fusion loss \mathcal{L}_{Fus} employed to guide the training of our fusion model can be summarized as the weighted sum of the above three sub-losses, which is formulated as:

    \begin{equation} \mathcal{L}_{Fus} = \mathcal{L}_{Text} + \beta_1 \cdot \mathcal{L}_{SSIM} + \beta_2 \cdot \mathcal{L}_{Int}, \end{equation} (19)

    where \beta_1 and \beta_2 are employed to control the trade-off between different losses.

    It is worth noting that in the loss functions of the fusion network, I_{vi} and I_f specifically refer to the Y channel of the visible images and fused images, respectively, and I_{ir} denotes the grayscale infrared images.

    3) Semantic-aware Loss Function: We introduce a semantic loss like SeAFusion [22] to prompt the fusion network to adequately consider the requirements of high-level vision tasks. However, SeAFusion employs only the simplest cross-entropy loss to model the semantic requirements, which potentially neglects the category imbalance issue. Thus, we introduce the Lovasz-Softmax loss [74] to calculate the error between the predicted results and ground truth.

    Let y\in \mathbb{R}^{H \times W \times C} denote the network output probability, \hat{y} \in \mathbb{R}^{H \times W \times C} denote one-hot predicted label, and y^* \in \mathbb{R}^{H \times W \times C} denote the corresponding ground truth, where C represents the number of classes. Intersection over union (IoU) is the primary metric to measure the performance of segmentation. The IoU loss for the c -th class between the predicted label \hat {y}_c\in\{0, 1\}^{H\times W} and the ground truth y^*_c\in\{0, 1\}^{H \times W} , can be formulated as:

    \begin{equation} \Delta(\hat{y}_c, y_c^*) = 1-\frac{|y^*_c \cap \hat {y}_c |}{|y^*_c\cup \hat {y}_c|}. \end{equation} (20)

    It is promising to directly minimize the IoU loss to achieve better performance, but IoU is not a differential function with respect to the network output probability of c -the class y_c\in(0, 1)^{H\times W} . Therefore, Lovasz-Softmax loss, the differential alternative of IoU loss, is proposed in [74]. To be specific, an error function err(y_c) of c -th class prediction should be formulated firstly as:

    \begin{equation} err(y_c, y_c^*, i)=\left\{ \begin{aligned} 1-y_c(i), &\ {\text{if}}\ \hat{y}_c(i) \cdot y_c^*(i)==1, \\ y_c(i), &\ {\text{otherwise}}, \end{aligned} \right. \end{equation} (21)

    where i is the index of a pixel. Then the Lovasz-Softmax loss can be described as:

    \begin{equation} \mathcal{L}_{LS}(y, y^*) = \frac{1}{|\mathcal{C}|}\sum\limits_{c\in\mathcal{C}}\overline{\Delta}err(y_c, y_c^*), \end{equation} (22)

    where \mathcal{C} is the set of classes. Specifically, the definition of \overline{\Delta}err(y_c, y_c^*) is provided as follows:

    \begin{equation} \overline{\Delta}err(y_c, y_c^*) = \sum\limits_{i\in{H\times W}}err(y_c, y_c^*, i)\cdot G(y_c, y_c^*, i), \end{equation} (23)

    where G(y_c, y_c^*, i) = \Delta(S_c(i), y^*)-\Delta(S_{c}(i-1), y^*) and S_c(i) is the set of segmented pixels of the sorted y_c . Specifically, y_c is sorted by y_c(0)\geq y_c(1)\geq\cdots\geq y_c(i)\geq\cdots\geq=y_c(H\times W) , then S_c(i)=\{\hat{y}_c(0), \hat{y}_c(1), \cdots, \hat{y}_c(i)\} . More details about the implementation of the Lovasz-Softmax loss can be found in the original paper [74]. Therefore, the semantic-aware loss \mathcal{L}_{Sea} utilized in our work is formulated as:

    \begin{equation} \mathcal{L}_{Sea} = \mathcal{L}_{LS}(y_{seg}^1, y^*) + \mathcal{L}_{LS}(y_{seg}^2, y^*). \end{equation} (24)

    Finally, the full objective function of our SuperFusion is defined as:

    \begin{equation} \mathcal{L}_{Total} = \lambda_1 \cdot\mathcal{L}_{Reg} + \lambda_2 \cdot \mathcal{L}_{Fus} + \lambda_3 \cdot \mathcal{L}_{Sea}, \end{equation} (25)

    where \lambda_1 , \lambda_2 , and \lambda_3 are the hyper-parameters for balancing each component.

    1) Architecture of Dense Matcher: Dense matcher consisting of a pyramid feature extractor and iterative flow estimators is the key module of the registration network. Firstly, weight-unsharing networks, which contain 4 convolutional layers with output channels [8, 8, 16, 16] , instance normalization, and Leaky Rectified Linear Unit (Leaky-ReLU), extract full-scale modal-adaptive features \mathcal{F}^0\in\mathbb{R}^{H \times W \times 16} for input images. Then, \mathcal{F}^0 successively passes the three weight-sharing sub-modules with downsampling to extract \mathcal{F}^1\in\mathbb{R}^{H/2 \times W/2 \times 32} , \mathcal{F}^2\in\mathbb{R}^{H/4 \times W/4 \times 64} , and \mathcal{F}^3\in\mathbb{R}^{H/8 \times W/8 \times 128} . Each sub-module consists of 3 convolutional layers. Outputs of layers are normalized by instance normalization and activated by Leaky-ReLU, except the last one. And the downsampling is implemented by a stride of 2 in the second layer. Subsequently, features are collected and fed into the flow estimators.

    Assume the moved infrared image I_{ir}^\prime and the visible image I_{vi} are the source and target images, respectively. In the i -th flow estimator, we firstly warp the source feature \mathcal{F}^i_{ir'} with the last estimated deformation field, and label it with Z^i_{ir'} . Note that, as shown in Fig. 4, the initial flow imported into the 3 -rd flow estimator is 0. Then we can compute the local correlation volume Corr\in\mathbb{R}^{H/2^i\times W/2^i\times 7 \times 7} of \mathcal{F}^i_{ir} and \mathcal{F}^i_{vi} as:

    \begin{equation} Corr_{ir'\rightarrow vi}^i(j, k, m, n) = \mathcal{F}^i_{vi}(j, k)^T Z^i_{ir'}(j+m, k+n), \end{equation} (26)
    Figure  4.  The architecture of dense matcher, which consists of a pyramid feature extractor and iterative flow estimators. Flows are estimated in three scales iteratively and summed up.

    where j, k are the indexes of all pixels at i -th scale, and m, n\in\{-12, -8, -4, 0, 4, 8, 12\} are the shifts. Next, Corr is reshaped into a feature-like shape and concatenated with Z^i_{ir'} . The following 4 convolutional layers would estimate the residual deformations \Delta\phi^i_{{ir'}\rightarrow {vi}}(Corr_{ir'\rightarrow vi}^i , Z^i_{ir'}) with the concatenation as the input. Each convolutional layer is followed by batch normalization and LeakReLU, except for the last one. The numbers of output channels of the first three layers are the floor division of the number of input channels. Finally, the two-channel deformation field at the i -th scale is represented as:

    \begin{equation} \phi^i_{{ir'}\rightarrow {vi}} = (\phi^{i+1}_{{ir'}\rightarrow {vi}}+\Delta\phi^i_{{ir'}\rightarrow {vi}})\uparrow2, \end{equation} (27)

    where \uparrow2 denotes the bilinear upsampling by a factor of two. Note that the first several layers of the pyramid feature extractor are weight-unsharing so that the modal variance can be eliminated in \mathcal{F}^0 . Moreover, it is easy to obtain the inverse flow by inverting the order of inputs of flow estimator layers. The bidirectional estimation can force the \mathcal{F}^0 to be free of modality-specific information.

    2) Architecture of Image Fusion Network: The architecture of our fusion network \mathcal{N}_F is presented in Fig. 5. The siamese feature extraction modules \mathcal{M}_{E}^{ir} and \mathcal{M}_{E}^{vi} are firstly deployed to extract infrared features \mathcal{F}_{ir}^{reg} and visible features \mathcal{F}_{vi} from source images, respectively:

    \begin{equation} \{\mathcal{F}_{ir}^{reg}, \mathcal{F}_{vi} \} = \{\mathcal{M}_{E}^{ir}(I_{ir}^{reg}), \mathcal{M}_{E}^{vi}(I_{vi})\}. \end{equation} (28)
    Figure  5.  Architecture of the fusion network \mathcal{N}_F . Conv( c, k ) denotes a convolutional layer with c output channels and kernel size of k\times k ; GSAM indicates the global spatial attention module.

    Specifically, \mathcal{M}_{E}^{ir} and \mathcal{M}_{E}^{vi} consist of four convolutional layers followed by the Leaky-ReLU activation function. The kernel size of all convolutional layers is 3 \times 3 , and the stride is 1 .

    After that, in order to achieve adaptive feature integration, infrared and visible features are cascaded in the channel dimension and then fed into the global spatial attention module (GSAM) to obtain the attention map \mathcal{A}_{ir}^{reg} (i.e., fusion weight) of infrared features. This process can be formulated as:

    \begin{equation} \mathcal{A}_{ir}^{reg} = GSAM(concat(\mathcal{F}_{ir}^{reg}, \mathcal{F}_{vi})), \end{equation} (29)

    where concat(\cdot) denotes the concatenation operation in the channel dimension. Moreover, the schematic illustration of GSAM is shown in Fig. 6. On the one hand, the concatenated features are first to cross three sequential convolutional layers to obtain the shared attention weights for four-directional context features. The first two convolutional layers are followed by the Leaky-ReLU activation function, and the third convolutional layer leverages the Sigmoid function to normalize the attention weights. On the other hand, the concatenated features are utilized as inputs to four-directional RNN after passing through a convolutional layer. Then, the contextual features yielded by RNN are multiplied with the attention weights and concatenated in the channel dimensions. The refined features are fed into a convolutional layer to reduce the channel dimension. Next, the second round of four-directional RNN is deployed to extract the contextual features with global perception, which is multiplied by the shared attention weights. Finally, two cascaded convolutional layers are employed to calculate the attention maps (i.e., fusion weights) for infrared features from the refined global features. The first and second layers use Leaky-ReLU and Sigmoid as the activation functions, respectively. The kernel size of all convolutional layers in GSAM is 3 \times 3 . It is also instructive to note that the fusion weights for infrared features are obtained by considering the importance of infrared and visible features jointly.

    Figure  6.  The schematic illustration of the global spatial attention module (GSAM). The global attention is calculated by adapting a spatial RNN to aggregate the spatial context in four directions.

    Considering that infrared and visible features are complementary, we define the fusion weight of visible features as 1 - \mathcal{A}_{ir}^{reg} . Then, the features can be adaptively integrated as:

    \begin{equation} \mathcal{F}_f = \mathcal{A}_{ir}^{reg} \odot \mathcal{F}_{ir}^{reg} + (1 - \mathcal{A}_{ir}^{reg}) \odot \mathcal{F}_{vi}, \end{equation} (30)

    where \odot stands for the element-wise multiplication operation.

    Finally, the fused features \mathcal{F}_f are fed into the image reconstruction module \mathcal{M}_{R} to yield the fused image:

    \begin{equation} I_f^1 = \mathcal{M}_{R}(\mathcal{F}_f). \end{equation} (31)

    The image reconstruction module \mathcal{M}_R consists of four sequential convolutional layers, where the kernel size of each layer is 3 \times 3 , and the stride is set to 1 . All layers are followed by the Leaky-ReLU activation function, except for the last layer, which employs Tanh as the activation function.

    In this section, we compare SuperFusion with several state-of-the-art algorithms on both the image registration, image fusion, and semantic segmentation tasks by quantitative and qualitative comparisons. Firstly, we provide some implementation details and experimental configurations. Subsequently, we present some qualitative and quantitative results compared to state-of-the-art alternatives. Finally, we also conduct ablation studies to demonstrate the effectiveness of the specific designs.

    1) Implementation Details: Our framework is implemented in PyTorch [75] on an NVIDIA TITAN RTX GPU and a 2.60GHz Intel(R) Xeon(R) Platinum 8171M CPU. It takes 1200 epochs to train our network with Adam optimizer and a batch of 8 pairs of 256\times 256 infrared and visible images. The learning rate of optimizer is initialized with 0.001 and then decayed linearly after the 600-th epoch. All images are normalized to [0, 1] before being fed into networks. Moreover, the hyper-parameters that control the trade-off of each sub-loss term are empirically set as \alpha_1 = 0.1 , \alpha_2 = 0.1 , \beta_1 = 0.3 , \beta_2 = 0.7 , \lambda_1 = 20 , \lambda_2 = 100 , and \lambda_3 = 20 .

    2) Experimental Configurations: Registration performance of our SuperFusion is compared with five state-of-the-art methods including DASC [62], RIFT [63], GLU-Net [76], UMF-CMGR [30], and CrossRAFT [66] on the MSRS [37] and RoadScene [43] datasets. The comparison is performed in two protocols. While the first is conducting synthetic random affine and elastic transform on infrared images, the second is on visible images. The rigid transform consists of a rotation of [-10, 10] degrees and a translation of [-10, 10] pixels. And, the elastic transform is obtained by blurring a two-channel [-1, 1] noise map with six Gaussian filters whose sigma is 15 and the kernel size is 45\times45 . All methods would estimate the transforms from the infrared image to the visible one in the two protocols. And the registration performance would be measured by re-projection error and end-point error.

    For image fusion, we also select five state-of-the-art methods for comparison, including RFN-Nest [32], SeAFusion [22], UMF-CMGR [30], U2Fusion [43], and TarDAL [25]. Six metrics involving mutual information (MI), visual information fidelity (VIF), structural similarity index measure (SSIM), feature mutual information based on discrete cosine transform ( FMI_{dct} ), Q_{abf} , and N_{abf} , are selected to quantitatively evaluate fusion performance. MI, FMI_{dct} , and Q_{abf} measure the amount of pixel information, feature information, and edge information transferred from source images to the fusion results, respectively. VIF assesses the information fidelity of fusion results from the perspective of human visual perception. SSIM reflects the similarity between the fused images and the source images from various perspectives, i.e., brightness, contrast, and structure. N_{abf} reflects the artifacts introduced into the fusion results during the fusion process. A higher value of the metrics mentioned above indicates better performance, except for N_{abf} . It is worth noting that UMF-CMGR [30] and our method utilize their own registration module to register the source images, while other comparison methods deploy the state-of-the-art multi-modal registration algorithms, i.e., CrossRAFT [66], to pre-process the misaligned source images. All comparison algorithms are publicly available, and all parameters are consistent with the original papers.

    1) Quantitative Comparison: Registration performance is quantified by reprojection error (RE), which measures the distance between a landmark reprojected by the estimated deformation and its ground truth, and end-point error (EPE) which is the root-mean-square error of the deformation field. Note that, the EPE in Protocol 1 is hard to compute because it requires the inversion of synthetic transforms. As shown in Table Ⅰ, the individually trained registration network of SuperFusion (without fusion) performs well, which demonstrates the superiority of our loss formulation and network design for registration. And our full SuperFuion achieves the best scores on all protocols and metrics, which proves the fusion and \mathcal{L}_{cc} can improve the registration in our framework. Although CrossRAFT that is based on complex data augmentations shows generalization on infrared and visible images as claimed in their paper [66], it still cannot handle challenging scenes in the MSRS dataset, which results in relatively large margins in performance to our methods. The relative weak results of UMF-CMGR [30], which also focuses on a joint registration and fusion paradigm, can be boiled down to the noisy translation as mentioned before. Moreover, RIFT and DASC are sparse features so they might not adapt to the elastic deformations in our experiments. GLU-Net, the state-of-the-art method for mono-modal image registration, cannot exhibit the generalization on cross-modal tasks, which demonstrates that cross-modal-specific studies deserve more attention.

    Table  Ⅰ.  QUANTITATIVE REGISTRATION PERFORMANCE ON MSRS AND ROADSCENE. MEAN REPROJECTION ERROR (RE) AND END-POINT ERROR (EPE) ARE REPORTED. THE BEST AND THE SECOND SCORES ARE HIGHLIGHTED IN BOLD AND ITALIC, RESPECTIVELY. *: OUT OF SCOPE
    RE/EPE RIFT [63] DASC [62] UMF-CMGR [30] GLU-Net [76] CrossRAFT [66] SuperFusion SuperFusion(Reg)
    RS ^1 24.4/ 321/ 20.6/ 37.5/ 7.50/ 7.43/ 7.58/
    RS ^2 67.6/* 99.9/* 15.7/163 20.7/445 4.25/14.0 3.86/9.13 4.01/10.4
    MSRS ^1 56.2/ 390/ 56.2/ 43.8/ 8.35/ 7.1/ 7.21/
    MSRS ^2 84.7/* 185/* 17.5/253 25.1/636 8.54/39.7 7.09/15.2 7.29/18.0
     | Show Table
    DownLoad: CSV

    2) Qualitative Comparison: The qualitative registration performance shown in Fig. 7 is generally consistent with the quantitative one. Our SuperFusion has the subtlest visible error between registered gradients and their ground truth in all scenes, especially in the highlighted regions. By contrast, the chaotic deformation fields estimated by DASC lead to weird qualitative results. Although CrossRAFT seems competitive with our methods in quantitative comparison, it still shows obvious flaws in the MSRS dataset, even in some salient regions, for example, well-bounded humans and objects, which probably reveals the limitation of data augmentation. It is worth mentioning that RIFT obtains overall decent results with the post homography matrix estimation, which are even better than CrossRAFT, except for in Row 3. It is those extreme failed registered examples as shown in Row 3 that significantly degenerate the quantitative indexes of RIFT. Additionally, the disappointing qualitative performance of GLU-Net and UMF-CMGR meet the corresponding quantitative indexes. The disappointing performance of UMF-CMFR would impose indicatively negative effects on the subsequent fusion experiments.

    Figure  7.  Qualitative registration performance of DASC, RIFT, GLU-Net, UMF-CMGR, CrossRAFT, and our SuperFusion. The first four rows of images are from the MSRS dataset, and the last two are from the RoadScene dataset. The purple textures are the gradients of registered infrared images and the backgrounds are the corresponding ground truths. The discriminateive regions that demonstrate the superiority of our method are highlighted in boxes. Note that, the gradients of the second column images are from the warped images, i.e., the misaligned infrared images.

    1) Qualitative Comparison: The fusion results of different fusion algorithms on the MSRS and RoadScene datasets are presented in Fig. 8. From the results, we can find that even the state-of-the-art registration algorithm fails to achieve strict registration of the misaligned source images, such that severe artifacts (e.g., pedestrians, shrubs, and bicycles) are presented in the fusion results. In addition, UMF-CMGR fails in dealing with severe deformation or parallax, although it jointly models the image registration and image fusion problems. RFN-Nest not only weakens the salient targets from infrared images but also introduces thermal radiation interference in the background region (e.g., the red boxes in the first and third rows). Similar phenomena also occur in UMF-CMGR and U2Fusion. The wall in the second row is the most obvious. Although TarDAL generates fusion results with the highest contrast, the object detection-driven fusion model focuses only on the significant objects in scenes. In particular, TarDAL sharpens the salient targets but ignores the background textures, which is not beneficial for the full understanding of the imaging scene. Although another semantic-driven approach, i.e., SeAFusion is effective in maintaining salient targets and preserving abundant texture details, it suffers from severe artifacts, which are highlighted by the green box. From the fusion results, we can clearly observe that our SuperFusion is able to effectively maintain significant targets in infrared images while retaining clear scene detail from visible images. In particular, our fusion results do not present severe artifacts in the green box areas, which indicates that our method can effectively mitigate the effects caused by misalignment in source images. We attribute these advantages to two aspects. On the one hand, we jointly model the image registration and fusion tasks, which allows our method to be robust to misalignment in source images. On the other hand, we design a fusion layer based on the global spatial attention module to achieve adaptive feature fusion, which allows to effectively perceive and integrate significant targets and texture detail information in source images.

    Figure  8.  Qualitative comparison results of SuperFusion with five state-of-the-art infrared and visible image fusion methods on the MSRS and RoadScene datasets. All methods employ the built-in registration module (e.g., UMF-CMGR [30] and our SuperFusion) or CrossRAFT [66] to register the source images.

    2) Quantitative Comparison: Quantitative comparison results of our SuperFusion and other state-of-the-art fusion approaches on the MSRS dataset are provided in Fig. 9. One can notice that our method achieves the best results on MI, SSIM, FMI_{dct} , and Q_{abf} metrics. The optimal results on these indicators mean that our method can transfer the most pixel information, structural information, feature information, and edge information from source images to fused images. In addition, our method only follows SeAFusion by a narrow margin in the VIF metric, which indicates that our fusion results are more consistent with the perception of the human visual system. Although our method does not show advantages on the N_{abf} metric, this is justifiable. Specifically, RFN-Nest and UMF-CMGR merge information from all source images indiscriminately during the fusion process, which causes the fused images not only to be disturbed by irrelevant information but also weakens the gradient of the fused images. This phenomenon can be corroborated by the Q_{abf} metric and qualitative results. N_{abf} determines artifacts by comparing the gradients of source images and fused images, so RFN-Nest and UMF-CMGR have more advantages in N_{abf} . It is worth mentioning that our method can effectively eliminate ghosts caused by misalignment, so our method outperforms the other rest approaches in the N_{abf} metric.

    Figure  9.  Quantitative comparison results of SuperFusion with five state-of-the-art alternatives on 361 image pairs from the MSRS dataset. A point (x, y) on the curve denotes that there are 100 * x percent of image pairs that have metric values no more than y .

    Fig. 10 shows the quantitative comparison results of different fusion methods on the RoadScene dataset. We can find that SuperFusion achieves the best results in MI, VIF, SSIM, and N_{abf} metrics. In particular, the performance of SuperFusion on N_{abf} is in line with our expectations, which indicates that the proposed method can reduce the artifacts by registering the input images. Moreover, our method is comparable to other state-of-the-art methods on the Q_{abf} metric, which means that our method transfers as much edge information as possible to the fusion results. Although our method lags behind UMF-CMGR and U2Fusion in the FMI_{dct} metric, it is still able to transfer enough feature information into the fused image. In conclusion, both quantitative and quantitative results demonstrate the superior performance of the proposed method.

    Figure  10.  Quantitative comparison results of SuperFusion with five state-of-the-art alternatives on 25 image pairs from the RoadScene dataset.

    As mentioned previously, our method takes into account the requirements of real-world applications and devises the semantic constraints to prompt the fusion network to integrate semantic information as much as possible. In this section, we compare different fusion algorithms in the semantic segmentation task. For a fair comparison, we employ the segmentation model provided by SeAFusion [22] to perform semantic segmentation on source images and fused images.

    1) Qualitative Comparison: The segmentation results are visualized in Fig. 11. As we can see, the segmentation model is only able to segment part of the objects from the source images due to the lack of comprehensive descriptions for the imaging scenes. The fused images generated by some fusion algorithms (RFN-Nest, UMF-CMGR, and U2Fusion) also fail to assist the segmentation model to sufficiently perceive all objects in the scenarios since their networks are trained without explicit semantic guidance. Although TarDAL uses the object detection task to guide the training of the fusion network, semantic information provides by object detection drives the fusion network to focus only on high-contrast regions of conspicuous objects. Therefore, the fusion results synthesized by TarDAL also do not provide comprehensive semantic descriptions for the segmentation model. Moreover, even though SeAFusion can provide more sufficient semantic information than other methods for the segmentation task, misalignment usually misleads the final segmentation model. Fortunately, the registration network of our method can alleviate the effect of misalignment. Furthermore, the imposed semantic constraint instructs the fusion network to integrate as much semantic information as possible to facilitate the segmentation model to comprehensively perceive the imaging scenes. Thus, the segmentation model can segment the most objects in our fusion results, such as pedestrians in each scene.

    Figure  11.  Segmentation results for source images and fused images from the MSRS dataset. Two rows denote a scene, and from top to bottom is: 01234N, 01368N, and 01502D. The fused image indicates the fusion result generated by our SuperFusion, and the pre-trained segmentation model is provided by SeAFusion [22].

    2) Quantitative Comparison: We also perform the quantitative comparison to objectively analyze the effects of different fusion approaches on the segmentation model. As shown in Table Ⅱ, SuperFusion achieves the highest pixel intersection over union (IoU) on most objects and the best mean IoU (mIoU). SeAFusion leverages semantic constraints to guide the training of the fusion model, thus it achieves suboptimal results on most objects. It is worth mentioning that the fusion results synthesized from source images without strict registration tend to degrade the performance of the segmentation. In contrast, our method can eliminate the effect of misalignment by pre-registering the input images, thus allowing the segmentation model to maintain better performance.

    Table  Ⅱ.  SEGMENTATION PERFORMANCE (IOU) OF VISIBLE, INFRARED, AND FUSED IMAGES ON THE MSRS DATASET. BLOD INDICATES THE BEST RESULT AND ITALIC INDICATES THE SECOND BEST RESULT. THE PRE-TRAINED SEGMENTATION MODEL IS PROVIDED BY SEAFUSION [22]
    Background Car Person Bike Curve Car Stop Guardrail Color Tone Bump mIoU
    Visible 97.92 86.80 39.97 70.50 53.33 71.84 85.9 65.44 79.2 72.32
    Infrared 94.52 50.10 41.53 16.32 13.92 12.54 0 11.34 18.48 28.75
    RFN-Nest 98.14 87.70 66.23 68.50 52.24 71.15 83.58 58.89 65.48 72.43
    SeAFusion 98.34 89.10 67.12 71.47 56.36 73.40 83.99 64.10 75.44 75.48
    UMF-CMGR 97.51 84.40 52.49 64.84 39.67 67.71 74.97 52.38 54.45 65.38
    U2Fusion 97.92 85.20 64.05 66.60 43.67 65.92 84.17 57.58 59.68 69.42
    TarDAL 96.46 71.00 53.08 46.49 7.86 28.62 59.88 49.59 14.70 47.52
    SuperFusion 98.49 90.00 70.82 72.00 62.37 74.23 84.55 65.48 79.13 77.42
     | Show Table
    DownLoad: CSV

    We conduct the complexity evaluation to analyze the operational efficiency of different algorithms from two perspectives, i.e., parameters and running time. The computational efficiency comparison results of different image registration and fusion approaches are illustrated in Table Ⅲ. From the results, we can find that our method has a significant efficiency advantage in the image registration task i.e., the fewest parameters and lowest running time. We attribute this advantage to the fact that our dense matcher employs the three-layer pyramid structure to extract features and estimate the deformation field progressively. It is worth noting that UMF-CMGR consumes a lot of time on the style transformation task despite it having a very lightweight registration network. All deep learning-based methods have a significant advantage in running time compared to traditional methods, benefiting from GPU acceleration. Moreover, it can be seen from the results that our fusion network has the least number of parameters. However, since our method deploys the two-round four-directional RNN-based GSAM for adaptive feature fusion, our method slightly lags behind TarDAL and SeAFusion in terms of running time. Nonetheless, our method can still meet the real-time requirements in practical applications. It is important to note that TarDAL sacrifices the accuracy of network parameters and data to seek higher operation efficiency. In conclusion, our SuperFusion has superior operational efficiency, which allows it to be deployed into the real-world applications.

    Table  Ⅲ.  COMPUTATIONAL EFFICIENCY COMPARISON WITH STATE-OF-THE-ART IMAGE REGISTRATION AND FUSION METHODS
    Registration DASC RIFT UMF-CMGR GLU-Net CrossRAFT SuperFusion
    Parameters(M) - - 11.5429 13.5905 39.7634 1.9624
    Time(s) MSRS 46.6238 4.4573 0.3893 0.0426 0.0857 0.0294
    RoadScene 23.8478 3.6980 0.4524 0.0351 0.0634 0.0327
    Fusion RFN-Nest SeAFusion UMF-CMGR U2Fusion TarDAL SuperFusion
    Parameters(M) 7.5242 0.1669 0.6293 0.6592 0.2966 0.1387
    Time(s) MSRS 0.1564 0.0371 0.0428 0.1444 0.0195 0.0765
    RoadScene 0.0810 0.0191 0.3520 0.7418 0.0103 0.0339
     | Show Table
    DownLoad: CSV

    1) Ablation on Registration Components: In our methods, several losses and designs are costumed for the registration network, including symmetric paradigm (Sym.), end-point loss ( \mathcal{L}_{EP} ), photometric loss ( \mathcal{L}_{PH} ), and fusion-conjoint (Fusion). To explore their effectiveness on registration, we conduct ablation studies on the RoadScene dataset, which is cheap for time. The results are recorded in Table Ⅳ. Our full formulation achieves the smallest error, followed by the registration network trained without fusion network and the related losses including \mathcal{L}_{CC} (w/o Fusion), which demonstrates that image fusion can bridge the gap between input modalities and then facilitate cross-modal registration. Moreover, while discarding the \mathcal{L}_{PH} only lead to a minor degradation, the registration network cannot yield a satisfactory performance without \mathcal{L}_{EP} . This is because although \mathcal{L}_{PH} could increase the precision at textured regions as expected, \mathcal{L}_{EP} does force the network to regress deformation fields in areas containing severe modal variances that are critical to cross-modal registration. Additionally, since we force the network to estimate bidirectional deformation fields, the symmetric paradigm is useful to provide end-point supervision on each direction, which reasons for the weak performance of only top half of registration training scheme shown in Fig. 3 (w/o Sym.).

    Table  Ⅳ.  ABLATION STUDY ON COMPONENTS OF REGISTRATION MODULE
    RE/EPE w/o Sym. w/o \mathcal{L}_{EP} w/o \mathcal{L}_{PH} w/o Fusion Ours
    RS ^1 7.81/ 8.20/ 7.60/ 7.58/ 7.43/
    RS ^2 4.32/14.6 4.81/18.9 4.10/12.2 4.01/10.4 3.86/9.13
     | Show Table
    DownLoad: CSV

    2) Ablation on Fusion Component: We integrate a global spatial attention module (GSAM) into the fusion network to implement adaptive feature fusion. In particular, the four-directional RNN in GSAM can perceive structural information from multiple directions to generate more reasonable fusion weights for feature maps. Thus, we also perform an ablation study to verify the effectiveness of GSAM. As shown in Table Ⅴ, the fusion results without GSAM exhibit a significant degradation in all metrics except N_{abf} . Moreover, as shown in Fig. 12, the visualized results also illustrate that the fusion network fails to effectively integrate the significant objects in the source images after discarding GSAM. In contrast, SuperFusion with GSAM can adaptively integrate significant information from source images into the fused image.

    Table  Ⅴ.  ABLATION STUDY ON THE GLOBAL SPATIAL ATTENTION MODULE (GSAM)
    MI VIF SSIM FMI_{dct} Q_{abf} N_{abf}
    MSRS w/o GSAM 3.086 0.8415 0.6970 0.2238 0.5394 0.070
    Ours 4.2398 0.9758 0.6978 0.3430 0.6360 0.0839
    Road-Scene w/o GSAM 2.8741 0.7489 0.6903 0.2173 0.3354 0.0479
    Ours 3.8925 0.8327 0.7059 0.2957 0.4985 0.0403
     | Show Table
    DownLoad: CSV
    Figure  12.  Visualized results of ablation study on the global spatial attention module in the fusion network.

    3) Ablation on Semantic Constraint: As mentioned above, we introduce the semantic constraint to prompt the fusion network to integrate more semantic information. Thus, we also perform an ablation study on the semantic constraint to verify its effectiveness. We only report IoU for some important categories i.e., car, person, bike, and background, as well as mIoU for all categories. As shown in Table Ⅵ, IoU of all categories is degraded after removing the semantic constraint. It is worth noting that even without the semantic constraint, the segmentation model still shows better performance in the fusion results compared to source images, which is attributed to effective image registration and sufficient information integration during the fusion process. Furthermore, the semantic constraint can guide our fusion network to pay more attention to the semantic information during the fusion process, thus allowing fused images to contribute to the performance improvement of the segmentation model.

    Table  Ⅵ.  ABLATION STUDY ON THE SEMANTIC CONSTRAIN
    Background Car Person Bike mIoU
    w/o \mathcal{L}_{Sea} 98.44 89.4 70.3 70.79 76.64
    Ours 98.49 89.73 70.82 71.99 77.42
     | Show Table
    DownLoad: CSV

    In this paper, we propose a versatile framework considering both image registration, image fusion, and requirements of high-level vision tasks, termed SuperFusion. It significantly extends the scope of image fusion in practical applications. SuperFusion consists of three components, including image registration, fusion, and semantic segmentation networks. Firstly, the registration network is devised to estimate bidirectional deformation fields so that both photometric and end-point losses can be used more simply to improve precision. Moreover, a symmetric joint registration and fusion scheme is developed to balance the bias on input modalities and further promote the registration with a similarity constraint in the fused domain. Secondly, a global spatial attention mechanism, which emphasizes the significant areas and targets in source images, is employed to achieve adaptive feature integration and serve the preceding registration as well as the subsequent segmentation. Thirdly, we design a semantic constraint based on Lovasz-Softmax loss to promote the fusion network to generate more reasonable results, which facilitates the perception of both machines and humans. In conclusion, we are the first to integrate image registration, fusion, and semantic segmentation into a single framework and achieve the mutual promotion of image fusion and image registration. Extensive experiments prove that each module in our framework achieves state-of-the-art performance.

  • Recommended by Associate Editor Qing-Long Han. (Linfeng Tang and Yuxin Deng contributed equally to this work.)
  • [1]
    X. Zhang, "Deep learning-based multi-focus image fusion: A survey and a comparative study, " IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 9, pp. 4819–4838, 2022.
    [2]
    J. Ma, Y. Ma, and C. Li, "Infrared and visible image fusion methods and applications: A survey, " Inf. Fusion, vol. 45, pp. 153–178, 2019. doi: 10.1016/j.inffus.2018.02.004
    [3]
    X. Zhang, P. Ye, and G. Xiao, "Vifb: A visible and infrared image fusion benchmark, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2020, pp. 104–105.
    [4]
    H. Zhang, H. Xu, X. Tian, J. Jiang, and J. Ma, "Image fusion meets deep learning: A survey and perspective, " Inf. Fusion, vol. 76, pp. 323–336, 2021. doi: 10.1016/j.inffus.2021.06.008
    [5]
    L. Tang, H. Zhang, H. Xu, and J. Ma, "Deep learning-based image fusion: A survey, " J. Image Graph. , 2022.
    [6]
    Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, "Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes, " in Proc. IEEE Int. Conf. Intell. Rob. Syst., 2017, pp. 5108–5115.
    [7]
    X. Zhang, P. Ye, H. Leung, K. Gong, and G. Xiao, "Object fusion tracking based on visible and infrared images: A comprehensive review, " Inf. Fusion, vol. 63, pp. 166–187, 2020. doi: 10.1016/j.inffus.2020.05.002
    [8]
    M. Yin, J. Pang, Y. Wei, and P. Duan, "Image fusion algorithm based on nonsubsampled dual-tree complex contourlet transform and compressive sensing pulse coupled neural network, " J. Comput. Aided Des. Comput. Graph. , no. 3, pp. 411–419, 2016.
    [9]
    J. Chen, X. Li, L. Luo, X. Mei, and J. Ma, "Infrared and visible image fusion based on target-enhanced multiscale transform decomposition, " Inf. Sci. , vol. 508, pp. 64–78, 2020. doi: 10.1016/j.ins.2019.08.066
    [10]
    H. Li, X. -J. Wu, and J. Kitler, "Mdlatlrr: A novel decomposition method for infrared and visible image fusion, " IEEE Trans. Image Process. , vol. 29, pp. 4733–4746, 2020. doi: 10.1109/TIP.2020.2975984
    [11]
    Y. Liu, X. Chen, R. K. Ward, and Z. J. Wang, "Image fusion with convolutional sparse representation, " IEEE Signal Process. Letters, vol. 23, no. 12, pp. 1882–1886, 2016. doi: 10.1109/LSP.2016.2618776
    [12]
    Z. Fu, X. Wang, J. Xu, N. Zhou, and Y. Zhao, "Infrared and visible images fusion based on rpca and nsct, " Infrared Phys. Technol. , vol. 77, pp. 114–123, 2016. doi: 10.1016/j.infrared.2016.05.012
    [13]
    J. Ma, Z. Zhou, B. Wang, and H. Zong, "Infrared and visible image fusion based on visual saliency map and weighted least square optimization, " Infrared Phys. Technol. , vol. 82, pp. 8–17, 2017. doi: 10.1016/j.infrared.2017.02.005
    [14]
    J. Ma, C. Chen, C. Li, and J. Huang, "Infrared and visible image fusion via gradient transfer and total variation minimization, " Inf. Fusion, vol. 31, pp. 100–109, 2016. doi: 10.1016/j.inffus.2016.02.001
    [15]
    W. Zhao, H. Lu, and D. Wang, "Multisensor image fusion and enhancement in spectral total variation domain, " IEEE Trans. Multimed. , vol. 20, no. 4, pp. 866–879, 2017.
    [16]
    H. Li and X. -J. Wu, "Densefuse: A fusion approach to infrared and visible images, " IEEE Trans. Image Process. , vol. 28, no. 5, pp. 2614–2623, 2019. doi: 10.1109/TIP.2018.2887342
    [17]
    H. Xu, H. Zhang, and J. Ma, "Classification saliency-based rule for visible and infrared image fusion, " IEEE Trans. Comput. Imaging, vol. 7, pp. 824–836, 2021. doi: 10.1109/TCI.2021.3100986
    [18]
    J. Liu, X. Fan, J. Jiang, R. Liu, and Z. Luo, "Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion, " IEEE Trans. Circuits Syst. Video Technol. , vol. 32, no. 1, pp. 105–119, 2022. doi: 10.1109/TCSVT.2021.3056725
    [19]
    F. Zhao, W. Zhao, L. Yao, and Y. Liu, "Self-supervised feature adaption for infrared and visible image fusion, " Inf. Fusion, vol. 76, pp. 189–203, 2021. doi: 10.1016/j.inffus.2021.06.002
    [20]
    J. Ma, L. Tang, M. Xu, H. Zhang, and G. Xiao, "Stdfusionnet: An infrared and visible image fusion network based on salient target detection, " IEEE Trans. Instrum. Meas. , vol. 70, pp. 1–13, 2021.
    [21]
    Y. Liu, Y. Shi, F. Mu, J. Cheng, C. Li, and X. Chen, "Multimodal mri volumetric data fusion with convolutional neural networks, " IEEE Trans. Instrum. Meas. , vol. 71, pp. 1–15, 2022.
    [22]
    L. Tang, J. Yuan, and J. Ma, "Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network, " Inf. Fusion, vol. 82, pp. 28–42, 2022. doi: 10.1016/j.inffus.2021.12.004
    [23]
    J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, "Fusiongan: A generative adversarial network for infrared and visible image fusion, " Inf. Fusion, vol. 48, pp. 11–26, 2019. doi: 10.1016/j.inffus.2018.09.004
    [24]
    J. Ma, H. Zhang, Z. Shao, P. Liang, and H. Xu, "Ganmcc: A generative adversarial network with multiclassification constraints for infrared and visible image fusion, " IEEE Trans. Instrum. Meas. , vol. 70, pp. 1–14, 2021.
    [25]
    J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, "Targetaware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5802–5811.
    [26]
    Y. Yang, J. Liu, S. Huang, W. Wan, W. Wen, and J. Guan, "Infrared and visible image fusion via texture conditional generative adversarial network, " IEEE Trans. Circuits Syst. Video Technol. , vol. 31, no. 12, pp. 4771–4783, 2021. doi: 10.1109/TCSVT.2021.3054584
    [27]
    J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, "Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer, " IEEE/CAA J. Autom. Sinica, vol. 9, no. 7, pp. 1200–1217, 2022. doi: 10.1109/JAS.2022.105686
    [28]
    W. Tang, F. He, and Y. Liu, "Ydtr: infrared and visible image fusion via y-shape dynamic transformer, " IEEE Trans. Multimed. , 2022.
    [29]
    J. Li, J. Zhu, C. Li, X. Chen, and B. Yang, "Cgtf: Convolution-guided transformer for infrared and visible image fusion, " IEEE Trans. Instrum. Meas. , vol. 71, p. 5012314, 2022.
    [30]
    W. Di, L. Jinyuan, F. Xin, and R. Liu, "Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration, " in Int. Joint Conf. Artif. Intell., 2022.
    [31]
    H. Li, X. -J. Wu, and T. Durrani, "Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel attention models, " IEEE Trans. Instrum. Meas. , vol. 69, no. 12, pp. 9645–9656, 2020. doi: 10.1109/TIM.2020.3005230
    [32]
    H. Li, X. -J. Wu, and J. Kittler, "Rfn-nest: An end-to-end residual fusion network for infrared and visible images, " Inf. Fusion, vol. 73, pp. 72–86, 2021. doi: 10.1016/j.inffus.2021.02.023
    [33]
    H. Xu, X. Wang, and J. Ma, "Drf: Disentangled representation for visible and infrared image fusion, " IEEE Trans. Instrum. Meas. , vol. 70, p. 5006713, 2021.
    [34]
    M. Xu, L. Tang, H. Zhang, and J. Ma, "Infrared and visible image fusion via parallel scene and texture learning, " Pattern Recognit. , vol. 132, p. 108929, 2022. doi: 10.1016/j.patcog.2022.108929
    [35]
    Z. Zhao, S. Xu, C. Zhang, J. Liu, J. Zhang, and P. Li, "Didfuse: deep image decomposition for infrared and visible image fusion, " in Int. Joint Conf. Artif. Intell., 2020, pp. 970–976.
    [36]
    F. Zhao and W. Zhao, "Learning specific and general realm feature representations for image fusion, " IEEE Trans. Multimed. , vol. 23, pp. 2745–2756, 2020.
    [37]
    L. Tang, J. Yuan, H. Zhang, X. Jiang, and J. Ma, "Piafusion: A progressive infrared and visible image fusion network based on illumination aware, " Inf. Fusion, vol. 83, pp. 79–92, 2022.
    [38]
    Y. Long, H. Jia, Y. Zhong, Y. Jiang, and Y. Jia, "Rxdnfuse: A aggregated residual dense network for infrared and visible image fusion, " Inf. Fusion, vol. 69, pp. 128–141, 2021. doi: 10.1016/j.inffus.2020.11.009
    [39]
    R. Liu, Z. Liu, J. Liu, and X. Fan, "Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion, " in Proc. ACM Int. Conf. Multimed., 2021, pp. 1600–1608.
    [40]
    H. Zhang, H. Xu, Y. Xiao, X. Guo, and J. Ma, "Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity. " in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12 797–12 804.
    [41]
    H. Zhang and J. Ma, "Sdnet: A versatile squeeze-and-decomposition network for real-time image fusion, " Int. J. Comput. Vis. , vol. 129, no. 10, pp. 2761–2785, 2021. doi: 10.1007/s11263-021-01501-8
    [42]
    H. Xu, J. Ma, Z. Le, J. Jiang, and X. Guo, "Fusiondn: A unified densely connected network for image fusion, " in Proc. AAAI Conf. Artif. Intell., 2020, pp. 12 484–12 491.
    [43]
    H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, "U2fusion: A unified unsupervised image fusion network, " IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 1, pp. 502–518, 2022. doi: 10.1109/TPAMI.2020.3012548
    [44]
    Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, "Ifcnn: A general image fusion framework based on convolutional neural network, " Inf. Fusion, vol. 54, pp. 99–118, 2020. doi: 10.1016/j.inffus.2019.07.011
    [45]
    J. Liu, R. Dian, S. Li, and H. Liu, "Sgfusion: A saliency guided deeplearning framework for pixel-level image fusion, " Inf. Fusion, 2022.
    [46]
    Y. Fu, T. Xu, X. Wu, and J. Kittler, "Ppt fusion: Pyramid patch transformerfor a case study in image fusion, " arXiv, 2021.
    [47]
    V. VS, J. M. J. Valanarasu, P. Oza, and V. M. Patel, "Image fusion transformer, " arXiv, 2021.
    [48]
    H. Zhao and R. Nie, "Dndt: Infrared and visible image fusion via densenet and dual-transformer, " in Proc. Int. Conf. Inf. Technol. Biomed. Eng., 2021, pp. 71–75.
    [49]
    J. Ma, P. Liang, W. Yu, C. Chen, X. Guo, J. Wu, and J. Jiang, "Infrared and visible image fusion via detail preserving adversarial learning, " Inf. Fusion, vol. 54, pp. 85–98, 2020. doi: 10.1016/j.inffus.2019.07.005
    [50]
    J. Ma, H. Xu, J. Jiang, X. Mei, and X. -P. Zhang, "Ddcgan: A dualdiscriminator conditional generative adversarial network for multiresolution image fusion, " IEEE Trans. Image Process. , vol. 29, pp. 4980–4995, 2020. doi: 10.1109/TIP.2020.2977573
    [51]
    J. Li, H. Huo, C. Li, R. Wang, and Q. Feng, "Attentionfgan: Infrared and visible image fusion using attention-based generative adversarial networks, " IEEE Trans. Multimed. , vol. 23, pp. 1383–1396, 2020.
    [52]
    H. Zhou, W. Wu, Y. Zhang, J. Ma, and H. Ling, "Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network, " IEEE Trans. Multimed. , 2021.
    [53]
    H. Zhang, J. Yuan, X. Tian, and J. Ma, "Gan-fm: Infrared and visible image fusion using gan with full-scale skip connection and dual markovian discriminators, " IEEE Trans. Comput. Imaging, vol. 7, pp. 1134–1147, 2021. doi: 10.1109/TCI.2021.3119954
    [54]
    Y. Liu, Y. Shi, F. Mu, J. Cheng, and X. Chen, "Glioma segmentationoriented multi-modal mr image fusion with adversarial learning, " IEEE/CAA J. Autom. Sinica, vol. 9, no. 8, pp. 1528–1531, 2022. doi: 10.1109/JAS.2022.105770
    [55]
    Y. Liu, F. Mu, Y. Shi, and X. Chen, "Sf-net: A multi-task model for brain tumor segmentation in multimodal mri via image fusion, " IEEE Signal Process. Letters, vol. 29, pp. 1799–1803, 2022. doi: 10.1109/LSP.2022.3198594
    [56]
    H. Xu, J. Ma, J. Yuan, Z. Le, and W. Liu, "Rfnet: Unsupervised network for mutually reinforcing multi-modal image registration and fusion, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2022, pp. 19 679–19 688.
    [57]
    J. Zhang and A. Rangarajan, "Affine image registration using a new information metric, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2004, pp. 1–8.
    [58]
    D. Rueckert, L. Sonoda, C. Hayes, D. Hill, M. Leach, and D. Hawkes, "Nonrigid registration using free-form deformations: application to breast mr images, " IEEE Trans. Med. Imaging, vol. 18, no. 8, pp. 712–721, 1999. doi: 10.1109/42.796284
    [59]
    J. -C. Yoo and T. H. Han, "Fast normalized cross-correlation, " Int. J Circuits, Syst. signal Process. , vol. 28, no. 6, pp. 819–843, 2009. doi: 10.1007/s00034-009-9130-7
    [60]
    F. Maes, D. Vandermeulen, and P. Suetens, "Medical image registration using mutual information, " Proc. IEEE, vol. 91, no. 10, pp. 1699–1722, 2003. doi: 10.1109/JPROC.2003.817864
    [61]
    C. Aguilera, F. Barrera, F. Lumbreras, A. D. Sappa, and R. Toledo, "Multispectral image feature points, " Sensors, vol. 12, no. 9, pp. 12 661–12 672, 2012. doi: 10.3390/s120912661
    [62]
    S. Kim, D. Min, B. Ham, M. N. Do, and K. Sohn, "Dasc: Robust dense descriptor for multi-modal and multi-spectral correspondence estimation, " IEEE Trans. Pattern Anal. Mach. Intell. , vol. 39, no. 9, pp. 1712–1729, 2016.
    [63]
    J. Li, Q. Hu, and M. Ai, "Rift: Multi-modal image matching based on radiation-variation insensitive feature transform, " IEEE Trans. Image Process. , vol. 29, pp. 3296–3310, 2019.
    [64]
    G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, "Voxelmorph: a learning framework for deformable medical image registration, " Proc. IEEE Conf. Comput. Vis. Pattern Recognit. , vol. 38, no. 8, pp. 1788–1800, 2019.
    [65]
    M. Arar, Y. Ginger, D. Danon, A. H. Bermano, and D. Cohen-Or, "Unsupervised multi-modal image registration via geometry preserving image-to-image translation, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13 410–13 419.
    [66]
    S. Zhou, W. Tan, and B. Yan, "Promoting single-modal optical flow network for diverse cross-modal flow estimation, " in Proc. AAAI Conf. Artif. Intell., 2022, pp. 3562–3570.
    [67]
    S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao, "A deep learning framework for remote sensing image registration, " ISPRS J. Photogramm. Remote Sens. , vol. 145, pp. 148–164, 2018. doi: 10.1016/j.isprsjprs.2017.12.012
    [68]
    K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using rnn encoder-decoder for statistical machine translation, " arXiv, 2014.
    [69]
    S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, "Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2874–2883.
    [70]
    X. Hu, C. -W. Fu, L. Zhu, J. Qin, and P. -A. Heng, "Direction-aware spatial context features for shadow detection and removal, " IEEE Trans. Pattern Anal. Mach. Intell. , vol. 42, no. 11, pp. 2795–2808, 2019.
    [71]
    T. Wang, X. Yang, K. Xu, S. Chen, Q. Zhang, and R. W. Lau, "Spatial attentive single-image deraining with a high quality real rain dataset, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12 270–12 279.
    [72]
    S. Meister, J. Hur, and S. Roth, "Unflow: Unsupervised learning of optical flow with a bidirectional census loss, " in Proc. AAAI Conf. Artif. Intell., 2018, pp. 7251–7259.
    [73]
    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity, " IEEE Trans. Image Process. , vol. 13, no. 4, pp. 600–612, 2004. doi: 10.1109/TIP.2003.819861
    [74]
    M. Berman, A. R. Triki, and M. B. Blaschko, "The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4413–4421.
    [75]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., "Pytorch: An imperative style, high-performance deep learning library, " in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 8026–8037.
    [76]
    P. Truong, M. Danelljan, and R. Timofte, "Glu-net: Global-local universal network for dense flow and correspondences, " in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6258–6268.
  • Related Articles

    [1]Yifan Yuan, Guanqun Yang, James Z. Wang, Hui Zhang, Hongming Shan, Fei-Yue Wang, Junping Zhang. Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(4): 705-718. doi: 10.1109/JAS.2024.124800
    [2]Jinyuan Liu, Xingyuan Li, Zirui Wang, Zhiying Jiang, Wei Zhong, Wei Fan, Bin Xu. PromptFusion: Harmonized Semantic Prompt Learning for Infrared and Visible Image Fusion[J]. IEEE/CAA Journal of Automatica Sinica, 2025, 12(3): 502-515. doi: 10.1109/JAS.2024.124878
    [3]Haotian Liu, Yuchuang Tong, Zhengtao Zhang. Human Observation-Inspired Universal Image Acquisition Paradigm Integrating Multi-Objective Motion Planning and Control for Robotics[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(12): 2463-2475. doi: 10.1109/JAS.2024.124512
    [4]Han Xu, Jiayi Ma, Yixuan Yuan, Hao Zhang, Xin Tian, Xiaojie Guo. More Than Lightening: A Self-Supervised Low-Light Image Enhancement Method Capable for Multiple Degradations[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 622-637. doi: 10.1109/JAS.2024.124263
    [5]Tao Wang, Qiming Chen, Xun Lang, Lei Xie, Peng Li, Hongye Su. Detection of Oscillations in Process Control Loops From Visual Image Space Using Deep Convolutional Networks[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(4): 982-995. doi: 10.1109/JAS.2023.124170
    [6]Qian Hu, Jiayi Ma, Yuan Gao, Junjun Jiang, Yixuan Yuan. MAUN: Memory-Augmented Deep Unfolding Network for Hyperspectral Image Reconstruction[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(5): 1139-1150. doi: 10.1109/JAS.2024.124362
    [7]Kui Jiang, Ruoxi Wang, Yi Xiao, Junjun Jiang, Xin Xu, Tao Lu. Image Enhancement via Associated Perturbation Removal and Texture Reconstruction Learning[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(11): 2253-2269. doi: 10.1109/JAS.2024.124521
    [8]Hongmin Liu, Qi Zhang, Yufan Hu, Hui Zeng, Bin Fan. Unsupervised Multi-Expert Learning Model for Underwater Image Enhancement[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 708-722. doi: 10.1109/JAS.2023.123771
    [9]Yanan Jia, Qiming Hu, Renwei Dian, Jiayi Ma, Xiaojie Guo. PAPS: Progressive Attention-Based Pan-sharpening[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(2): 391-404. doi: 10.1109/JAS.2023.123987
    [10]Xin Tian, Wei Zhang, Dian Yu, Jiayi Ma. Sparse Tensor Prior for Hyperspectral, Multispectral, and Panchromatic Image Fusion[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(1): 284-286. doi: 10.1109/JAS.2022.106013
    [11]Yan-Jun Lin, Yun-Shi Yang, Li Chai, Zhi-Yun Lin. Distributed Finite-Time Event-Triggered Formation Control Based on a Unified Framework of Affine Image[J]. IEEE/CAA Journal of Automatica Sinica. doi: 10.1109/JAS.2023.123885
    [12]Dan Zhang, Qiusheng Lian, Yueming Su, Tengfei Ren. Dual-Prior Integrated Image Reconstruction for Quanta Image Sensors Using Multi-Agent Consensus Equilibrium[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(6): 1407-1420. doi: 10.1109/JAS.2023.123390
    [13]Xinya Wang, Qian Hu, Yingsong Cheng, Jiayi Ma. Hyperspectral Image Super-Resolution Meets Deep Learning: A Survey and Perspective[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(8): 1668-1691. doi: 10.1109/JAS.2023.123681
    [14]Quan Kong, Huabing Zhou, Yuntao Wu. NormFuse: Infrared and Visible Image Fusion With Pixel-Adaptive Normalization[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(12): 2190-2192. doi: 10.1109/JAS.2022.106112
    [15]Qimin Cheng, Yuzhuo Zhou, Haiyan Huang, Zhongyuan Wang. Multi-Attention Fusion and Fine-Grained Alignment for Bidirectional Image-Sentence Retrieval in Remote Sensing[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(8): 1532-1535. doi: 10.1109/JAS.2022.105773
    [16]Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, Yong Ma. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(7): 1200-1217. doi: 10.1109/JAS.2022.105686
    [17]Chengcai Leng, Hai Zhang, Guorong Cai, Zhen Chen, Anup Basu. Total Variation Constrained Non-Negative Matrix Factorization for Medical Image Registration[J]. IEEE/CAA Journal of Automatica Sinica, 2021, 8(5): 1025-1037. doi: 10.1109/JAS.2021.1003979
    [18]Ashish Kumar Bhandari, Arunangshu Ghosh, Immadisetty Vinod Kumar. A Local Contrast Fusion Based 3D Otsu Algorithm for Multilevel Image Segmentation[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(1): 200-213. doi: 10.1109/JAS.2019.1911843
    [19]Gaurav Bhatnagar, Q. M. Jonathan Wu. A Fractal Dimension Based Framework for Night Vision Fusion[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(1): 220-227. doi: 10.1109/JAS.2018.7511102
    [20]Hongyu Zhao, Chuangbai Xiao, Jing Yu, Xiujie Xu. Single Image Fog Removal Based on Local Extrema[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(2): 158-165.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(12)  / Tables(6)

    Article Metrics

    Article views (10398) PDF downloads(541) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return