A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 15.3, Top 1 (SCI Q1)
    CiteScore: 23.5, Top 2% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Y. Yuan, G. Yang, J. Wang, H. Zhang, H. Shan, F.-Y. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 4, pp. 1–14, Apr. 2025. doi: 10.1109/JAS.2024.124800
Citation: Y. Yuan, G. Yang, J. Wang, H. Zhang, H. Shan, F.-Y. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 4, pp. 1–14, Apr. 2025. doi: 10.1109/JAS.2024.124800

Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation

doi: 10.1109/JAS.2024.124800
Funds:  This work was supported in part by the National Natural Science Foundation of China (62176059). The work of James Z. Wang was supported by The Pennsylvania State University
More Information
  • Finding suitable initial noise that retains the original image’s information is crucial for image-to-image (I2I) translation using text-to-image (T2I) diffusion models. A common approach is to add random noise directly to the original image, as in SDEdit. However, we have observed that this can result in “semantic discrepancy” issues, wherein T2I diffusion models misinterpret the semantic relationships and generate content not present in the original image. We identify that the noise introduced by SDEdit disrupts the semantic integrity of the image, leading to unintended associations between unrelated regions after U-Net upsampling. Building on the widely-used latent diffusion model, Stable Diffusion, we propose a training-free, plug-and-play method to alleviate semantic discrepancy and enhance the fidelity of the translated image. By leveraging the deterministic nature of denoising diffusion implicit models (DDIMs) inversion, we correct the erroneous features and correlations from the original generative process with accurate ones from DDIM inversion. This approach alleviates semantic discrepancy and surpasses recent DDIM-inversion-based methods such as PnP with fewer priors, achieving a speedup of 11.2 times in experiments conducted on COCO, ImageNet, and ImageNet-R datasets across multiple I2I translation tasks. The codes are available at https://github.com/Sherlockyyf/Semantic_Discrepancy.

     

  • loading
  • [1]
    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.
    [2]
    Y. Chen, Y. Lv, and F.-Y. Wang, “Traffic flow imputation using parallel data and generative adversarial networks,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1624–1630, Apr. 2020. doi: 10.1109/TITS.2019.2910295
    [3]
    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 574.
    [4]
    P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. 35th Int. Conf. Neural Information Processing Systems, 2021, pp. 672.
    [5]
    C. Wang, T. Chen, Z. Chen, Z. Huang, T. Jiang, Q. Wang, and H. Shan, “FLDM-VTON: Faithful latent diffusion model for virtual try-on,” in Proc. Thirty-Third Int. Joint Conf. Artificial Intelligence, Jeju, South Korea, 2024, pp. 1362–1370.
    [6]
    Q. Gao, Z. Li, J. Zhang, Y. Zhang, and H. Shan, “CoreDiff: Contextual error-modulated generalized diffusion model for low-dose CT denoising and generalization,” IEEE Trans. Med. Imaging, vol. 43, no. 2, pp. 745–759, Feb. 2024. doi: 10.1109/TMI.2023.3320812
    [7]
    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 10674–10685.
    [8]
    A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Proc. 39th Int. Conf. Machine Learning, Baltimore, USA, 2022, pp. 16784–16804.
    [9]
    C. Saharia, W. Chan, S. Saxena, L. Lit, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. Gontijo-Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 2643.
    [10]
    J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu, “Scaling autoregressive models for content-rich text-to-image generation,” Trans. Mach. Learn. Res., vol. 2022, pp. 1–53, Nov. 2022.
    [11]
    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8821–8831.
    [12]
    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv: 2204.06125, 2022.
    [13]
    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, and Y. Jiao, “Improving image generation with better captions,” 2023. [Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf.
    [14]
    K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallel vision for perception and understanding of complex scenes: Methods, framework, and perspectives,” Artif. Intell. Rev., vol. 48, no. 3, pp. 299–329, Oct. 2017. doi: 10.1007/s10462-017-9569-z
    [15]
    H. Zhang, G. Luo, Y. Li, and F.-Y. Wang, “Parallel vision for intelligent transportation systems in metaverse: Challenges, solutions, and potential applications,” IEEE Trans. Syst. Man Cybern. Syst., vol. 53, no. 6, pp. 3400–3413, Jun. 2023. doi: 10.1109/TSMC.2022.3228314
    [16]
    H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask SSD: An effective single-stage approach to object instance segmentation,” IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020. doi: 10.1109/TIP.2019.2947806
    [17]
    B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6007–6017.
    [18]
    T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to follow image editing instructions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18392–18402.
    [19]
    X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto, “Diffusion-LM improves controllable text generation,” Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 313.
    [20]
    S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 22873–22884.
    [21]
    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “RePaint: Inpainting using denoising diffusion probabilistic models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 11451–11461.
    [22]
    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 3813–3824.
    [23]
    J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 7589–7599.
    [24]
    C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided image synthesis and editing with stochastic differential equations,” in Proc. Tenth Int. Conf. Learning Representations, 2021.
    [25]
    J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. 9th Int. Conf. Learning Representations, 2020.
    [26]
    S. Witteveen and M. Andrews, “Investigating prompt engineering in diffusion models,” arXiv preprint arXiv: 2211.15462, 2022.
    [27]
    N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 1921–1930.
    [28]
    M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
    [29]
    R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6038–6047.
    [30]
    D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. 2nd Int. Conf. Learning Representations, Banff, Canada, 2014.
    [31]
    D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 1530–1538.
    [32]
    L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. 3rd Int. Conf. Learning Representations, San Diego, USA, 2015.
    [33]
    D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1×1 convolutions,” in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 10236–10245.
    [34]
    Z. Huang, S. Chen, J. Zhang, and H. Shan, “AgeFlow: Conditional age progression and regression with normalizing flows,” in Proc. Thirtieth Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 743–750.
    [35]
    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5967–5976.
    [36]
    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2242–2251.
    [37]
    Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2868–2876.
    [38]
    Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8789–8797.
    [39]
    A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “GANimation: Anatomically-aware facial animation from a single image,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 835–851.
    [40]
    Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute editing by only changing what you want,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5464–5478, Nov. 2019. doi: 10.1109/TIP.2019.2916751
    [41]
    T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 2332–2341.
    [42]
    K. Zhang, Y. Su, X. Guo, L. Qi, and Z. Zhao, “MU-GAN: Facial attribute editing based on multi-attention mechanism,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1614–1626, Sep. 2021. doi: 10.1109/JAS.2020.1003390
    [43]
    M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “STGAN: A unified selective transfer network for arbitrary image attribute editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 3668–3677.
    [44]
    E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 2287–2296.
    [45]
    H. Lin, Y. Liu, S. Li, and X. Qu, “How generative adversarial networks promote the development of intelligent transportation systems: A survey,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1781–1796, Sep. 2023. doi: 10.1109/JAS.2023.123744
    [46]
    K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: Introduction and outlook,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 4, pp. 588–598, Oct. 2017. doi: 10.1109/JAS.2017.7510583
    [47]
    Y. Yuan, S. Ma, and J. Zhang, “VR-FAM: Variance-reduced encoder with nonlinear transformation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 1755–1759.
    [48]
    Y. Yuan, S. Ma, H. Shan, and J. Zhang, “DO-FAM: Disentangled non-linear latent navigation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5.
    [49]
    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
    [50]
    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 159.
    [51]
    G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in Proc. the Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
    [52]
    B. Wallace, A. Gokul, and N. Naik, “EDICT: Exact diffusion inversion via coupled transformations,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22532–22541.
    [53]
    L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” in Proc. 5th Int. Conf. Learning Representations, Toulon, France, 2017.
    [54]
    A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross-attention control,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
    [55]
    B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 7817–7826.
    [56]
    N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22500–22510.
    [57]
    R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
    [58]
    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv: 2308.06721, 2023.
    [59]
    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6309–6318.
    [60]
    D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 8320–8329.
    [61]
    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. 18th Int. Conf. Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015, pp. 234–241.
    [62]
    A. Maćkiewicz and W. Ratajczak, “Principal components analysis (PCA),” Computers & Geosciences, vol. 19, no. 3, pp. 303–342, Mar. 1993.
    [63]
    C. Si, Z. Huang, Y. Jiang, and Z. Liu, “FreeU: Free lunch in diffusion U-net,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2023, pp. 4733–4743.
    [64]
    T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
    [65]
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y
    [66]
    J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. doi: 10.1109/TPAMI.1986.4767851
    [67]
    Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Thirty-Seventh Asilomar Conf. Signals, Systems & Computers, Pacific Grove, USA, 2003, pp. 1398–1402.
    [68]
    S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1395–1403.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(15)  / Tables(5)

    Article Metrics

    Article views (88) PDF downloads(17) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return