Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation

Yifan Yuan; Guanqun Yang; James Z. Wang; Hui Zhang; Hongming Shan; Fei-Yue Wang; Junping Zhang

doi:10.1109/JAS.2024.124800

Volume 12 Issue 4

Apr. 2025

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2025 > 12(4): 705-718

Y. Yuan, G. Yang, J. Wang, H. Zhang, H. Shan, F.-Y. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 4, pp. 705–718, Apr. 2025. doi: 10.1109/JAS.2024.124800

Citation:

Y. Yuan, G. Yang, J. Wang, H. Zhang, H. Shan, F.-Y. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 4, pp. 705–718, Apr. 2025. doi: 10.1109/JAS.2024.124800

Citation:

PDF( 96246 KB)

Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation

doi: 10.1109/JAS.2024.124800

Funds: This work was supported in part by the National Natural Science Foundation of China (62176059). The work of James Z. Wang was supported by The Pennsylvania State University

More Information

Author Bio:
Yifan Yuan (Student Member, IEEE) received the B.S. degree in physics from Fudan University, in 2019. Since 2019, she has been pursuing the Ph.D. degree in computer science and technology at Fudan University. Her research interests include image editing, generative models and disentanglement learning

Guanqun Yang is a Senior Learning R&D Engineer in the Fuxi team of NetEase corp. Graduated from School of Computer Science at BUAA in spring 2022, where he has been working on reinforcement learning and computer vision research and won the first prize of HUAWEI digix global AI challenge in 2021. Currently, he is working on intelligent gaming robots

James Z. Wang (Senior Member, IEEE) is a Distinguished Professor of the Data Sciences and Artificial Intelligence section of the College of Information Sciences and Technology at The Pennsylvania State University. He received the bachelor degree in mathematics summa cum laude from the University of Minnesota (1994), and the M.S. degree in mathematics (1997), the M.S. degree in computer science (1997), and the Ph.D. degree in medical information sciences (2000), all from Stanford University. His research interests include image analysis, affective computing, image modeling, image retrieval, and their applications. He was a Visiting Professor at the Robotics Institute at Carnegie Mellon University (2007–2008), a lead special section Guest Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence (2008), a Program Manager at the Office of the Director of the National Science Foundation (2011–2012), and a special issue Guest Editor of the IEEE BITS – The Information Theory Magazine (2022). He is also affiliated with the Department of Communication and Media, School of Social Sciences and Humanities, Loughborough University, UK (2023–2024). He was a recipient of a National Science Foundation Career Award (2004) and Amazon Research Awards (2018–2022)

Hui Zhang (Senior Member, IEEE) received the B.S. degree in automation from the Beijing Jiaotong University in 2015, and received the Ph.D. degree in control theory and control engineering from the University of Chinese Academy of Sciences (UCAS) in 2020. From August 2018 to October 2019, she was supported by UCAS as a joint-supervision Ph.D. student with University of Rhode Island, USA. She is currently a Lecturer at the School of Computer and Information Technology, Beijing Jiaotong University. Her research interests include computer vision, pattern recognition, and intelligent transportation systems

Hongming Shan (Senior Member, IEEE) received the Ph.D. degree in machine learning from Fudan University in 2017. From 2017 to 2020, he was a Postdoctoral Research Associate and Research Scientist at Rensselaer Polytechnic Institute, USA. He is currently a Professor at the Institute of Science and Technology for Braininspired Intelligence, Fudan University, and also a “Qiusuo” Research Leader with the Shanghai Center for Brain Science and Braininspired Technology. His research focuses on representation learning, generative AI, medical image reconstruction, and brain disease analysis. He was recognized with Youth Outstanding Paper Award at World Artificial Intelligence Conference 2021

Fei-Yue Wang (Fellow, IEEE) received the Ph.D. degree in computer and systems engineering from the Rensselaer Polytechnic Institute, Troy, NY, USA, in 1990. He joined The University of Arizona in 1990 and became a Professor and the Director of the Robotics and Automation Laboratory and the Program in Advanced Research for Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center at the Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China, under the support of the Outstanding Chinese Talents Program from the State Planning Council. In 2002, he was appointed as the Director of the Key Laboratory of Complex Systems and Intelligence Science, CAS. In 2011, he became the State Specially Appointed Expert and the Director of the State Key Laboratory for Management and Control of Complex Systems. He is also a Senior Professor and the Chair of the International Academic Advisory Committee for the Institute of Systems Engineering, Macau University of Science and Technology, as well as a Professor at various colleges, including the Artificial Intelligence, and Economics and Management, University of Chinese Academy of Sciences. His current research focuses on methods and applications for parallel intelligence, social computing, and knowledge automation

Junping Zhang (Senior Member, IEEE) received the B.S. degree in automation from Xiangtan University in 1992. He received the M.S. degree in control theory and control engineering from Hunan University in 2000. He received the Ph.D. degree in intelligent system and pattern recognition from the Institute of Automation, Chinese Academy of Sciences, in 2003. He is a Professor at School of Computer Science, Fudan University since 2011. His research interests include machine learning, image processing, biometric authentication, and intelligent transportation systems. He has been an Associate Editor of IEEE Intelligent Systems since 2009. He has widely published in highly ranked international journals such as IEEE TPAMI and IEEE TNNLS, and leading international conferences such as ICML, CVPR, IJCAI, and ECCV
Corresponding author: Fei-Yue Wang, e-mail: feiyue.wang@ia.ac.cn; Junping Zhang, e-mail: jpzhang@fudan.edu.cn
¹ https://huggingface.co/runwayml/stable-diffusion-v1-5
² https://huggingface.co/openai/clip-vit-large-patch14
³ https://huggingface.co/Salesforce/blip-image-captioning-large
⁴ https://civitai.com/models/40/papercut
⁵ https: //civitai.com/models/4201?modelVersionId=29460
⁶ https://civitai.com/models/7371?modelVersionId=19575
Received Date: 2024-05-19
Revised Date: 2024-07-08
Accepted Date: 2024-07-24

Available Online: 2025-03-19

Abstract

Abstract

Finding suitable initial noise that retains the original image’s information is crucial for image-to-image (I2I) translation using text-to-image (T2I) diffusion models. A common approach is to add random noise directly to the original image, as in SDEdit. However, we have observed that this can result in “semantic discrepancy” issues, wherein T2I diffusion models misinterpret the semantic relationships and generate content not present in the original image. We identify that the noise introduced by SDEdit disrupts the semantic integrity of the image, leading to unintended associations between unrelated regions after U-Net upsampling. Building on the widely-used latent diffusion model, Stable Diffusion, we propose a training-free, plug-and-play method to alleviate semantic discrepancy and enhance the fidelity of the translated image. By leveraging the deterministic nature of denoising diffusion implicit models (DDIMs) inversion, we correct the erroneous features and correlations from the original generative process with accurate ones from DDIM inversion. This approach alleviates semantic discrepancy and surpasses recent DDIM-inversion-based methods such as PnP with fewer priors, achieving a speedup of 11.2 times in experiments conducted on COCO, ImageNet, and ImageNet-R datasets across multiple I2I translation tasks. The codes are available at https://github.com/Sherlockyyf/Semantic_Discrepancy.
- DDIM inversion,
- diffusion model,
- image-to-image translation,
- semantic discrepancy,
- stable diffusion

FullText(HTML)

¹ https://huggingface.co/runwayml/stable-diffusion-v1-5
² https://huggingface.co/openai/clip-vit-large-patch14
³ https://huggingface.co/Salesforce/blip-image-captioning-large
⁴ https://civitai.com/models/40/papercut
⁵ https: //civitai.com/models/4201?modelVersionId=29460
⁶ https://civitai.com/models/7371?modelVersionId=19575

References(68)

References

[1]	I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.
[2]	Y. Chen, Y. Lv, and F.-Y. Wang, “Traffic flow imputation using parallel data and generative adversarial networks,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1624–1630, Apr. 2020. doi: 10.1109/TITS.2019.2910295
[3]	J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 574.
[4]	P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. 35th Int. Conf. Neural Information Processing Systems, 2021, pp. 672.
[5]	C. Wang, T. Chen, Z. Chen, Z. Huang, T. Jiang, Q. Wang, and H. Shan, “FLDM-VTON: Faithful latent diffusion model for virtual try-on,” in Proc. Thirty-Third Int. Joint Conf. Artificial Intelligence, Jeju, South Korea, 2024, pp. 1362–1370.
[6]	Q. Gao, Z. Li, J. Zhang, Y. Zhang, and H. Shan, “CoreDiff: Contextual error-modulated generalized diffusion model for low-dose CT denoising and generalization,” IEEE Trans. Med. Imaging, vol. 43, no. 2, pp. 745–759, Feb. 2024. doi: 10.1109/TMI.2023.3320812
[7]	R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 10674–10685.
[8]	A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Proc. 39th Int. Conf. Machine Learning, Baltimore, USA, 2022, pp. 16784–16804.
[9]	C. Saharia, W. Chan, S. Saxena, L. Lit, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. Gontijo-Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 2643.
[10]	J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu, “Scaling autoregressive models for content-rich text-to-image generation,” Trans. Mach. Learn. Res., vol. 2022, pp. 1–53, Nov. 2022.
[11]	A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8821–8831.
[12]	A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv: 2204.06125, 2022.
[13]	J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, and Y. Jiao, “Improving image generation with better captions,” 2023. [Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf.
[14]	K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallel vision for perception and understanding of complex scenes: Methods, framework, and perspectives,” Artif. Intell. Rev., vol. 48, no. 3, pp. 299–329, Oct. 2017. doi: 10.1007/s10462-017-9569-z
[15]	H. Zhang, G. Luo, Y. Li, and F.-Y. Wang, “Parallel vision for intelligent transportation systems in metaverse: Challenges, solutions, and potential applications,” IEEE Trans. Syst. Man Cybern. Syst., vol. 53, no. 6, pp. 3400–3413, Jun. 2023. doi: 10.1109/TSMC.2022.3228314
[16]	H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask SSD: An effective single-stage approach to object instance segmentation,” IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020. doi: 10.1109/TIP.2019.2947806
[17]	B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6007–6017.
[18]	T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to follow image editing instructions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18392–18402.
[19]	X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto, “Diffusion-LM improves controllable text generation,” Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 313.
[20]	S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 22873–22884.
[21]	A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “RePaint: Inpainting using denoising diffusion probabilistic models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 11451–11461.
[22]	L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 3813–3824.
[23]	J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 7589–7599.
[24]	C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided image synthesis and editing with stochastic differential equations,” in Proc. Tenth Int. Conf. Learning Representations, 2021.
[25]	J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. 9th Int. Conf. Learning Representations, 2020.
[26]	S. Witteveen and M. Andrews, “Investigating prompt engineering in diffusion models,” arXiv preprint arXiv: 2211.15462, 2022.
[27]	N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 1921–1930.
[28]	M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[29]	R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6038–6047.
[30]	D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. 2nd Int. Conf. Learning Representations, Banff, Canada, 2014.
[31]	D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 1530–1538.
[32]	L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. 3rd Int. Conf. Learning Representations, San Diego, USA, 2015.
[33]	D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1×1 convolutions,” in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 10236–10245.
[34]	Z. Huang, S. Chen, J. Zhang, and H. Shan, “AgeFlow: Conditional age progression and regression with normalizing flows,” in Proc. Thirtieth Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 743–750.
[35]	P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5967–5976.
[36]	J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2242–2251.
[37]	Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2868–2876.
[38]	Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8789–8797.
[39]	A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “GANimation: Anatomically-aware facial animation from a single image,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 835–851.
[40]	Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute editing by only changing what you want,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5464–5478, Nov. 2019. doi: 10.1109/TIP.2019.2916751
[41]	T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 2332–2341.
[42]	K. Zhang, Y. Su, X. Guo, L. Qi, and Z. Zhao, “MU-GAN: Facial attribute editing based on multi-attention mechanism,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1614–1626, Sep. 2021. doi: 10.1109/JAS.2020.1003390
[43]	M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “STGAN: A unified selective transfer network for arbitrary image attribute editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 3668–3677.
[44]	E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 2287–2296.
[45]	H. Lin, Y. Liu, S. Li, and X. Qu, “How generative adversarial networks promote the development of intelligent transportation systems: A survey,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1781–1796, Sep. 2023. doi: 10.1109/JAS.2023.123744
[46]	K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: Introduction and outlook,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 4, pp. 588–598, Oct. 2017. doi: 10.1109/JAS.2017.7510583
[47]	Y. Yuan, S. Ma, and J. Zhang, “VR-FAM: Variance-reduced encoder with nonlinear transformation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 1755–1759.
[48]	Y. Yuan, S. Ma, H. Shan, and J. Zhang, “DO-FAM: Disentangled non-linear latent navigation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5.
[49]	A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
[50]	T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 159.
[51]	G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in Proc. the Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[52]	B. Wallace, A. Gokul, and N. Naik, “EDICT: Exact diffusion inversion via coupled transformations,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22532–22541.
[53]	L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” in Proc. 5th Int. Conf. Learning Representations, Toulon, France, 2017.
[54]	A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross-attention control,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[55]	B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 7817–7826.
[56]	N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22500–22510.
[57]	R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
[58]	H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv: 2308.06721, 2023.
[59]	A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6309–6318.
[60]	D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 8320–8329.
[61]	O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. 18th Int. Conf. Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015, pp. 234–241.
[62]	A. Maćkiewicz and W. Ratajczak, “Principal components analysis (PCA),” Computers & Geosciences, vol. 19, no. 3, pp. 303–342, Mar. 1993.
[63]	C. Si, Z. Huang, Y. Jiang, and Z. Liu, “FreeU: Free lunch in diffusion U-net,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2023, pp. 4733–4743.
[64]	T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
[65]	O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y
[66]	J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. doi: 10.1109/TPAMI.1986.4767851
[67]	Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Thirty-Seventh Asilomar Conf. Signals, Systems & Computers, Pacific Grove, USA, 2003, pp. 1398–1402.
[68]	S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1395–1403.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(15) / Tables(5)

Get Citation

PDF

XML

Article Metrics

Article views (575) PDF downloads(103)

Highlights

Identifies semantic discrepancy problem in Stable Diffusion-based I2I translation
Propose a training-free method uses the first DDIM inversion step to correct feature associations
Achieves 11.2× faster inference compared to prior DDIM-based approaches while improving semantic consistency
Seamlessly integrates into Stable Diffusion-based models for real-world tasks

Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation

doi: 10.1109/JAS.2024.124800

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content