Citation: | Y. Yuan, G. Yang, J. Wang, H. Zhang, H. Shan, F.-Y. Wang, and J. Zhang, “Dissecting and mitigating semantic discrepancy in stable diffusion for image-to-image translation,” IEEE/CAA J. Autom. Sinica, vol. 12, no. 4, pp. 1–14, Apr. 2025. doi: 10.1109/JAS.2024.124800 |
[1] |
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.
|
[2] |
Y. Chen, Y. Lv, and F.-Y. Wang, “Traffic flow imputation using parallel data and generative adversarial networks,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 4, pp. 1624–1630, Apr. 2020. doi: 10.1109/TITS.2019.2910295
|
[3] |
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 574.
|
[4] |
P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. 35th Int. Conf. Neural Information Processing Systems, 2021, pp. 672.
|
[5] |
C. Wang, T. Chen, Z. Chen, Z. Huang, T. Jiang, Q. Wang, and H. Shan, “FLDM-VTON: Faithful latent diffusion model for virtual try-on,” in Proc. Thirty-Third Int. Joint Conf. Artificial Intelligence, Jeju, South Korea, 2024, pp. 1362–1370.
|
[6] |
Q. Gao, Z. Li, J. Zhang, Y. Zhang, and H. Shan, “CoreDiff: Contextual error-modulated generalized diffusion model for low-dose CT denoising and generalization,” IEEE Trans. Med. Imaging, vol. 43, no. 2, pp. 745–759, Feb. 2024. doi: 10.1109/TMI.2023.3320812
|
[7] |
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 10674–10685.
|
[8] |
A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models,” in Proc. 39th Int. Conf. Machine Learning, Baltimore, USA, 2022, pp. 16784–16804.
|
[9] |
C. Saharia, W. Chan, S. Saxena, L. Lit, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. Gontijo-Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, “Photorealistic text-to-image diffusion models with deep language understanding,” in Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 2643.
|
[10] |
J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu, “Scaling autoregressive models for content-rich text-to-image generation,” Trans. Mach. Learn. Res., vol. 2022, pp. 1–53, Nov. 2022.
|
[11] |
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8821–8831.
|
[12] |
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv: 2204.06125, 2022.
|
[13] |
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, and Y. Jiao, “Improving image generation with better captions,” 2023. [Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf.
|
[14] |
K. Wang, C. Gou, N. Zheng, J. M. Rehg, and F.-Y. Wang, “Parallel vision for perception and understanding of complex scenes: Methods, framework, and perspectives,” Artif. Intell. Rev., vol. 48, no. 3, pp. 299–329, Oct. 2017. doi: 10.1007/s10462-017-9569-z
|
[15] |
H. Zhang, G. Luo, Y. Li, and F.-Y. Wang, “Parallel vision for intelligent transportation systems in metaverse: Challenges, solutions, and potential applications,” IEEE Trans. Syst. Man Cybern. Syst., vol. 53, no. 6, pp. 3400–3413, Jun. 2023. doi: 10.1109/TSMC.2022.3228314
|
[16] |
H. Zhang, Y. Tian, K. Wang, W. Zhang, and F.-Y. Wang, “Mask SSD: An effective single-stage approach to object instance segmentation,” IEEE Trans. Image Process., vol. 29, pp. 2078–2093, 2020. doi: 10.1109/TIP.2019.2947806
|
[17] |
B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6007–6017.
|
[18] |
T. Brooks, A. Holynski, and A. A. Efros, “InstructPix2Pix: Learning to follow image editing instructions,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 18392–18402.
|
[19] |
X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, and T. B. Hashimoto, “Diffusion-LM improves controllable text generation,” Proc. 36th Int. Conf. Neural Information Processing Systems, New Orleans, USA, 2022, pp. 313.
|
[20] |
S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 22873–22884.
|
[21] |
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “RePaint: Inpainting using denoising diffusion probabilistic models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, USA, 2022, pp. 11451–11461.
|
[22] |
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 3813–3824.
|
[23] |
J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Paris, France, 2023, pp. 7589–7599.
|
[24] |
C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “SDEdit: Guided image synthesis and editing with stochastic differential equations,” in Proc. Tenth Int. Conf. Learning Representations, 2021.
|
[25] |
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proc. 9th Int. Conf. Learning Representations, 2020.
|
[26] |
S. Witteveen and M. Andrews, “Investigating prompt engineering in diffusion models,” arXiv preprint arXiv: 2211.15462, 2022.
|
[27] |
N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 1921–1930.
|
[28] |
M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
|
[29] |
R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 6038–6047.
|
[30] |
D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. 2nd Int. Conf. Learning Representations, Banff, Canada, 2014.
|
[31] |
D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 1530–1538.
|
[32] |
L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” in Proc. 3rd Int. Conf. Learning Representations, San Diego, USA, 2015.
|
[33] |
D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1×1 convolutions,” in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 10236–10245.
|
[34] |
Z. Huang, S. Chen, J. Zhang, and H. Shan, “AgeFlow: Conditional age progression and regression with normalizing flows,” in Proc. Thirtieth Int. Joint Conf. Artificial Intelligence, Montreal, Canada, 2021, pp. 743–750.
|
[35] |
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 5967–5976.
|
[36] |
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2242–2251.
|
[37] |
Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2868–2876.
|
[38] |
Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018, pp. 8789–8797.
|
[39] |
A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer, “GANimation: Anatomically-aware facial animation from a single image,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 835–851.
|
[40] |
Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute editing by only changing what you want,” IEEE Trans. Image Process., vol. 28, no. 11, pp. 5464–5478, Nov. 2019. doi: 10.1109/TIP.2019.2916751
|
[41] |
T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthesis with spatially-adaptive normalization,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 2332–2341.
|
[42] |
K. Zhang, Y. Su, X. Guo, L. Qi, and Z. Zhao, “MU-GAN: Facial attribute editing based on multi-attention mechanism,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 9, pp. 1614–1626, Sep. 2021. doi: 10.1109/JAS.2020.1003390
|
[43] |
M. Liu, Y. Ding, M. Xia, X. Liu, E. Ding, W. Zuo, and S. Wen, “STGAN: A unified selective transfer network for arbitrary image attribute editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 3668–3677.
|
[44] |
E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A StyleGAN encoder for image-to-image translation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 2287–2296.
|
[45] |
H. Lin, Y. Liu, S. Li, and X. Qu, “How generative adversarial networks promote the development of intelligent transportation systems: A survey,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 9, pp. 1781–1796, Sep. 2023. doi: 10.1109/JAS.2023.123744
|
[46] |
K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: Introduction and outlook,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 4, pp. 588–598, Oct. 2017. doi: 10.1109/JAS.2017.7510583
|
[47] |
Y. Yuan, S. Ma, and J. Zhang, “VR-FAM: Variance-reduced encoder with nonlinear transformation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Singapore, Singapore, 2022, pp. 1755–1759.
|
[48] |
Y. Yuan, S. Ma, H. Shan, and J. Zhang, “DO-FAM: Disentangled non-linear latent navigation for facial attribute manipulation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023, pp. 1–5.
|
[49] |
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
|
[50] |
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 159.
|
[51] |
G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in Proc. the Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
|
[52] |
B. Wallace, A. Gokul, and N. Naik, “EDICT: Exact diffusion inversion via coupled transformations,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22532–22541.
|
[53] |
L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,” in Proc. 5th Int. Conf. Learning Representations, Toulon, France, 2017.
|
[54] |
A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross-attention control,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
|
[55] |
B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2024, pp. 7817–7826.
|
[56] |
N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 22500–22510.
|
[57] |
R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in Proc. Eleventh Int. Conf. Learning Representations, Kigali, Rwanda, 2023.
|
[58] |
H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv: 2308.06721, 2023.
|
[59] |
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, USA, 2017, pp. 6309–6318.
|
[60] |
D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proc. IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 8320–8329.
|
[61] |
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. 18th Int. Conf. Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015, pp. 234–241.
|
[62] |
A. Maćkiewicz and W. Ratajczak, “Principal components analysis (PCA),” Computers & Geosciences, vol. 19, no. 3, pp. 303–342, Mar. 1993.
|
[63] |
C. Si, Z. Huang, Y. Jiang, and Z. Liu, “FreeU: Free lunch in diffusion U-net,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, USA, 2023, pp. 4733–4743.
|
[64] |
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
|
[65] |
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. doi: 10.1007/s11263-015-0816-y
|
[66] |
J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. doi: 10.1109/TPAMI.1986.4767851
|
[67] |
Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Thirty-Seventh Asilomar Conf. Signals, Systems & Computers, Pacific Grove, USA, 2003, pp. 1398–1402.
|
[68] |
S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1395–1403.
|