 Volume 11
							Issue 2
								
						 Volume 11
							Issue 2 
						IEEE/CAA Journal of Automatica Sinica
| Citation: | Y. Zhu, Q. Kong, J. Shi, S. Liu, X. Ye, J.-C. Wang, H. Shan, and J. Zhang, “End-to-end paired ambisonic-binaural audio rendering,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 2, pp. 502–513, Feb. 2024. doi: 10.1109/JAS.2023.123969 | 
Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for machines to achieve binaural rendering since the description of a sound field often requires multiple channels and even the metadata of the sound sources. In addition, the perceived sound varies from person to person even in the same sound field. Previous methods generally rely on individual-dependent head-related transferred function (HRTF) datasets and optimization algorithms that act on HRTFs. In practical applications, there are two major drawbacks to existing methods. The first is a high personalization cost, as traditional methods achieve personalized needs by measuring HRTFs. The second is insufficient accuracy because the optimization goal of traditional methods is to retain another part of information that is more important in perception at the cost of discarding a part of the information. Therefore, it is desirable to develop novel techniques to achieve personalization and accuracy at a low cost. To this end, we focus on the binaural rendering of ambisonic and propose 1) channel-shared encoder and channel-compared attention integrated into neural networks and 2) a loss function quantifying interaural level differences to deal with spatial information. To verify the proposed method, we collect and release the first paired ambisonic-binaural dataset and introduce three metrics to evaluate the content information and spatial information accuracy of the end-to-end methods. Extensive experimental results on the collected dataset demonstrate the superior performance of the proposed method and the shortcomings of previous methods.
 
	                | [1] | H. Møller, “Fundamentals of binaural technology,” Applied Acoustics, vol. 36, no. 3–4, pp. 171–218, 1992. doi:  10.1016/0003-682X(92)90046-U | 
| [2] | A. Mckeag and D. S. Mcgrath, “Sound field format to binaural decoder with head tracking,” J. Audio Engineering Society, p. 4302, 1996. | 
| [3] | D. R. Begault and L. J. Trejo, “3-D sound for virtual reality and multimedia,” National Aeronautics and Space Administration, Tech. Rep., 2000. | 
| [4] | O. Puomio, J. Pätynen, and T. Lokki, “Optimization of virtual loudspeakers for spatial room acoustics reproduction with headphones,” Applied Sciences, vol. 7, no. 12, p. 1282, 2017. doi:  10.3390/app7121282 | 
| [5] | O. Kirkeby and P. A. Nelson, “Digital filter design for inversion problems in sound reproduction,” J. Audio Engineering Society, vol. 47, no. 7/8, pp. 583–595, 1999. | 
| [6] | C. Schörkhuber, M. Zaunschirm, and R. Höldrich, “Binaural rendering of ambisonic signals via magnitude least squares,” in Proc. the Deutsche German Acoustical Society, vol. 44, 2018, pp. 339–342. | 
| [7] | M. Zaunschirm, C. Schörkhuber, and R. Höldrich, “Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint,” The J. Acoustical Society of America, vol. 143, no. 6, pp. 3616–3627, 2018. doi:  10.1121/1.5040489 | 
| [8] | Z. Ben-Hur, D. L. Alon, R. Mehra, and B. Rafaely, “Binaural reproduction based on bilateral ambisonics and ear-aligned HRTFs,” IEEE/ACM Trans. Audio,Speech,and Language Processing, vol. 29, pp. 901–913, 2021. doi:  10.1109/TASLP.2021.3055038 | 
| [9] | I. Engel, D. Goodman, and L. Picinali, “Improving binaural rendering with bilateral ambisonics and MagLS,” in Proc. Annu. German Conf. Acoustics, Vienna, Austria, 2021, pp. 1608–1611. | 
| [10] | R. Sridhar, J. G. Tylka, and E. Choueiri, “A database of head-related transfer functions and morphological measurements,” J. Audio Engineering Society, vol. 2, pp. 851–855, 2017. | 
| [11] | L. Rayleigh, “On our perception of sound direction,” The London,Edinburgh,and Dublin Philosophical Magazine and Journal of Science, vol. 13, no. 74, pp. 214–232, 1907. doi:  10.1080/14786440709463595 | 
| [12] | M. A. Gerzon, “Periphony: With-height sound reproduction,” J. Audio Engineering Society, vol. 21, no. 1, pp. 2–10, 1973. | 
| [13] | Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation,” in Proc. 39th Int. Conf.  Machine Learning, vol. 162, 2022, pp. 10120–10134. | 
| [14] | S. O. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, vol. 31, 2018, pp. 10040–10050. | 
| [15] | X. Li and Z. Wang, “A HMM-based mandarin chinese singing voice synthesis system,” IEEE/CAA J. Autom. Sinica, vol. 3, no. 2, pp. 192–202, 2016. doi:  10.1109/JAS.2016.7451107 | 
| [16] | T. Sun, C. Wang, H. Dong, Y. Zhou, and C. Guan, “A novel parameter-optimized recurrent attention network for pipeline leakage detection,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 4, pp. 1064–1076, 2023. doi:  10.1109/JAS.2023.123180 | 
| [17] | R. Jiao, K. Peng, and J. Dong, “Remaining useful life prediction for a roller in a hot strip mill based on deep recurrent neural networks,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1345–1354, 2021. doi:  10.1109/JAS.2021.1004051 | 
| [18] | M. Shang and X. Hong, “Recurrent conformer for WiFi activity recognition,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 6, pp. 1491–1493, 2023. | 
| [19] | I. D. Gebru, D. Marković, A. Richard, S. Krenn, G. A. Butler, F. De la Torre, and Y. Sheikh, “Implicit HRTF modeling using temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2021, pp. 3385–3389. | 
| [20] | Y. Wang, Y. Zhang, Z. Duan, and M. Bocko, “Predicting global head-related transfer functions from scanned head geometry using deep learning and compact representations,” arXiv preprint arXiv: 2207.14352, 2022. | 
| [21] | P. Siripornpitak, I. Engel, I. Squires, S. J. Cooper, and L. Picinali, “Spatial up-sampling of HRTF sets using generative adversarial networks: A pilot study,” Frontiers in Signal Processing, vol. 2, pp. 1–10, 2022. | 
| [22] | G. W. Lee, H. K. Kim, C. J. Chun, and K. M. Jeon, “Personalized HRTF estimation based on one-to-many neural network architecture,” in Audio Engineering Society Convention 153, 2022. | 
| [23] | C. Wang, Z. Wang, L. Ma, H. Dong, and W. Sheng, “Subdomain-alignment data augmentation for pipeline fault diagnosis: An adversarial self-attention network,” IEEE Trans. Industrial Informatics, pp. 1–11, 2023. | 
| [24] | C. Wang, H. Dong, Z. Wang, L. Ma, ang W. Sheng, “A novel contrastive adversarial network for minor-class data augmentation: Applications to pipeline fault diagnosis,” Knowledge-Based Systems, vol. 271, p. 110516, 2023. doi:  10.1016/j.knosys.2023.110516 | 
| [25] | E. Sejdić, I. Djurović, and J. Jiang, “Time-frequency feature representation using energy concentration: An overview of recent advances,” Digital Signal Processing, vol. 19, no. 1, pp. 153–183, 2009. doi:  10.1016/j.dsp.2007.12.004 | 
| [26] | Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation.” in Proc. Int. Society for Music Information Retrieval Conf., 2021. | 
| [27] | A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8024–8035. | 
| [28] | Y. Luo and J. Yu, “Music source separation with band-split RNN,” arXiv preprint arXiv: 2209.15174, 2022. | 
| [29] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017. | 
| [30] | C. Armstrong, L. Thresh, D. Murphy, and G. Kearney, “A perceptual evaluation of individual and non-individual HRTFs: A case study of the SADIE II database,” Applied Sciences, vol. 8, no. 11, p. 2029, 2018. doi:  10.3390/app8112029 | 
| [31] | D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, 2015. | 
| [32] | P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training imagenet in 1 hour,” arXiv preprint arXiv: 1706.02677, 2018. | 
| [33] | E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio,Speech,and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. doi:  10.1109/TSA.2005.858005 | 
| [34] | Z. Ren, K. Qian, Z. Zhang, V. Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 662–669, 2018. doi:  10.1109/JAS.2018.7511066 | 
| [35] | Y. Zhang, D. Li, X. Shi, D. He, K. Song, X. Wang, H. Qin, and H. Li, “KBNet: Kernel basis network for image restoration,” arXiv preprint arXiv: 2303.02881, 2023. | 
| [36] | S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head Transformer with convolution-augmented joint self-attentions,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2023. | 
| [37] | G. W. Lee and H. K. Kim, “Personalized HRTF modeling based on deep neural network using anthropometric measurements and images of the ear,” Applied Sciences, vol. 8, no. 11, p. 2180, 2018. doi:  10.3390/app8112180 | 
| [38] | R. Miccini and S. Spagnol, “HRTF individualization using deep learning,” in Proc. IEEE Conf. Virtual Reality and 3D User Interfaces Abstracts and Workshops, 2020, pp. 390–395. |