A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 11 Issue 2
Feb.  2024

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
Y. Zhu, Q. Kong, J. Shi, S. Liu, X. Ye, J.-C. Wang, H. Shan, and  J.  Zhang,  “End-to-end paired ambisonic-binaural audio rendering,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 2, pp. 502–513, Feb. 2024. doi: 10.1109/JAS.2023.123969
Citation: Y. Zhu, Q. Kong, J. Shi, S. Liu, X. Ye, J.-C. Wang, H. Shan, and  J.  Zhang,  “End-to-end paired ambisonic-binaural audio rendering,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 2, pp. 502–513, Feb. 2024. doi: 10.1109/JAS.2023.123969

End-to-End Paired Ambisonic-Binaural Audio Rendering

doi: 10.1109/JAS.2023.123969
Funds:  This work was supported in part by the National Natural Science Foundation of China (62176059, 62101136)
More Information
  • Binaural rendering is of great interest to virtual reality and immersive media. Although humans can naturally use their two ears to perceive the spatial information contained in sounds, it is a challenging task for machines to achieve binaural rendering since the description of a sound field often requires multiple channels and even the metadata of the sound sources. In addition, the perceived sound varies from person to person even in the same sound field. Previous methods generally rely on individual-dependent head-related transferred function (HRTF) datasets and optimization algorithms that act on HRTFs. In practical applications, there are two major drawbacks to existing methods. The first is a high personalization cost, as traditional methods achieve personalized needs by measuring HRTFs. The second is insufficient accuracy because the optimization goal of traditional methods is to retain another part of information that is more important in perception at the cost of discarding a part of the information. Therefore, it is desirable to develop novel techniques to achieve personalization and accuracy at a low cost. To this end, we focus on the binaural rendering of ambisonic and propose 1) channel-shared encoder and channel-compared attention integrated into neural networks and 2) a loss function quantifying interaural level differences to deal with spatial information. To verify the proposed method, we collect and release the first paired ambisonic-binaural dataset and introduce three metrics to evaluate the content information and spatial information accuracy of the end-to-end methods. Extensive experimental results on the collected dataset demonstrate the superior performance of the proposed method and the shortcomings of previous methods.

     

  • loading
  • [1]
    H. Møller, “Fundamentals of binaural technology,” Applied Acoustics, vol. 36, no. 3–4, pp. 171–218, 1992. doi: 10.1016/0003-682X(92)90046-U
    [2]
    A. Mckeag and D. S. Mcgrath, “Sound field format to binaural decoder with head tracking,” J. Audio Engineering Society, p. 4302, 1996.
    [3]
    D. R. Begault and L. J. Trejo, “3-D sound for virtual reality and multimedia,” National Aeronautics and Space Administration, Tech. Rep., 2000.
    [4]
    O. Puomio, J. Pätynen, and T. Lokki, “Optimization of virtual loudspeakers for spatial room acoustics reproduction with headphones,” Applied Sciences, vol. 7, no. 12, p. 1282, 2017. doi: 10.3390/app7121282
    [5]
    O. Kirkeby and P. A. Nelson, “Digital filter design for inversion problems in sound reproduction,” J. Audio Engineering Society, vol. 47, no. 7/8, pp. 583–595, 1999.
    [6]
    C. Schörkhuber, M. Zaunschirm, and R. Höldrich, “Binaural rendering of ambisonic signals via magnitude least squares,” in Proc. the Deutsche German Acoustical Society, vol. 44, 2018, pp. 339–342.
    [7]
    M. Zaunschirm, C. Schörkhuber, and R. Höldrich, “Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint,” The J. Acoustical Society of America, vol. 143, no. 6, pp. 3616–3627, 2018. doi: 10.1121/1.5040489
    [8]
    Z. Ben-Hur, D. L. Alon, R. Mehra, and B. Rafaely, “Binaural reproduction based on bilateral ambisonics and ear-aligned HRTFs,” IEEE/ACM Trans. Audio,Speech,and Language Processing, vol. 29, pp. 901–913, 2021. doi: 10.1109/TASLP.2021.3055038
    [9]
    I. Engel, D. Goodman, and L. Picinali, “Improving binaural rendering with bilateral ambisonics and MagLS,” in Proc. Annu. German Conf. Acoustics, Vienna, Austria, 2021, pp. 1608–1611.
    [10]
    R. Sridhar, J. G. Tylka, and E. Choueiri, “A database of head-related transfer functions and morphological measurements,” J. Audio Engineering Society, vol. 2, pp. 851–855, 2017.
    [11]
    L. Rayleigh, “On our perception of sound direction,” The London,Edinburgh,and Dublin Philosophical Magazine and Journal of Science, vol. 13, no. 74, pp. 214–232, 1907. doi: 10.1080/14786440709463595
    [12]
    M. A. Gerzon, “Periphony: With-height sound reproduction,” J. Audio Engineering Society, vol. 21, no. 1, pp. 2–10, 1973.
    [13]
    Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation,” in Proc. 39th Int. Conf. Machine Learning, vol. 162, 2022, pp. 10120–10134.
    [14]
    S. O. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, vol. 31, 2018, pp. 10040–10050.
    [15]
    X. Li and Z. Wang, “A HMM-based mandarin chinese singing voice synthesis system,” IEEE/CAA J. Autom. Sinica, vol. 3, no. 2, pp. 192–202, 2016. doi: 10.1109/JAS.2016.7451107
    [16]
    T. Sun, C. Wang, H. Dong, Y. Zhou, and C. Guan, “A novel parameter-optimized recurrent attention network for pipeline leakage detection,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 4, pp. 1064–1076, 2023. doi: 10.1109/JAS.2023.123180
    [17]
    R. Jiao, K. Peng, and J. Dong, “Remaining useful life prediction for a roller in a hot strip mill based on deep recurrent neural networks,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1345–1354, 2021. doi: 10.1109/JAS.2021.1004051
    [18]
    M. Shang and X. Hong, “Recurrent conformer for WiFi activity recognition,” IEEE/CAA J. Autom. Sinica, vol. 10, no. 6, pp. 1491–1493, 2023.
    [19]
    I. D. Gebru, D. Marković, A. Richard, S. Krenn, G. A. Butler, F. De la Torre, and Y. Sheikh, “Implicit HRTF modeling using temporal convolutional networks,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2021, pp. 3385–3389.
    [20]
    Y. Wang, Y. Zhang, Z. Duan, and M. Bocko, “Predicting global head-related transfer functions from scanned head geometry using deep learning and compact representations,” arXiv preprint arXiv: 2207.14352, 2022.
    [21]
    P. Siripornpitak, I. Engel, I. Squires, S. J. Cooper, and L. Picinali, “Spatial up-sampling of HRTF sets using generative adversarial networks: A pilot study,” Frontiers in Signal Processing, vol. 2, pp. 1–10, 2022.
    [22]
    G. W. Lee, H. K. Kim, C. J. Chun, and K. M. Jeon, “Personalized HRTF estimation based on one-to-many neural network architecture,” in Audio Engineering Society Convention 153, 2022.
    [23]
    C. Wang, Z. Wang, L. Ma, H. Dong, and W. Sheng, “Subdomain-alignment data augmentation for pipeline fault diagnosis: An adversarial self-attention network,” IEEE Trans. Industrial Informatics, pp. 1–11, 2023.
    [24]
    C. Wang, H. Dong, Z. Wang, L. Ma, ang W. Sheng, “A novel contrastive adversarial network for minor-class data augmentation: Applications to pipeline fault diagnosis,” Knowledge-Based Systems, vol. 271, p. 110516, 2023. doi: 10.1016/j.knosys.2023.110516
    [25]
    E. Sejdić, I. Djurović, and J. Jiang, “Time-frequency feature representation using energy concentration: An overview of recent advances,” Digital Signal Processing, vol. 19, no. 1, pp. 153–183, 2009. doi: 10.1016/j.dsp.2007.12.004
    [26]
    Q. Kong, Y. Cao, H. Liu, K. Choi, and Y. Wang, “Decoupling magnitude and phase estimation with deep resunet for music source separation.” in Proc. Int. Society for Music Information Retrieval Conf., 2021.
    [27]
    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, 2019, pp. 8024–8035.
    [28]
    Y. Luo and J. Yu, “Music source separation with band-split RNN,” arXiv preprint arXiv: 2209.15174, 2022.
    [29]
    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
    [30]
    C. Armstrong, L. Thresh, D. Murphy, and G. Kearney, “A perceptual evaluation of individual and non-individual HRTFs: A case study of the SADIE II database,” Applied Sciences, vol. 8, no. 11, p. 2029, 2018. doi: 10.3390/app8112029
    [31]
    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learning Representations, 2015.
    [32]
    P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training imagenet in 1 hour,” arXiv preprint arXiv: 1706.02677, 2018.
    [33]
    E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio,Speech,and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. doi: 10.1109/TSA.2005.858005
    [34]
    Z. Ren, K. Qian, Z. Zhang, V. Pandit, A. Baird, and B. Schuller, “Deep scalogram representations for acoustic scene classification,” IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 662–669, 2018. doi: 10.1109/JAS.2018.7511066
    [35]
    Y. Zhang, D. Li, X. Shi, D. He, K. Song, X. Wang, H. Qin, and H. Li, “KBNet: Kernel basis network for image restoration,” arXiv preprint arXiv: 2303.02881, 2023.
    [36]
    S. Zhao and B. Ma, “MossFormer: Pushing the performance limit of monaural speech separation using gated single-head Transformer with convolution-augmented joint self-attentions,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2023.
    [37]
    G. W. Lee and H. K. Kim, “Personalized HRTF modeling based on deep neural network using anthropometric measurements and images of the ear,” Applied Sciences, vol. 8, no. 11, p. 2180, 2018. doi: 10.3390/app8112180
    [38]
    R. Miccini and S. Spagnol, “HRTF individualization using deep learning,” in Proc. IEEE Conf. Virtual Reality and 3D User Interfaces Abstracts and Workshops, 2020, pp. 390–395.

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(5)  / Tables(4)

    Article Metrics

    Article views (370) PDF downloads(83) Cited by()

    Highlights

    • Our method directly outputs binaural audios and does not rely on HRTFs
    • We view different channels of ambisonics as different perspectives of the same sound field
    • We propose to use a channel-shared encoder and a channel-compared decoder to learn the spatial feature
    • We provide a Fourier-based definition for interaural level difference to quantifying the loss of spatial information

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return