A HMM-based Mandarin Chinese Singing Voice Synthesis System

Xian Li; Zengfu Wang

Volume 3 Issue 2

Apr. 2016

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2016 > 3(2): 192-202

Xian Li and Zengfu Wang, "A HMM-based Mandarin Chinese Singing Voice Synthesis System," IEEE/CAA J. of Autom. Sinica, vol. 3, no. 2, pp. 192-202, 2016.

Citation:

Xian Li and Zengfu Wang, "A HMM-based Mandarin Chinese Singing Voice Synthesis System," IEEE/CAA J. of Autom. Sinica, vol. 3, no. 2, pp. 192-202, 2016.

Xian Li and Zengfu Wang, "A HMM-based Mandarin Chinese Singing Voice Synthesis System," IEEE/CAA J. of Autom. Sinica, vol. 3, no. 2, pp. 192-202, 2016.

Citation:

Xian Li and Zengfu Wang, "A HMM-based Mandarin Chinese Singing Voice Synthesis System," IEEE/CAA J. of Autom. Sinica, vol. 3, no. 2, pp. 192-202, 2016.

PDF( 2295 KB)

A HMM-based Mandarin Chinese Singing Voice Synthesis System

Xian Li¹,
Zengfu Wang²

1. Department of Automation, University of Science and Technology of China;
2. Institute of Intelligent Machines, Chinese Academy of Sciences

Abstract

Abstract

We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive.
- Singing voice synthesis,
- melisma,
- discrete cosine transform (DCT)

FullText(HTML)

References(29)

References

[1]	Cook P R. Singing voice synthesis: history, current work, and future directions. Computer Music Journal, 1996, 20(3): 38-46
[2]	Bonada J, Serra X. Synthesis of the singing voice by performance sampling and spectral models. IEEE Signal Processing Magazine, 2007, 24(2): 69-79
[3]	Bonada J. Voice Processing and Synthesis by Performance Sampling and Spectral Models [Ph. D. dissertation], Universitat Pompeu Fabra, Barcelona, 2008.
[4]	Kenmochi H, Ohshita H. VOCALOID-commercial singing synthesizer based on sample concatenation. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium, 2007. 4009-4010
[5]	Ling Z H, Wu Y J, Wang Y P, Qin L, Wang R H. USTC system for blizzard challenge 2006 an improved HMM-based speech synthesis method. In: Blizzard Challenge Workshop. Pittsburgh, USA, 2006.
[6]	Zen H G, Tokuda K, Black A W. Statistical parametric speech synthesis. Speech Communication, 2009, 51(11): 1039-1064
[7]	Saino K, Zen H G, Nankaku Y, Lee A, Tokuda K. An HMM-based singing voice synthesis system. In: Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, PA, USA, 2006.
[8]	Mase A, Oura K, Nankaku Y, Tokuda K. HMM-based singing voice synthesis system using pitch-shifted pseudo training data. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba, Japan, 2010. 845-848
[9]	Oura K, Mase A, Yamada T, Muto S, Nankaku Y, Tokuda K. Recent development of the HMM-based singing voice synthesis system - Sinsy. In: Proceedings of the 2010 ICASSP. Kyoto, Japan, 2010. 211 -216
[10]	Oura K, Mase A, Nankaku Y, Tokuda K. Pitch adaptive training for HMM-based singing voice synthesis. In: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Kyoto: IEEE, 2012. 5377-5380
[11]	Zhou S S, Chen Q C, Wang D D, Yang X H. A corpus-based concatenative mandarin singing voice synthesis system. In: Proceedings of the 2008 International Conference on Machine Learning and Cybernetics. Kunming, China: IEEE, 2008. 2695-2699
[12]	Li J L, Yang H W, Zhang W Z, Cai L H. A lyrics to singing voice synthesis system with variable timbre. In: Proceedings of the 2011 International Conference, Applied Informatics, and Communication. Xi'an, China: Springer, 2011. 186-193
[13]	Gu H Y, Liau H L. Mandarin singing voice synthesis using an HNM based scheme. In: Proceedings of the 2008 Congress on Image and Signal Processing. Sanya, China: IEEE, 2008. 347-351
[14]	Cheng J Y, Huang Y C, Wu C H. HMM-based mandarin singing voice synthesis using tailored synthesis units and question sets. Computational Linguistics and Chinese Language Processing, 2013, 18(4): 63-80
[15]	Latorre J, Akamine M. Multilevel parametric-base F0 model for speech synthesis. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association. Brisbane, Australia, 2008. 2274-2277
[16]	Qian Y, Wu Z Z, Gao B Y, Soong F K. Improved prosody generation by maximizing joint probability of state and longer units. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(6): 1702-1710
[17]	Li X, Yu J, Wang Z F. Prosody conversion for mandarin emotional voice conversion. Acta Acustica, 2014, 39(4): 509-516 (in Chinese)
[18]	Tokuda K, Masuko T, Miyazaki N, Kobayashi T. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Phoenix, AZ: IEEE, 1999. 229-232
[19]	Shinoda K, Watanabe T. MDL-based context-dependent subword modeling for speech recognition. The Journal of the Acoustical Society of Japan (E), 2000, 21(2): 79-86
[20]	Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T. Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Istanbul: IEEE, 2000. 1315-1318
[21]	Imai S, Sumita K, Furuichi C. Mel log spectrum approximation (MLSA) filter for speech synthesis. Electronics and Communications in Japan (Part I: Communications), 1983, 66(2): 10-18
[22]	Saino K, Tachibana M, Kenmochi H. An HMM-based singing style modeling system for singing voice synthesizers. In: Proceedings of the 7th ISCA Workshop on Speech Synthesis, 2010.
[23]	Yamagishi J, Kobayashi T. Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICETransactions on Information and Systems, 2007, E90-D(2): 533-543
[24]	Nakano T, Goto M. An automatic singing skill evaluation method for unknown melodies using pitch interval accuracy and vibrato features. In: Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, PA, USA, 2006. 1706-1709
[25]	Saitou T, Unoki M, Akagi M. Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis. Speech Communication, 2005, 46(3-4): 405-417
[26]	Devaney J C, Mandel M I, Fujinaga I. Characterizing singing voice fundamental frequency trajectories. In: Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY: IEEE, 2011. 73-76
[27]	Lee S W, Dong M H, Li H Z. A study of F0 modelling and generation with lyrics and shape characterization for singing voice synthesis. In: Proceedings of the 8th International Symposium on Chinese Spoken Language Processing. Kowloon: IEEE, 2012. 150-154
[28]	Koishida K, Tokuda K, Kobayashi T, Imai S. CELP coding based on melcepstral analysis. In: Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing. Detroit, MI: IEEE, 1995. 33-36
[29]	Zen H G, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Hidden semi-Markov model based speech synthesis. In: Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, Korea, 2004. 1-4

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Get Citation

PDF

XML

Article Metrics

Article views (2296) PDF downloads(15)

A HMM-based Mandarin Chinese Singing Voice Synthesis System

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Export File

Citation

Format

Content