Analysis of quantity degrees by synthesis

Meelis Mihkla

doi:https://doi.org/10.54013/kk756a2

Keywords: quantity degrees, analysis by synthesis, speech synthesis, machine learning

The article is the first attempt to provide a fresh view of the quantity degree as a phonological category by using synthesis and machine learning. The results of the analysis are compared with recorded natural speech and with synthetic speech achieved by different methods of synthesis. Self-training speech synthesizers based on hidden Markov models (HMM), on deep neural netrworks (DNN) and on recurrent neural networks (RNN) were used. The input for machine learning consisted of written Estonian texts (ca 1000 sentences) and audio texts recorded from different informants who used the written texts as a basis. The self-training TTS systems did not contain any Estonian-based module or analyser. To check the precision of pronunciation, the output speech of each synthesizer was subjected to a perception test. According to the results obtained, the general precision of pronunciation reached up to 80.6% for synthetic speech, vs. 94.4% for natural speech. Another evaluative mechanism used involved an analysis of nine acoustic parameters. The analytic comparison of synthetic and natural speech confirmed that the main acoustic cue to differentiate between quantity degrees is the durational ratio between syllable onsets. As for additional characteristics, it was the position of the peak of the pitch contour in the stressed syllable and the difference measured between the pitch maxima that proved to be significant parameters both in synthetic and natural speech. Synthetic speech displayed less intra-degree variability than natural speech, especially for the difference between pitch maxima and between the duration ratios of syllable onsets. While the different methods used for synthesizing quantity degrees ended up in relatively small differences in pronunciation precision (77.8–81.5%), involvement of relevant acoustic parameters brought about a considerable rise in variability. This indicates that although the duration ratio of syllable onsets will remain the main characteristic feature of a quantity degree, the algorithms of machine learning enjoy a relative freedom of choice among additional characteristics to achieve a good enough audio result. The prevalent cause of error in the pronunciation of words with different quantity degrees was the choice of a wrong duration ratio between syllable onsets, which applied to all three quantity degrees. The rest of error-causing parameters only worked for single quantity degrees. In conclusion, although the present study did not reveal any new aspects of Estonian quantity, it serves well as a pilot study indicating that analysis by synthesis is quite a considerable and promising method to test various phonological categories and phonological representation of speech.

Meelis Mihkla (b. 1955), PhD, Institute of the Estonian Language, Senior Researcher (Roosikrantsi 6, 10119 Tallinn), meelis.mihkla@eki.ee

References

Veebivarad

Boersma, Paul; Weenink, David 2018. Praat: doing phonetics by computer. Version 6.1.08. Installitud 25. I 2020. http://www.praat.org

Merlin. https://github.com/CSTR-Edinburgh/merlin

Mozilla TTS. https://github.com/mozilla/TTS

Ossian. http://simple4all.org/product/ossian

WebMAUS Basic. https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/WebMAUSBasic

Kirjandus

Asu, Eva Liina; Lippus, Pärtel; Pajusalu, Karl; Teras, Pire 2016. Eesti keele hääldus. (Eesti keele varamu II.) Tartu: Tartu Ülikooli Kirjastus.

Eek, Arvo; Meister, Einar 2003. Foneetilisi katseid ja arutlusi kvantiteedi alalt (I). Häälikukestusi muutvad kontekstid ja välde. – Keel ja Kirjandus, nr 12, lk 904-918.

Eek, Arvo; Meister, Einar 2004. Foneetilisi katseid ja arutlusi kvantiteedi alalt (II). Takt, silp ja välde. – Keel ja Kirjandus, nr 4, lk 251-271; nr 5, lk 336-351.

Ehala, Martin 1999. Eesti väldete probleemi üks lahendusi. – Keel ja Kirjandus, nr 6, lk 378-386; nr 7, lk 453-466.

Hint, Mati 1998. Why syllabic quantity? Why not the foot? – Linguistica Uralica, kd 34, nr 3, lk 172-177.

King, Simon 2014. Measuring a decade of progress in Text-to-Speech. – Loquens, kd 1, nr 1, e006. https://doi.org/10.3989/loquens.2014.006

Lehiste, Ilse 1960. Segmental and syllabic quantity in Estonian. – American Studies in Uralic Linguistics. Uralic and Altaic Series 1. Toim Thomas A. Sebeok. Bloomington: Indiana University Publications, lk 21-82.

Lehiste, Ilse 1970. Suprasegmentals. Cambridge (Mass.)-London: The M.I.T. Press.

Lehiste, Ilse 1975. Experiments with synthetic speech concerning quantity in Estonian. – Congressus tertius internationalis fenno-ugristarum, Tallinnae habitus, 17.-23. VIII 1970. Pars I: Acta Linguistica. Toim Valmen Hallap. Tallinn: Valgus, lk 254-269.

Liiv, Georg 1961. Eesti keele kolme vältusastme vokaalide kestus ja meloodiatüübid. – Keel ja Kirjandus, nr 7, lk 412-424; nr 8, lk 480-490.

Lippus, Pärtel; Pajusalu, Karl; Allik, Jüri 2011. The role of pitch cue in the perception of the Estonian long quantity. – Prosodic Categories: Production, Perception and Comprehension. Studies in Natural Language and Linguistic Theory 82. Toim Sónia Frota, Gorka Elordieta, Pilar Prieto. Dordrecht: Springer, lk 231-242. https://doi.org/10.1007/978-94-007-0137-3_10

Malisz, Zofia; Henter, Gustav Eje; Valentini-Botinhao, Cassia; Watts, Oliver; Beskow, Jonas; Gustafson, Joakim 2019. Modern speech synthesis for phonetic sciences: A discussion and an evaluation. – Proceedings of the 19th International Congress of Phonetic Sciences. Toim Sasha Calhoun, Paola Escudero, Marija Tabain, Paul Warren. Canberra: Australasian Speech Science and Technology Association Inc, lk 487-491. https://doi.org/10.31234/osf.io/dxvhc

Mihkla, Meelis; Hein, Indrek; Kalvik, Mari-Liis; Kiissel, Indrek; Sirts, Risto; Tamuri, Kairi 2012. Estonian speech synthesis: Applications and challenges. – Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue”. Toim A. E. Kibrik. Moskva: РГГУ, lk 443-453.

van den Oord, Aaron; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcouglu, Koray 2016. WaveNet: A generative model for raw audio. – arXiv preprint. arXiv:1609.03499.

Piits, Liisi; Mihkla, Meelis; Nurk, Tõnis; Kiissel, Indrek 2007. Designing a speech corpus for Estonian unit selection synthesis. – Nodalida 2007: Proceedings of the 16th Nordic Conference of Computational Linguistics. Toim Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek, Mare Koit. Tartu: University of Tartu, lk 367-371.

Prillop, Külli 2018. Mida teeb moora eesti keeles? – Keel ja Kirjandus, nr 5, lk 345-364. https://doi.org/10.54013/kk726a1

Remmel, Mart 1975. The phonetic scope of Estonian: Some specifications. Preprint KKI-5. Tallinn: Academy of Sciences of the Estonian S. S. R., Institute of Language and Literature.

Ross, Kristiina; Piits, Liisi 2019. Mõlgutusi tõest eesti keelekirjelduses. – Keel ja Kirjandus, nr 8-9, lk 686-694. https://doi.org/10.54013/kk742a9

Sahkai, Heete; Mihkla, Meelis 2019. Intensity and spectral parameters as correlates of phrasal stress and word quantity in Estonian. – Proceedings of the 19th International Congress of Phonetic Sciences. Toim Sasha Calhoun, Paola Escudero, Marija Tabain, Paul Warren. Canberra: Australasian Speech Science and Technology Association Inc, lk 2475-2479.

Zhu, Jian 2020. Probing the phonetic and phonological knowledge of tones in Mandarin TTS models. – Proceedings of 10th International Conference on Speech Prosody. Tokyo, Japan, lk 930-934. https://doi.org/10.21437/SpeechProsody.2020-190

Tauli, Valter 1973. Standard Estonian grammar. Part I: Phonology, Morphology, Word-formation. (Acta Universitatis Upsaliensis 8.) Uppsala: Studia Uralica et Altaica Upsaliensia.

Wang, Yuxuan; Skerry-Ryan, RJ; Stanton, Daisy; Wu, Yonghui; Weiss, Ron J.; Jaitly, Navdeep; Yang, Zongheng; Xiao, Ying; Chen, Zhifeng; Bengio, Samy; Le, Quoc; Agiomyrgiannakis, Yannis; Clarck, Rob; Saurous, Rif A. 2017. Tacotron: Towards end-to-end speech synthesis. – Proceedings of Interspeech 2017, lk 4006-4010. https://doi.org/10.21437/Interspeech.2017-1452

Watts, Oliver; Henter, Gustav Eje; Merritt, Thomas; Wu, Zhizheng; King, Simon 2016. From HMMs to DNNs: Where do the improvements come from? – Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, lk 5505-5509. https://doi.org/10.1109/ICASSP.2016.7472730

Weske, Mihkel 1879. Eesti keele healte õpetus ja kirjutuse wiis. Tartu: Schnakenburg.

Xu, Yi; Prom-on, Santitham 2014. Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning. – Speech Communication, kd 57, nr 2, lk 181-208. https://doi.org/10.1016/j.specom.2013.09.013