BTW I was really impressed by the results of F5-TTS. The thing I liked best was the "Tagged" TTS, where you can specify a tag to use different tones of your own voice, like
{Angry}What have you done?
{Suprised}Me, I did nothing?
{Shouting}Who else do you think I'm talking to?
{Sad}Why are you always shouting at me?
I wonder if this would also work for "Character" tags, like
{Susan}How was your day?
{Peter}I had a great day.
That would open great new ways of having audio books read by cloned voices - switching between characters with the same voice like often done by the real narrators
This feature also greatly interests me, although I'm looking for a system that would allow to slightly alter the pronunciation of individual words. Is anyone aware of such a system?
Especially with TTS in a language other than English (but also with English), the pronunciation of certain words is sometimes jarringly wrong. Until TTS systems can compensate for this themselves, it would be great if it were possible for humans to use such tags to hint the system to pronounce better. Even if you can't specify the exact correction, but the TTS would just generate a 'different' sound, that could help.
Are you not looking for ssml with ipa tag? I think you might be. It’s part of all your standard OS tts - including espeak-ng on Linux. Also in Google cloud, azure, Watson, and Amazon Polly voices.
Features like artificial breathing, slightly different pronounciation and other "features" are only available in commercial systems... unfortunately I don't remember the name or the video I saw about these, because I'm not interested in non FOSS stuff for my personal projects.
IMHO this should work (in english or chinese).
Here i show how it sounds with different tags (in this case emotions and not characters): https://youtu.be/ASFoTNpkM8o?t=27
Good quality and easy-to-use open TTS models are hard to find. SpeechT5 while a bit old was relatively easy to clone voices with using the Transformers library.
I've also found a couple of the ESPNet TTS models are decent. I've exported those models to ONNX to make them easier to use.
For what it's worth, here is a list of models that cover what I've worked on in the "Open models" TTS space.
From a quick try results aren't good. Sounds bland, and the text I type isn't exactly equal to the text that is spoken. Didn't try with voice cloning though.
Why is good TTS so expensive and why are there no good open source options? Is it just from the need for high quality training data? I don't imagine these models are more expensive to run compared to SOTA LLMs, yet they cost so much more.
From what I'm seeing, most of the open source TTS models are trained on the same few voices, mostly in 16Khz, mostly from Librivox books I think.
Eleven Labs is most likely trained on stolen audiobooks, they've published a few Youtube videos in Polish, now taken down, of AI renditions of famous Polish audiobook narrators. This was all before they became popular, and before their voice cloning models were publicly available I think.
That probably explains a lot. I've tried listening to some of those audiobooks - very hit and miss, mostly miss. Definitely amateur hour and mostly bad quality.
I had pretty good results with coqui-tts and a VITS model, I trained myself with an open dataset and later with one I extracted from audiobooks / epub and therefore can't publish (german)
The dataset and video tutorials are all available and linked on (also english):
a few weeks ago i used piper to create an acceptable translation of a book. i didn't listen to it all, but the result sounded better than anything i was able to listen to before. good enough to listen to a book if a human read one is not available. just a few years ago, this was not the case.
in other words, while FOSS TTS lags behind commercial options, it does get better and i expect within a few years it will produce results that are at least as good as the commercial options today if not fully caught up.
Of all the TTS APIs I have tried, I like OpenAI voices the best. Haven't considered things like elevenlabs because I find them ridiculously expensive.
I love voice to voice interfaces, but only when they sound natural to my ears, and the current pricing for good ones is prohibitive for a huge number of use cases.
well, i was comparing it to the free tools available a few years ago, and against that, this example is a markable improvement. it's the first that i could actually bear to listen to over a longer period of time. i expect just another few years and this will actually be good.
Commercially available high quality training dataset is the key. Open search libraries don't get the luxury of working with voice actors to record voices.
That's the first thing I thought of! I wonder how used these are. Are there any sources or data points indicating that this commonvoice data is being used, and if so, where/how? I think I may have contributed to this a few times back years ago. Nice to see it's still going, would be better to know it's being used.
Yeah all these seem hyper focused on "voice cloning" so on replicate VoiceCraft doesn't even let you try normal TTS unless you provide a reference voice so I noped out.
I’ve had great luck so far with GPT-SoVITS. With a custom trained Japanese model and clean reference audio the quality is outstanding. It is quite finicky to set up and use though.