A CC-By Open-Source TTS Model with Voice Cloning

sandreas · on Nov 9, 2024

BTW I was really impressed by the results of F5-TTS. The thing I liked best was the "Tagged" TTS, where you can specify a tag to use different tones of your own voice, like

  {Angry}What have you done?
  {Suprised}Me, I did nothing?
  {Shouting}Who else do you think I'm talking to?
  {Sad}Why are you always shouting at me?

I wonder if this would also work for "Character" tags, like

  {Susan}How was your day?
  {Peter}I had a great day.

That would open great new ways of having audio books read by cloned voices - switching between characters with the same voice like often done by the real narrators

throwaway89201 · on Nov 9, 2024

This feature also greatly interests me, although I'm looking for a system that would allow to slightly alter the pronunciation of individual words. Is anyone aware of such a system?

Especially with TTS in a language other than English (but also with English), the pronunciation of certain words is sometimes jarringly wrong. Until TTS systems can compensate for this themselves, it would be great if it were possible for humans to use such tags to hint the system to pronounce better. Even if you can't specify the exact correction, but the TTS would just generate a 'different' sound, that could help.

willwade · on Nov 10, 2024

Are you not looking for ssml with ipa tag? I think you might be. It’s part of all your standard OS tts - including espeak-ng on Linux. Also in Google cloud, azure, Watson, and Amazon Polly voices.

sandreas · on Nov 10, 2024

I didn't know it existed... Thank you very much

sandreas · on Nov 9, 2024

Features like artificial breathing, slightly different pronounciation and other "features" are only available in commercial systems... unfortunately I don't remember the name or the video I saw about these, because I'm not interested in non FOSS stuff for my personal projects.

thorsten-voice · on Nov 10, 2024

IMHO this should work (in english or chinese). Here i show how it sounds with different tags (in this case emotions and not characters): https://youtu.be/ASFoTNpkM8o?t=27

Here's how it's done: https://youtu.be/ASFoTNpkM8o?t=992

sandreas · on Nov 10, 2024

Hey, thorsten-voice himself. Thank you for your contribution to the community. I'm a happy follower of your content.

Can't wait F5-tts to support the german language. Do you know wether this is planned in the near future?

thorsten-voice · on Nov 12, 2024

You're very welcome. On f5 github repo is an active discussion (i'm involved too) on supporting other languages including german: https://github.com/SWivid/F5-TTS/issues/87#issuecomment-2418...

dmezzetti · on Nov 9, 2024

Good quality and easy-to-use open TTS models are hard to find. SpeechT5 while a bit old was relatively easy to clone voices with using the Transformers library.

I've also found a couple of the ESPNet TTS models are decent. I've exported those models to ONNX to make them easier to use.

For what it's worth, here is a list of models that cover what I've worked on in the "Open models" TTS space.

https://huggingface.co/collections/NeuML/text-to-speech-tts-...

asaddhamani · on Nov 9, 2024

From a quick try results aren't good. Sounds bland, and the text I type isn't exactly equal to the text that is spoken. Didn't try with voice cloning though.

Why is good TTS so expensive and why are there no good open source options? Is it just from the need for high quality training data? I don't imagine these models are more expensive to run compared to SOTA LLMs, yet they cost so much more.

miki123211 · on Nov 9, 2024

From what I'm seeing, most of the open source TTS models are trained on the same few voices, mostly in 16Khz, mostly from Librivox books I think.

Eleven Labs is most likely trained on stolen audiobooks, they've published a few Youtube videos in Polish, now taken down, of AI renditions of famous Polish audiobook narrators. This was all before they became popular, and before their voice cloning models were publicly available I think.

generalizations · on Nov 9, 2024

> mostly from Librivox books

That probably explains a lot. I've tried listening to some of those audiobooks - very hit and miss, mostly miss. Definitely amateur hour and mostly bad quality.

sandreas · on Nov 9, 2024

I had pretty good results with coqui-tts and a VITS model, I trained myself with an open dataset and later with one I extracted from audiobooks / epub and therefore can't publish (german)

The dataset and video tutorials are all available and linked on (also english):

https://www.thorsten-voice.de/en/motivation-vision/

thorsten-voice · on Nov 10, 2024

Thanks for mentioning my Thorsten-Voice project, dear sandreas :)

sandreas · on Nov 10, 2024

You're very welcome.

em-bee · on Nov 9, 2024

a few weeks ago i used piper to create an acceptable translation of a book. i didn't listen to it all, but the result sounded better than anything i was able to listen to before. good enough to listen to a book if a human read one is not available. just a few years ago, this was not the case.

in other words, while FOSS TTS lags behind commercial options, it does get better and i expect within a few years it will produce results that are at least as good as the commercial options today if not fully caught up.

asaddhamani · on Nov 9, 2024

Piper seems roughly equivalent to old-school TTS outputs that sound flat, jumpy with the concatenative approach. Listen to this first example I tried:

https://rhasspy.github.io/piper-samples/samples/en/en_GB/ala...

Of all the TTS APIs I have tried, I like OpenAI voices the best. Haven't considered things like elevenlabs because I find them ridiculously expensive.

I love voice to voice interfaces, but only when they sound natural to my ears, and the current pricing for good ones is prohibitive for a huge number of use cases.

em-bee · on Nov 9, 2024

well, i was comparing it to the free tools available a few years ago, and against that, this example is a markable improvement. it's the first that i could actually bear to listen to over a longer period of time. i expect just another few years and this will actually be good.

modeless · on Nov 9, 2024

There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is pretty good, the new E2 TTS and F5 TTS also seem decent.

amrrs · on Nov 9, 2024

Commercially available high quality training dataset is the key. Open search libraries don't get the luxury of working with voice actors to record voices.

Aeolun · on Nov 9, 2024

Would it be hard to create such a training dataset? Seems like you’d just need a lot of people to say a bunch of stuff for you?

wahnfrieden · on Nov 9, 2024

needs a crowdsourced model

huggingmouth · on Nov 9, 2024

Ideally, Mozilla would step up here given their mission statement, but they won't, probably because their CEO needs another bonus.

IshKebab · on Nov 9, 2024

Yeah there's no chance Mozilla would do anything like this:

https://commonvoice.mozilla.org/

mgkimsal · on Nov 9, 2024

That's the first thing I thought of! I wonder how used these are. Are there any sources or data points indicating that this commonvoice data is being used, and if so, where/how? I think I may have contributed to this a few times back years ago. Nice to see it's still going, would be better to know it's being used.

willwade · on Nov 10, 2024

It was used quite a bit of speech to text - but tts it’s not that great.

Aeolun · on Nov 10, 2024

It costs a million dollar a year to host 32k hours of audio?

sjnair96 · on Nov 9, 2024

Have you tried VoiceCraft?

asaddhamani · on Nov 9, 2024

Yeah all these seem hyper focused on "voice cloning" so on replicate VoiceCraft doesn't even let you try normal TTS unless you provide a reference voice so I noped out.

DrPhish · on Nov 9, 2024

I’ve had great luck so far with GPT-SoVITS. With a custom trained Japanese model and clean reference audio the quality is outstanding. It is quite finicky to set up and use though.

https://github.com/RVC-Boss/GPT-SoVITS

xrd · on Nov 9, 2024

I have been having fun with this as well:

https://github.com/neonbjb/tortoise-tts

It supports voice cloning, but I am indeed having trouble getting docker container working and the command line docs are not perfect:

https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...