Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A CC-By Open-Source TTS Model with Voice Cloning (huggingface.co)
131 points by amrrs on Nov 9, 2024 | hide | past | favorite | 31 comments


BTW I was really impressed by the results of F5-TTS. The thing I liked best was the "Tagged" TTS, where you can specify a tag to use different tones of your own voice, like

  {Angry}What have you done?
  {Suprised}Me, I did nothing?
  {Shouting}Who else do you think I'm talking to?
  {Sad}Why are you always shouting at me?
I wonder if this would also work for "Character" tags, like

  {Susan}How was your day?
  {Peter}I had a great day.
That would open great new ways of having audio books read by cloned voices - switching between characters with the same voice like often done by the real narrators


This feature also greatly interests me, although I'm looking for a system that would allow to slightly alter the pronunciation of individual words. Is anyone aware of such a system?

Especially with TTS in a language other than English (but also with English), the pronunciation of certain words is sometimes jarringly wrong. Until TTS systems can compensate for this themselves, it would be great if it were possible for humans to use such tags to hint the system to pronounce better. Even if you can't specify the exact correction, but the TTS would just generate a 'different' sound, that could help.


Are you not looking for ssml with ipa tag? I think you might be. It’s part of all your standard OS tts - including espeak-ng on Linux. Also in Google cloud, azure, Watson, and Amazon Polly voices.


I didn't know it existed... Thank you very much


Features like artificial breathing, slightly different pronounciation and other "features" are only available in commercial systems... unfortunately I don't remember the name or the video I saw about these, because I'm not interested in non FOSS stuff for my personal projects.


IMHO this should work (in english or chinese). Here i show how it sounds with different tags (in this case emotions and not characters): https://youtu.be/ASFoTNpkM8o?t=27

Here's how it's done: https://youtu.be/ASFoTNpkM8o?t=992


Hey, thorsten-voice himself. Thank you for your contribution to the community. I'm a happy follower of your content.

Can't wait F5-tts to support the german language. Do you know wether this is planned in the near future?


You're very welcome. On f5 github repo is an active discussion (i'm involved too) on supporting other languages including german: https://github.com/SWivid/F5-TTS/issues/87#issuecomment-2418...


Good quality and easy-to-use open TTS models are hard to find. SpeechT5 while a bit old was relatively easy to clone voices with using the Transformers library.

I've also found a couple of the ESPNet TTS models are decent. I've exported those models to ONNX to make them easier to use.

For what it's worth, here is a list of models that cover what I've worked on in the "Open models" TTS space.

https://huggingface.co/collections/NeuML/text-to-speech-tts-...


From a quick try results aren't good. Sounds bland, and the text I type isn't exactly equal to the text that is spoken. Didn't try with voice cloning though.

Why is good TTS so expensive and why are there no good open source options? Is it just from the need for high quality training data? I don't imagine these models are more expensive to run compared to SOTA LLMs, yet they cost so much more.


From what I'm seeing, most of the open source TTS models are trained on the same few voices, mostly in 16Khz, mostly from Librivox books I think.

Eleven Labs is most likely trained on stolen audiobooks, they've published a few Youtube videos in Polish, now taken down, of AI renditions of famous Polish audiobook narrators. This was all before they became popular, and before their voice cloning models were publicly available I think.


> mostly from Librivox books

That probably explains a lot. I've tried listening to some of those audiobooks - very hit and miss, mostly miss. Definitely amateur hour and mostly bad quality.


I had pretty good results with coqui-tts and a VITS model, I trained myself with an open dataset and later with one I extracted from audiobooks / epub and therefore can't publish (german)

The dataset and video tutorials are all available and linked on (also english):

https://www.thorsten-voice.de/en/motivation-vision/


Thanks for mentioning my Thorsten-Voice project, dear sandreas :)


You're very welcome.


a few weeks ago i used piper to create an acceptable translation of a book. i didn't listen to it all, but the result sounded better than anything i was able to listen to before. good enough to listen to a book if a human read one is not available. just a few years ago, this was not the case.

in other words, while FOSS TTS lags behind commercial options, it does get better and i expect within a few years it will produce results that are at least as good as the commercial options today if not fully caught up.


Piper seems roughly equivalent to old-school TTS outputs that sound flat, jumpy with the concatenative approach. Listen to this first example I tried:

https://rhasspy.github.io/piper-samples/samples/en/en_GB/ala...

Of all the TTS APIs I have tried, I like OpenAI voices the best. Haven't considered things like elevenlabs because I find them ridiculously expensive.

I love voice to voice interfaces, but only when they sound natural to my ears, and the current pricing for good ones is prohibitive for a huge number of use cases.


well, i was comparing it to the free tools available a few years ago, and against that, this example is a markable improvement. it's the first that i could actually bear to listen to over a longer period of time. i expect just another few years and this will actually be good.


There are a lot of options. StyleTTS2 is pretty good, XTTSv2 is pretty good, the new E2 TTS and F5 TTS also seem decent.


Commercially available high quality training dataset is the key. Open search libraries don't get the luxury of working with voice actors to record voices.


Would it be hard to create such a training dataset? Seems like you’d just need a lot of people to say a bunch of stuff for you?


needs a crowdsourced model


Ideally, Mozilla would step up here given their mission statement, but they won't, probably because their CEO needs another bonus.


Yeah there's no chance Mozilla would do anything like this:

https://commonvoice.mozilla.org/


That's the first thing I thought of! I wonder how used these are. Are there any sources or data points indicating that this commonvoice data is being used, and if so, where/how? I think I may have contributed to this a few times back years ago. Nice to see it's still going, would be better to know it's being used.


It was used quite a bit of speech to text - but tts it’s not that great.


It costs a million dollar a year to host 32k hours of audio?


Have you tried VoiceCraft?


Yeah all these seem hyper focused on "voice cloning" so on replicate VoiceCraft doesn't even let you try normal TTS unless you provide a reference voice so I noped out.


I’ve had great luck so far with GPT-SoVITS. With a custom trained Japanese model and clean reference audio the quality is outstanding. It is quite finicky to set up and use though.

https://github.com/RVC-Boss/GPT-SoVITS


I have been having fun with this as well:

https://github.com/neonbjb/tortoise-tts

It supports voice cloning, but I am indeed having trouble getting docker container working and the command line docs are not perfect:

https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05b...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: