AI Voice Generator: Text to Speech Software

AltruisticGapHN · on Jan 1, 2023

Human voice is a carrier of emotion, it helps co-regulate our nervous system. It is extremely rich in signals. It is known in modern trauma therapy for example that people who are emotionally disconnected or in a state of shock have less "prosody" in their voice - the voice becomes more monotonous.

In my opinion this tech is bad - and the more we spend time listening to artificial voices I would bet it can have a disregulating effect on the listener's nervous system.

There is also a unhealthy trend on YouTube where creators actually voice their content, but they speak really fast and they cut all the pauses. It's really stressful to listen to in my experience, and I believe also unhealthy for listeners on the long run.

It's no wonder that some creators who are just chill in their videos, sometime attract a wide audience, become a father-like figure almost - they could talk about anything - because younger people nowadays are just starving for this co-regulation effect.

Like I'm watching a certain "Dwayne" and I don't need to agree to everything he says.. but the delivery is so calm and grounded , and there's none of that speeding up / cutting pauses non-sense, that it genuinely helps me as I am recovering from trauma. It calms me down.

It's kinda unfortunate that at same time modern trauma models are gaining ground on YouTube, all about vagus nerve, fight/flight/freeze etc, the concept of capacity in the nervous system... at the same time you have an increasing assault from this really disregulating content...

I guess all I can say s more than ever you have to be really aware of what you consume.

sangnoir · on Jan 1, 2023

This comment is packed with speculation, and lofty predictions based on speculation.

I grew up in a family of fast talkers, at least 2 generations - predating YouTube by decades. Nature or nurture? Who knows?! Family events are lively, and we've traded stories about when people occasionally ask us to slow down.

I find listening to slow speakers a little annoying because of the lower information density per unit time. What's more important: knowing how the story ends, or the subtle inflections and dramatic pauses?

cptskippy · on Jan 1, 2023

> What's more important: knowing how the story ends, or the subtle inflections and dramatic pauses?

The inflections and pauses are the story. People are often disappointed when a good story ends.

If the goal of your story is just to densely transmit information then maybe you should just print bullet points on cards and mail them instead? It would save you the time wasted traveling to events.

sangnoir · on Jan 1, 2023

> If the goal of your story is just to densely transmit information then maybe you should just print bullet points on cards and mail them instead?

This is exactly the reason why I prefer emails to meetings! Half[1] the meetings I attend can be replaced by emails, preferably with bullet points as you said.

Edit: as child comment has pointed out, we may be talking across each other: for stories those are important but not crucial. For professional communication, I want as little subtlety as possible

1. Perhaps more. Meetings are a huge time sinks. Few people are effective at presiding over them: unactionable ramblings, repetition, demanding that people who don't need to be there to attend. Interestingly, the higher ups who invariably demand this "face-time" use similar arguments to yours

vorpalhex · on Jan 1, 2023

Not to interrupt a good argument too much, but you and the other poster may be equivocating between two types of communication: functional communication ("Mary, I need you to write a program...") and pleasure communication.

sangnoir · on Jan 1, 2023

I was about to edit the same observation after some thought. Not every piece of communication hasti be a story - even on YouTube.

cptskippy · on Jan 1, 2023

I agree with you that meetings are mostly a waste of time, and at the same time most people do not invest much time in communicating effectively. Because of the later, meetings become necessary to extract information or get questions answered timely because back-and-forth exchanges occur painfully slow over email.

Email could eliminate most meetings if people bothered to invest time in anticipating questions and preempting them. Providing clarity and insight rather than vagary. But they don't, thus we have meetings.

99% of the time the person asking for the meeting can even be bothered to write an agenda or give you any opportunity to prepare. At best you get vague subjects like "discuss stuff".

I'm even starting to see this laziness appear in search results for answers to obscure questions being buried in 15 min long YouTube videos which end up being screen captures with some middle school AV club quality title sequences because that's easier than writing.

But my original remarks were in regards to story telling.

thedorkknight · on Jan 1, 2023

Yeah the sped-up and micro-edited content is hard to miss after you start spotting it. I've definitely stopped watching certain channels just because of how grating it is

DoingIsLearning · on Jan 1, 2023

> In my opinion this tech is bad - and the more we spend time listening to artificial voices I would bet it can have a disregulating effect on the listener's nervous system

Is there any evidence of this?

I feel like I myself have 'disregulated' in maintaining rapport in face to face conversations over the years. As in I now feel that I don't know what to do with my eye-gazing during a face to face conversation specially with first acquaintances, I don't feel particularly introverted just feel ackward/unsure what to do, when it was fairly effortless a few years back.

amelius · on Jan 1, 2023

> It's no wonder that some creators who are just chill in their videos, sometime attract a wide audience, become a father-like figure almost - they could talk about anything - because younger people nowadays are just starving for this co-regulation effect.

Then perhaps this is exactly what text2speech packages will optimize for ...

raybb · on Jan 1, 2023

This kinda explains why I've often felt stressed a bit by listening to podcasts with the trim silence option turned on or at significantly higher speeds.

It also makes a great case for me to use Audm more which has real people reading news (usually longform) aloud. Often it's even the journalist who wrote the piece.

hackerman123469 · on Jan 2, 2023

It's necessary in some things, for example I made an app for myself where I have words/sentences spoken by text to speech, something like 12000 words/sentences, both in a normal speed and a slow speaking speed. That would be very difficult for me to do with natural voices as that would require some human to read these words/sentences, not to mention how expensive it would be, when I was able to do it for free with AI.

There are some things where AI voices absolutely ruin it, but it's not always a requirement for "emotions" to be felt in the speech we're listening to.

croon · on Jan 1, 2023

You mentioning youtube fast cuts is sort of the reaction I felt too, but at least in those cases they're usually filmed around the same time by a human.

Outside of the generated glitching in the sound here, my main complaint is that sentence umpteen sounds the same as sentence one. When we speak regularly, our intonation and cadence moves over time and the subject matter. A sentence here sounds okayish, but all the sentences in a row sounds like they're generated discretely (which I assume they technically are), and all the cohesion is gone.

throwaway675309 · on Jan 1, 2023

Without TTS systems a lot of online content would be completely inaccessible to blind people. Systems that sound more authentic than Microsoft Sam are a big win for everyone.

JKCalhoun · on Jan 1, 2023

I haven't thought about it in that light before. For me, I just want the artificial to be free of any artifice.

I'm okay with robots always looking robotic and synthesized voices always sounding synthetic.

I have no problem with robots becoming more human-like in their dexterity and locomotion, prefer that artificial voices be intelligible. But apart from "look what we can do" I see no need for either to ever try to pass as human.

digitalsankhara · on Jan 1, 2023

That's a really interesting view and I'll have to look into it a bit more. I used to be a professional sound recordist and was really fascinated by the craft that actors and voiceover artists put into their speech. The intonation, emotion and pace are really important aspects of their work. I also learned that recording the human voice with all of its range and subtlies was harder than I thought.

I agree, the jump cuts in a lot of videos can be exhausting.

prox · on Jan 1, 2023

Interesting point you raise. I enjoy listening to Sovietwomble on Twitch, he just speaks relaxed like a radio host (he often mentions this as his inspiration) and he verbalizes what he is doing or thinking.

fxtentacle · on Jan 1, 2023

Does anyone know how the business model can work for such a product?

I would expect that anyone working on scripts with voice-overs professionally would want to use their favorite movie/audio editor. That means from a user perspective, a "AI Voice VST/AAX Plugin" is strictly superior to whatever cloud GUI anybody builds. (EDIT: Also, running AI as a SaaS means murf.ai needs to pay for pricey datacenter GPUs. Any user-downloadable software will have much lower operating costs.)

And the big elephant in the room with speech AI is that it's so easy to copy the tech. Just like Stable Diffusion did with images, TTS developers just train on public audio from the internet, so there is no dataset moat. And arXiv is full with papers that produce pretty good results, if implemented correctly. And NVIDIA has a collection of freely downloadable TTS models with good/usable quality. To me, it seems like it's only a matter of time until someone builds a high-quality open source TTS VST plugin and then all those SaaS offerings are basically worthless.

In effect, what I'm asking is: What is the competitive moat here? How can murf.ai defend against a motivated high school kid with $100k in EC2 credits?

Rastonbury · on Jan 1, 2023

For the segment of mom and pop store who need an explainer video or Facebook ad made in Canva and don't want to pay someone to record, they want easy of use, realism and editability/speed.

My friend who runs a Shopify store asked for this. They are not going to fiddle with VST plugins or local/cloud GPUs.

fxtentacle · on Jan 1, 2023

Aren't they better off hiring cheap on Fiverr for someone else to do the entire video? The traditional reason against this was that you'd want your narrator to sound like a native speaker. But if AI fixes that, is there any downside to outsourcing video voice-overs to cheap labor countries?

Rastonbury · on Jan 1, 2023

How is that better? The AI should be cheaper and the with less hassle (creating a job, reviewing freelancers, negotiating) with less risk of poor quality/reworks and disputes and yes accent is a big one.

The ideal TTS product for such a person would be something like: sign up and pay > choose voice > paste text > download audio

herbst · on Jan 1, 2023

They would likely use a Produkt that also AI helps them to make the video and not 'only' TTS

Like: invideo.io (no ad, just a very impressed user)

mhuffman · on Jan 1, 2023

I am not even joking when I say that the most likely killer use for this will be YouTube/tiktok voice-over for non-native American English speakers. There is a lot of great content on the YouTube for example that could be easily monetized in the "rich" Western countries, but can be difficult to follow due to the different ways people speak the same language.

This assumes that we eventually get over the uncanny valley that we are all sensitive to when it comes to voices.

fxtentacle · on Jan 1, 2023

Yes, but are non-native TikTok- and YouTube-ers a demographic that would pay a monthly SaaS rent? Or would they rather go with an Open Source solution? All of them are using OBS (GPL2 I think) already ;)

mhuffman · on Jan 1, 2023

>Yes, but are non-native TikTok- and YouTube-ers a demographic that would pay a monthly SaaS rent?

I think, yes, but there is a chicken-and-egg problem. I think ones that are starting to see a little youtube/tiktok income would be the most likely.

UncleEntity · on Jan 1, 2023

> How can murf.ai defend against a motivated high school kid with $100k in EC2 credits?

Think I read somewhere you can retrain tacotron II on a new voice for something like $6 on google colab, been wanting to try it with the ScotRail voice recording dump they did a while back (just because) but haven’t gotten around to it yet.

fxtentacle · on Jan 1, 2023

Yes. That is precisely the kind of competitive thread I would foresee for murf.ai. But how to defend against that?

vorpalhex · on Jan 1, 2023

Executing a business is hard. TTS has significant processing time involved. Different users need interfaces - Grandma needs a very helpful web interface, Tech org wants an api.

welshwelsh · on Jan 1, 2023

I don't understand why text to speech approaches are so common. It's really hard to specify exactly what you want with text.

It seems to me like speech-to-speech would be much better: start with your best attempt to produce the audio yourself, with the emotion, rhythm and timing you want. Then let the AI do the "last mile" transformation, taking your voice and making it sound like someone else, like how neural style transfer can change a picture to another style.

schroeding · on Jan 1, 2023

My guess would be that text-to-speech scales very well for arbitrary data, for e.g. automatic audiobook generation, speech-to-speech does not.

But yeah, fully agreed, for individual projects speech-to-speech appears to be a better idea, much more data to work with in there. Otherwise it will be a Vocaloid-like experience, where you have to tinker with the intonation of individual words.

There is significant work in this area, too, e.g. Zero-Shot Voice Style Transfer: https://auspicious3000.github.io/autovc-demo/

staindk · on Jan 2, 2023

I'm late to this but IIRC this is kind of the tech that LTT has started making use of for spanish audio - I don't know any of the nitty gritty details but I think they feed the english track as well as the script into the AI and get a much more natural-sounding translation out of it. For sections that don't come out "right" you can help it along by re-training just that section etc.

See vid for some discussion around it - https://www.youtube.com/watch?v=_5uCvcyD0Eo

O__________O · on Jan 1, 2023

Audio samples to me feel lower quality than other samples I have seen from competitors, but been awhile seen I looked into text-to-speech so unable to quickly post example of a competitor’s less glitchy samples. EDIT: Here just one example, recall others, but unable to find them:

- https://www.resemble.ai/

Anyone know why/how this company appears to be growing quickly?

fxtentacle · on Jan 1, 2023

[1] suggests that Murf.ai received $1.5 mio in Seed funding in 2021, has 12 employees and only made $78k in 2022 in revenue. Even if we price their developers at only $100k annually including all benefits, that suggest that they are close to being bankrupt cough close to raising the next funding round.

[1] https://getlatka.com/companies/murfai

mritchie712 · on Jan 1, 2023

Yeah, I feel spoiled by all the AI products that have come out in the past year. This one is underwhelming.

jacooper · on Jan 1, 2023

Resemble feels a bit better, but still suffers from some weirdness, like some uncanny valley for voices

miki123211 · on Jan 1, 2023

If you have any coding knowledge, you can get similar-quality voices for much cheaper from Azure, Google, AWS and IBM Watson. Azure gives you 500k characters for free per month, and then it's $16 per one million characters, paid per character. If you're using this to generate voice overs / videos, these rates are so low that you can basically forget about them existing.

You have to use the API, but if that's fine with you, it's definitely worth it.

liminalsunset · on Jan 1, 2023

You don't need coding to use Azure. Look up "Azure Audio Content Creation", which is a web-based application hosted by Microsoft that can be used to generate audio using a GUI.

There is some rather annoying process to setup an account with resource groups etc required but it does give you 500k free characters, or you can just abuse the free demo applet on the website without signing in (may need to clear cookies and reload once in a while), and just tape the audio that comes out.

I had very good luck with some of the Azure voices to create a YouTube video. My favourite right now is Sara (US English), because in testing she sounds the most emotionally natural.

Interestingly, if you choose a voice from another language, and ask it to speak English, sometimes it will replicate a non-native accent, which I found somewhat amusing

vlugorilla · on Jan 1, 2023

Can you elaborate on how could one achieve this? I have knowledge about Python and Golang.

zackkatz · on Jan 1, 2023

Here are the docs: https://learn.microsoft.com/en-us/azure/cognitive-services/s...

throwaway675309 · on Jan 1, 2023

I'd be curious to know how this differs from Vocode.ai, which has been around for over a year now, and has voices from Sir Mix-A-Lot to Bender from Futurama.

https://fakeyou.com

buf · on Jan 1, 2023

Murf.ai has also been around for a few years.

I own a creator platform (with 500k or so Voice Actors) and have been very interested in AI Voices so I've been watching this develop for a bit.

IMO, Murf's marketing page has better results than their product.

I think the VAs on my platform are in trouble, but they still have a little ways to go.

consta · on Jan 1, 2023

Do you mind sharing a link to your creator platform too?

buf · on Jan 1, 2023

https://www.castingcall.club/

I'm an indie entrepreneur, so it's just me on this project, but it's been great fun.

wilg · on Jan 1, 2023

Cool platform! Thinking about using it for a game project I'm working on – at least for temp V/O. I was noticing that the player for voice acting doesn't play reliably on the search page – you have to go to the profile first. Might be a quick fix.

buf · on Jan 1, 2023

thanks for the bug report!

Is there an error on the page or just it just not play anything?

Trying to reproduce.

Edit: figured it out. Will fix!

nmfisher · on Jan 1, 2023

Just curious - have the worlds collided yet, with people asking voice actors to record data for training TTS models with their voice?

fxtentacle · on Jan 1, 2023

My company has hired some people for TTS voice training on Upwork. About 90% of the voice actors resented the implications that someone else could make their voice say stuff that they disagree with. But some of them also found the idea of becoming digitally immortal very attractive.

The same way some people like to put up a marble statue of their heroic deeds, others like to record themselves for the internet. In my opinion, both types of people want to avoid being forgotten and surely if you become a famous TTS voice, you'll have a Wikipedia entry...

buf · on Jan 1, 2023

Currently, projects asking for TTS model training are banned on the platform, but only because there was outrage amongst the users.

EZ-Cheeze · on Jan 1, 2023

I'd work on that as a intern for free

You can try some really really really interesting things with half a million users

causality0 · on Jan 1, 2023

AIs are never going to get the tone and emotional context right. TTS from Google and Samsung running locally on a phone is already listenable so I don't think minorly better AI is going to eat the audiobook market if it hasn't already.

boredhedgehog · on Jan 1, 2023

Most actors have no ear for what the context requires. They randomize their performance until they hit something the director likes. So whether AIs will be able to replace actors or directors are two different questions.

PaulHoule · on Jan 1, 2023

The director has a conversation with the actor about what the director wants and hopefully the actor is quick to learn and event anticipate it.

That's what the AI model needs to do to get a similar level of performance.

stavros · on Jan 1, 2023

Yes, just like they'll never be able to write poetry, or paint.

rebuilder · on Jan 1, 2023

They still don’t do either very well. It’s more obvious with poetry and prose, but AI art is pretty vacuous as well once you get past the wow-effect.

stavros · on Jan 1, 2023

"Your dog can sing?!" "It's not as impressive as it sounds, he's pitchy".

rebuilder · on Jan 1, 2023

Yea, that’s a pretty good analogy. A singing dog is impressive but not able to replace human singers. From what I’ve seen so far, AI tools create technically impressive but generic and derivative works, and on their own, can’t do what a human artist does in terms of understanding the context of what they are requested to do.

It’s possible that doesn’t matter to most people, and the art world will have to realise that mass-produced schlock is all the public really wants. We’ll see.

stavros · on Jan 1, 2023

There's also the possibility they'll get better, possibly much better than humans. Given how much they've improved recently, that's a very very big possibility.

vagrantJin · on Jan 1, 2023

You mention "better than humans" like its some profound realization.

It's like saying: for hammering, a hammer is better than my pinky finger.

That's kind of the point of tools.

stavros · on Jan 1, 2023

This "art is just a job we need done" take comes right after a comment about how voice acting is a uniquely human thing that AIs will never be able to do, and I'm finding the disconnect interesting.

rebuilder · on Jan 1, 2023

I’m not sure what it would mean for AI to get “better” at art than humans. I expect we couldn’t understand the work it creates at that point.

stavros · on Jan 1, 2023

Well, if AI is worse at art than us now, it means that we currently have a quality metric, otherwise "worse" means nothing. For an AI to get better than humans, it means that the humans are now the group that's worse at art.

Hedepig · on Jan 1, 2023

Are you saying you're not impressed by what we have acomlished so far?

Do you assume it will stay the way it is?

rebuilder · on Jan 1, 2023

I am impressed with the technical level of the various AI tools, absolutely. I just think they are learning the surface of art - reproduction of reality and stylisation thereof - but not the point of art, which is orthogonal to the technical skill of an artist.

thom · on Jan 1, 2023

Never is a long time.

ben_w · on Jan 1, 2023

I've heard these voices a few times on youtube recently.

I close those videos within seconds of recognising that the voice is synthetic.

I'm not sure why my reaction is so strongly negative (I don't have this for GPT or SD). My first thought was "Infinite free generation means infinite A/B testing, and I don't want to be part of that", but that should exclude those other AI also.

zulban · on Jan 1, 2023

It also means no human thought it was worth their time to voice the video. Not a good so sign.

karmasimida · on Jan 1, 2023

I listened some of the samples ... the artificialness of AI voice is very much present there.

Unless the pricing is aggressively cheaper, can't say I am that impressed with the product.

dannyw · on Jan 1, 2023

Is there an open source version, that's as good as stable diffusion is when it comes to AI art?

seydor · on Jan 1, 2023

I expected software to be downloadable

stevehiehn · on Jan 1, 2023

Love to use this for my procedural music experiments. I wonder if the EUA has any issue with that. It'd be awesome if the pitch and tempo were mapped to music pitch and tempo i.e pitch A440hz and 60bpm. I just tested text like: "one, two, three, four" and it looks like you could manually map it to pitch and bpm in a DAW.

dejobaan · on Jan 1, 2023

YMMV, but I recently picked up Synthesizer V with the Natalie voice—I think it's pretty incredible for a singing synth. You could potentially procgen out a source file (IIRC, it's plaintext), have Synthesizer V render it, and thereby skip the autotune/beat matching.

stevehiehn · on Jan 1, 2023

Noted, Thanks

anonytrary · on Jan 1, 2023

I entered a paragraph from the beginning of this article https://en.wikipedia.org/wiki/Hilbert_space

I selected several different voices, but it only generated between 2 and 11 seconds. Only got up to the first sentence...

lovelearning · on Jan 1, 2023

I really liked the way they've implemented their user interfaces and interactions. And the overall user experience too to a large extent, though I wish the actual TTS felt faster and responsive.

As for its core functionality, sounded good enough for my modest needs.

tkgally · on Jan 1, 2023

I tried it with several paragraphs of text. The many options offered—various voices with adjustable pitch, speed, and pauses, and customizable pronunciations for specific words and names—are attractive, and I can imagine a lot of potential uses.

Like other voice synthesis software, though, it does not seem able to adjust the pauses and intonation to indicate emphasis and contrast the way a skilled human narrator does. I wonder if that will be coming as the AI becomes more meaning-aware.

techload · on Jan 1, 2023

Suppose you would like to create an audio version of a long text to listen while commuting. What free tools would you use to acomplish that?

exodust · on Jan 1, 2023

If you write naughty words you receive a telling off via email...

"Our system has detected content that might be inappropriate...we request you to remove such content."

I was sent this moments after signing up and entering one single word starting with F.

causality0 · on Jan 1, 2023

This is trending into the uncanny valley where instead of sounding like a really good TTS system it sounds like an absent-minded half-illiterate cretin reading a script. Not sure that's a step in the right direction.

stavros · on Jan 1, 2023

Isn't it? You have to enter the uncanny valley before you cross it. The fact that this is in it means that it no longer triggers our "this isn't a person" response, and instead triggers our "this is a lazy person" one.

bookofjoe · on Jan 1, 2023

>You have to enter the uncanny valley before you cross it. —comment of the year

stavros · on Jan 1, 2023

I'd be honored if the year wasn't 12 hours old :(

mcbits · on Jan 1, 2023

I already use TTS to listen to e-pulp fiction books, which is something a lot of people wouldn't put up with, but I could easily listen to books narrated by any of those sample voices, especially if it switched between distinct styles for each character without changing the voice. But that would still probably require human work to tag the dialogue with the right characters.

logicallee · on Jan 1, 2023

Absolutely. But nothing a simple "in an enthusiastic style" in the prompt won't fix, that's how AI works right? :P

thom · on Jan 1, 2023

You perhaps joke, but I suspect the combination of large text models to work out the subtext, and still fairly simple TTS models to render audio with a variety of emotional tones, is going to be very powerful in the future.

takyon · on Jan 1, 2023

Check out Wellsaid Labs (https://wellsaidlabs.com/). Much better quality for longer text.

oefnak · on Jan 1, 2023

The music accompanying most of the samples is so loud you can barely hear the voices. This makes it difficult to get a good idea of how life-like the voices sound.

jldugger · on Jan 1, 2023

Seems intentional. It's better than whatever version of `say` ships with macOS but I can still hear a lot of artifacts and giveaways.

bemmu · on Jan 1, 2023

No API it seems?

I was looking for better TTS for my AI video presentation generator side project. Which one has the best voices out of those offering an API?

herbst · on Jan 1, 2023

For some reason I don't understand none of these modern AI narrator & video creator platforms comes with an API

GrumpyNl · on Jan 1, 2023

Voicemaker has a great api

photoGrant · on Jan 1, 2023

It wouldn't accept my password. I used an asterisks and then it complained not to include whitespace. There was no whitespace.

I gave up.

fancymcpoopoo · on Jan 1, 2023

where is the intelligence here? can people stop using the term AI for everything computer generated?

Dowwie · on Jan 1, 2023

The black and white artwork is everywhere these days. Does the style have a name?

amelius · on Jan 1, 2023

Wrapping-paper style?

aszantu · on Jan 1, 2023

time to get rid of smartphones, phones in general and only talk to ppl in person xD