Human voice is a carrier of emotion, it helps co-regulate our nervous system. It is extremely rich in signals. It is known in modern trauma therapy for example that people who are emotionally disconnected or in a state of shock have less "prosody" in their voice - the voice becomes more monotonous.
In my opinion this tech is bad - and the more we spend time listening to artificial voices I would bet it can have a disregulating effect on the listener's nervous system.
There is also a unhealthy trend on YouTube where creators actually voice their content, but they speak really fast and they cut all the pauses. It's really stressful to listen to in my experience, and I believe also unhealthy for listeners on the long run.
It's no wonder that some creators who are just chill in their videos, sometime attract a wide audience, become a father-like figure almost - they could talk about anything - because younger people nowadays are just starving for this co-regulation effect.
Like I'm watching a certain "Dwayne" and I don't need to agree to everything he says.. but the delivery is so calm and grounded , and there's none of that speeding up / cutting pauses non-sense, that it genuinely helps me as I am recovering from trauma. It calms me down.
It's kinda unfortunate that at same time modern trauma models are gaining ground on YouTube, all about vagus nerve, fight/flight/freeze etc, the concept of capacity in the nervous system... at the same time you have an increasing assault from this really disregulating content...
I guess all I can say s more than ever you have to be really aware of what you consume.
This comment is packed with speculation, and lofty predictions based on speculation.
I grew up in a family of fast talkers, at least 2 generations - predating YouTube by decades. Nature or nurture? Who knows?! Family events are lively, and we've traded stories about when people occasionally ask us to slow down.
I find listening to slow speakers a little annoying because of the lower information density per unit time. What's more important: knowing how the story ends, or the subtle inflections and dramatic pauses?
> What's more important: knowing how the story ends, or the subtle inflections and dramatic pauses?
The inflections and pauses are the story. People are often disappointed when a good story ends.
If the goal of your story is just to densely transmit information then maybe you should just print bullet points on cards and mail them instead? It would save you the time wasted traveling to events.
> If the goal of your story is just to densely transmit information then maybe you should just print bullet points on cards and mail them instead?
This is exactly the reason why I prefer emails to meetings! Half[1] the meetings I attend can be replaced by emails, preferably with bullet points as you said.
Edit: as child comment has pointed out, we may be talking across each other: for stories those are important but not crucial. For professional communication, I want as little subtlety as possible
1. Perhaps more. Meetings are a huge time sinks. Few people are effective at presiding over them: unactionable ramblings, repetition, demanding that people who don't need to be there to attend. Interestingly, the higher ups who invariably demand this "face-time" use similar arguments to yours
Not to interrupt a good argument too much, but you and the other poster may be equivocating between two types of communication: functional communication ("Mary, I need you to write a program...") and pleasure communication.
I agree with you that meetings are mostly a waste of time, and at the same time most people do not invest much time in communicating effectively. Because of the later, meetings become necessary to extract information or get questions answered timely because back-and-forth exchanges occur painfully slow over email.
Email could eliminate most meetings if people bothered to invest time in anticipating questions and preempting them. Providing clarity and insight rather than vagary. But they don't, thus we have meetings.
99% of the time the person asking for the meeting can even be bothered to write an agenda or give you any opportunity to prepare. At best you get vague subjects like "discuss stuff".
I'm even starting to see this laziness appear in search results for answers to obscure questions being buried in 15 min long YouTube videos which end up being screen captures with some middle school AV club quality title sequences because that's easier than writing.
But my original remarks were in regards to story telling.
Yeah the sped-up and micro-edited content is hard to miss after you start spotting it. I've definitely stopped watching certain channels just because of how grating it is
> In my opinion this tech is bad - and the more we spend time listening to artificial voices I would bet it can have a disregulating effect on the listener's nervous system
Is there any evidence of this?
I feel like I myself have 'disregulated' in maintaining rapport in face to face conversations over the years. As in I now feel that I don't know what to do with my eye-gazing during a face to face conversation specially with first acquaintances, I don't feel particularly introverted just feel ackward/unsure what to do, when it was fairly effortless a few years back.
> It's no wonder that some creators who are just chill in their videos, sometime attract a wide audience, become a father-like figure almost - they could talk about anything - because younger people nowadays are just starving for this co-regulation effect.
Then perhaps this is exactly what text2speech packages will optimize for ...
This kinda explains why I've often felt stressed a bit by listening to podcasts with the trim silence option turned on or at significantly higher speeds.
It also makes a great case for me to use Audm more which has real people reading news (usually longform) aloud. Often it's even the journalist who wrote the piece.
It's necessary in some things, for example I made an app for myself where I have words/sentences spoken by text to speech, something like 12000 words/sentences, both in a normal speed and a slow speaking speed. That would be very difficult for me to do with natural voices as that would require some human to read these words/sentences, not to mention how expensive it would be, when I was able to do it for free with AI.
There are some things where AI voices absolutely ruin it, but it's not always a requirement for "emotions" to be felt in the speech we're listening to.
You mentioning youtube fast cuts is sort of the reaction I felt too, but at least in those cases they're usually filmed around the same time by a human.
Outside of the generated glitching in the sound here, my main complaint is that sentence umpteen sounds the same as sentence one. When we speak regularly, our intonation and cadence moves over time and the subject matter. A sentence here sounds okayish, but all the sentences in a row sounds like they're generated discretely (which I assume they technically are), and all the cohesion is gone.
Without TTS systems a lot of online content would be completely inaccessible to blind people. Systems that sound more authentic than Microsoft Sam are a big win for everyone.
I haven't thought about it in that light before. For me, I just want the artificial to be free of any artifice.
I'm okay with robots always looking robotic and synthesized voices always sounding synthetic.
I have no problem with robots becoming more human-like in their dexterity and locomotion, prefer that artificial voices be intelligible. But apart from "look what we can do" I see no need for either to ever try to pass as human.
That's a really interesting view and I'll have to look into it a bit more. I used to be a professional sound recordist and was really fascinated by the craft that actors and voiceover artists put into their speech. The intonation, emotion and pace are really important aspects of their work. I also learned that recording the human voice with all of its range and subtlies was harder than I thought.
I agree, the jump cuts in a lot of videos can be exhausting.
Interesting point you raise. I enjoy listening to Sovietwomble on Twitch, he just speaks relaxed like a radio host (he often mentions this as his inspiration) and he verbalizes what he is doing or thinking.
Does anyone know how the business model can work for such a product?
I would expect that anyone working on scripts with voice-overs professionally would want to use their favorite movie/audio editor. That means from a user perspective, a "AI Voice VST/AAX Plugin" is strictly superior to whatever cloud GUI anybody builds. (EDIT: Also, running AI as a SaaS means murf.ai needs to pay for pricey datacenter GPUs. Any user-downloadable software will have much lower operating costs.)
And the big elephant in the room with speech AI is that it's so easy to copy the tech. Just like Stable Diffusion did with images, TTS developers just train on public audio from the internet, so there is no dataset moat. And arXiv is full with papers that produce pretty good results, if implemented correctly. And NVIDIA has a collection of freely downloadable TTS models with good/usable quality. To me, it seems like it's only a matter of time until someone builds a high-quality open source TTS VST plugin and then all those SaaS offerings are basically worthless.
In effect, what I'm asking is: What is the competitive moat here? How can murf.ai defend against a motivated high school kid with $100k in EC2 credits?
For the segment of mom and pop store who need an explainer video or Facebook ad made in Canva and don't want to pay someone to record, they want easy of use, realism and editability/speed.
My friend who runs a Shopify store asked for this. They are not going to fiddle with VST plugins or local/cloud GPUs.
Aren't they better off hiring cheap on Fiverr for someone else to do the entire video? The traditional reason against this was that you'd want your narrator to sound like a native speaker. But if AI fixes that, is there any downside to outsourcing video voice-overs to cheap labor countries?
How is that better? The AI should be cheaper and the with less hassle (creating a job, reviewing freelancers, negotiating) with less risk of poor quality/reworks and disputes and yes accent is a big one.
The ideal TTS product for such a person would be something like: sign up and pay > choose voice > paste text > download audio
I am not even joking when I say that the most likely killer use for this will be YouTube/tiktok voice-over for non-native American English speakers. There is a lot of great content on the YouTube for example that could be easily monetized in the "rich" Western countries, but can be difficult to follow due to the different ways people speak the same language.
This assumes that we eventually get over the uncanny valley that we are all sensitive to when it comes to voices.
Yes, but are non-native TikTok- and YouTube-ers a demographic that would pay a monthly SaaS rent? Or would they rather go with an Open Source solution? All of them are using OBS (GPL2 I think) already ;)
> How can murf.ai defend against a motivated high school kid with $100k in EC2 credits?
Think I read somewhere you can retrain tacotron II on a new voice for something like $6 on google colab, been wanting to try it with the ScotRail voice recording dump they did a while back (just because) but haven’t gotten around to it yet.
Executing a business is hard. TTS has significant processing time involved. Different users need interfaces - Grandma needs a very helpful web interface, Tech org wants an api.
I don't understand why text to speech approaches are so common. It's really hard to specify exactly what you want with text.
It seems to me like speech-to-speech would be much better: start with your best attempt to produce the audio yourself, with the emotion, rhythm and timing you want. Then let the AI do the "last mile" transformation, taking your voice and making it sound like someone else, like how neural style transfer can change a picture to another style.
My guess would be that text-to-speech scales very well for arbitrary data, for e.g. automatic audiobook generation, speech-to-speech does not.
But yeah, fully agreed, for individual projects speech-to-speech appears to be a better idea, much more data to work with in there. Otherwise it will be a Vocaloid-like experience, where you have to tinker with the intonation of individual words.
I'm late to this but IIRC this is kind of the tech that LTT has started making use of for spanish audio - I don't know any of the nitty gritty details but I think they feed the english track as well as the script into the AI and get a much more natural-sounding translation out of it. For sections that don't come out "right" you can help it along by re-training just that section etc.
Audio samples to me feel lower quality than other samples I have seen from competitors, but been awhile seen I looked into text-to-speech so unable to quickly post example of a competitor’s less glitchy samples. EDIT: Here just one example, recall others, but unable to find them:
[1] suggests that Murf.ai received $1.5 mio in Seed funding in 2021, has 12 employees and only made $78k in 2022 in revenue. Even if we price their developers at only $100k annually including all benefits, that suggest that they are close to being bankrupt cough close to raising the next funding round.
If you have any coding knowledge, you can get similar-quality voices for much cheaper from Azure, Google, AWS and IBM Watson. Azure gives you 500k characters for free per month, and then it's $16 per one million characters, paid per character. If you're using this to generate voice overs / videos, these rates are so low that you can basically forget about them existing.
You have to use the API, but if that's fine with you, it's definitely worth it.
You don't need coding to use Azure. Look up "Azure Audio Content Creation", which is a web-based application hosted by Microsoft that can be used to generate audio using a GUI.
There is some rather annoying process to setup an account with resource groups etc required but it does give you 500k free characters, or you can just abuse the free demo applet on the website without signing in (may need to clear cookies and reload once in a while), and just tape the audio that comes out.
I had very good luck with some of the Azure voices to create a YouTube video. My favourite right now is Sara (US English), because in testing she sounds the most emotionally natural.
Interestingly, if you choose a voice from another language, and ask it to speak English, sometimes it will replicate a non-native accent, which I found somewhat amusing
I'd be curious to know how this differs from Vocode.ai, which has been around for over a year now, and has voices from Sir Mix-A-Lot to Bender from Futurama.
Cool platform! Thinking about using it for a game project I'm working on – at least for temp V/O. I was noticing that the player for voice acting doesn't play reliably on the search page – you have to go to the profile first. Might be a quick fix.
My company has hired some people for TTS voice training on Upwork. About 90% of the voice actors resented the implications that someone else could make their voice say stuff that they disagree with. But some of them also found the idea of becoming digitally immortal very attractive.
The same way some people like to put up a marble statue of their heroic deeds, others like to record themselves for the internet. In my opinion, both types of people want to avoid being forgotten and surely if you become a famous TTS voice, you'll have a Wikipedia entry...
AIs are never going to get the tone and emotional context right. TTS from Google and Samsung running locally on a phone is already listenable so I don't think minorly better AI is going to eat the audiobook market if it hasn't already.
Most actors have no ear for what the context requires. They randomize their performance until they hit something the director likes. So whether AIs will be able to replace actors or directors are two different questions.
Yea, that’s a pretty good analogy. A singing dog is impressive but not able to replace human singers. From what I’ve seen so far, AI tools create technically impressive but generic and derivative works, and on their own, can’t do what a human artist does in terms of understanding the context of what they are requested to do.
It’s possible that doesn’t matter to most people, and the art world will have to realise that mass-produced schlock is all the public really wants. We’ll see.
There's also the possibility they'll get better, possibly much better than humans. Given how much they've improved recently, that's a very very big possibility.
This "art is just a job we need done" take comes right after a comment about how voice acting is a uniquely human thing that AIs will never be able to do, and I'm finding the disconnect interesting.
Well, if AI is worse at art than us now, it means that we currently have a quality metric, otherwise "worse" means nothing. For an AI to get better than humans, it means that the humans are now the group that's worse at art.
I am impressed with the technical level of the various AI tools, absolutely. I just think they are learning the surface of art - reproduction of reality and stylisation thereof - but not the point of art, which is orthogonal to the technical skill of an artist.
I've heard these voices a few times on youtube recently.
I close those videos within seconds of recognising that the voice is synthetic.
I'm not sure why my reaction is so strongly negative (I don't have this for GPT or SD). My first thought was "Infinite free generation means infinite A/B testing, and I don't want to be part of that", but that should exclude those other AI also.
Love to use this for my procedural music experiments. I wonder if the EUA has any issue with that. It'd be awesome if the pitch and tempo were mapped to music pitch and tempo i.e pitch A440hz and 60bpm. I just tested text like: "one, two, three, four" and it looks like you could manually map it to pitch and bpm in a DAW.
YMMV, but I recently picked up Synthesizer V with the Natalie voice—I think it's pretty incredible for a singing synth. You could potentially procgen out a source file (IIRC, it's plaintext), have Synthesizer V render it, and thereby skip the autotune/beat matching.
I really liked the way they've implemented their user interfaces and interactions. And the overall user experience too to a large extent, though I wish the actual TTS felt faster and responsive.
As for its core functionality, sounded good enough for my modest needs.
I tried it with several paragraphs of text. The many options offered—various voices with adjustable pitch, speed, and pauses, and customizable pronunciations for specific words and names—are attractive, and I can imagine a lot of potential uses.
Like other voice synthesis software, though, it does not seem able to adjust the pauses and intonation to indicate emphasis and contrast the way a skilled human narrator does. I wonder if that will be coming as the AI becomes more meaning-aware.
This is trending into the uncanny valley where instead of sounding like a really good TTS system it sounds like an absent-minded half-illiterate cretin reading a script. Not sure that's a step in the right direction.
Isn't it? You have to enter the uncanny valley before you cross it. The fact that this is in it means that it no longer triggers our "this isn't a person" response, and instead triggers our "this is a lazy person" one.
I already use TTS to listen to e-pulp fiction books, which is something a lot of people wouldn't put up with, but I could easily listen to books narrated by any of those sample voices, especially if it switched between distinct styles for each character without changing the voice. But that would still probably require human work to tag the dialogue with the right characters.
You perhaps joke, but I suspect the combination of large text models to work out the subtext, and still fairly simple TTS models to render audio with a variety of emotional tones, is going to be very powerful in the future.
The music accompanying most of the samples is so loud you can barely hear the voices. This makes it difficult to get a good idea of how life-like the voices sound.
In my opinion this tech is bad - and the more we spend time listening to artificial voices I would bet it can have a disregulating effect on the listener's nervous system.
There is also a unhealthy trend on YouTube where creators actually voice their content, but they speak really fast and they cut all the pauses. It's really stressful to listen to in my experience, and I believe also unhealthy for listeners on the long run.
It's no wonder that some creators who are just chill in their videos, sometime attract a wide audience, become a father-like figure almost - they could talk about anything - because younger people nowadays are just starving for this co-regulation effect.
Like I'm watching a certain "Dwayne" and I don't need to agree to everything he says.. but the delivery is so calm and grounded , and there's none of that speeding up / cutting pauses non-sense, that it genuinely helps me as I am recovering from trauma. It calms me down.
It's kinda unfortunate that at same time modern trauma models are gaining ground on YouTube, all about vagus nerve, fight/flight/freeze etc, the concept of capacity in the nervous system... at the same time you have an increasing assault from this really disregulating content...
I guess all I can say s more than ever you have to be really aware of what you consume.