Building voice agents with Nvidia open models

amelius · 2026-01-07T18:11:43 1767809503

I've been using festival under Linux.

https://manpages.ubuntu.com/manpages/trusty/man1/festival.1....

But it is quite old now and pre-dates the DL/AI era.

Does anybody know of a good modern replacement that I can "apt install"?

sigmonsays · 2026-01-07T20:57:08 1767819428

I used piper with a model I found online. It's _ALOT_ better than festival afaik. I'm not sure you can apt install it though.

echo "hello" | piper --model ~/.local/share/piper/en_US-lessac-medium.onnx --output_file - | aplay

gunalx · 2026-01-07T22:48:08 1767826088

You can in fact apt install piper.

amelius · 2026-01-07T22:56:21 1767826581

That's a different piper.

    piper - GTK application to configure gaming devices

gunalx · 2026-01-20T20:42:37 1768941757

^piper-tts exists.

smusamashah · 2026-01-08T19:28:31 1767900511

Do any of the top models let you pause and think while speaking? I have to speak non-stop to Gemini assitant and ChatGPT, which is very very useless/unnatural for voice mode. Specially for non-english speakers probably. I sometimes have to think more to translate my thoughts to english.

fragmede · 2026-01-08T20:28:42 1767904122

Have you tried talking to ChatGPT in your native tongue? I was blown away by my mother speaking her native tongue to ChatGPT and having it respond in that language. (It's ever so slightly not a mainstream one.)

smusamashah · 2026-01-09T06:07:25 1767938845

Even in my own language I can't talk without any pauses.

jjcm · 2026-01-07T22:06:58 1767823618

These have gotten good enough to really make command-by-voice interactions pleasant. I'd love to try this with Cursor - just use it fully with voice.

rickydroll · 2026-01-08T14:42:19 1767883339

<pedantic>Voice recognition identifies who you are, speech recognition identifies what you say. </pedantic>

Example:

Voice recognition: arrrrrrgh! (Oh, I know that guy. He always gets irritated when someone uses terms speech and voice recognition wrong)

Speech Recognition: "Why can't you guys keep it straight? It is as simple as knowing the difference between hypothesis and theory."

nowittyusername · 2026-01-07T22:52:38 1767826358

This is perfect for me. I just started working on the voice related stuff for my agent framework and this will be of real use. Thanks.

jauntywundrkind · 2026-01-07T23:41:51 1767829311

There's also the excellent also open source unmute.sh. which alas is also Nvidia only at this point. https://unmute.sh/

vikboyechko · 2026-01-08T00:40:03 1767832803

The game show is pretty good. Have a feeling this project will consume all my attention this week, thanks for the tip.

atonse · 2026-01-08T15:39:08 1767886748

Can't wait for this to land in MacWhisper. I like the idea of the streaming dictation especially when dictating long prompts to Claude Code.

deckar01 · 2026-01-08T05:17:09 1767849429

It supports Turing T4, but not Ampere…

nsbk · 2026-01-08T10:11:26 1767867086

Any ideas on how to add Ampere support? I have a use case in mind that I would love to try on my 3090 rig

deckar01 · 2026-01-08T19:10:18 1767899418

Magpie-TTS needs a kernel compiled targeting Ampere, but it appears to be closed source. It was compiled for the 2018 T4, but not 2020-2024 consumer cards, just 2025 consumer cards.

nsbk · 2026-01-11T16:20:55 1768148455

I actually forked the repo, modified the Dockerfile and build/run scripts targeting Ampere and the whole setup is running seamlessly on my 3090, Magpie is running fine and using under 3Gb of memory, ~2Gb for nemotron STT, and ~18Gb for Nemotron Nano 30b. Latencies are great and the turn detection works really well!

I'm going to use this setup as the base for a language learning App for my gf :)

deckar01 · 2026-01-25T01:45:01 1769305501

I got your fork working (also on a 3090). I was not impressed with the latency or the recommended LLM’s quality.

nsbk · 2026-01-25T09:21:35 1769332895

Make sure you’re using the nemotron-speech asr model. I added support for Spanish via Canary models but these have like 10x the latency: 160ms on nemotron-speech vs 1.5s canary.

For the LLM I’m currently using Mistral-Small-3.2-24B-Instruct instead of Nemotron 3 and it works well for my use case