Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building voice agents with Nvidia open models (daily.co)
126 points by kwindla 24 days ago | hide | past | favorite | 20 comments


I've been using festival under Linux.

https://manpages.ubuntu.com/manpages/trusty/man1/festival.1....

But it is quite old now and pre-dates the DL/AI era.

Does anybody know of a good modern replacement that I can "apt install"?


I used piper with a model I found online. It's _ALOT_ better than festival afaik. I'm not sure you can apt install it though.

echo "hello" | piper --model ~/.local/share/piper/en_US-lessac-medium.onnx --output_file - | aplay


You can in fact apt install piper.


That's a different piper.

    piper - GTK application to configure gaming devices


^piper-tts exists.

Do any of the top models let you pause and think while speaking? I have to speak non-stop to Gemini assitant and ChatGPT, which is very very useless/unnatural for voice mode. Specially for non-english speakers probably. I sometimes have to think more to translate my thoughts to english.


Have you tried talking to ChatGPT in your native tongue? I was blown away by my mother speaking her native tongue to ChatGPT and having it respond in that language. (It's ever so slightly not a mainstream one.)


Even in my own language I can't talk without any pauses.


These have gotten good enough to really make command-by-voice interactions pleasant. I'd love to try this with Cursor - just use it fully with voice.


<pedantic>Voice recognition identifies who you are, speech recognition identifies what you say. </pedantic>

Example:

Voice recognition: arrrrrrgh! (Oh, I know that guy. He always gets irritated when someone uses terms speech and voice recognition wrong)

Speech Recognition: "Why can't you guys keep it straight? It is as simple as knowing the difference between hypothesis and theory."


This is perfect for me. I just started working on the voice related stuff for my agent framework and this will be of real use. Thanks.


There's also the excellent also open source unmute.sh. which alas is also Nvidia only at this point. https://unmute.sh/


The game show is pretty good. Have a feeling this project will consume all my attention this week, thanks for the tip.


Can't wait for this to land in MacWhisper. I like the idea of the streaming dictation especially when dictating long prompts to Claude Code.


It supports Turing T4, but not Ampere…


Any ideas on how to add Ampere support? I have a use case in mind that I would love to try on my 3090 rig


Magpie-TTS needs a kernel compiled targeting Ampere, but it appears to be closed source. It was compiled for the 2018 T4, but not 2020-2024 consumer cards, just 2025 consumer cards.


I actually forked the repo, modified the Dockerfile and build/run scripts targeting Ampere and the whole setup is running seamlessly on my 3090, Magpie is running fine and using under 3Gb of memory, ~2Gb for nemotron STT, and ~18Gb for Nemotron Nano 30b. Latencies are great and the turn detection works really well!

I'm going to use this setup as the base for a language learning App for my gf :)


I got your fork working (also on a 3090). I was not impressed with the latency or the recommended LLM’s quality.

Make sure you’re using the nemotron-speech asr model. I added support for Spanish via Canary models but these have like 10x the latency: 160ms on nemotron-speech vs 1.5s canary.

For the LLM I’m currently using Mistral-Small-3.2-24B-Instruct instead of Nemotron 3 and it works well for my use case




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: