Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Numen: Voice Control for Handsfree Computing (numenvoice.com)
100 points by memorable on Feb 18, 2023 | hide | past | favorite | 42 comments


It's worth mentioning Talon[0] here, which is a system for offline voice control as well, with great python-based scripting (and also supports eye tracking, though I haven't used it myself).

Using your computer or programming with it works like a charm, with some interesting and impressive projects based on it coming out as well, like Cursorless[1].

There's a great strangeloop talk[2] demonstrating talon and the actual state of voice coding, which is how I discovered it (hint: it's much better than you'd expect, and straightforward to learn at that).

[0]: https://talonvoice.com/

[1]: https://github.com/cursorless-dev/cursorless

[2]: https://youtu.be/YKuRkGkf5HU

Disclaimer: not affiliated, just a happy occasional user


Wow eyetracking is not something i thought of.. and now i want it.

I wonder if we could replace mouse with eyetracking? I wouldn't expect it to be accurate enough though, give micro movements that eyes do.. and in general erratic movements.. but i'd love to be wrong.


Eye tracking is useful if you can or want to sit in front of a desk. I'm concerned at the lack of diversity in eye-tracking manufacturers. Tobii is the only commercial brand I'm aware of or that Talon supports and initial setup requires Windows (I don't know if recalibration also requires Windows).

I haven't used eye tracking but I'd imagine that commands could be given in the short time that an on-screen element is focused... and the rest of the time the cursor jumps erratically.


Talon's eye tracking functions as a mouse replacement. Is there a specific demo you'd like to see? I can record one.


I've been researching eye tracking for my own project for the past year. I have a Tobii eye tracker which is probably the best eye tracking device for consumers currently (or the only one really). It's much more accurate than trying to repurpose a webcam.

So the problem with eye tracking is what's called the "midas touch" problem. Everything you look at is potentially a target. If you were to simply connect your mouse pointer to your gaze, for example, any sort of hover effect on a web page would be activated simply by glancing at it. [1]

Additionally, our eyes are constantly making small movements call saccades [2]. If you track eye movement perfectly, the target will wobble all over the screen like mad. The ways to alleviate this is by expanding the target visually so that the small movements are contained within a "bubble" or by delaying the target slightly so the movements can be smoothed out which naturally causes inaccuracy and latency. [3] There are efforts to predict the eyes movements to give the user the impression of lower latency, but it's imperfect solution.

Another issue is gaze activation. Computers can't read our minds, so systems which require one to stare fixedly at an object in order to activate an interface are common. The problem with this is the both the delay and effort required. You can easily get a headache from the effort of trying to fixate your eyes on a target. Eye tracking in VR and AR have similar problems.

There are other forms of activation - if you open your iPhone's accessibility menu in the settings, you'll see a bunch of options including head nods, facial gestures, eye blinks and more. [4]

The future of eye tracking is definitely multimodal. A specific gaze target combined with a gesture or hotword is the way humans naturally interact with other humans (you look at a person, get confirmation through eye contact or a nod, and then speak or gesture.) What's amazing is the amount of redundant effort being made in this area. Some of this stuff has been known a decade or more. There are tons of both research papers and thousands of patents to explore which cover the topic in great detail. There is very little that hasn't already been solved.

1. https://uxdesign.cc/the-midas-touch-effect-the-most-unknown-...

2. https://en.m.wikipedia.org/wiki/Saccade

3. https://help.tobii.com/hc/en-us/articles/210245345-How-to-se...

4. https://support.apple.com/accessibility


That strangeloop talk inspired me to explore a lot of things, including my methodology for writing command phrases that are phonetically distinct and succinct.

Glad to hear Talon is still around! Their slack has grown and they really seem like they have a product now.


The talon demonstration from the last link was inspiring, but it works in the exact opposite fashion that I would have imagined. The code-development examples are command-based, with a command to enter phrase mode. I'd have expected with technology such as tree-sitter and IntelliJ etc, that by parsing the syntax tree of current computer language for completions, development could occur completely in phrase mode with only a few commands for handling unknown inputs such as new variable names.

I'm curious if anyone has ever tried implementing the latter, or compared the two approaches. I'm sure there would be many obstacles I haven't considered.


Fixed commands are fast, precise, and predictable.

Assuming you mean speaking in natural language, that's slower to say, and likely less precise and predictable if you want to be able to just say "anything" any have a result.

You need a command system either way. If you want to express some precise intention, you need to understand what the command system will do.

There is a combined "mixed mode" system I've been testing in the talon beta where you can use both phrases and commands without switching modes.


Was hoping for a comparison to Talon. Talon is incredible. I’m particularly interested to see if any project spawns focused around augmenting the keyboard as opposed to replacing it in a programming context.


You might be interested in Cursorless's experimental keyboard mode: https://www.cursorless.org/docs/user/experimental/keyboard/m...


I can go back to win 7 and it had "speech recognition". Before that in xp days I dabbled with offline dragon and stuff.

Point is, I've been bugged with this problem.

" I need a dictation software to read me back what it understood and typed". ALL the software either assume you are looking at the screen and like the win7 (scratch that) I don't want that.

Let me say "I was walking and running besides the train." <pause> "I was walking and besides The train." Would be response so I would say "scratch that." And I would repeat it or ask for help and all.

Why isn't such a system there?

Think of it as a person doing the typing. You write a line, they read back what you said, okay, next. Otherwise fix that like this


It seems SAPI might be removed from the latest versions of windows. It was pretty simple to use in VB6 in pure dictation mode or you could even load a dictionary of listen words for even higher false positives. Any replacements that anyone is aware of for offline dictation / dictionary?


Last time I checked Talon's models were very bad at recognizing my voice. Does it support better models now, for example OpenAI's Whisper?


Depending on when that was: in 2018 the free model was the macOS speech engine, in 2019 it was a fast but relatively weak model, and as of late 2021 it's a much stronger model. I'm currently working on the next model series with a lot more resources than I had before.

It's also worth saying that if you only tried things out briefly, there are a handful of reasons recognition may have seemed worse. Talon uses a strict command system by default, because that improves precision and speed for trained users, but the tradeoff there is it's more confusing for people who haven't learned it yet.

For example, Talon isn't in "dictation mode" by default, so you need to switch to that if you're trying to write email-like text and don't want to prefix your phrases with a command like "say".

The timeout system may also be confusing at first. When you pause, Talon assumes you were done speaking and tries to run whatever you said. You can mitigate this by speaking faster or increasing the timeout.

The default commands (like the alphabet) may also just not be very good for some accents, and that will be the case for any speech engine - you will likely need to change some commands if they're hard to enunciate in your accent.

I recommend joining the slack [1] and asking there if you want more specific feedback. I definitely want to support many accents and even have some users testing Talon with other spoken languages.

[1] https://talonvoice.com/chat


The creator of Talon has tested the Whisper models extensively[0].

[0]: https://twitter.com/lunixbochs/status/1574848899897884672


I don't know what type of speech each dataset represents, but the talon results are extremely impressive... I assume it wasn't trained on at least some subset (depending on the train/test split) of this data?


A handful of the datasets I tested are fully held out (I have reason to believe none of the models have trained on them), and talon was trained on none of the dev or test data of any of the datasets in question.

Due to whisper's weakly supervised training on a large amount of automatically scraped data and reliance on a bigger language model, it's far more likely whisper had seen some of the test data before.


I tried this and the speech recognition is really poor.


The Talon model is fairly accurate, but it can be confusing for new users to use the command system correctly. I posted a sibling reply about this, but the most common reason for Talon users to complain about the recognition is that they are in the strict "command mode" and say things that aren't actually commands.

If you encounter what feels like poor recognition in Talon, I recommend enabling Save Recordings and zipping+sharing some examples on the Slack and asking for advice.

The current command set is definitely harder to learn than a system designed for chat/email where "what you say is what you get", but it's much more powerful for tasks like programming once you learn it.

I'm dubious about what kind of general command accuracy Numen is able to get with the Vosk models, as Vosk to my understanding is more designed for natural language than commands.


Okay, I read some docs but I didn't find much info for writing code. Mostly editing already present.


This is soo needed.

All big techs use of voice has so far required Internet access and is creepy. Googles is apawling in that it changes so things that did work, stop working.

What voice needed was for humans to adjust a little to make the computer work easier. e.g. "Computer" "file save" is much more efficient all round than sending off audio to the bork for AI to try work out what it means.


I like this a lot. This is built upon Vosk [0], open source voice recognition. I must try it for some of my own projects!

[0] https://alphacephei.com/vosk/


I'm only interested if you have to activate it by saying "Hello... Numen..." ala Seinfeld.


Hello, Jerry...


Dunno what the video is, but it's broken on Firefox mobile at least.


I only found this thread today but that's good to know.

Here's the original screencast on a peertube: https://diode.zone/w/7ZjccgJ5EJCsES3x3yrkpQ

There's also videos of my phone and pi: https://peertube.tv/w/uzMMQ5nbmsHMkDGGVcS1ZB https://peertube.tv/w/miurmjVygd6C71EfPk19QU


Same here - but I bet it’s the HN hug


Doesn’t work on ipad safari either…


Broken in Safari on iPhone as well


Impressive, I'm looking forward to seeing more of this project. Did you draw inspiration from Talon? There are a lot of similarities when it comes to the voice commands.


Thank you! I only found this thread today. I was inspired by talon but all I really wanted was some phrases that worked everywhere and to use normal tools like vim and a tiling window manager, and hoped it could all be foss. I think talon's more about app specific phrases and trying to take the role of your text editor and window manager.


Interesting project for providing better accessibility!

Reminded me a bit of those scenes on Blade Runner where Deckard is asking the computer to zoom in a certain area and enhance image :D


> Deckard

It does. Bloody awesome. I'm re-watching this video trying to understand some of the shorthand being used. There's "bang" for exclamation mark; "cap drum" (?) for `cd`. I can't figure out what words he uses to invoke `git clone` at 1:27 but it's incredibly futuristic. I wish my daily driver wasn't a Mac these days =(


It looks like they have a word (or multiple) for each letter of the alphabet. So CD is "change drum", git clone is " guest ice traps space cap look [Ctrl right - autocomplete]", where you can read the commands from the first letters of each word.

Edit: the default 'phrases' are here: https://git.sr.ht/~geb/numen/tree/master/item/phrases


Find it counter intuitive, like we have to memorize new constants that a programer defines... But having this with english words like next line or page down or page up, gamechanger.


Intresting. Why will someone use handsfree computing. It's slow, i would rather type.


Because they might be physically impaired in some way. Or have severe Repetitive Strain Injury (RSI).


In addition to the other points above, i have found that in some workflows I have found that using talon and cursorless i am faster than I was using just a keyboard and mouse. I have had some bad RSI flare ups over the past few years and they have been a lifesaver getting me though my undergrad and most of my masters. Even during the times in-between flare ups i found that using them to editing a latex document was much more convenient being able to reference paragraphs/sentences/words and then change them or move them around.


TBF, you need working hands to use hands-on computers.

Plus, the offline part could make a good starting point for a DIY personal assistant.

That said, their "getting started" sounds...esoteric.

>There normally isn't any output but you should be able to type "hey" by saying "hoof eve yank" and transcribe a sentence after saying "scribe". You can terminate it by pressing Ctrl+c or saying "troll cap".


this looks much better than any voice control I've seen so far, I wonder if it requires tiles or you can integrate it with other tiling managers


I only found this thread today but it works everywhere a keyboard would so any desktop environment or in the virtual consoles. I also have a raspberry pi which I use to install operating systems and the like. https://peertube.tv/w/miurmjVygd6C71EfPk19QU


interesting... i just broke my arm so this is potentially useful for me. The words you use will take some getting used to though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: