It is inference latency most of the time. These VLA models take in an image + state + text and spit out a set of joint angle deltas.
Depending on the model being used, we may get just one set of joint angle deltas or a series of them. In order to be able to complete a task, it will need to capture images from the cameras, current joint angles and send them to the model along with the task text to get the joint angle changes we will need to apply. Once the joint angles are updated, we will need to check if the task is complete (this can come from the model too). We run this loop till the task is complete.
Combine this with the motion planning that has to happen to make sure the joint angles we are getting do not result in colliding with the surroundings and are safe, results in overall slowness.
I have been working with LLMs and VLMs to automate browser based workflows among other things for the last couple of years. Given how good the vision models have gotten lately, the perception problem is solved to level where it opens up a lot of possibilities. Manipulation is not generally solved yet but there is a lot of activity in the field and there are promising approaches to solve (OpenVLA, π0). Given these, I'm trying to build an affordable robot that can help around with household chores using language and vision models. Idea is to ship capable enough hardware that can do a few things really well with the currently available models and keep upgrading the AI stack as manipulation models get better over time.
For now it still uses openai for embeddings generation by default and we are updating that in the next couple of releases to be able to use a local model for embedding generation before writing to a vector db.
Disclosure: I'm the maintainer of LLMStack project
If anyone is looking to try it out quick without local installation, we added Llama-8B model to Promptly playground. Please check it out at https://trypromptly.com/playground.
We can get a lot done with vector db + RAG before having to finetune or custom models. There are a lot of techniques to improve RAG performance. Captured a few of them a while back at https://llmstack.ai/blog/retrieval-augmented-generation.
We have recently added support to query data from SingleStore to our agent framework, LLMStack (https://github.com/trypromptly/LLMStack). Out of the box performance performance when prompting with just the table schemas is pretty good with GPT-4.
The more domain specific knowledge needed for queries, the harder it has gotten in general. We've had good success `teaching` the model different concepts in relation to the dataset and giving it example questions and queries greatly improved performance.