Cradle: Empowering Foundation Agents Towards General Computer Control

eamag · on July 12, 2024

I'm looking through the code, does it mean that the authors wrote all basic skills themselves and let LLM choose from them? So this approach can't be generalised, can it?

https://github.com/BAAI-Agents/Cradle/pull/44/files#diff-3f3...

Fripplebubby · on July 12, 2024

They start with some basic skills but the model builds the more complex skills. The specific skills that they start with depends on which task, for example RDR2 they start with only how to turn, move forward, shoot, and select an item from the item equip wheel and then a couple composite skills that were not _necessary_ but saved a lot of time - for example, "follow npc", since the model could follow an npc without it but it required a huge amount of back and forth. However, a lot of skills were actually learned by the model from the in-game tutorial in RDR2 explaining how to do things!

> So this approach can't be generalised, can it?

The paper definitely demonstrates that the _approach_ can be generalized because they use the same approach across a variety of different environments and tasks, but you can also see that they did have to specialize the prompts, tools (like how the incoming screenshot was decorated with object detection / segmentation, stuff like that), and set of skills for each environment.

finnh · on July 12, 2024

Right ... the output isn't really "keyboard & mouse operations" as stated in the abstract. The output is a sequence of atomic commands ("shoot", "select_weapon") or composite commands ("follow") that are then translated into mouse actions by a human-written layer.

Similarly, when I'm typing this post my output isn't "keyboard interrupts" nor "TCP frames". Those things do happen, but only by the grace of others' work.

Still I like the basic framework idea & I suppose future work looks at pushing the provided layer lower & lower - so the system can learn those atomic commands, rather than having humans provide them - and pushing the logic higher & higher via AI-generated composite commands.

fordacious · on July 12, 2024

Many games do happen to be set up this way wrt their input systems if they happen to leverage Steam input. Player actions in game are defined as abstract verbs with concrete input device binding being left to the player / underlying system. Valve is trying to nudge games towards being platonic objects isolated from specific systems (so they'll be readily compatible steam deck, vr, pc etc...). Translating these verbs in an automated way with an agent is an interesting side effect.

miohtama · on July 12, 2024

The problem is not triggering the action, the problem is being at the right place at the right time and looking into the right direction.

huevosabio · on July 12, 2024

I think they seed it with some skills and then can generate more? The website allures to "skill generation" so arguably they are doing some level of code generation.

Art9681 · on July 12, 2024

Fantastic. This is why efforts to defeat web scrapers will ultimately prove futile unless the human/computer interfaces require constant biometric authentication. I imagine in some dark timeline, content will not be displayed unless the finger touching the trackpad is a human finger. Or the keyboard keys wont register unless they detect a fingerprint or other bio signatures. Same thing with online multiplayer games. Only approved controllers that have some future tech that constantly polls that fingerprint pressing the buttons to ensure it is a human.

Perhaps something like the eye tracking tech in modern vehicles to ensure you're paying attention if the lane assist is turned on.

Of course, that would be awful. But what other recourse is there?

squigz · on July 12, 2024

> unless the human/computer interfaces require constant biometric authentication

Other than the obvious path of getting over this fear of bots and whatnot, I see 2 options forward in this regard: 1) government ID verification on all major platforms 2) end-to-end verification of all software being ran, and refusal to run other programs if unsigned code is present; I'm pretty sure there's plenty of efforts in this area already

polotics · on July 12, 2024

Sure, end-to-end verification of software being run could be a thing, but what then prevents the LLM-assisted setup from using an analog video feed to a separate computer, and analog input back on keyboard and mouse? Do you also expect we'd sign-off on location and proximity -tracking for all computing devices in existence?

doctorpangloss · on July 12, 2024

Every time you access a Cloudflare WAF protected service from an iPhone’s cell connection, you are living the situation you describe.

squigz · on July 12, 2024

I mean... I wouldn't have expected us to 'sign off' on many of the privacy-eroding shit we've been given, and yet...

To be clear, I agree that this is a losing battle, but I don't think that's going to stop some interested parties in pushing systems like what we're talking about.

dsign · on July 12, 2024

> What other resources is there?

Charge money to the content consumer? Of course, that will be unacceptable for companies that make money from ads…they will prefer the dystopia. It will also be bad for us humans when our income dwindles because the machines take our jobs…which this paper shows it’s just a matter of time.

throwaway4aday · on July 12, 2024

If the bot isn't abusing the service and is operating at a speed roughly equivalent to a human and your intent is to distribute the service freely or the bot is paying a fee for access then what's the problem? Anything I can think of would also be abuse of the terms of service or code of conduct and falls in the same category as regular meat people doing the same things.

userbinator · on July 12, 2024

Drink verification can.

gs17 · on July 12, 2024

I imagine in that same timeline, none of the issues have actually been solved but there are a lot more homeless people with missing fingers.

add-sub-mul-div · on July 12, 2024

> Of course, that would be awful. But what other recourse is there?

Rewatching Jurassic Park?

Fripplebubby · on July 12, 2024

Very cool work. Note that this was completed using a vanilla ChatGPT-4o model, all the magic dust is prompting, the dataflow between stages (info gathering, self-reflection, task inference, skill curation, action planning), and some tooling like added object detection / bounding boxes / icon detection.

Also, neither here nor there but I enjoyed the discussion in the paper about how the model had a surprisingly low performance on sending an email in Outlook because while it well-understood the task and how to send an email, Outlook's UI still managed to confuse it - can relate.

lucianbr · on July 12, 2024

What happened to Robotic Process Automation? Wasn't that supposed to be this?

eMPee584 · on July 13, 2024

Since the release of the "CogAgent" visual language model by scientists from Tsinghua University end of 2023, more and more general GUI agents have been showing up, including those by MS. Check out https://scholar.google.de/scholar?cites=11749002511260467707 f.e. or current publications by these authors..

bckr · on July 12, 2024

Robotic process automation is when you automate a process a business is already doing.

Models like this will be useful for RPA.

ilaksh · on July 12, 2024

I think I saw a framework designed to make RPA easier for LLMs by identifying all of the HI elements or fields with a number and allowing them to do entry by reference to the number or something. Can't remember what it was called.

cs702 · on July 12, 2024

Wow, this looks amazing.

The authors have developed Cradle, a multimodal-LLM-powered agent framework with six modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. Once Cradle has processed high-level instructions, its inputs are sequences of computer screenshots. Its output is executable code for low-level keyboard and mouse control, enabling Cradle to interact with any software and complete long-horizon complex tasks without relying on any built-in APIs:

       Oversimplified Big-Picture Diagram

                 +------------+
                 |   Cradle   |    executable code
  screenshots -> |(high-level | -> for controlling
                 |  planning) |    keyboard & mouse
                 +------------+

The authors' experiments show what to me looks like impressive generalization and performance across software applications, successfully operating daily software like Chrome and Outlook, and across commercial video games: It is able to follow the main storyline and complete 40-minute-long missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain to make a profit in Dealer's Life 2.

There are of course many caveats -- the technology is still in its infant stage -- but still, I'm impressed at how quickly things are progressing.

We sure live in interesting times!

sitkack · on July 12, 2024

Hold on to your scrip, two more papers down the line. What a time

No one is prepared.

yawnxyz · on July 12, 2024

weird nitpick — they keep mentioning LMM, but do they mean LLMs?

Krastan · on July 12, 2024

LMM is a large multimodal model. So it does more than just language, in this case interacting with UI, in others using voice and video

dinkblam · on July 12, 2024

i am waiting for version 2.0 "Enclave"