This is so good. I've been using it with Claude Code with great success.
I just leave an instruction in CLAUDE.md to validate changes with Playwright. It automatically starts a dev server (wrote a little MCP server to do that), navigates to the page with the changes it just made, and validates that its changes worked. If there is anything unexpected, it self-corrects.
It's like working with a really great mid-level engineer.
Interesting use-case. Can you give an example of a prompt you use that triggers this tool? Are you validating UI changes (button color), navigation, or something more complex?
+1 for claude code being amazing, and especially +1 for the cost. I've spent $500 this week, $.10 - $1 at a time, fixing bugs and adding features. It took a while to get used to not @ tagging all of the files and realizing it just "figures it out" (using tokens to do so of course!)
I burned though $25 in just 3 hours. Claude code will be great when they can get the cost down. If the cost is like 1/10th of that I’d be using it all the time, but +/- $10 / hour is too much.
>I burned though $25 in just 3 hours. Claude code will be great when they can get the cost down. If the cost is like 1/10th of that I’d be using it all the time, but +/- $10 / hour is too much.
I've been trying to figure this out, and I don't think it's malicious, but it's just a matter of incentives. Anthropic devs are certainly not paying retail prices for Claude usage, so their benchmark (or just intuition) of efficiency is probably much different than the average user. Without that hard constraint the incentive just isn't there for them to squeeze out a few more pennies, and it ends up way more expensive than stuff like Cline or Cursor.
It uses ariaSnapshot, which is an accessible representation of the DOM used by screen readers and accessibility validation tools as well as playwright testing.
However, even with that, it will quickly exhaust the model context if you navigate to something like Gmail. I just verified this with cursor.
I've been playing around with a much better textual representation of the page that's much more compact:
I agree -- I hacked up a CDP-driven MCP so that Claude can drive your own browser instance, and I think that's more in the spirit of how MCP is supposed to work (where it's driving your tools under supervision, rather than spinning up its own context)
I’m going to see if I can use this in combination with our JIRA MCP to read a bug ticket’s “steps to reproduce” to see if it translate those steps to actually reproduce those actions.
I don’t understand the hate against MCP. It is truly exciting to see the Cambrian explosion of “connectors” coming out.
This is going to be the “App Store” for models in a way that OpenAI’s custom GPTs never was.
Watermarking and synthesizing text for hosts and clients, private RAG over Slack MCP implementations would disperse LLM's to Local Data Souce: A, B, and remote server C.
I don't know playwright, but how is this different than puppeteer?
The issue I'm noticing with puppeteer is that it isn't always successful to immediately get the right javascript to complete a simple task such as accepting a cookie consent banner, for example.
Playwright is a bit of an evolution of Puppeteer. Mostly the same API, extends the API a bit (I tend to prefer its abstractions over Puppeteer), and designed to work with multiple browsers. It came from many of the same developers as Puppeteer.
Does Playwright work with multiple browsers? I get the impression it can work with multiple engines, but they're just custom wrappers and not the full/original browsers.
I've been using Playwright for years testing safari/FF/chromium-based engines. Playwright team compiles every single browser at each new release.
It's great, no worry. Besides very minor things like mobile safari bugs (which you can't test on Macos safari neither, you need a real device or browser stack) it's perfect.
I think the use cases are slightly different between for the two. The playwright MCP depends on the mcp server (like claude desktop or cursor) to provide the intelligence, while browser-use can "think" by itself. Plus it seems that unless you use the vision mode, you are kind of restricted to the accessibility tree, which may not be present or well populated depending on the website you're using. This also means that it won't really work as well with stuff like cursor/windsurf since they don't really process images from MCPs right now.
I'm more in the camp of using claude computer-use/openai cua. I think they work better for most things, especially if you don't interact with hidden/obscured elements.
If you're interested in comparing these different services, you can try HyperPilot by Hyperbrowser at https://pilot.hyperbrowser.ai .
Disclaimer: I worked on Hyperpilot so I might be a bit biased.
So instead of specifying explicit selectors, etc, you just use a prompt?
(like "Go to eBay.com, search for Playstation 5, and click on the first result that isn't a promoted listing")
Yes, exactly. It defaults to using the Chrome accessibility tree but it can also be run so it uses Claude's vision feature against screenshots instead.
I just leave an instruction in CLAUDE.md to validate changes with Playwright. It automatically starts a dev server (wrote a little MCP server to do that), navigates to the page with the changes it just made, and validates that its changes worked. If there is anything unexpected, it self-corrects.
It's like working with a really great mid-level engineer.
What a time to be alive.