Look at Browser Use.
They self-reported 89% on WebVoyager. On hard tasks with a real benchmark, they score 8.1%. That's not a performance drop….. that's a different product than what's being advertised.
To be fair, this isn't just a Browser Use problem. Look at the drop-off for every agent as tasks get harder:
Operator goes from 83% easy → 43% hard. That's a 40-point cliff.
Claude Computer Use: 90% easy → 32% hard. 58-point drop.
Browser Use: 55% easy → 8% hard. Just falls off a cliff entirely.
TinyFish: 97.5% easy → 81.9% hard. 15-point drop.
The gap between easy and hard is where you see if a system actually works or if it's just good at simple tasks. Every other agent loses half its ability or more when tasks get complex. We lose 15 points.
That's the difference between "cool demo" and "I can actually ship this."
To use WebMCP, you need:
- Chrome: Version 146.0.7672.0 or higher, which means a next release.
- Flags: The "WebMCP for testing" flag must be enabled.
What is WebMCP?
WebMCP is a proposed web standard that exposes structured tools for AI agents on existing websites. This would replace "screen-scraping" with robust, high-performance page interaction and knowledge retrieval.
News: The Agentic AI Foundation (AAIF) is a directed fund under the Linux Foundation co-founded by Anthropic, Block and OpenAI, with support from Google, Microsoft, AWS, Cloudflare and Bloomberg. The AAIF aims to ensure agentic AI evolves transparently, collaboratively, and in the public interest through strategic investment, community building, and shared development of open standards.
This is another WebMCP use case: testing features during vibe-coding.
One prompt, 60 seconds from writing requirements to validating the implementation.
check out this tutorial: https://screen.studio/share/y9b9Fmnc
I've been thinking about how we assess software engineering skills, and I'm curious about others' thoughts on using open source contributions as a primary metric.
My hypothesis is that real-world collaboration and communication skills, as demonstrated through open source work, are more indicative of a developer's capabilities than typical coding quizzes.
(I tried OtherBranch's sample coding problem mentioned at their post, https://www.otherbranch.com/practice-coding-problem, and got this opinion.)
With the rise of AI tools, I believe the ability to effectively use these to enhance one's contributions is becoming increasingly valuable.
For those who hire or work with other developers:
1. How much weight do you give to a candidate's open source contributions?
2. Do you find that strong open source contributors tend to be better collaborators?
3. How do you balance assessing technical skills vs. communication/collaboration abilities?
I'm working on a platform to facilitate this new assessment with personalized LLM support. So I'd love to hear your experiences and thoughts!