Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.
OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.
Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
Iran tends to lie about these things while Israel usually says the truth, at least after running an investigation. It's pretty simple: one is a dictatorship without free media, and the other one isn't. It's easy to lie when you can tell the newspapers what to write, and it's much harder when they're doing their job. You want an example? Khamenei. Iran says he's safe and wasn't hurt. Israel says he's dead. Let's see.
> while Israel usually says the truth, at least after running an investigation
You can't be serious about that statement. At best it reflects overwhelming naivete about how governments (let alone those engaged in war) work. At worst, its a deliberate attempt at misinformation.
I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.
I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.
It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"
It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).
I've used a bunch of the SOTA models (via my work's Windsurf subscription) for HTML/CSS/JS stuff over the past few months. Mind you, I am not a web developer, these are just internal and personal projects.
My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.
I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.
Amazingly, just yesterday, I had Opus 4.5 crap itself extensively on a fairly simple problem -- it was trying to override a column with an aggregation function while also using it in a group-by without referring to the original column by its full qualified name prefixed with the table -- and in typical Claude fashion it assembled an entire abstraction layer to try and hide the problem under, before finally giving up, deleting the column, and smugly informing me I didn't need it anyway.
That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.
It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)
But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.
This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.
Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.
Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.
I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.
I have 459 titles on my IMDB watchlist and a tiny percentage of it is available on Netflix (if at all), but this is anecdotal and might have to do something to where I live.
Netflix outside of the US is a very different experience.
In the US, it's mostly their own productions and older content they explicitly acquired, but elsewhere, especially in markets that don't have a local HBO or Disney streaming service, they have incredible backlogs.
I remember finding basically everything I could wish for on there when traveling in SE Asia almost a decade ago, compared to a still decent offering in Western Europe, and mostly cobwebs in the US.
It's actually an interesting example, because unlike Warp that tries to be a CLI with AI, Claude defaults to the AI (unless you prefix with an exclamation mark). Maybe it says more about me, but I now find myself asking Claude to write for me even relatively short sed/awk invocations that would have been faster to type by hand. The uncharitable interpretation is that I'm lazy, but the charitable one I tell myself is that I don't want to context-switch and prefer to keep my working memory at the higher level problem.
In any case, Claude Code is not really CLI, but rather a conversational interface.
Claude Code is a TUI (with "text"), not a CLI (with "command line"). The very point of CC is that you can replace a command line with human-readable texts.
You may think that's pedantic but it really isn't. Half-decent TUIs are much closer to GUIs than they are to CLIs because they're interactive and don't suffer from discoverability issues like most CLIs do. The only similarity they have with CLIs is that they both run in a terminal emulator.
"htop" is a TUI, "ps" is a CLI. They can both accomplish most of the same things but the user experience is completely different. With htop you're clicking on columns to sort the live-updating process list, while with "ps" you're reading the manual pages to find the right flags to sort the columns, wrapping it in a "watch" command to get it to update periodically, and piping into "head" to get the top N results (or looking for a ps flag to do the same).
But that's not how it's typically used, it's predominantly used in TUI mode so the popularity of CC doesn't tell us anything about popularity of the CLI.