Hacker Newsnew | past | comments | ask | show | jobs | submit | amrrs's commentslogin

Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!


In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).

A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).


Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.


The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.

OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.

It’s an engineering result, not a scientific one.


Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...


That's easily explained by those being two different people with two different opinions?


And together they make one single community that s effectively NEVER happy.


is that str.replace(g,t) ?


No. I am actually too highly regarded for measly single dimensional game


Why are you religiously defending Israel?

And you already bet this story is a fabrication as well.

This is exactly who media takes advantage of not the one who waits for investigation and acts rationally.

If going by your recent comments, I can say I bet you're just an Israeli propagandist. Would you be happy with that assesment?


Iran tends to lie about these things while Israel usually says the truth, at least after running an investigation. It's pretty simple: one is a dictatorship without free media, and the other one isn't. It's easy to lie when you can tell the newspapers what to write, and it's much harder when they're doing their job. You want an example? Khamenei. Iran says he's safe and wasn't hurt. Israel says he's dead. Let's see.


> while Israel usually says the truth, at least after running an investigation

You can't be serious about that statement. At best it reflects overwhelming naivete about how governments (let alone those engaged in war) work. At worst, its a deliberate attempt at misinformation.


haha


Have you tried the new GLM 4.7?


I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).


I've used a bunch of the SOTA models (via my work's Windsurf subscription) for HTML/CSS/JS stuff over the past few months. Mind you, I am not a web developer, these are just internal and personal projects.

My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.

I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.


Exactly the same feedback


Amazingly, just yesterday, I had Opus 4.5 crap itself extensively on a fairly simple problem -- it was trying to override a column with an aggregation function while also using it in a group-by without referring to the original column by its full qualified name prefixed with the table -- and in typical Claude fashion it assembled an entire abstraction layer to try and hide the problem under, before finally giving up, deleting the column, and smugly informing me I didn't need it anyway.

That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.

It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)

But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.


> I can't believe how bad it is

This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.

Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.

Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.


yes I did, not on par with Opus 4.5.

I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code


Thanks for sharing your repo..looks super cool.. I'm planning to try out. Is it based on mlx or just hf transformers?


Thank you, just transformers.


Unethical conduct is negotiating with Zuck? :D jokes aside. but it must something serious enough for them to part ways with the co-founder



On fal, it takes less than a second many times.

https://fal.ai/models/fal-ai/z-image/turbo/api

Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.

The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.


I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.

https://fal.ai/pricing


It works with prepaid credits, so there should be no risk. Minimum credit amount is $10, though.


This. You can also run most (if not all) of the models that Fal.ai directly from the playground tab including Z-Image Turbo.

https://fal.ai/models/fal-ai/z-image/turbo


For images I like them: https://runware.ai/ super cheap and super fast, they also support Loras and you can upload your own models.

And you work with credits


Why the down vote? Are they scam?


Honestly speaking Netflix has good catalog, much more comparable to Hollywood. Take the latest Frankenstein for example.

Don't look at only series. They also have recipes repurposed. But they acquire good titles and also produce some good ones.


I have 459 titles on my IMDB watchlist and a tiny percentage of it is available on Netflix (if at all), but this is anecdotal and might have to do something to where I live.


Netflix outside of the US is a very different experience.

In the US, it's mostly their own productions and older content they explicitly acquired, but elsewhere, especially in markets that don't have a local HBO or Disney streaming service, they have incredible backlogs.

I remember finding basically everything I could wish for on there when traveling in SE Asia almost a decade ago, compared to a still decent offering in Western Europe, and mostly cobwebs in the US.


459!? It must take a while to check your list…


After checking 20 titles and getting no results you can notice the pattern.


If at all anything, Claude Code's success disproved this


It's actually an interesting example, because unlike Warp that tries to be a CLI with AI, Claude defaults to the AI (unless you prefix with an exclamation mark). Maybe it says more about me, but I now find myself asking Claude to write for me even relatively short sed/awk invocations that would have been faster to type by hand. The uncharitable interpretation is that I'm lazy, but the charitable one I tell myself is that I don't want to context-switch and prefer to keep my working memory at the higher level problem.

In any case, Claude Code is not really CLI, but rather a conversational interface.


Claude Code is a TUI (with "text"), not a CLI (with "command line"). The very point of CC is that you can replace a command line with human-readable texts.


Let's not be overly reductive, Claude Code is a TUI with a CLI for all input including slash commands.


You may think that's pedantic but it really isn't. Half-decent TUIs are much closer to GUIs than they are to CLIs because they're interactive and don't suffer from discoverability issues like most CLIs do. The only similarity they have with CLIs is that they both run in a terminal emulator.

"htop" is a TUI, "ps" is a CLI. They can both accomplish most of the same things but the user experience is completely different. With htop you're clicking on columns to sort the live-updating process list, while with "ps" you're reading the manual pages to find the right flags to sort the columns, wrapping it in a "watch" command to get it to update periodically, and piping into "head" to get the top N results (or looking for a ps flag to do the same).


Claude Code is a Terminal User Interface, not a Command Line Interface.


Well, it is if you just run

claude -p "Question goes here"

As that will print the answer only and exit.


But that's not how it's typically used, it's predominantly used in TUI mode so the popularity of CC doesn't tell us anything about popularity of the CLI.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: