More

bgirard · 2026-02-24T03:55:47 1771905347

It's very useful to understand what you're struggling from even if it's not curable. It explains your symptoms, your experience and help you understand what you're going through. Understanding that you're suffering from something incurable is also helpful in not looking for other ineffective methods to cure a mysterious illness.

bgirard · 2026-02-20T17:19:10 1771607950

> SpaceX has deorbiting assets on top of depreciating ones

The deorbiting part is redundant. Their satellite are just that, a depreciating asset. Their lifetime seem to be 5 to 7 years. The important claim is if the total cost, including the launch, can be recuperate over that lifetime or not.

bgirard · 2026-02-13T15:53:11 1770997991

Are there benchmarks if we allow the LLM to practice and study the game?

raincole · 2026-02-13T16:57:24 1771001844

You can make one, the balatro bench is open source. But I'm quite sure it'd be crazily expensive for a hobby project. At the end of the day, LLM can't actually 'practice and learn.'

bgirard · 2026-02-13T20:48:54 1771015734

I've gotten pretty good results by prompting "What did you struggle on? Please update the instructions in <PROMPT/SKILL>" and "Here's your conversation <PASTE>, please see what you struggled with and update <PROMPT/SKILL>".

It's hit or miss, but I've been able to have it self improve on prompts. It can spot mistakes and retain things that didn't work. Similar to how I learned games like Balatro. Playing Balatro blind, you wouldn't know which jokers are coming and have synergy together, or that X strategy is hard to pull off, or that you can retain a card to block it from appearing in shops.

If the LLM can self discover that, and build prompt files that gradually allow it to win at the highest stake, that's an interesting result. And I'd love to know which models do best at that.

bgirard · 2026-02-10T20:10:55 1770754255

Your experience sounds exactly like mine. My son is very autistic as well. I've had to cut off friends with families because either their didn't understand meltdown and were incredibly judgy because they were blaming my parenting for his ASD meltdowns, or others because my autistic son was a "bad influence". God forbid their (later diagnosed) kid have some exposure to a child with different neurodiversities.

That's not even going into my traumatic health care experience to getting my son help when he needed it.

So now I have all the hardships of raising a family, and I'm restricted friendship within the small ND accepting community of my area. So my support network is incredibly small and I barely get any support. It sucks.

Reading the responses to your story that are nitpicking it over your daycare experience is a perfect representation of the problems that families face.

bgirard · 2026-02-10T15:51:29 1770738689

That's a good question. As someone bootstraping a few projects on Vercel this post has me looking over at the pricing sheet more closely.

bgirard · 2026-02-09T06:25:42 1770618342

This to me sounds a lot like the SpaceX conversation:

- Ohh look it can [write small function / do a small rocket hop] but it can't [ write a compiler / get to orbit]!

- Ohh look it can [write a toy compiler / get to orbit] but it can't [compile linux / be reusable]

- Ohh look it can [compile linux / get reusable orbital rocket] but it can't [build a compiler that rivals GCC / turn the rockets around fast enough]

- <Denial despite the insane rate of progress>

There's no reason to keep building this compiler just to prove this point. But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.

We're going to see a lot of disruption come from AI assisted development.

jeffreygoesto · 2026-02-09T07:23:03 1770621783

All these people that built GCC and evolved the language did not have the end result in their training set. They invented it. They extrapolated from earlier experiences and knowledge, LLMs only ever accidentally stumble into "between unknown manifolds" when the temperature is high enough, they interpolate with noise (in so many senses). The people building GCC together did not only solve a to technical problem. They solved a social one, agreeing on what they wanted to build, for what and why. LLMs are merely copying these decisions.

bgirard · 2026-02-09T08:09:36 1770624576

That's true and I fully agree. I don't think LLMs' progress in writing a toy C compiler diminishes the achievements that the GCC project did.

But also we've just witnessed LLMs go from being a glorified line auto-complete tool to it writing a C compiler in ~3 years. And I think that's something. And noting how we keep moving the goal post.

direwolf20 · 2026-02-09T11:34:42 1770636882

GP: "it didn't write a C compiler, it copied other compilers. Writing one from scratch is a lot harder."

You: "but look! It wrote a C compiler!"

bwfan123 · 2026-02-09T17:37:01 1770658621

The pattern matching rote-student is acing the class. No surprises here. There is no need to understand the subject from first principles to ace tests. Majority of high-school and college kids know this.

yourapostasy · 2026-02-09T14:26:15 1770647175

> LLMs are merely copying these decisions.

This I strongly suspect is the crux of the boundaries of their current usefulness. Without accompanying legibility/visibility into the lineage of those decisions, LLM's will be unable to copy the reasoning behind the "why", missing out on a pile of context that I'm guessing is necessary (just like with people) to come up to speed on the decision flow going forward as the mathematical space for the gradient descent to traverse gets both bigger and more complex.

We're already seeing glimmers of this as the frontier labs are reporting that explaining the "why" behind prompts is getting better results in a non-trivial number of cases.

I wonder whether we're barely scratching the surface of just how powerful natural language is.

itsyonas · 2026-02-09T07:48:16 1770623296

All right, but perhaps they should also list the grand promises they made and failed to deliver on. They said they would have fully self-driving cars by 2016. They said they would land on Mars in 2018, yet almost a decade has passed since then. They said they would have Tesla's fully self-driving robo-taxis by 2020 and human-to-human telepathy via Neuralink brain implants by 2025–2027.

> - <Denial despite the insane rate of progress>

Sure, but not by what was actually promised. There may also be fundamental limitations to what the current architecture of LLMs can achieve. The vast majority of LLMs are still based on Transformers, which were introduced almost a decade ago. If you look at the history of AI, it wouldn't be the first time that a roadblock stalled progress for decades.

> But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.

Okay, so at that point, we would have proved that AI can replicate an existing software project using hundreds of thousands of dollars of computing power and probably millions of dollars in human labour costs from highly skilled domain experts.

jopsen · 2026-02-09T21:22:20 1770672140

There's an argument to be made that replicating existing software is extremely useful.

Most of the time when you're writing a compiler for a new language, you'll be doing things that have been done before.

Because most of the concepts in your language are brought along from somewhere else.

That said: I'd always want a compiler and language designs to be well considered. Ideally, the authors have some proofs of soundness in their heads.

Perhaps LLM will make formal verification more feasible (from a cost perspective) and then our mind about what reliable software is might change.

raincole · 2026-02-09T06:43:32 1770619412

> the insane rate of progress

Yeah but the speed of progress can never catch the speed of a moving goalpost!

wrxd · 2026-02-09T07:02:12 1770620532

What about the hype? If you claim your LLM generated compiler is functionally on par with GCC I’d expect it to match your claim.

I still won’t use it while it also matches all the non-functional requirements but you’re free to go and recompile all the software you use with it.

friendzis · 2026-02-09T06:55:06 1770620106

> Yeah but the speed of progress can never catch the speed of a moving goalpost!

How do you like those coast-to-coast self drives since the end of 2017?

samultio · 2026-02-09T07:58:41 1770623921

Training data only teaches it how to reach the goalpost, not how to overtake it.

codethief · 2026-02-09T11:46:11 1770637571

Are we sure about that? I mean, we have seen that LLMs are able to generalize to some degree. So I don't see a reason why you couldn't put an agent in a loop with a profiler and have it try to optimize the code. Will it come up with entirely novel ideas? Unlikely. Could it potentially combine existing ideas in interesting, novel ways that would lead to CCC outperforming GCC? I think so. Will it get stuck along the way? Almost certainly.

andriamanitra · 2026-02-09T12:44:43 1770641083

Would you want it to? The further the goal posts are the more progress we are making, and that's good, no? Trying to make it into a religious debate between believers and non-believers is silly. Neither side can predict the future, and, even if they could, winning the debate is not worth anything!

What is interesting is what can do with LLMs today and what we would like them to be able to do tomorrow so we can keep developing them into a good direction. Whether or not you (or I) believe it can do that thing tomorrow is thoroughly uninteresting.

gjulianm · 2026-02-09T11:31:58 1770636718

The goalpost is not moving. The issue is that AI generates code that kinda looks ok but usually has deep issues, specially the more complex the code is. And that's not being really improved.

Ygg2 · 2026-02-09T06:44:26 1770619466

You can be wrong on every step of your approximation and still be right in the aggregate. E.g. order of magnitude estimate, where every step is wrong but mistakes cancel out.

Human crews on Mars is just as far fetched as it ever was. Maybe even farther due to Starlink trying to achieve Kessler syndrome by 2050.

forty · 2026-02-09T06:59:25 1770620365

There are two questions which can be asked for both. The first one is "can these tech can achieve their goals?" which is what you seem debating. The other question is "is a successful outcome of these tech desirable at all?". One is making us pollute space faster than ever, as if we did not fuck the rest enough. They other will make a few very rich people even richer and probably everyone else poorer.

Interesting that people call this "progress" :)

littlestymaar · 2026-02-09T07:03:00 1770620580

> This to me sounds a lot like the SpaceX conversation

The problem is that it is absolutely indiscernible from the Theranos conversation as well…

If Anthropic stopped making lies about the current capability of their models (like “it compiles the Linux kernel” here, but it's far from the first time they do that), maybe neutral people would give them the benefit of the doubt.

For one grifter that happen to succeed at delivering his grandiose promises (Elon), how many grifters will fail?

gordonhart · 2026-02-09T13:54:39 1770645279

The difference I see is that, after "get to orbit", the goalposts for SpaceX are things that have never been done before, whereas for LLMs the goalposts are all things that skilled humans have been able to do for decades.

benreesman · 2026-02-09T07:42:03 1770622923

AI assist in software engineering is unambiguously demonstrated to some done degree at this point: the "no LLM output in my project" stance is cope.

But "reliable, durable, scalable outcomes in adversarial real-world scenarios" is not convincingly demonstrated in public, the asterisks are load bearing as GPT 5.2 Pro would say.

That game is still on, and AI assist beyond FIM is still premature for safety critical or generally outcome critical applications: i.e. you can do it if it doesn't have to work.

I've got a horse in this race which is formal methods as the methodology and AI assist as the thing that makes it economically viable. My stuff is north of demonstrated in the small and south of proven in the large, it's still a bet.

But I like the stock. The no free lunch thing here is that AI can turn specifications into code if the specification is already so precise that it is code.

The irreducible heavy lift is that someone has to prompt it, and if the input is vibes the output will be vibes. If the input is zero sorry rigor... you've just moved the cost around.

The modern software industry is an expensive exercise in "how do we capture all the value and redirect it from expert computer scientists to some arbitrary financier".

You can't. Not at less than the cost of the experts if the outcomes are non-negotiable.

a1o · 2026-02-09T12:20:55 1770639655

What is FIM ?

delaminator · 2026-02-09T10:32:57 1770633177

In 1908 the Model T could do 45mph.

In 1935 the Auburn 851 S/C Speedster hit 100mph

In 1955 the Mercedes-Benz 300 SL Gullwing did 161mph

In 2025 the Yangwang U9 Xtreme hit 308mph

progress is a decaying exponential - Tsiolkovsky's tyranny

CleaveIt2Beaver · 2026-02-09T14:48:34 1770648514

And all these improvements past 1935 have been rendered irrelevant to the daily driver by safety regulations (I'll limit this claim to most of the continental US to avoid straying beyond my experience.)

a1o · 2026-02-09T12:23:20 1770639800

These specific points look like a line if you plot

bgirard · 2026-02-06T21:41:08 1770414068

I like how the author shared the prompt + conversation transcripts. I wish OAI / Anthropic would do that when they share content demos.

bgirard · 2026-02-05T19:58:31 1770321511

Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.

smith7018 · 2026-02-05T20:21:10 1770322870

Sure, but I've never experienced a 20 minute wait with CC before. It was an architectural question but it would have taken a couple minutes with a definitive answer on 4.5.

sejje · 2026-02-06T19:11:11 1770405071

> I've spent weeks+ of my career fix single bugs.

Same, same. It's not a useful data point at all.

bug: llm alignment

timeframe to fix : probably never

bgirard · 2026-02-05T19:49:40 1770320980

> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

veb · 2026-02-05T19:54:07 1770321247

I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.

bgirard · 2026-02-05T20:03:00 1770321780

Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.

This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.

gspetr · 2026-02-06T03:46:54 1770349614

Any estiimates on how much it cost you? In terms of total real world time, money, and time spent by the agents.

bgirard · 2026-02-06T05:38:12 1770356292

About ~$300: $200 for Claude max subscription $20 for Vercel $20 for Codex $20 for Meshy

I think these days the $200 Max subscription wouldn't be needed. I bet with these latest models you can make due with mixing two $20/mo subscriptions.

Real time was 2 weeks of watching the agents while watching TV and playing games, waiting for limit resets, etc... Very little decided focused time.

bgirard · 2026-02-03T17:24:32 1770139472

The switching cost is so low that I find it's easier and better value to have two $20/mo subscription from different providers than a $200/mo subscription with the frontier model of the month. Reliability and model diversity are a bonus.

davedx · 2026-02-04T12:21:31 1770207691

Yes that's exactly what I have too.