It's very useful to understand what you're struggling from even if it's not curable. It explains your symptoms, your experience and help you understand what you're going through. Understanding that you're suffering from something incurable is also helpful in not looking for other ineffective methods to cure a mysterious illness.
> SpaceX has deorbiting assets on top of depreciating ones
The deorbiting part is redundant. Their satellite are just that, a depreciating asset. Their lifetime seem to be 5 to 7 years. The important claim is if the total cost, including the launch, can be recuperate over that lifetime or not.
You can make one, the balatro bench is open source. But I'm quite sure it'd be crazily expensive for a hobby project. At the end of the day, LLM can't actually 'practice and learn.'
I've gotten pretty good results by prompting "What did you struggle on? Please update the instructions in <PROMPT/SKILL>" and "Here's your conversation <PASTE>, please see what you struggled with and update <PROMPT/SKILL>".
It's hit or miss, but I've been able to have it self improve on prompts. It can spot mistakes and retain things that didn't work. Similar to how I learned games like Balatro. Playing Balatro blind, you wouldn't know which jokers are coming and have synergy together, or that X strategy is hard to pull off, or that you can retain a card to block it from appearing in shops.
If the LLM can self discover that, and build prompt files that gradually allow it to win at the highest stake, that's an interesting result. And I'd love to know which models do best at that.
Your experience sounds exactly like mine. My son is very autistic as well. I've had to cut off friends with families because either their didn't understand meltdown and were incredibly judgy because they were blaming my parenting for his ASD meltdowns, or others because my autistic son was a "bad influence". God forbid their (later diagnosed) kid have some exposure to a child with different neurodiversities.
That's not even going into my traumatic health care experience to getting my son help when he needed it.
So now I have all the hardships of raising a family, and I'm restricted friendship within the small ND accepting community of my area. So my support network is incredibly small and I barely get any support. It sucks.
Reading the responses to your story that are nitpicking it over your daycare experience is a perfect representation of the problems that families face.
This to me sounds a lot like the SpaceX conversation:
- Ohh look it can [write small function / do a small rocket hop] but it can't [ write a compiler / get to orbit]!
- Ohh look it can [write a toy compiler / get to orbit] but it can't [compile linux / be reusable]
- Ohh look it can [compile linux / get reusable orbital rocket] but it can't [build a compiler that rivals GCC / turn the rockets around fast enough]
- <Denial despite the insane rate of progress>
There's no reason to keep building this compiler just to prove this point. But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.
We're going to see a lot of disruption come from AI assisted development.
All these people that built GCC and evolved the language did not have the end result in their training set. They invented it. They extrapolated from earlier experiences and knowledge, LLMs only ever accidentally stumble into "between unknown manifolds" when the temperature is high enough, they interpolate with noise (in so many senses). The people building GCC together did not only solve a to technical problem. They solved a social one, agreeing on what they wanted to build, for what and why. LLMs are merely copying these decisions.
That's true and I fully agree. I don't think LLMs' progress in writing a toy C compiler diminishes the achievements that the GCC project did.
But also we've just witnessed LLMs go from being a glorified line auto-complete tool to it writing a C compiler in ~3 years. And I think that's something. And noting how we keep moving the goal post.
The pattern matching rote-student is acing the class. No surprises here.
There is no need to understand the subject from first principles to ace tests.
Majority of high-school and college kids know this.
This I strongly suspect is the crux of the boundaries of their current usefulness. Without accompanying legibility/visibility into the lineage of those decisions, LLM's will be unable to copy the reasoning behind the "why", missing out on a pile of context that I'm guessing is necessary (just like with people) to come up to speed on the decision flow going forward as the mathematical space for the gradient descent to traverse gets both bigger and more complex.
We're already seeing glimmers of this as the frontier labs are reporting that explaining the "why" behind prompts is getting better results in a non-trivial number of cases.
I wonder whether we're barely scratching the surface of just how powerful natural language is.
All right, but perhaps they should also list the grand promises they made and failed to deliver on. They said they would have fully self-driving cars by 2016. They said they would land on Mars in 2018, yet almost a decade has passed since then. They said they would have Tesla's fully self-driving robo-taxis by 2020 and human-to-human telepathy via Neuralink brain implants by 2025–2027.
> - <Denial despite the insane rate of progress>
Sure, but not by what was actually promised. There may also be fundamental limitations to what the current architecture of LLMs can achieve. The vast majority of LLMs are still based on Transformers, which were introduced almost a decade ago. If you look at the history of AI, it wouldn't be the first time that a roadblock stalled progress for decades.
> But I bet it would catch up real fast to GCC with a fraction of the resources if it was guided by a few compiler engineers in the loop.
Okay, so at that point, we would have proved that AI can replicate an existing software project using hundreds of thousands of dollars of computing power and probably millions of dollars in human labour costs from highly skilled domain experts.
Are we sure about that? I mean, we have seen that LLMs are able to generalize to some degree. So I don't see a reason why you couldn't put an agent in a loop with a profiler and have it try to optimize the code. Will it come up with entirely novel ideas? Unlikely. Could it potentially combine existing ideas in interesting, novel ways that would lead to CCC outperforming GCC? I think so. Will it get stuck along the way? Almost certainly.
Would you want it to? The further the goal posts are the more progress we are making, and that's good, no? Trying to make it into a religious debate between believers and non-believers is silly. Neither side can predict the future, and, even if they could, winning the debate is not worth anything!
What is interesting is what can do with LLMs today and what we would like them to be able to do tomorrow so we can keep developing them into a good direction. Whether or not you (or I) believe it can do that thing tomorrow is thoroughly uninteresting.
The goalpost is not moving. The issue is that AI generates code that kinda looks ok but usually has deep issues, specially the more complex the code is. And that's not being really improved.
You can be wrong on every step of your approximation and still be right in the aggregate. E.g. order of magnitude estimate, where every step is wrong but mistakes cancel out.
Human crews on Mars is just as far fetched as it ever was. Maybe even farther due to Starlink trying to achieve Kessler syndrome by 2050.
There are two questions which can be asked for both. The first one is "can these tech can achieve their goals?" which is what you seem debating. The other question is "is a successful outcome of these tech desirable at all?". One is making us pollute space faster than ever, as if we did not fuck the rest enough. They other will make a few very rich people even richer and probably everyone else poorer.
> This to me sounds a lot like the SpaceX conversation
The problem is that it is absolutely indiscernible from the Theranos conversation as well…
If Anthropic stopped making lies about the current capability of their models (like “it compiles the Linux kernel” here, but it's far from the first time they do that), maybe neutral people would give them the benefit of the doubt.
For one grifter that happen to succeed at delivering his grandiose promises (Elon), how many grifters will fail?
The difference I see is that, after "get to orbit", the goalposts for SpaceX are things that have never been done before, whereas for LLMs the goalposts are all things that skilled humans have been able to do for decades.
AI assist in software engineering is unambiguously demonstrated to some done degree at this point: the "no LLM output in my project" stance is cope.
But "reliable, durable, scalable outcomes in adversarial real-world scenarios" is not convincingly demonstrated in public, the asterisks are load bearing as GPT 5.2 Pro would say.
That game is still on, and AI assist beyond FIM is still premature for safety critical or generally outcome critical applications: i.e. you can do it if it doesn't have to work.
I've got a horse in this race which is formal methods as the methodology and AI assist as the thing that makes it economically viable. My stuff is north of demonstrated in the small and south of proven in the large, it's still a bet.
But I like the stock. The no free lunch thing here is that AI can turn specifications into code if the specification is already so precise that it is code.
The irreducible heavy lift is that someone has to prompt it, and if the input is vibes the output will be vibes. If the input is zero sorry rigor... you've just moved the cost around.
The modern software industry is an expensive exercise in "how do we capture all the value and redirect it from expert computer scientists to some arbitrary financier".
You can't. Not at less than the cost of the experts if the outcomes are non-negotiable.
And all these improvements past 1935 have been rendered irrelevant to the daily driver by safety regulations (I'll limit this claim to most of the continental US to avoid straying beyond my experience.)
Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.
Sure, but I've never experienced a 20 minute wait with CC before. It was an architectural question but it would have taken a couple minutes with a definitive answer on 4.5.
> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.
This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.
The switching cost is so low that I find it's easier and better value to have two $20/mo subscription from different providers than a $200/mo subscription with the frontier model of the month. Reliability and model diversity are a bonus.
reply