The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets
So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.
Well cartel money for example, depends on the definition of hard earned but I don't quite imagine for example the japanese Yakuza to deposit into banks/stock markets for example, I am not sure but I imagine something like gold/cash being used.
Maybe you can argue that yakuza is making hard earned money but imo, they are doing illegal activities within the law and are doing something more closer to extortion.
Ironically, in a sense, what AI did in a sense is also an extortion.
One is just legal (barely, I am not even sure how or why), the other isn't. That was my intention to highlight when I said hard earned money.
> And when that happens people STILL won’t be able to afford the hardware.
Of course they will - if that happens all these AI token providers won't have a use for all that hardware they bought. You'll be buying used H100s and H200s off eBay for pennies on the dollar.
Then those datacenters will barely need any new GPUs, so the companies making them will be desperate to get gamers to buy cards and set very competitive prices.
> and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.
That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.
> That's ridiculous, "infinite money" isn't a thing.
My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.
If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.
Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.
We moved from the mainframe era to desktops and smaller servers because computers got fast enough to do what we needed them to do locally. Centralized computing resources are still vastly more powerful than what's under your desk or in a laptop, but it doesn't matter because people generally don't need that much power for their daily tasks.
The problem with AI is that it's not obvious what the upper limit of capability demand might be. And until or if we get there, there will always be demand for the more capable models that run on centralized computing resources. Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.
> Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.
Only if it's competitively priced. You wouldn't want to use the SaaS if the breakeven in investment on local instances is a matter of months.
Right now people are shelling out for Claude Code and similar because for $200/m they can consume $10k/m of tokens. If you were actually paying $10k/m, than it makes sense to splurge $20k-$30k for a local instance.
The underlying advantage of local inference is that you're repurposing your existing hardware for free. You don't need your token spend to pay a share of the capex cost for datacenters that are large enough to draw gigawatts in power, you can just pay for your own energy use. Even though the raw energy cost per operation will probably be higher for local inference, the overall savings in hardware costs can still be quite real.
I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?
Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.
AI is not cheap to run no matter where it is running. The price we get charged today for AI is a loss-leader. The actual cost is much higher, so much higher that the average paying user today would balk at what it actually costs to run. These AI companies are trying to get people hooked on their product, to get it integrated into every business and workflow that they can, then start raising prices.
Even if you live somewhere where it does, that is not remotely "almost free", and lots of places the payback period is more in the range of 10-15 years even with subsidies.
But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"
There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.
This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.
Because the true goal is AGI, not just nice little tools to solve subsets of problems. The first company which can achieve human level intelligence will just be able to self-improve at such a rate as to create a gigantic moat
There’s no evidence that the current architectures will reach AGI levels.
Of course OpenAI wants you to think they will rule the world but if we’ve reached the plateau of LLM capabilities regardless of the amount of compute we throw at them then local models will soon be good enough.
> The first company which can achieve human level intelligence will just be able to...
They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.
There's no particular reason to assume a human level AI would be able to improve itself any better than the thousands of human level humans that designed it.
Sure, but: that single human with the intelligence of a top tier engineer of scientist will have immediate access to all human knowledge. Plus, what do you think happens the moment its optimizes itself to run in 2, 4, 8, 16, etc. parallel instances?
Well, A) "top tier engineer/scientist" is a significant step above generic human, B) the human engineers/scientists also have immediate access to the same database, C) The humans have been optimizing it for even longer, so what makes us think the AI can optimize itself even a couple percent?
For example, if the number of AIs you can run per petaflop started to scale with the cube root of researcher-years, then even if your researcher AIs are quite fast and you can double your density in a couple years, hitting 5x will take a decade and hitting 10x will approach half a century.
MoE feels a lot more like engineering to me. You're routing around the problem rather than actually solving it. The real math gains are things like quantization schemes that change how information is actually represented. Whether that distinction matters long term probably will depend on whether we hit a capability wall first or an efficiency ceiling first.