Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.
It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).
ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.
There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.
Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...
Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.
Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.
https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark
Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.
They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".
Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.
And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...
50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.
I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.
very cool, i did something similar but turning the doom frame running on a server into ascii (with colour) and then a small shim to give inputs via subdomains
Which is yet another chore. And it doesn’t add any security. A certificate expired yesterday proves I am who I am just as much as it did yesterday. As long as the validity length is shorter than how long it would take somebody to work out the private key from the public key, it is fine.
No, they're not useless at all. The point of shortening certificate periods is that companies complain when they have to put customers on revocation lists, because their customers need ~2 years to update a certificate. If CRLs were useless, nobody would complain about being put on them. If you follow the revocation tickets in ca-compliance bugzilla, this is the norm—not the exception. Nobody wants to revoke certificates because it will break all of their customers. Shortening the validity period means that CAs and users are more prepared for revocation events.
... what are the revocation tickets about then? how is it even a question whether to put a cert on the CRL? either the customer wants to or the key has been compromised? (in which case the customer should also want to have it revoked ASAP, no?)
Usually, technical details. Think: a cert issued with a validity of exactly 1000 days to the second when the rules say the validity should be less than 1000 days. Or, a cert where the state name field contains its abbreviation rather than the full name. The WebPKI community is rather strict about this: if it doesn't follow the rules, it's an invalid cert, and it MUST be revoked. No "good enough" or "no real harm done, we'll revoke it in three weeks when convenient".
> either the customer wants to or the key has been compromised
The CA wants to revoke, because not doing so risks them being removed from the root trust stores. The customer doesn't want to revoke, because to them the renewal process is a massive inconvenience and there's no real risk of compromise.
This results in CAs being very hesitant to revoke because major enterprise / government customers are threatening to sue and/or leave if they revoke on the required timeline. This in turn shows the WebPKI community that CAs are fundamentally unable to deal with mass revocation events, which means they can't trust that CAs will be able to handle a genuinely harmful compromise properly.
By forcing an industry-wide short cert validity you are forcing large organizations to also automate their cert renewal, which means they no longer pose a threat during mass revocation events. No use threatening your current CA when all of its competitors will treat you exactly the same...
From my experience the biggest complaints/howlings are when the signing key is compromised; e.g., your cert is valid and fine, but the authority screwed up and so they had to revoke all certs signed with their key because that leaked.
Sure, happy to. The average revocation ticket is something like https://bugzilla.mozilla.org/show_bug.cgi?id=1892419 or https://bugzilla.mozilla.org/show_bug.cgi?id=1624527. The CA shipped some kind of bug leading to noncompliance with baseline requirements. This could be anything from e.g. not validating the email address properly, inappropriately using a third-party resolver to fetch DNS names, or including some kind of extra flag set that they weren't supposed to have set. The CA doesn't want to revoke these certificates, because that would cause customers to complain:
In response to this incident of mistaken issuance, the verification targets are all government units and government agency websites. We have assessed that the cause of this mis-issuance does not involve a key issue, but only a certificate field issue, which will not affect the customer's information security. In addition, in accordance with the administrative efficiency of government agencies, from notification to the start of processing, it requires agency supervisors at all levels. Signing and approval, and some public agencies need to find information vendors for processing, so it is difficult to complete the replacement within 5 days. Therefore, the certificate is postponed and revoked within a time limit so that the certificates of all websites can be updated smoothly.
[...]
In this project we plan to initially issue new certificates using the same keys for users to install, and then revoke the old certificates. As these are official government websites, and considering the pressure from government agencies and public opinion, we cannot immediately revoke all certificates without compromising security. Doing so would quickly become news, and we would face further censure from government authorities.
The browsers want them to revoke the certificates immediately, because they rely on CAs to agree to the written requirements of the policy. If you issue certificates, you must validate them in precisely this way, and you must generate certificates with precisely these requirements. The CAs agree in their policies to revoke within 24 hours (serious) or 120 hours (less serious) any certificates issued that violate policy.
And yet when push comes to shove, certificates don't actually get revoked. Everybody has critical clients who pay them $$$$$ and no CAs actually want to make those clients mad. Browsers very rarely revoke certificates themselves, and realistically their only lever is to trust or distrust a CA—they need to rely on the CA to be truthful and manage their own certificates properly. They don't know exactly all of which certificates would be subject to an incident, they don't want to punish CAs for disclosing info publicly, etc. So instead, they push for systematic changes that will make it easier for CAs to revoke certificates in the future, including ACME ARI and shorter certificate lifetimes.
Yes, everyone in the WebPKI community is pushing for shorter validity lifetime. But as you can see in the parent thread here ("Which is yet another chore. And it doesn’t add any security"), everybody is mad that browsers are pushing for shorter certificate lifetimes.
Right. It's the same debate about how long authorization cookies or tokens should last. At one point in time--only one--authentication was performed in a provable enough manner that the certificate was issued. After that--it could be seconds, hours, days, years, or never--that assumption could become invalid.
Or that someone asked to renewed it, one of their four bosses didn't sign off the apropriate form, the only person to take that form to whoever does the certs is on a vacation, person issuing certs needs all four of his bosses to sign it off, and one of those bosses has been DOGE-ed and not yet replaced.
expired letsencrypt cert on a raspberrypi at home smells of not paying attention... with governments, there are many, many points of failure.
The whole point of these shorter certificate durations is to force companies to put in automation that doesn't require 14 layers of paperwork. Some companies will be stubborn, and will thus be locked in an eternal cycle of renew->get paperwork started for renew. Most will adapt.
It's the government... they have 30 different services just in that department, made by 30 different companies with 30 different support companies, two of those don't exist anymore, 3 have been bought by cisco, two by google, 2 services are behind some old palo alto web proxy that's centrally managed by some other department, one service is written in cobol, one requires the cert to be on a usb flash drive and another on a memory stick.
It's cheaper to pay someone just to take care of the certs (unless their bosses and procurement and accounting messes up) than to fix all that.
I've seen government stuff, i wouldn't touch it with a 5m pole.
I don't see how any of that is the CA's problem. As far as I'm concerned, the CA's and browser vendors are entirely in the right to go "Here's the new rules. Adapt. Or don't, we don't care."
Well, they didn't, and you have to click through "i understand" (or whatever) to see the contents from servers with expired certs. Usually you need files from them and not vice-versa, sp as far as they're concerned, it's your problem now.
I guess it depends on the country. Where I live they’d be on the hook in somehow safely providing me with the files if they were involved in me fulfilling some kind of legal obligation to them, and I’d be off the hook if they refused.
I am curious how long the approval process in some large corp or the military would be for either of those options...
Hand over our private keys to a third party or run this binary written by some volunteers in some basements who will not sign a support contract with us...
I've worked with large "enterprises" that refuse to use the easy-to-automate certificate services, including AWS Certificate Manager. They would rather continue to procure certificates through a third party, email around keys, etc. They somehow believe these archaic practices are more secure.
Isn't that why certificates expire, and the expiry window is getting shorter and shorter? To keep up with the length of time it takes someone to crack a private key?
No, it has nothing to do with the time to crack encryption. It's to protect against two things: organizations that still have manual processes in place (making them increasingly infeasible in order to require automatic renewal) and excessively large revocation lists (because you don't need to serve data on the revocation of a now-expired certificate).
No. The sister comment gave the correct answer. It is because nobody checks revocation lists. I promise you there’s nobody out there who can factor a private key out of your certificate in 10, 40, 1000, or even 10,000 days.
I thought I remembered someone breaking one recently, but (unless I've found a different recent arxiv page) seems like it was done using keys that share a common prime factor. Oops!
It's also a "how much exposure do people have if the private key is compromised?"
Yes, its to make it so that a dedicated effort to break the key has it rotated before someone can impersonate it... its also a question of how big is the historical data window that an attacker has i̶f̶ when someone cracks the key?
There was a startup that did this in the mid 2010s named Magic, but was just via SMS. I used it a few times to get random things done, and it was really useful when it was cheap, then it became mega expensive.
There's two seperate things DoorDash seems to be doing: "Tasks" in the physical world (taking photos of inventory on shelves, closing Waymo doors), and then some seperate app for training AI models.
As for Magic, they were an SMS-based virtual assistant. They still exist today. They went downhill. https://getmagic.com/
Why?… The experiment.yaml shows that it is calling h100/200 explicitly, it’s pretty common for humans to say “number bigger more gooder” for anything… Lie and reverse the values and see what happens. I would put money on a rabbit hole of complaining about it being misconfigured.
It means holding the actual stocks in the underlying index, as opposed to synthetic replication, which aims to achieve returns matching the index via derivatives or other techniques.
It's physical in the sense that literal means not literal nowadays.
ETF and index arb traders use the term physical to describe securities that require full margin. Example: Sell stocks, buy index futures (and reverse) is the classic EFP equity trade. To be clear, futures are highly leveraged, thus do not require full margin.
I have tried most of the major open source models now and they all feel okay, but i’d prefer Sonnet or something any day over them. Not even close in capability for general tasks in my experience.
Harvester is just Kubevirt with some UI atop it, the same as Redhat Virt. Works fine if you’re hosting datacenters or whatever, haven’t seen it be suitable in smaller manufacturing environment
The country is perfectly capable of having its own rotten morals, and outsourcing of all blame to Israel is just excusing the mistakes of American leadership.
> outsourcing of all blame to Israel is just excusing the mistakes of American leadership.
Isn't it really the other way around? Israel is literally outsourcing its war and its war crimes to the US military (the strike on the girls' school was not Israeli but American).
Sure, Israel is getting some bombardment, but the lion's share of retaliatory strikes are being borne by American allies, almost all of whom have now lost trust in the US, and are now being forced to buy stockpiles from the EU and even Ukraine because the Americans came unprepared.
It's telling when the tightly controlled media in those countries lets billionaire magnates openly criticize their country's relationship with the US (not Israel), on national print.
If your friend plans on killing a hooker because he likes the idea of snuff porn, and you pitch in to kill her coworker and her boss so there are no witnesses and so your friend doesn't get hurt and because okay maybe you also enjoy snuff porn when you're in the right mood, then it's a joint venture and you share culpability. Trying to divy up fault (45% or 55%?) is kind of besides the point. Trying to decide who's ultimately responsible (0% or 100%?) is both besides the point and violates every ethical principle we have.
Trump sent half of the US's fleet to the Persian Gulf to mount a war on Iran, in part to distract us from the Epstein Files, in part because he thinks he's a czar who we'll title "The Great" for his territorial expansion since the Nobel committee vetoed "The Merciful". Rubio said that we knew Israel was going to attack Iran, and we would have stepped in to defend Israel from the counterattack, so we decided to just attack Iran ourselves. Hegseth ("Deus Vult" Hegseth) and the general staff apparently are disseminating the view internally that "Trump was anointed by Jesus to light the signal fire in Iran to cause Armageddon and mark his return to Earth", and they are His blessed instruments.
reply