> “Doesn’t the NSA lie to break our encryption?” No, the NSA has never intentionally jeopardized US national security with a non-NOBUS backdoor, and there is no way for ML-KEM and ML-DSA to hide a NOBUS backdoor.
The most concrete issue for me, as highlighted by djb, is that when the NSA insists against hybrids, vendors like telecommunications companies will handwrite poor implementations of ML-KEM to save memory/CPU time etc. for their constrained hardware that will have stacks of timing side channels for the NSA to break. Meanwhile X25519 has standard implementations that don't have such issues already deployed, which the NSA presumably cannot break (without spending $millions per key with a hypothetical quantum attack, a lot more expensive than side channels).
> The most concrete issue for me, as highlighted by djb, is that when the NSA insists against hybrids
The fact that only NSA does that and they really have no convincing arguments seems like the biggest reason why the wider internet should only roll out hybrids. Then possibly wait decades for everything to mature and then reconsider plain modes of operation.
Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency, and the only reason they wouldn't use the better cryptography designed by Dr. Bernstein. /s
Truly, truly can't understand why anyone finds this line of reasoning plausible. (Before anyone yells Dual_EC_DRBG, that was a NOBUS backdoor, which is an argument against the NSA promoting mathematically broken cryptography, if anything.)
Timing side channels don't matter to ephemeral ML-KEM key exchanges, by the way. It's really hard to implement ML-KEM wrong. It's way easier to implement ECDH wrong, and remember that in this hypothetical you need to compare to P-256, not X25519, because US regulation compliance is the premise.
(I also think these days P-256 is fine, but that is a different argument.)
I genuinely do not understand how someone working in the capacity that you do, for things that matter universally for people, can contend that an organization who is intentionally engaging in NOBUS backdoors can be remotely trusted at all.
That is insanely irresponsible and genuinely concerning. I don't care if they have a magical ring that defies all laws of physics and assuredly prevents any adversary stealing the backdoor. If an organization is implementing _ANY_ backdoor, they are an adversary from a security perspective and their guidance should be treated as such.
The world just doesn’t work in such a binary way. Forming a mental model of an entity’s incentives, goals, capabilities, and dysfunctions will serve you much better than making two buckets for trusted parties and adversaries.
As you are someone building cryptographic libraries used by people all over the world, which includes those who might be seen as "enemies" by the organization in question, this is not a gradient — it's quite binary in nature.
Maybe your motives are benevolent, but you're arguing two things:
1) We can broadly trust the US government
2) We should adopt new encryption partly designed and funded by the US government, and get rid of the battle tested encryption that they seem not to be able to break
Forgive me for being somewhat suspicious of your motives here
> Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency
NSA still has the secret Suite A system for their most sensitive information. If they think that is better than the current public algorithms and their goal is to make telecommunications vendors to have better encryption, then why doesn't they publish those so telco could use it?
> Truly, truly can't understand why anyone finds this line of reasoning plausible. (Before anyone yells Dual_EC_DRBG, that was a NOBUS backdoor, which is an argument against the NSA promoting mathematically broken cryptography, if anything.)
The NSA weakened DES against brute-force attack by reducing the key size (while making it stronger against differential cryptanalysis, though).
The thing that sets this effort apart from DES and Clipper is that USG actually has skin in the game. Neither DES or Clipper were ever intended or approved to protect classified information.
These are algorithms that NSA will use in real systems to protect information up to the TOP SECRET codeword level through programs such as CNSA 2.0[1] and CsFC.
> Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency, and the only reason they wouldn't use the better cryptography designed by Dr. Bernstein. /s
I guess the NSA thinks they're the only one that can target such a side channel, unlike, say, a foreign government, which doesn't have access to the US Internet backbone, doesn't have as good mathematicians or programmers (in NSA opinion), etc.
> Timing side channels don't matter to ephemeral ML-KEM key exchanges, by the way. It's really hard to implement ML-KEM wrong. It's way easier to implement ECDH wrong, and remember that in this hypothetical you need to compare to P-256, not X25519, because US regulation compliance is the premise.
Except for KyberSlash (I was surprised when I looked at the bug's code, it's written very optimistically wrt what the compiler would produce...)
So do you think vendors will write good code within the deadlines between now and... 2029? I wouldn't bet my state secrets on that...
That's a timing side-channel, irrelevant to ephemeral key exchanges, and tbh if that's the worst that went wrong in a year and a half, I am very hopeful indeed.
This is interestingly very similar to domain fronting, except in this case the server doesn't need to work around it because it will still see the correct SNI.
Do DPI servers in your experience only check the first SNI packet for a given connection?
Thanks! Yes, the DPI systems I've tested against only look at
the first ClientHello in a connection. They don't do full TCP
reassembly. The fake packet arrives first (eBPF fires
synchronously before the app sends data), DPI records that SNI,
and the real ClientHello passes through unchecked.
More sophisticated DPI (like China's GFW) does reassembly and
would likely catch this. But for simpler stateless DPI, it works.
Good analogy with domain fronting. The key difference is exactly
what you said: the server sees the real SNI, so no server-side
cooperation needed.
Very cool. The horsle demo made me think, how hard would it be to add a virtual memory address (or a non-8086 RAND instruction) that returns a random byte (that would allow it to pick a random value and get a standard wordle working in principle)
I see CSS random() is only supported by Safari, I wonder if there's some side channel that would work in Chrome specifically? (I guess timing the user input would work)
In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].
Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.
To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.
Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.
Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.
Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.
This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)
This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.
Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.
I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.
It emits a cmp/jmp still when arithmetic would be fine though which is the difference highlighted in the article and examples in this thread. It's nice that it simplifies down to assembly, but the assembly is somewhat questionable (especially that xor eax eax branch target on the other side).
I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?
I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.
The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.
There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.
x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.
ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.
> [x86] call instruction would be kind of annoying
I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).
Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).
Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).
Both this and parent comment about PA-RISC are very interesting.
As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.
In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.
For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).
Most exploitable memory corruption bugs are heap buffer overflows.
Being able to cat directories like that doesn't surprise me as much as the contents being readable. Is there not a bunch of binary garbage in between the filenames?
I came here to suggest the same! It's incredibly handy and I use it all the time at work: there's a process that runs for a very long time and I can't be sure ahead of time if the output it generates is going to be useful or not, but if it's useful I want to capture it. I usually just pipe it into `less` and then examine the contents once it's done running, and if needed I will use `s` to save it to a file.
(I suppose I could `tee`, but then I would always dump to a file even if it ends up being useless output.)
2. Replace libc.so with a fake library that has the right version symbol with a version script
e.g. version.map
GLIBC_2.29 {
global:
*;
};
With an empty fake_libc.c
`gcc -shared -fPIC -Wl,--version-script=version.map,-soname,libc.so.6 -o libc.so.6 fake_libc.c`
3. Hope that you can still point the symbols back to the real libc (either by writing a giant pile of dlsym C code, or some other way, I'm unclear on this part)
Ideally glibc would stop checking the version if it's not actually marked as needed by any symbol, not sure why it doesn't (technically it's the same thing normally, so performance?).
So you can do e.g. `patchelf --remove-needed-version libm.so.6 GLIBC_2.29 ./mybinary` instead of replacing glibc wholesale (step 2 and 3) and assuming all of used glibc by the executable is ABI compatible this will just work (it's worked for a small binary for me, YMMV).
> When you get into lower power, anything lower than Steam Deck, I think you’ll find that there’s an Arm chip that maybe is competitive with x86 offerings in that segment.
At which point does this pay off the emulation overhead? Fex has a lot of work to do to bridge two ISAs while going through the black box of compiler output of assembly, right?
afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine.
Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.
The most concrete issue for me, as highlighted by djb, is that when the NSA insists against hybrids, vendors like telecommunications companies will handwrite poor implementations of ML-KEM to save memory/CPU time etc. for their constrained hardware that will have stacks of timing side channels for the NSA to break. Meanwhile X25519 has standard implementations that don't have such issues already deployed, which the NSA presumably cannot break (without spending $millions per key with a hypothetical quantum attack, a lot more expensive than side channels).
reply