Hacker Newsnew | past | comments | ask | show | jobs | submit | btdmaster's commentslogin

> “Doesn’t the NSA lie to break our encryption?” No, the NSA has never intentionally jeopardized US national security with a non-NOBUS backdoor, and there is no way for ML-KEM and ML-DSA to hide a NOBUS backdoor.

The most concrete issue for me, as highlighted by djb, is that when the NSA insists against hybrids, vendors like telecommunications companies will handwrite poor implementations of ML-KEM to save memory/CPU time etc. for their constrained hardware that will have stacks of timing side channels for the NSA to break. Meanwhile X25519 has standard implementations that don't have such issues already deployed, which the NSA presumably cannot break (without spending $millions per key with a hypothetical quantum attack, a lot more expensive than side channels).


> The most concrete issue for me, as highlighted by djb, is that when the NSA insists against hybrids

The fact that only NSA does that and they really have no convincing arguments seems like the biggest reason why the wider internet should only roll out hybrids. Then possibly wait decades for everything to mature and then reconsider plain modes of operation.


Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency, and the only reason they wouldn't use the better cryptography designed by Dr. Bernstein. /s

Truly, truly can't understand why anyone finds this line of reasoning plausible. (Before anyone yells Dual_EC_DRBG, that was a NOBUS backdoor, which is an argument against the NSA promoting mathematically broken cryptography, if anything.)

Timing side channels don't matter to ephemeral ML-KEM key exchanges, by the way. It's really hard to implement ML-KEM wrong. It's way easier to implement ECDH wrong, and remember that in this hypothetical you need to compare to P-256, not X25519, because US regulation compliance is the premise.

(I also think these days P-256 is fine, but that is a different argument.)


I genuinely do not understand how someone working in the capacity that you do, for things that matter universally for people, can contend that an organization who is intentionally engaging in NOBUS backdoors can be remotely trusted at all.

That is insanely irresponsible and genuinely concerning. I don't care if they have a magical ring that defies all laws of physics and assuredly prevents any adversary stealing the backdoor. If an organization is implementing _ANY_ backdoor, they are an adversary from a security perspective and their guidance should be treated as such.


The world just doesn’t work in such a binary way. Forming a mental model of an entity’s incentives, goals, capabilities, and dysfunctions will serve you much better than making two buckets for trusted parties and adversaries.

As you are someone building cryptographic libraries used by people all over the world, which includes those who might be seen as "enemies" by the organization in question, this is not a gradient — it's quite binary in nature.

Maybe your motives are benevolent, but you're arguing two things:

1) We can broadly trust the US government 2) We should adopt new encryption partly designed and funded by the US government, and get rid of the battle tested encryption that they seem not to be able to break

Forgive me for being somewhat suspicious of your motives here


[We can broadly trust the US government] not to promote broken encryption to its own agencies.

> Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency

NSA still has the secret Suite A system for their most sensitive information. If they think that is better than the current public algorithms and their goal is to make telecommunications vendors to have better encryption, then why doesn't they publish those so telco could use it?

> Truly, truly can't understand why anyone finds this line of reasoning plausible. (Before anyone yells Dual_EC_DRBG, that was a NOBUS backdoor, which is an argument against the NSA promoting mathematically broken cryptography, if anything.)

The NSA weakened DES against brute-force attack by reducing the key size (while making it stronger against differential cryptanalysis, though).

https://en.wikipedia.org/wiki/Data_Encryption_Standard#NSA's...

Also NSA put a broken cipher in the Clipper Chip (beside all the other vulnerabilities).


The thing that sets this effort apart from DES and Clipper is that USG actually has skin in the game. Neither DES or Clipper were ever intended or approved to protect classified information.

These are algorithms that NSA will use in real systems to protect information up to the TOP SECRET codeword level through programs such as CNSA 2.0[1] and CsFC.

[1] https://media.defense.gov/2025/May/30/2003728741/-1/-1/0/CSA...

[2] https://www.nsa.gov/Resources/Commercial-Solutions-for-Class...


> Thus succeeding at making the telecommunications vendors used for Top Secret US national security data less secure, the obvious goal of the US National Security Agency, and the only reason they wouldn't use the better cryptography designed by Dr. Bernstein. /s

I guess the NSA thinks they're the only one that can target such a side channel, unlike, say, a foreign government, which doesn't have access to the US Internet backbone, doesn't have as good mathematicians or programmers (in NSA opinion), etc.

> Timing side channels don't matter to ephemeral ML-KEM key exchanges, by the way. It's really hard to implement ML-KEM wrong. It's way easier to implement ECDH wrong, and remember that in this hypothetical you need to compare to P-256, not X25519, because US regulation compliance is the premise.

Except for KyberSlash (I was surprised when I looked at the bug's code, it's written very optimistically wrt what the compiler would produce...)

So do you think vendors will write good code within the deadlines between now and... 2029? I wouldn't bet my state secrets on that...


> KyberSlash

That's a timing side-channel, irrelevant to ephemeral key exchanges, and tbh if that's the worst that went wrong in a year and a half, I am very hopeful indeed.


That's so cool.

This is interestingly very similar to domain fronting, except in this case the server doesn't need to work around it because it will still see the correct SNI.

Do DPI servers in your experience only check the first SNI packet for a given connection?


Thanks! Yes, the DPI systems I've tested against only look at the first ClientHello in a connection. They don't do full TCP reassembly. The fake packet arrives first (eBPF fires synchronously before the app sends data), DPI records that SNI, and the real ClientHello passes through unchecked.

More sophisticated DPI (like China's GFW) does reassembly and would likely catch this. But for simpler stateless DPI, it works.

Good analogy with domain fronting. The key difference is exactly what you said: the server sees the real SNI, so no server-side cooperation needed.


Very cool. The horsle demo made me think, how hard would it be to add a virtual memory address (or a non-8086 RAND instruction) that returns a random byte (that would allow it to pick a random value and get a standard wordle working in principle)

I see CSS random() is only supported by Safari, I wonder if there's some side channel that would work in Chrome specifically? (I guess timing the user input would work)


It's really easy, I was considering adding it.

The easiest way is to make an @property that's animated at ridiculous speeds that can be sampled to get (sort of) random bits.


Or use a cycle timer and run a PRNG from it.

Or wait for us to launch random() :-) (It's in development, available if you enable a flag)


In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].

Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.

[0] https://godbolt.org/z/vso7xbh61

[1] https://godbolt.org/z/MjcEKd9Tr


To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.


What's strange is I'm finding that gcc really struggles to correctly optimize this.

This was my function

    for (auto v : array) {
        if (v != 0)
            return false;
    }
    return true;
clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

Here's gcc for 42 elements:

https://godbolt.org/z/sjz7xd8Gs

and here's clang for 42 elements:

https://godbolt.org/z/frvbhrnEK

Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.


Writing a micro benchmark is an academic exercise. You end up benchmarking in isolation which only tells you is your function faster in that exact scenario. Something which is faster in isolation in a microbenchmark can be slower when put in a real workload because vextoising is likely to have way more of an impact than anything else. Similarly, if you parallelise it, you introduce a whole new category of ways to compare.


This isn't a microbenchmark. In fact, I haven't even bothered to benchmark it (perhaps the non-simd version actually is faster?)

This is purely me looking at the emitted assembly and being surprised at when the compilers decide to deploy it and not deploy it. It may be the case that the SIMD instructions are in fact slower even though they should theoretically end up faster.

Both compilers are simply using heuristics to determine when it's fruitful to deploy SIMD instructions.


I;ve had to coerce gcc to emitting SIMD code by using int instead of bool. Also, the early return may be putting it off.


Doing both of those things does seem to help: https://godbolt.org/z/1vv7cK4bE

GCC trunk seems to like using `bool` so we may eventually be able to retire the hack of using `int`.


I see yeah that makes sense. I wanted to highlight that "magic" will, on average, give the optimizer a harder time. Explicit offset loops like that are generally avoided in many C++ styles in favor of iterators.


Even at a higher level of abstraction, the compiler seems to pull through: https://godbolt.org/z/1nvE34YTe


It emits a cmp/jmp still when arithmetic would be fine though which is the difference highlighted in the article and examples in this thread. It's nice that it simplifies down to assembly, but the assembly is somewhat questionable (especially that xor eax eax branch target on the other side).


Except that the C++ version doesn't need to be like that.

Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.


I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?


Tagged memory was a thing, and is a thing again on some ARM machines. Check out Google Pixel 9.


I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.


The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.

There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.


> There is no reason why the C/C++ stack can't grow up rather than down.

Historical accident. Imagine if PDP-7/PDP-11 easily allowed for the following memory layout:

    FFFF +---------------+
         |     text      |  X
         +---------------+
         |    rodata     |  R
         +---------------+
         |  data + bss   |  RW
         +---------------+
         |     heap      |
         |      ||       |  RW
         |      \/       |
         +---------------+
         |  empty space  |  unmapped
         +---------------+
         |      /\       |
         |      ||       |  RW
         |     stack     |
    0000 +---------------+
Things could have turned out very differently than they have. Oh well.


Nice diagram. I might put read-only pages on both sides of 0 though to mitigate null pointer effects.


Is there anything stopping us from doing this today on modern hardware? Why do we grow the stack down?


x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.

ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.


> [x86] call instruction would be kind of annoying

I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).


Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).


Nothing stops you from having upward growing stacks in RISC-V, for example, as there are no dedicated stack instructions.

Instead of

  addi sp, sp, -16
  sd a0, 0(sp)
  sd a1, 8(sp)
Do:

  addi sp, sp, 16
  sd a0, -8(sp)
  sd a1, -16(sp)


HP-UX on PA-RISC had an upward-growing stack. In practice, various exploits were developed which adapted to the changed direction of the stack.

One source from a few mins of searching: https://phrack.org/issues/58/11


Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).


Both this and parent comment about PA-RISC are very interesting.

As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.


Yeah, my assumption is that all the PA-RISC operating systems did, but I only know about HP-UX for certain.


In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.


For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).

Most exploitable memory corruption bugs are heap buffer overflows.


It’s still fairly easy to attack buffer overflows if the stack grows upward


Everything really is a file: if you do `cat /` you'll get back the internal representation of the directory entries in / (analogous to ls)

And they still had coredumps at the time if you press ctrl-\


Being able to cat directories like that doesn't surprise me as much as the contents being readable. Is there not a bunch of binary garbage in between the filenames?


I remember `cat` on directories working on Unixen much newer than v4. Not sure if it ever was the case on Linux tho.


You can also press `s` to save data from a pipe to a file rather than manually copy pasting.


I came here to suggest the same! It's incredibly handy and I use it all the time at work: there's a process that runs for a very long time and I can't be sure ahead of time if the output it generates is going to be useful or not, but if it's useful I want to capture it. I usually just pipe it into `less` and then examine the contents once it's done running, and if needed I will use `s` to save it to a file.

(I suppose I could `tee`, but then I would always dump to a file even if it ends up being useless output.)


Yes you can do this, thanks for mentioning I was interested and checked how you would go about it.

1. Delete the shared symbol versioning as per https://stackoverflow.com/a/73388939 (patchelf --clear-symbol-version exp mybinary)

2. Replace libc.so with a fake library that has the right version symbol with a version script e.g. version.map GLIBC_2.29 { global: *; };

With an empty fake_libc.c `gcc -shared -fPIC -Wl,--version-script=version.map,-soname,libc.so.6 -o libc.so.6 fake_libc.c`

3. Hope that you can still point the symbols back to the real libc (either by writing a giant pile of dlsym C code, or some other way, I'm unclear on this part)

Ideally glibc would stop checking the version if it's not actually marked as needed by any symbol, not sure why it doesn't (technically it's the same thing normally, so performance?).


Ah you can use https://github.com/NixOS/patchelf/pull/564

So you can do e.g. `patchelf --remove-needed-version libm.so.6 GLIBC_2.29 ./mybinary` instead of replacing glibc wholesale (step 2 and 3) and assuming all of used glibc by the executable is ABI compatible this will just work (it's worked for a small binary for me, YMMV).


> When you get into lower power, anything lower than Steam Deck, I think you’ll find that there’s an Arm chip that maybe is competitive with x86 offerings in that segment.

At which point does this pay off the emulation overhead? Fex has a lot of work to do to bridge two ISAs while going through the black box of compiler output of assembly, right?


afaia emulators like Fex are within 30 to 70% of native performance. On the fringes worse or better. But overall emulation seems totally fine. Plus emulator technology in general could be used for binary optimization rather than strict mappings, opening up space for more optimization.


See also "Parse, don't validate (2019)" [0]

[0] https://news.ycombinator.com/item?id=41031585


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: