Smallest transformer that can add two 10-digit numbers

alexlitz · 2026-02-28T02:44:03 1772246643

I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

E-Reverance · 2026-02-28T01:27:44 1772242064

Not sure how much this fits into the rules but I saw on twitter someone claimed 28 params : https://gist.github.com/SeuperHakkerJa/da3050739bea97aabd86e...

amelius · 2026-02-28T00:33:01 1772238781

> In short: if you can swap in a different set of weights and use the exact same inference code for a different task, your setup is legitimate. If the inference code is inseparable from the algorithm, it's not.

I wonder why they don't just write the code themselves, so by design the focus can be on the model.

i000 · 2026-02-28T02:11:31 1772244691

Would it make sense to embed such single-purpose network with fixed weights within a LLM before pre-training?

medi8r · 2026-02-28T00:59:33 1772240373

You can do that in a single matmul of course.

hyperhello · 2026-02-28T01:05:05 1772240705

So can you take an arbitrary transformer and somehow turn it into a compact set of low-power fast gates by some algorithm?

measurablefunc · 2026-02-28T01:06:39 1772240799

I think you're misunderstanding the joke.

medi8r · 2026-02-28T01:47:37 1772243257

Yes joke is:

    [A B]

times

    [1]
    [1]

is

    [A+B]

hyperhello · 2026-02-28T01:57:32 1772243852

From context then, I infer that a transformer is not comprised of matrix multiplications, because it would simply be one that adds two 10-digit numbers.

medi8r · 2026-02-28T02:02:45 1772244165

A transformer tokenizes input, does a bunch of matmul and relu set up in a certain way. It doesn't get to see the raw number (just like you don't when you look at 1+1 you need visual cortex etc. first.)

ks2048 · 2026-02-28T01:30:06 1772242206

So, hand-coded weights can do it with 36 params and 311 for trained weights - did anyone try the former architecture, but starting with random weights and learning?

alexlitz · 2026-02-28T02:49:27 1772246967

For one the specific 36 parameter version is impossible without float64 so you might guess the corollary that it is not exactly amenable to being found by gradient descent. I think the question of how you can structure transformers and neural nets in general so that they can both very parsimoniously represent things like this and have it be amenible to learning by gradient descent.

munro · 2026-02-28T02:10:07 1772244607

>=99% accuracy wtf?!?

I was initially excited until i saw that, because it would reveal some sort of required local min capacity, and then further revelation that this was all vibe coded and no arXiv, makes me feel I should save my attn for another article.

1over137 · 2026-02-28T02:04:15 1772244255

Now wrap it all in an Electron app!