Every time someone writes some really carefully micro-optimized piece of code li...

slashdev · on Aug 21, 2023

The author is a French Canadian academic at Université du Québec à Montréal. He is one of the more famous figures in computer science in all of Canada, with over 5000 citations (which is stretching the meaning of famous, but still.) This is not closed source work optimizing for some company product, this is research for publication on his blog or in computer science journals.

benreesman · on Aug 21, 2023

He’s one of the most famous computer scientists in general!

The audience for wicked-clever, low/no branch, cache aware, SIMD sorcery is admittedly not everyone, but if you end up with that kind of problem, this is a go to!

re-thc · on Aug 21, 2023

> I worry that the implementation won't be shared with the whole world.

Considering the author also created https://github.com/simdutf/simdutf it's likely used or will be used in NodeJs amongst other things. Is that good enough?

magicalhippo · on Aug 21, 2023

> This code only makes people's lives better if many languages and frameworks that translates latin-1 to utf8 are updated to have this new faster implementation.

Except CPUs evolve and what was once a fast way of doing things may no longer be very fast. And with ASM you got no compiler to generate better targeted instructions.

I've seen many instances where significant performance was gained by swapping out and old hand-written ASM routine with a plain language version.

If you ever add some optimized ASM to your code, do a performance check at startup or similar, and have the plain language version as a fallback.

TinkersW · on Aug 21, 2023

It is written with intrinsics not ASM.

Compilers understand intrinsics and can optimize around them, and CPUs evolve improved SIMD instruction sets at a snails pace.

Intel doesn't even really support AVX512 yet for consumer hardware, and maybe never will, so this code is mostly only good for very modern AMD.

magicalhippo · on Aug 21, 2023

I'm talking about which instructions and idioms are optimal. AFAIK, with intrinsics the compiler won't completely change what you've written.

Back in the days REP MOVSB was the fastes way to copy bytes, then Pentium came and rolling your own loop was better. Then CPUs improved and REP MOVSB was suddenly better again[1], for those CPUs. And then it changed again...

Similar story for other idioms where implementation details on CPUs change. Compilers can respond and target your exact CPU.

[1]: https://github.com/golang/go/issues/14630 (notice how one comments the same patch that gives 1.6x boost for OP gives them a 5x degradation)

bruce343434 · on Aug 21, 2023

What do you mean "optimize around them"? Do you have a godbolt/codegen example of suboptimal intrinsic calls being optimized?

eesmith · on Aug 21, 2023

You should also worry about how other peoples' time is wasted when you miss important details then comment about easily assuaged worries.

Quoting the article "I use GCC 11 on an Ice Lake server. My source code is available.", linking to https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/... .

From the README at the top-level:

> Unless otherwise stated, I make no copyright claim on this code: you may consider it to be in the public domain.

> Don't bother forking this code: just steal it.

maxerickson · on Aug 21, 2023

Are you also worried about my hobby vegetable garden being a waste of time?

I'm sure I could get my tomato fix at the farmers market.

whoknowswhat11 · on Aug 21, 2023

Is avx512 broadly available and error free w no stalls slowdowns or other side effects. For a long time it felt like a corner intel thing

CuriousCosmic · on Aug 21, 2023

In terms of being broadly available, most of AVX-512 (ER, PF, 4FMAPS, and 4VNNIW haven't been available on any new hardware since 2017) is available on basically any Intel cpu manufactured since 2020 as well as on all AMD Zen4 (2022 and on) cpus.

https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

I can't speak to being error free or other issues but it should at the very least be present on any modern desktop, laptop, or server x86 CPU you could buy today.

Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores. I'd guess Intel will get their shit together eventually wrt this now that AMD is shipping all their hardware with this instruction set.

unnah · on Aug 21, 2023

Intel seems to be going for market segmentation, with AVX-512 only available on their server CPUs. The option to enable AVX-512 has been removed from Alder Lake CPUs since 2022, and there is no AVX-512 on Raptor Lake.

AMD also keeps making and selling Zen 3 and Zen 2 chips as lower-cost products, and those do not have AVX-512.

the8472 · on Aug 21, 2023

With AVX10 intel will make the instructions available again on all segments. SIMD register width will vary between cores but the instructions will be there.

wtallis · on Aug 21, 2023

I don't think it was intentional market segmentation, just poor planning: the whole heterogenous cores strategy seems to have been thrown together in a hurry and they didn't have time to add AVX-512 to their Atom cores in an area-efficient way (so as not to negate the point of having E-cores).

nullifidian · on Aug 21, 2023

>most of AVX-512 is available on basically any Intel cpu manufactured since 2020

That's incorrect. On the consumer cpu side Intel introduced AVX-512 for one generation in 2021 (Rocket lake), but than removed AVX-512 from the subsequent Alder Lake using bios updates, and fused it off in later revisions. It's also absent from the current Raptor Lake. So actually it's only available on Intel's server grade cpus.

>Edit: I forgot to mention but Intel's Alder lake CPUs only have partial support presumably due to some issue with E cores.

No, this wiki page is outdated.

papercrane · on Aug 21, 2023

The latest Intel architecture (Sapphire Rapids) support it without downclocking. AMD Zen 4 also supports it, although their implementation is double pumped, not sure what the real world performance impact of that is.

adrian_b · on Aug 21, 2023

There is a huge confusion about this "double pumped" thing.

All that this means is that Zen 4 uses the same execution units both for 256-bit operations and for 512-bit operations. This means that the throughput in instructions per cycle for 512-bit operations is half of that for 256-bit operations, but the throughput in bytes per cycle is the same.

However the 512-bit operations need fewer resources for instruction fetching and decoding and for micro-operation storing and dispatching, so in most cases using 512-bit instructions on Zen 4 provides a big speed-up.

Even if Zen 4 is "double pumped", its 256-bit throughput is higher than that of Sapphire Rapids, so after dividing by two, for most instructions it has exactly the same 512-bit throughput as Sapphire Rapids, i.e. two 512-bit register-register instructions per cycle.

The only exceptions are that Sapphire Rapids (with the exception of the cheap SKUs) can do 2 FMA instructions per cycle, while Zen 4 can do only 1 FMA + 1 FADD instructions per cycle, and that Sapphire Rapids has a double throughput for loads and stores from the L1 cache memory. There are also a few 512-bit instructions where Zen 4 has better throughput or latency than Sapphire Rapids, e.g. some of the shuffles.

stkdump · on Aug 21, 2023

It's unlikely that this makes anyone's life better. It is more a curiosity and maybe a teachable thing on how to do SIMD. I would venture the guess that there are very few workloads that require this conversion for more than a few KB, and over time as the world migrates to Unicode it will be less and less.

SomeoneFromCA · on Aug 21, 2023

It is mostly an educational code. Once you learn AVX-512 you can get boosts in many areas.

antiloper · on Aug 22, 2023

Not every human action is required to move the GDP line upwards.