Benchmarks for concurrent hash map implementations in Go

tl2do · 2026-02-23T22:27:45 1771885665

I ran benchmarks comparing xsync.Map's memory allocation against orcaman/concurrent-map.

Pure overwrite workload (pre-allocated values): xsync.Map: 24 B/op 1 alloc/op 31.89 ns/op orcaman/concurrent-map: 0 B/op 0 alloc/op 70.72 ns/op

Real-world mixed (80% overwrites, 20% new): xsync.Map: 57 B/op 2 allocs/op 218.1 ns/op orcaman/concurrent-map: 63 B/op 3 allocs/op 283.1 ns/op

Go maps reuse memory on overwrites, which is why orcaman achieves 0 B/op for pure updates. xsync's custom bucket structure allocates 24 B/op per write even when overwriting existing keys.

At 1M writes/second with 90% overwrites: xsync allocates ~27 MB/s, orcaman ~6 MB/s. The trade is 24 bytes/op for 2x speed under contention. Whether this matters depends on whether your bottleneck is CPU or memory allocation.

Benchmark code: standard Go testing framework, 8 workers, 100k keys.

puzpuzpuz-hn · 2026-02-24T07:58:35 1771919915

Allocation rates comparison is included. If your application writes into the map most of the time, you should go with plain map + RWMutex (or orcaman/concurrent-map). But if, for instance, you're using the map as a cache, read operations will dominate and having better read scalability becomes important. As an example, Otter cache library uses a modified variant of xsync.Map, not a plain map + RWMutex.

hinkley · 2026-02-24T01:07:22 1771895242

How does reuse avoid false sharing between cores? Since this is concurrent hashmap we are talking about.

tl2do · 2026-02-24T04:06:32 1771905992

I focused on B/op because it was the only apparent weakness I saw. My “reuse” note was about allocation behavior, not false sharing. We’re talking about different concerns.

hinkley · 2026-02-24T06:48:28 1771915708

Allocation behavior when one core deletes and another adds and they reuse the same memory allocation is what I thought you meant.

Is that what you meant? Because if it is then you now have potential for the problem I described.

withinboredom · 2026-02-23T21:29:28 1771882168

Looks good! There's an important thing missing from the benchmarks though:

- cpu usage under concurrency: many of these spin-lock or use atomics, which can use up to 100% cpu time just spinning.

- latency under concurrency: atomics cause cache-line bouncing which kills latency, especially p99 latency

puzpuzpuz-hn · 2026-02-24T07:42:03 1771918923

Yup, that's a valid point. I'll consider adding these metrics.

withinboredom · 2026-02-25T12:10:18 1772021418

Am I reading the benchmark code that uses the same prefix for all string keys? This would be pathological for any trie-based implementation.

vanderZwan · 2026-02-23T20:50:54 1771879854

I don't write Go but respect to the author for trying to list trade-off considerations for each of the implementations tested, and not just proclaim their library the overal winner.

puzpuzpuz-hn · 2026-02-24T07:52:46 1771919566

Thanks. There are downsides in each approach, e.g. if you care about minimal allocation rate, you should go with plain map + RWMutex. So yeah, no silver bullet.

vanderZwan · 2026-02-25T10:04:48 1772013888

There almost never is. The fact that you acknowledge it and give context only would make me more confident in trying out your library, or any of the other listed (if I wrote Go code, that is).

eatonphil · 2026-02-23T20:56:05 1771880165

Will we also eventually get a generic sync.Map?

darkr · 2026-02-24T00:45:31 1771893931

It’d be nice to have in stdlib, but it’s pretty trivial to write a generic wrapper for it

jeffbee · 2026-02-23T21:32:17 1771882337

Almost certainly, since the internal HashTrieMap is already generic. But for now this author's package stands in nicely.

puzpuzpuz-hn · 2026-02-24T07:46:57 1771919217

Would be great to see that - there are multiple GH issues for that. But so far, I'm not convinced that Google prioritizes community requests over its own needs.

candiddevmike · 2026-02-23T22:37:56 1771886276

Idk why but I tend to shy away from non std libs that use unsafe (like xsync). I'm sure the code is fine, but I'd rather take the performance hit I guess.

puzpuzpuz-hn · 2026-02-24T07:51:43 1771919503

Unsafe usage in the recent xsync versions is very limited (runtime.cheaprand only). On the other hand, your point is valid and it'd be great to see standard library improvements.

mappu · 2026-02-23T23:52:32 1771890752

A few release cycles back, Swiss Maps became popular (i think, particular thanks to CockroachDB) as a replacement for standard Go map[K]V.

Later, Go's stdlib map implementation was updated to use Swiss Maps internally and everyone benefited.

Do you think the xsync.Map could be considered for upstreaming? Especially if it outperforms sync.Map at all the same use cases.

puzpuzpuz-hn · 2026-02-24T07:48:46 1771919326

There are multiple GH issues around better sync.Map. Among other alternatives, xsync.Map is also mentioned. But Golang core team doesn't seem interested in sync.Map (or a generic variant of it) improvements.

kgeist · 2026-02-24T01:28:16 1771896496

Orcaman is a very straightforward implementation (just sharded RW locks and backing maps), but it limits the number of shards to a fixed 32. I wonder what the benchmarks would look like if the shard count were increased to 64, 128, etc.

puzpuzpuz-hn · 2026-02-24T07:49:54 1771919394

My box is 12c/24t only, so it won't make any difference. But on a beefy box, it may improve performance in high cardinality key scenarios.

nasretdinov · 2026-02-24T11:29:50 1771932590

It potentially still might make a difference due to reduced contention: if we have more shards the chances of two or more goroutines hitting the same shard would be lower. In my mind the only downside to having more shards is the upfront cost, so it might slow down the smallest example only

kgeist · 2026-02-24T16:04:31 1771949071

The upfront cost isn't that big: a Go map+RW lock is probably a few hundred bytes. Allocating them costs far below 1 ms.

2026-02-24T05:33:04 1771911184

[dead]

puzpuzpuz-hn · 2026-02-24T07:45:57 1771919157

Allocation rates are also compared. Long story short, vanilla map + RWMutex (or a sharded variant of it like orcaman/concurrent-map) is the way to go if you want to minimize allocations. On the other hand, if reads dominate your workload, using one of custom concurrent maps may be a good idea.