Threads and Goroutines

mamcx · on July 25, 2023

One of the question asked but not answered is "why is more/less expensive to use threads/coroutines?"

This is my attempt to the answer: (related to working in a database engine and there you also get "why is not a good idea to let the OS handle things for you".)

- Is partially historical

- ... and change it could break in funny ways (you bet code depend in this numbers in code somehow!)

- And this is because this is a GLOBAL choice for all the apps running, so the OS is blind to what could be optimal (and even if you can tweak it the OS CAN'T optimize for that rare event)

- It must work across many workloads

- MOST code run Process-> Small threads and CPU intensive taks (ie: "I pay $$$$$ for a thread-ripper and my app is equally slow!")

And probably the #1:

- The main switch is 1 UI thread * Small-N background threads, so is better to pre-allocate for "bigger" capacity

---

Then, "why coroutines and similar are better?" (Is easier to see when you implement it (coroutines) but lets ignore that and use the same mental framework as above)

- Even a "general" purpose language have a more defined role ("Go is for networking apps") and this leak, strongly, if this lang is biased for coroutines/other OR threads (Note how Rust can't because the role for it is as broad as a full OS)

- Langs that use coroutines/other VS Threads are optimized for high concurrency because is expected that the developers WILL create significant libraries/frameworks for the smaller subset of high concurrency apps (like web servers)

- Langs that use Threads VS coroutines/other are more general in purpose (Rust, Java, ...) or need to run faster for CPU-intensive/Parallel code and there Threads are in fact faster than coroutines.

klooney · on July 25, 2023

Coroutines are cooperative scheduling, and have all the same drawbacks.

They're really really good for high performance proxies though, since they make a very small number of cheap decisions for each request, and then punt the bit shuffling to the OS (which uses threads). These proxies are ubiquitous and well regarded (nginx, haproxy, linkerd, etc) so they take up maybe more mind share than they should.

anarazel · on July 25, 2023

I don't think the 8KB calculation is quite right - there are allocations in the kernel for every thread as well, and I don't think that'll be accounted for by /use/bin/time.

anarazel · on July 25, 2023

Indeed. Making the program wait (using a barrier) after all the threads have started, I see a a substantially higher memory usage than reported by time. I only used 100k threads. For me time reports 1018376kB, but looking at /proc/meminfo I see a difference of 5231092kB - so about 52kB per thread. That's a far cry from 8MB, but also not 8kB.

/proc/slabinfo shows some drastic differences:

slabtop -o|head -n 15

with 100k threads running:

   Active / Total Objects (% used)    : 10134567 / 10280497 (98.6%)
   Active / Total Slabs (% used)      : 337756 / 337756 (100.0%)
   Active / Total Caches (% used)     : 132 / 175 (75.4%)
   Active / Total Size (% used)       : 3598333.95K / 3674904.09K (97.9%)
   Minimum / Average / Maximum Object : 0.01K / 0.36K / 11.81K

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
  3419013 3418579  99%    0.10K  87667       39    350668K buffer_head
  1485372 1484947  99%    0.19K  35366       42    282928K dentry
  635299 635038  99%    0.68K  13517       47    432544K proc_inode_cache
  575792 575708  99%    1.14K  20564       28    658048K ext4_inode_cache
  514794 514703  99%    0.04K   5047      102     20188K extent_status
  503140 502198  99%    0.18K  11435       44     91480K vm_area_struct
  501738 500514  99%    0.04K   4919      102     19676K vma_lock
  381304 378068  99%    0.57K  13618       28    217888K radix_tree_node

without:

   Active / Total Objects (% used)    : 7235070 / 7678039 (94.2%)
   Active / Total Slabs (% used)      : 187813 / 187813 (100.0%)
   Active / Total Caches (% used)     : 132 / 175 (75.4%)
   Active / Total Size (% used)       : 1751469.45K / 1851083.31K (94.6%)
   Minimum / Average / Maximum Object : 0.01K / 0.24K / 11.81K

    OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
  3419013 3418579  99%    0.10K  87667       39    350668K buffer_head
  898296 887716  98%    0.19K  21388       42    171104K dentry
  575792 575708  99%    1.14K  20564       28    658048K ext4_inode_cache
  514794 514703  99%    0.04K   5047      102     20188K extent_status
  381304 376987  98%    0.57K  13618       28    217888K radix_tree_node
  164352 164352 100%    0.06K   2568       64     10272K kmalloc-rcl-64
  157360  53925  34%    0.07K   2810       56     11240K vmap_area
  139638 109093  78%    0.04K   1369      102      5476K vma_lock

gpderetta · on July 25, 2023

Incidentally the minimum per-thread kernel stack (which is distinct from the user stack) size used to be 8Kb. Not sure if it has changed.

nmehner · on July 25, 2023

Don't most applications that use threads also use thread pools because thread creation is expensive?

I guess these comparisons are nice to get a feeling how expensive operations are. But has it a lot real world relevancy?

In practice my beef with coroutines is that with unlimited concurrency you run into resource limits when the system gets some load (file descriptors, open connections, ... whatever the coroutine does that is limited) or starts a DOS attack. Which is something no one ever tests and something that appears randomly in production depending on the data fed to the system.

With many developers using them as default solution when things get slow that creates brittle systems. How is everyone else dealing with things like this?

opportune · on July 25, 2023

Yes, but thread pools kind of suck for workloads with multiple i/o operations per "run", because they end up with either coloring problems, blocking ineffeciently, or doing extra context switching. Also, most developers set the pool size to the hardware parallelization limit which worsens either the context switching cost or the ability to be smart about synchronization.

At the bare minimum, you want your userspace scheduling to include synchronization tools in addition to a threading abstraction - you have more information available than the kernel about which threads are blocking which others, so you can be smart about starting threads you know are unblocked and not leaving hardware idle while there’s work to be done.

Regarding load, a lot of people use infrastructure tools to limit that these days before the load hits their application - without that, your gripe is not actually a solvable problem in the application itself. You’ll incur an inherent amount of I/O and scheduling thrashing for each ingress event which is completely unavoidable without dropping requests - even with fixed concurrency in your application code you’ll get this. However, as seen in the blog, the problem is not actually with internal application concurrency at all - Golang can handle 1mm active goroutines no sweat.

The problem instead is your hardware simple can’t perform HTTPS, or god forbid json, deserialization fast enough to handle the throughput getting thrown at you, which is distinct from the number of active requests. If you are gonna make a very long running call to some other service and otherwise do nothing in that goroutine while doing so, you can actually handle absurd levels of concurrency, easily in the hundreds and probably thousands or tens of thousands. But you still can’t handle 8 virtual cores trying to deserialize 100 json bodies every 10ms

jmaker · on July 25, 2023

Well, ordinarily, you just know your specs and your requirements, and install rate limiting and load balancing. Just standard availability & resilience. With proper resource constraints and observability in place, your system should be fine in most cases.

I think it’s standard practice to plan for failure and simulate adversary access patterns. On the other hand you say “no one ever tests” stuff like that. I suppose it’s not that standard after all.

ngrilly · on July 25, 2023

A significant difference is that goroutine stacks can grow, but also shrink, thus releasing memory after a temporary stack growth. Threads won’t do that.

davidgrenier · on July 25, 2023

Note that lightweight threads (even in CSP style concurrency model) can be even lighter and faster. F#'s Hopac library seems to meet such expectations with this simple program:

   #r "nuget:hopac"

   open Hopac
   open Hopac.Infixes
   open Hopac.Extensions

   seq{1..1000000}
   |> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 1000)
   |> run

Runs in 1.824s real, 17.635s user and 0.088s system time in under 186000KB, hence less around 200 bytes per thread.

10M threads sleeping 10 seconds taxes this machine to:

22s real, 4m user, 0.506s system @ 140 bytes/thread.

nsm · on July 25, 2023

I feel like this glossed over how when entering something like epoll, Go can optimize switching to another goroutine. Wouldn't the epoll block until the calling kernel thread had a file descriptor ready?

ridiculous_fish · on July 25, 2023

Go can multiplex goroutines into a single syscall invocation. Two goroutines waiting on distinct file descriptors will use a single epoll (or similar) syscall. However not all syscalls can be multiplexed this way.

IMO this partially undermines the "Go doesn't have colored functions" claim. 1k goroutines calling read is fine, because read is multiplexed, but 1k goroutines calling stat will be painful, because you get 1k kernel threads. Go's "red functions" are those which require a dedicated kernel thread.

thumbuddy · on July 25, 2023

I wish this was stated more. A lot of people new to go have this air of mystery surrounding what coroutines and goroutines are. They'll do something it'll be fast! Then do something else and feel like they are running hardware from the 90s. Not everyone has the know how to strace their applications and think through why something at that level is happening.

gpderetta · on July 25, 2023

Does Go spawn a thread for every non multiplexable system call?

In principle, on linux, you should be able to multiplex stat anyway via io_uring.

insanitybit · on July 25, 2023

I think it just undermines "colorless functions are good", not that go's functions are colored - they aren't. The problem is that "colors" (ie: "this is async") are helpful to the caller, they let them know which context they should be in when they perform that work.

sharkbot · on July 25, 2023

I don’t have your answers, but the runtime source has some pointers that would likely get you the answer you seek: https://github.com/golang/go/blob/master/src/runtime/HACKING...

It sounds like blocking calls switch to a system stack and return the Go stack to the executor pool, but I don’t have source links to back up that claim.

wahern · on July 25, 2023

epoll_wait takes a timeout, which if 0 means it will return immediately even if no events are pending.

You wouldn't necessarily need to make use of this feature to implement something like Go's scheduler. Jumping to the semantics of epoll skips most of the interesting bits about how Goroutines work. The source code will always beat a blog post. If you're curious, you could work backwards from here: https://go.dev/src/runtime/netpoll_epoll.go Note that interface does indeed expose the timeout argument to the higher-level scheduling code; see line 92.

nsm · on July 25, 2023

I am aware of epoll_wait and friends (select/poll) accepting timeouts, and how that plugs into a general event loop like in libuv.

However if memory serves, these calls are reasonably expensive (on the order of microseconds, and _worse_ if readiness requires putting the current thread to sleep and then waking it up again). However https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part... says goroutines are closer to 200ns.

However the same source above does a decent job of explaining how epoll is being called "out of band". I am guessing when they talk about context switching costs, they are doing some sort of amortization of these system call costs. Since the actual OS threads try to remain CPU-affine (thread-per-core, LMAX disruptor etc. etc.), a lot of the OS scheduling costs can be skipped.

Thank you for the pointer to the source code. It gives me a starting point.

seabrookmx · on July 25, 2023

The key is that epoll _isn't_ blocking: epoll_create returns immediately with a file descriptor. Go can use it to issue some sort of i/o operation and rather than blocking the kernel thread, it swaps it to another goroutine that needs work done. When epoll notifies the go scheduler the task is complete via that file descriptor, it will then resume the original goroutine.

nsm · on July 25, 2023

The crucial operation here is epoll_wait, not epoll_create.

Are you saying that the system call is performed on another kernel thread while goroutine continue running on the GOMAXPROCS threads?

gpderetta · on July 25, 2023

So, I'm not familiar with the go event loop specifically, but I have written a few event loops myself.

You can think of an event loop multiplexing across multiple sources of events: for example a local ready queue, a remote queue (for tasks coming from other threads), an io-queue, a timer queue. Epoll is used for the io-queue. If the local ready queue is empty (i.e. there are no runnable coroutines), the event loop will block in the epoll call, possibly until the next timer expiration if any; remote queue operations will wake-up the poll via somethink like event-fd. If the local queue is non-empty, epoll is called with a 0 timeout every N local queue pops (actual scheduling logic and priority is application specific of course).

Polling epoll with a 0 timeout is a bit wasteful, for high performance applications you can use io_uring so checking for non-empty queue is very cheap and/or use some network bypass library that makes epoll cheap. There is also old-school readiness notification via signals but I doubt that still see any use.

As you suggest, you could also have a dedicated thread per event loop just running poll (for example a sibling hyperthread), but it seems wasteful.

muxator · on July 26, 2023

Thanks for this comment! It is a very helpful design document for improving one's future event loops architecture!

tick_tock_tick · on July 25, 2023

> Rust threads on linux use 8kb of memory

Seemly wildly off unless Rust's docs are just wrong? So it's literally the “kilobytes vs megabytes” claim he dismisses....

https://doc.rust-lang.org/std/thread/#stack-size

The default stack size is platform-dependent and subject to change. Currently, it is 2 MiB on all Tier-1 platforms.

p1necone · on July 25, 2023

The stack size starts at 2mb, but unless the code in the thread actually uses that it doesn't really get consumed, because virtual memory.

insanitybit · on July 25, 2023

I think this is pretty well covered in the article:

> Now you might think that to make a new thread I need 8 megabytes of RAM free. But thanks to the magic of virtual memory and overcommit, that’s not necessarily the case. The right way to think about this is that the OS, let’s assume 64bit, is going to allocate you your own private range but this doesn’t really have to be backed by anything. There are alot of 8mb blocks in a 64bit address space. However there is some book keeping overhead as the kernel tracks it’s IOUs.

This is to say that just because you allocate 8MB doesn't mean you create the backing pages and page them in.

tick_tock_tick · on July 25, 2023

I know how virtual memory works they are just arguing both sides in the article.

samsquire · on July 25, 2023

Thanks for this article.

The article mentions Nodejs libuv which makes everything async but not multithreaded. I want to combine the benefits of kernel threads with coroutines or goroutines/green threads/lightweight threads. (If anybody knows anything specifically about fibers, I'd appreciate you telling me that because I'm not familiar with them.)

I have a lightweight thread scheduler https://github.com/samsquire/preemptible-thread which is a 1:M:N scheduler (1 scheduler thread, M kernel threads, N lightweight threads) with the lightweight threads being multiplexed on the kernel threads.

I am working on a multithreaded architecture which I currently call 3 tier multithreaded architecture for lack of a better name. It combines request parallelism with IO and CPU parallelism and intra request parallelism.

We split kernel threads into three groups: app threads, which run lightweight threads, IO threads (liburing/epoll) and traditional CPU threadpool with work stealing.

* There should be low latency for handling IO because of the liburing or epoll loop is dedicated to serving that request and then dispatching it for CPU heavy task or app level task.

* The IO threads have buffers that other threads can write to to queue up data for sockets.

* The IO threads are basically listeners for IO events, they either spawn a lightweight thread for running on the app threads or submit a CPU task.

* The app threads coordinate parallelism WITHIN a request: they spawn CPU tasks and IO tasks.

* You can also spawn app threads to do work outside the context of a request, such as server tasks or schedule.

* You never put a while (true) in a CPU task and only use cooperatively preemptable loops in the lightweight app threads.

* The IO threads do very little work.

* Nobody should block the app threads because that would prevent other requests from completing.

* Lockfree Ringbuffers are used by CPU threads and lightweight threads to communicate.

I invite feedback and comments: https://github.com/samsquire/three-tier-multithreaded-archit...

opportune · on July 25, 2023

> I want to combine the benefits of kernel threads with coroutines or goroutines/green threads/lightweight threads. (If anybody knows anything specifically about fibers, I'd appreciate you telling me that because I'm not familiar with them.)

Fibers mean a couple different things these days. The basics are usually things like cooperative scheduling/different context switching and pre-emption semantics (using special synchronization libraries) that make it so you can have huge numbers of active thread-like (fibrous) things with less overhead. They also usually support a tree-like structure that enables things like cancellation. Sometimes they re-use p_threads (when one fiber completes, the next fiber gets the old p_thread) between fibers under the hood to amortize thread creation and destruction, or sometimes they pack multiple fibers into a single thread for further amortization and to reduce overhead.

Your

``` while (this.parent.isScheduled(this)) { ```

in the thread function itself confuses me a bit as typically you'd expect the code to just run when scheduled, not run otherwise - are you keeping all your lightweight threads running/scheduled by the OS but bit-flipping the printing from off to on based on if your scheduler determines it to be active? If so, to get the actual performance benefits, you need to keep the lightweight threads off the kernel scheduler itself

samsquire · on July 25, 2023

Thanks so much for your reply and explanation of fibers. I still don't get it yet but I'll do some more investigation and ask ChatGPT.

That code shall only run when the thread is scheduled, the guard is to prevent the lightweight thread blocking the kernel worker loop over lightweight threads.

The lightweight threads run in worker threads and the scheduler is a different thread to that.

You would have to remove a lightweight thread from kernel_thread.lightweight_threads to deschedule it completely. Presumably you'd do this if you were waiting for an event.

I assume all my lightweight threads are ready to run and all want to run. But they need to relinquish the CPU when preempted by the scheduler.

In the C code of preemptible-thread, the user_function() of a lightweight thread is the body of the lightweight thread, it is called by one of the kernel threads assigned to the lightweight thread runtime. The kernel thread loops over each lightweight threads assigned to that kernel thread.

The scheduler thread does NOT run lightweight threads. It's just used to generate signals on a time interval. It sets flags to interrupt loops.

In other words, a lightweight thread will only run until it is cooperatively preempted by the scheduler thread. I've been criticised for describing it as preemption but it's a cooperative preemption.

while (this.parent.isScheduled(this)) is used to cancel the execution of that lightweight thread's execution of user_function when the scheduler toggles its scheduled & preempted flag. It's the equivalent of while (true) but in a lightweight thread context. If I used while (true), the lightweight thread would never be descheduled for another lightweight thread to run.

It's a way of stopping the current thread's while loop.

There is one scheduler kernel thread that does the following on each time slice (ideally a few microseconds):

   for kernel_thread in kernel_threads:
      for lightweight_thread in kernel_thread.lightweight_threads:
        lightweight_thread.preempt()

In as many kernel thread worker thread as possible, as many as you have hardware threads. They runs lightweight threads one by one until the scheduler preempts that lightweight thread and then it moves to the next one:

Kernel thread lightweight thread runner/worker. It round robins the lightweight threads.

    while (running) {
      for lightweight_thread in lightweight threads:
         lightweight_thread.markScheduled(true)
         #### between here and
         while (lightweight_thread.isScheduled(this)) { // while i'm not preempted
           registerLoop(0, 0, 10000000);
           for (int loopVar = initialVar(0, 0); getValue(0) < getLimit(0); loopVar = increment(0)) {
              Math.sqrt(getValue(0));
           }
          #### here is the lightweight thread's body.
       }

    }

I've inlined the lightweight thread's body in this code above.

The kernel thread scheduler loop over lightweight threads, preempt all currently running lightweight threads. While the kernel worker thread loops over that thread's lightweight threads and loops while they are not preempted.

Cancellation is possible in this design. I haven't thought about removing lightweight threads from the runqueue, it's assumed every lightweight thread wants to run as much as possible. The scheduler can be enhanced to provide this.

gpderetta · on July 25, 2023

Running a separate scheduling thread seems very wasteful and non scalable. What is your preemption condition? Can't you just increment a counter in isScheduled(this) and synchronously invoke the scheduler every N or so checks?

samsquire · on July 25, 2023

The preemption condition is literally time. Just need to set a flag in threads after so much time.

There is only one scheduling thread.

That could work but that would slow down while loops of lightweight threads, but it shall work.

The design is designed to avoid putting if statements inside loops for performance reasons.

If you have 128 core machine then 1 thread for scheduling might be worthwhile rather than slow down 128 cores worth of work with an if statement.

But is there a way to run something a timer on the kernel or something so I don't need to waste a thread? It would presumably be asleep most of the time anyway.

hn92726819 · on July 25, 2023

This doesn't even mention tokio. Goroutines are green threads (threads). Rust threads are os-level threads. This is a cool compairoson, but it should definitely include Rust's green-thread model like tokio for comparison.

yveezy · on July 24, 2023

This was a good read! Thanks for sharing

renox · on July 25, 2023

A good part of the article is about a non-realistic microbenchmark (nobody spawns empty threads)..

Also measuring the performance difference with threads making some work would be more interesting IMHO.

marcusb · on July 25, 2023

I did some non-scientific testing with this last week and, at least for my problem (brute-forcing RC4 keys with relatively small numbers of long-lives threads,) threads and go routines were approximately equal in terms of performance with goroutines very slightly faster (around 5% or so.)

I have another test I'm running with the same workload, but with more dynamically created threads (n keys dispatched to a pool of threads/goroutines until the key space is exhausted) but I haven't finished the threaded version for comparison yet.

gpderetta · on July 25, 2023

Wouldn't you have exactly 1 thread per CPU for this sort of brute force embarrassingly parallel, CPU intensive computation? You should be context switch approximately never which makes coroutine vs thread moot.

marcusb · on July 25, 2023

The point was more to see what the overhead is for CPU-bound tasks that do need to switch, not that this is the best approach for this particular task. I just happened to have the threaded version available, so I thought it made a good point of comparison.

akvadrako · on July 25, 2023

It's a good place to start since empty threads should show you the maximum advantage of goroutines.

It also shows that the overhead of a million threads is reasonable if you can actually handle them doing something.

renox · on July 25, 2023

>It's a good place to start since empty threads should show you the maximum advantage of goroutines.

That's the part I'm not so sure..

opportune · on July 25, 2023

> For example if one goroutine is writing to a channel and one goroutine is reading to a channel, it’s possible go can literally run the reader and the writer on the same thread and take a fast path where the writer goroutine calls send and that triggers the current thread to switchto the reader goroutine.

> ancient legacy registers/flags

Right, the problem is the kernel scheduler has less context on the relationships between threads, less context on application/thread state, and solves a bunch of problems you probably don’t care about, while juggling a bunch of other stuff besides scheduling.

The magic of user space scheduling is not that scary, it’s packing more concurrency into less OS threads, using synchronization to be smart about waking them up/not even trying, reusing threads… basically just avoiding expensive OS thread lifecycle stuff and overhead.

There are problems with go’s scheduling though, compared to other similar userspace scheduling implementations. One is context - any call site should be able to either statically resolve a context or grab one from some special spot of the underlying thread without explicit plumbing IMO. If they hadn’t allowed the terrible anti pattern of stuffing anything you want in context this would be more tractable to fix: it could just encode a tiny amount of info like a single int/enum for cancellation and OS-signaling state.

Secondly, Golang nudges you towards channels too much. While write to channel->wait until blocked->schedule channel reader is a nobrainer this is very often an unnecessary use of the second go routine and also an anti pattern (either because starting a goroutine to wait for an error event then scheduled it on error seems overhead free but isn’t, or because you’ve created a huge spaghetti code mess of an application and this is easier than winding up the call stack). In practice you can, drum roll, call the code living on the other side of the channel directly. The channel is just a gross way to manage control flow outside the stack.

Also since goroutines are annoyingly syntactically integrated with the language itself, you can’t do typical semantics of like joining the “threadlike thing”, cancelling it directly (fuck you, context) passing it around and plumbing it. This eliminates certain application-level scheduling improvements like telling a particular goroutine execution to start working on some new chunk of logic as soon as it finishes its current thing, and encourages wastefully firing and forgetting routines instead, while requiring you to do unnecessarily explicit bookkeeping with things like waitgroups.

It’s not terrible - some of these aren’t Golang problems per se, just the language nudging you into problematic patterns - but Fiber implementations often improve on these problems so I prefer them.