One of the question asked but not answered is "why is more/less expensive to use threads/coroutines?"
This is my attempt to the answer: (related to working in a database engine and there you also get "why is not a good idea to let the OS handle things for you".)
- Is partially historical
- ... and change it could break in funny ways (you bet code depend in this numbers in code somehow!)
- And this is because this is a GLOBAL choice for all the apps running, so the OS is blind to what could be optimal (and even if you can tweak it the OS CAN'T optimize for that rare event)
- It must work across many workloads
- MOST code run Process-> Small threads and CPU intensive taks (ie: "I pay $$$$$ for a thread-ripper and my app is equally slow!")
And probably the #1:
- The main switch is 1 UI thread * Small-N background threads, so is better to pre-allocate for "bigger" capacity
---
Then, "why coroutines and similar are better?" (Is easier to see when you implement it (coroutines) but lets ignore that and use the same mental framework as above)
- Even a "general" purpose language have a more defined role ("Go is for networking apps") and this leak, strongly, if this lang is biased for coroutines/other OR threads (Note how Rust can't because the role for it is as broad as a full OS)
- Langs that use coroutines/other VS Threads are optimized for high concurrency because is expected that the developers WILL create significant libraries/frameworks for the smaller subset of high concurrency apps (like web servers)
- Langs that use Threads VS coroutines/other are more general in purpose (Rust, Java, ...) or need to run faster for CPU-intensive/Parallel code and there Threads are in fact faster than coroutines.
Coroutines are cooperative scheduling, and have all the same drawbacks.
They're really really good for high performance proxies though, since they make a very small number of cheap decisions for each request, and then punt the bit shuffling to the OS (which uses threads). These proxies are ubiquitous and well regarded (nginx, haproxy, linkerd, etc) so they take up maybe more mind share than they should.
I don't think the 8KB calculation is quite right - there are allocations in the kernel for every thread as well, and I don't think that'll be accounted for by /use/bin/time.
Indeed. Making the program wait (using a barrier) after all the threads have started, I see a a substantially higher memory usage than reported by time. I only used 100k threads. For me time reports 1018376kB, but looking at /proc/meminfo I see a difference of 5231092kB - so about 52kB per thread. That's a far cry from 8MB, but also not 8kB.
/proc/slabinfo shows some drastic differences:
slabtop -o|head -n 15
with 100k threads running:
Active / Total Objects (% used) : 10134567 / 10280497 (98.6%)
Active / Total Slabs (% used) : 337756 / 337756 (100.0%)
Active / Total Caches (% used) : 132 / 175 (75.4%)
Active / Total Size (% used) : 3598333.95K / 3674904.09K (97.9%)
Minimum / Average / Maximum Object : 0.01K / 0.36K / 11.81K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3419013 3418579 99% 0.10K 87667 39 350668K buffer_head
1485372 1484947 99% 0.19K 35366 42 282928K dentry
635299 635038 99% 0.68K 13517 47 432544K proc_inode_cache
575792 575708 99% 1.14K 20564 28 658048K ext4_inode_cache
514794 514703 99% 0.04K 5047 102 20188K extent_status
503140 502198 99% 0.18K 11435 44 91480K vm_area_struct
501738 500514 99% 0.04K 4919 102 19676K vma_lock
381304 378068 99% 0.57K 13618 28 217888K radix_tree_node
without:
Active / Total Objects (% used) : 7235070 / 7678039 (94.2%)
Active / Total Slabs (% used) : 187813 / 187813 (100.0%)
Active / Total Caches (% used) : 132 / 175 (75.4%)
Active / Total Size (% used) : 1751469.45K / 1851083.31K (94.6%)
Minimum / Average / Maximum Object : 0.01K / 0.24K / 11.81K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3419013 3418579 99% 0.10K 87667 39 350668K buffer_head
898296 887716 98% 0.19K 21388 42 171104K dentry
575792 575708 99% 1.14K 20564 28 658048K ext4_inode_cache
514794 514703 99% 0.04K 5047 102 20188K extent_status
381304 376987 98% 0.57K 13618 28 217888K radix_tree_node
164352 164352 100% 0.06K 2568 64 10272K kmalloc-rcl-64
157360 53925 34% 0.07K 2810 56 11240K vmap_area
139638 109093 78% 0.04K 1369 102 5476K vma_lock
Don't most applications that use threads also use thread pools because thread creation is expensive?
I guess these comparisons are nice to get a feeling how expensive operations are. But has it a lot real world relevancy?
In practice my beef with coroutines is that with unlimited concurrency you run into resource limits when the system gets some load (file descriptors, open connections, ... whatever the coroutine does that is limited) or starts a DOS attack. Which is something no one ever tests and something that appears randomly in production depending on the data fed to the system.
With many developers using them as default solution when things get slow that creates brittle systems. How is everyone else dealing with things like this?
Yes, but thread pools kind of suck for workloads with multiple i/o operations per "run", because they end up with either coloring problems, blocking ineffeciently, or doing extra context switching. Also, most developers set the pool size to the hardware parallelization limit which worsens either the context switching cost or the ability to be smart about synchronization.
At the bare minimum, you want your userspace scheduling to include synchronization tools in addition to a threading abstraction - you have more information available than the kernel about which threads are blocking which others, so you can be smart about starting threads you know are unblocked and not leaving hardware idle while there’s work to be done.
Regarding load, a lot of people use infrastructure tools to limit that these days before the load hits their application - without that, your gripe is not actually a solvable problem in the application itself. You’ll incur an inherent amount of I/O and scheduling thrashing for each ingress event which is completely unavoidable without dropping requests - even with fixed concurrency in your application code you’ll get this. However, as seen in the blog, the problem is not actually with internal application concurrency at all - Golang can handle 1mm active goroutines no sweat.
The problem instead is your hardware simple can’t perform HTTPS, or god forbid json, deserialization fast enough to handle the throughput getting thrown at you, which is distinct from the number of active requests. If you are gonna make a very long running call to some other service and otherwise do nothing in that goroutine while doing so, you can actually handle absurd levels of concurrency, easily in the hundreds and probably thousands or tens of thousands. But you still can’t handle 8 virtual cores trying to deserialize 100 json bodies every 10ms
Well, ordinarily, you just know your specs and your requirements, and install rate limiting and load balancing. Just standard availability & resilience. With proper resource constraints and observability in place, your system should be fine in most cases.
I think it’s standard practice to plan for failure and simulate adversary access patterns. On the other hand you say “no one ever tests” stuff like that. I suppose it’s not that standard after all.
A significant difference is that goroutine stacks can grow, but also shrink, thus releasing memory after a temporary stack growth. Threads won’t do that.
Note that lightweight threads (even in CSP style concurrency model) can be even lighter and faster. F#'s Hopac library seems to meet such expectations with this simple program:
#r "nuget:hopac"
open Hopac
open Hopac.Infixes
open Hopac.Extensions
seq{1..1000000}
|> Seq.Con.iterJobIgnore (fun _ -> timeOutMillis 1000)
|> run
Runs in 1.824s real, 17.635s user and 0.088s system time in under 186000KB, hence less around 200 bytes per thread.
10M threads sleeping 10 seconds taxes this machine to:
22s real, 4m user, 0.506s system @ 140 bytes/thread.
I feel like this glossed over how when entering something like epoll, Go can optimize switching to another goroutine. Wouldn't the epoll block until the calling kernel thread had a file descriptor ready?
Go can multiplex goroutines into a single syscall invocation. Two goroutines waiting on distinct file descriptors will use a single epoll (or similar) syscall. However not all syscalls can be multiplexed this way.
IMO this partially undermines the "Go doesn't have colored functions" claim. 1k goroutines calling read is fine, because read is multiplexed, but 1k goroutines calling stat will be painful, because you get 1k kernel threads. Go's "red functions" are those which require a dedicated kernel thread.
I wish this was stated more. A lot of people new to go have this air of mystery surrounding what coroutines and goroutines are. They'll do something it'll be fast! Then do something else and feel like they are running hardware from the 90s. Not everyone has the know how to strace their applications and think through why something at that level is happening.
I think it just undermines "colorless functions are good", not that go's functions are colored - they aren't. The problem is that "colors" (ie: "this is async") are helpful to the caller, they let them know which context they should be in when they perform that work.
It sounds like blocking calls switch to a system stack and return the Go stack to the executor pool, but I don’t have source links to back up that claim.
epoll_wait takes a timeout, which if 0 means it will return immediately even if no events are pending.
You wouldn't necessarily need to make use of this feature to implement something like Go's scheduler. Jumping to the semantics of epoll skips most of the interesting bits about how Goroutines work. The source code will always beat a blog post. If you're curious, you could work backwards from here: https://go.dev/src/runtime/netpoll_epoll.go Note that interface does indeed expose the timeout argument to the higher-level scheduling code; see line 92.
I am aware of epoll_wait and friends (select/poll) accepting timeouts, and how that plugs into a general event loop like in libuv.
However if memory serves, these calls are reasonably expensive (on the order of microseconds, and _worse_ if readiness requires putting the current thread to sleep and then waking it up again). However https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part... says goroutines are closer to 200ns.
However the same source above does a decent job of explaining how epoll is being called "out of band". I am guessing when they talk about context switching costs, they are doing some sort of amortization of these system call costs. Since the actual OS threads try to remain CPU-affine (thread-per-core, LMAX disruptor etc. etc.), a lot of the OS scheduling costs can be skipped.
Thank you for the pointer to the source code. It gives me a starting point.
The key is that epoll _isn't_ blocking: epoll_create returns immediately with a file descriptor. Go can use it to issue some sort of i/o operation and rather than blocking the kernel thread, it swaps it to another goroutine that needs work done. When epoll notifies the go scheduler the task is complete via that file descriptor, it will then resume the original goroutine.
So, I'm not familiar with the go event loop specifically, but I have written a few event loops myself.
You can think of an event loop multiplexing across multiple sources of events: for example a local ready queue, a remote queue (for tasks coming from other threads), an io-queue, a timer queue. Epoll is used for the io-queue. If the local ready queue is empty (i.e. there are no runnable coroutines), the event loop will block in the epoll call, possibly until the next timer expiration if any; remote queue operations will wake-up the poll via somethink like event-fd. If the local queue is non-empty, epoll is called with a 0 timeout every N local queue pops (actual scheduling logic and priority is application specific of course).
Polling epoll with a 0 timeout is a bit wasteful, for high performance applications you can use io_uring so checking for non-empty queue is very cheap and/or use some network bypass library that makes epoll cheap. There is also old-school readiness notification via signals but I doubt that still see any use.
As you suggest, you could also have a dedicated thread per event loop just running poll (for example a sibling hyperthread), but it seems wasteful.
I think this is pretty well covered in the article:
> Now you might think that to make a new thread I need 8 megabytes of RAM free. But thanks to the magic of virtual memory and overcommit, that’s not necessarily the case. The right way to think about this is that the OS, let’s assume 64bit, is going to allocate you your own private range but this doesn’t really have to be backed by anything. There are alot of 8mb blocks in a 64bit address space. However there is some book keeping overhead as the kernel tracks it’s IOUs.
This is to say that just because you allocate 8MB doesn't mean you create the backing pages and page them in.
The article mentions Nodejs libuv which makes everything async but not multithreaded. I want to combine the benefits of kernel threads with coroutines or goroutines/green threads/lightweight threads. (If anybody knows anything specifically about fibers, I'd appreciate you telling me that because I'm not familiar with them.)
I have a lightweight thread scheduler https://github.com/samsquire/preemptible-thread which is a 1:M:N scheduler (1 scheduler thread, M kernel threads, N lightweight threads) with the lightweight threads being multiplexed on the kernel threads.
I am working on a multithreaded architecture which I currently call 3 tier multithreaded architecture for lack of a better name. It combines request parallelism with IO and CPU parallelism and intra request parallelism.
We split kernel threads into three groups: app threads, which run lightweight threads, IO threads (liburing/epoll) and traditional CPU threadpool with work stealing.
* There should be low latency for handling IO because of the liburing or epoll loop is dedicated to serving that request and then dispatching it for CPU heavy task or app level task.
* The IO threads have buffers that other threads can write to to queue up data for sockets.
* The IO threads are basically listeners for IO events, they either spawn a lightweight thread for running on the app threads or submit a CPU task.
* The app threads coordinate parallelism WITHIN a request: they spawn CPU tasks and IO tasks.
* You can also spawn app threads to do work outside the context of a request, such as server tasks or schedule.
* You never put a while (true) in a CPU task and only use cooperatively preemptable loops in the lightweight app threads.
* The IO threads do very little work.
* Nobody should block the app threads because that would prevent other requests from completing.
* Lockfree Ringbuffers are used by CPU threads and lightweight threads to communicate.
> I want to combine the benefits of kernel threads with coroutines or goroutines/green threads/lightweight threads. (If anybody knows anything specifically about fibers, I'd appreciate you telling me that because I'm not familiar with them.)
Fibers mean a couple different things these days. The basics are usually things like cooperative scheduling/different context switching and pre-emption semantics (using special synchronization libraries) that make it so you can have huge numbers of active thread-like (fibrous) things with less overhead. They also usually support a tree-like structure that enables things like cancellation. Sometimes they re-use p_threads (when one fiber completes, the next fiber gets the old p_thread) between fibers under the hood to amortize thread creation and destruction, or sometimes they pack multiple fibers into a single thread for further amortization and to reduce overhead.
Your
```
while (this.parent.isScheduled(this)) {
```
in the thread function itself confuses me a bit as typically you'd expect the code to just run when scheduled, not run otherwise - are you keeping all your lightweight threads running/scheduled by the OS but bit-flipping the printing from off to on based on if your scheduler determines it to be active? If so, to get the actual performance benefits, you need to keep the lightweight threads off the kernel scheduler itself
Thanks so much for your reply and explanation of fibers. I still don't get it yet but I'll do some more investigation and ask ChatGPT.
That code shall only run when the thread is scheduled, the guard is to prevent the lightweight thread blocking the kernel worker loop over lightweight threads.
The lightweight threads run in worker threads and the scheduler is a different thread to that.
You would have to remove a lightweight thread from kernel_thread.lightweight_threads to deschedule it completely. Presumably you'd do this if you were waiting for an event.
I assume all my lightweight threads are ready to run and all want to run. But they need to relinquish the CPU when preempted by the scheduler.
In the C code of preemptible-thread, the user_function() of a lightweight thread is the body of the lightweight thread, it is called by one of the kernel threads assigned to the lightweight thread runtime. The kernel thread loops over each lightweight threads assigned to that kernel thread.
The scheduler thread does NOT run lightweight threads. It's just used to generate signals on a time interval. It sets
flags to interrupt loops.
In other words, a lightweight thread will only run until it is cooperatively preempted by the scheduler thread. I've been criticised for describing it as preemption but it's a cooperative preemption.
while (this.parent.isScheduled(this)) is used to cancel the execution of that lightweight thread's execution of user_function when the scheduler toggles its scheduled & preempted flag. It's the equivalent of while (true) but in a lightweight thread context. If I used while (true), the lightweight thread would never be descheduled for another lightweight thread to run.
It's a way of stopping the current thread's while loop.
There is one scheduler kernel thread that does the following on each time slice (ideally a few microseconds):
for kernel_thread in kernel_threads:
for lightweight_thread in kernel_thread.lightweight_threads:
lightweight_thread.preempt()
In as many kernel thread worker thread as possible, as many as you have hardware threads. They runs lightweight threads one by one until the scheduler preempts that lightweight thread and then it moves to the next one:
Kernel thread lightweight thread runner/worker. It round robins the lightweight threads.
while (running) {
for lightweight_thread in lightweight threads:
lightweight_thread.markScheduled(true)
#### between here and
while (lightweight_thread.isScheduled(this)) { // while i'm not preempted
registerLoop(0, 0, 10000000);
for (int loopVar = initialVar(0, 0); getValue(0) < getLimit(0); loopVar = increment(0)) {
Math.sqrt(getValue(0));
}
#### here is the lightweight thread's body.
}
}
I've inlined the lightweight thread's body in this code above.
The kernel thread scheduler loop over lightweight threads, preempt all currently running lightweight threads. While the kernel worker thread loops over that thread's lightweight threads and loops while they are not preempted.
Cancellation is possible in this design. I haven't thought about removing lightweight threads from the runqueue, it's assumed every lightweight thread wants to run as much as possible. The scheduler can be enhanced to provide this.
Running a separate scheduling thread seems very wasteful and non scalable. What is your preemption condition? Can't you just increment a counter in isScheduled(this) and synchronously invoke the scheduler every N or so checks?
The preemption condition is literally time. Just need to set a flag in threads after so much time.
There is only one scheduling thread.
That could work but that would slow down while loops of lightweight threads, but it shall work.
The design is designed to avoid putting if statements inside loops for performance reasons.
If you have 128 core machine then 1 thread for scheduling might be worthwhile rather than slow down 128 cores worth of work with an if statement.
But is there a way to run something a timer on the kernel or something so I don't need to waste a thread? It would presumably be asleep most of the time anyway.
This doesn't even mention tokio. Goroutines are green threads (threads). Rust threads are os-level threads. This is a cool compairoson, but it should definitely include Rust's green-thread model like tokio for comparison.
I did some non-scientific testing with this last week and, at least for my problem (brute-forcing RC4 keys with relatively small numbers of long-lives threads,) threads and go routines were approximately equal in terms of performance with goroutines very slightly faster (around 5% or so.)
I have another test I'm running with the same workload, but with more dynamically created threads (n keys dispatched to a pool of threads/goroutines until the key space is exhausted) but I haven't finished the threaded version for comparison yet.
Wouldn't you have exactly 1 thread per CPU for this sort of brute force embarrassingly parallel, CPU intensive computation? You should be context switch approximately never which makes coroutine vs thread moot.
The point was more to see what the overhead is for CPU-bound tasks that do need to switch, not that this is the best approach for this particular task. I just happened to have the threaded version available, so I thought it made a good point of comparison.
> For example if one goroutine is writing to a channel and one goroutine is reading to a channel, it’s possible go can literally run the reader and the writer on the same thread and take a fast path where the writer goroutine calls send and that triggers the current thread to switchto the reader goroutine.
> ancient legacy registers/flags
Right, the problem is the kernel scheduler has less context on the relationships between threads, less context on application/thread state, and solves a bunch of problems you probably don’t care about, while juggling a bunch of other stuff besides scheduling.
The magic of user space scheduling is not that scary, it’s packing more concurrency into less OS threads, using synchronization to be smart about waking them up/not even trying, reusing threads… basically just avoiding expensive OS thread lifecycle stuff and overhead.
There are problems with go’s scheduling though, compared to other similar userspace scheduling implementations. One is context - any call site should be able to either statically resolve a context or grab one from some special spot of the underlying thread without explicit plumbing IMO. If they hadn’t allowed the terrible anti pattern of stuffing anything you want in context this would be more tractable to fix: it could just encode a tiny amount of info like a single int/enum for cancellation and OS-signaling state.
Secondly, Golang nudges you towards channels too much. While write to channel->wait until blocked->schedule channel reader is a nobrainer this is very often an unnecessary use of the second go routine and also an anti pattern (either because starting a goroutine to wait for an error event then scheduled it on error seems overhead free but isn’t, or because you’ve created a huge spaghetti code mess of an application and this is easier than winding up the call stack). In practice you can, drum roll, call the code living on the other side of the channel directly. The channel is just a gross way to manage control flow outside the stack.
Also since goroutines are annoyingly syntactically integrated with the language itself, you can’t do typical semantics of like joining the “threadlike thing”, cancelling it directly (fuck you, context) passing it around and plumbing it. This eliminates certain application-level scheduling improvements like telling a particular goroutine execution to start working on some new chunk of logic as soon as it finishes its current thing, and encourages wastefully firing and forgetting routines instead, while requiring you to do unnecessarily explicit bookkeeping with things like waitgroups.
It’s not terrible - some of these aren’t Golang problems per se, just the language nudging you into problematic patterns - but Fiber implementations often improve on these problems so I prefer them.
This is my attempt to the answer: (related to working in a database engine and there you also get "why is not a good idea to let the OS handle things for you".)
- Is partially historical
- ... and change it could break in funny ways (you bet code depend in this numbers in code somehow!)
- And this is because this is a GLOBAL choice for all the apps running, so the OS is blind to what could be optimal (and even if you can tweak it the OS CAN'T optimize for that rare event)
- It must work across many workloads
- MOST code run Process-> Small threads and CPU intensive taks (ie: "I pay $$$$$ for a thread-ripper and my app is equally slow!")
And probably the #1:
- The main switch is 1 UI thread * Small-N background threads, so is better to pre-allocate for "bigger" capacity
---
Then, "why coroutines and similar are better?" (Is easier to see when you implement it (coroutines) but lets ignore that and use the same mental framework as above)
- Even a "general" purpose language have a more defined role ("Go is for networking apps") and this leak, strongly, if this lang is biased for coroutines/other OR threads (Note how Rust can't because the role for it is as broad as a full OS)
- Langs that use coroutines/other VS Threads are optimized for high concurrency because is expected that the developers WILL create significant libraries/frameworks for the smaller subset of high concurrency apps (like web servers)
- Langs that use Threads VS coroutines/other are more general in purpose (Rust, Java, ...) or need to run faster for CPU-intensive/Parallel code and there Threads are in fact faster than coroutines.