I wrote a deep dive on how modern reverse proxies handle thousands of concurrent requests.
The article walks through what actually happens inside a proxy when multiple requests arrive at the same time: connection handling, request queues, event loops vs thread pools, and how different implementations approach the problem.
It also compares design patterns used in systems like Apache Traffic Server, HAProxy, and Envoy, and explains why these architectural choices matter for latency, throughput, and failure behavior.
The goal was to help zoom in what happens between the client and backend when traffic spikes.
Would love feedback from folks running proxies at scale.
DNS keeps showing up in outage postmortems, but what's often missing is discussion about recovery, not just prevention.
In this post, I break down common DNS failure patterns (TTL propagation, resolver overload, control plane dependency loops) and why recovery can deadlock when your tooling itself depends on DNS.
I'd love to hear how others design around this:
Do you use DNS-independent fallbacks?
Static seed lists?
Separate control plane resolution?
Aggressive caching vs short TTLs?
Curious what patterns have worked (or failed) in real systems for folks.
For me and also the place I retired from the optimal solutions was an instance of Unbound [1] on every node keeping local cache, retrying edge resolvers intelligently, preferring the fastest responding edge resolvers, cap on min-ttl or both resource records and infrastructure, pre-caching, etc... I've done that at home and when others talk about a DNS outage I have to go out of my way to see or replicate it usually by forcing a flush of the cache.
Most Linux distributions have a build of Unbound. I point edge DNS recursive resolvers to the root servers rather than leaking internal systems requests to Cloudflare or Google. Unbound can also be configured to not forward internal names or to point requests for internal names to specific upstream servers.
Nice. Running Unbound locally with intelligent upstream selection and caching definitely reduces blast radius from edge resolver outages.
I haven't tried Unbound but I’m curious though, how do you handle recovery behavior when the failure isn’t just recursive resolver unavailability, but scenarios like stale IPs after control plane failover, or long-lived gRPC connections that never re-resolve, or bootstrap loops where the system that needs to reconfigure DNS itself depends on DNS?
In my experience, local recursive resolvers solve availability pretty well, but recovery semantics still depend heavily on client behavior and connection lifecycle management.
Do you rely on aggressive re-resolution policies at the application layer? Or force connection churn after TTL expiry?
Would love to understand how you think about resolver-level resilience vs application-level recovery.
We did not have to do this but in that scenario I would have automation reach out to Unbound and drop the cache for that particular zone or sub-domain. A script could force fetching the new records for any given zone to rebuild the cache.
Or force connection churn after TTL expiry?
The TTL can be kept low and Unbound told to hold the last known IP after resolution accepting this breaks an RFC and the apps may hold onto the wrong IP for too long and then Unbound will request it from upstream again to get the new IP. There is no one right answer. Whomever is the architect for the environment in question would have to decide with methods they believe will be more resilient and then test failure conditions when they do chaos testing. Anywhere there is a gap in resilience should be part of monitoring and automation when the bad behavior can not be eliminated through app/infra configuration.
how you think about resolver-level resilience vs application-level recovery
Well sadly the people managing or architecting the infrastructure may not have any input into how the applications manage DNS. Ideally both groups would meet and discuss options if this is a greenfield deployment. If not then the second best option would be to discuss the platform behavior with a subject matter expert in addition to an operations manager that can summarize all the DNS failures, root cause analysis and restoration methods to determine what behavior should be configured into the stack. Here again there is no one right answer. As a group they will have to decide at which layer DNS retries occur most aggressively and how much input automation will have at the app and infra layers.
The overall priority should be to ensure that past DNS issues known-knowns are designed out of the system. That leaves only unknown-unknowns. to be dealt with in a reactive state, possibly first with automation and then with an operations or SRE team.
Take a look through the Unbound configuration directives [2] to see some of the options available.
Resolver-level resilience is often manageable centrally. The harder part is application-level recovery; especially in larger orgs where DNS behavior spans multiple teams.
Even with low TTLs or cache flush automation, apps may: resolve once at startup or hold long-lived gRPC/TCP connections, or even worse -> ignore TTL semantics entirely
So infra assumes "DNS healed," but the app never re-resolves.
or something along that line including other options. At least I tried to get teams to use those and then rely on Unbound DNS cache and retry schemes. SystemD also has it's own resolver cache which can be disabled and told to use a local instance of Unbound. Windows servers require Group Policy and registry modifications to change their behavior.
One of my pet peeves is when groups do not manage domain/search correctly and they do not use FQDN in the application configuration resulting in 3x or 4x or more the number of DNS requests which also amplifies all DNS problems/outages. That really grinds my gears.
And of course if the Linux system uses glibc, editing /etc/gai.conf to prefer IPv4 or IPv6 depending on what is primarily used inside the data-center makes a big difference.
DNS works well in certain scenarios (multi-region failover, coarse traffic steering), but I've seen it misapplied in cases where faster failure detection or finer-grained routing was required.
This piece tries to outline where DNS fits well architecturally and where it starts to show limits.
For client-side LB, moving active healthcheck outside into dedicated service, wouldn't it create more reliability issues with one more service to worry about? Are there any examples of this approach being used in the industry?
IME you end up with both; something like discrete client, LB, and controller. You can’t rely on any one component to “turn itself off.“ ex a client or LB can easily get into a “wedged” state where it’s unable to take itself out of consideration for traffic. For example, I’ve had silly incidents based on bgp routes staying up, memory errors/pressure preventing new health check results from being parsed, the file systems is going read only, SKB pressure interfering with pipes, and of course, the classic difference between a dedicated health check in point versus actual traffic. All those examples it prevents the client or LB from removing itself from the traffic path.
An external controller is able to safely remove traffic from one of the other failed components. In addition the client can still do local traffic analysis, or use in band signaling, to identify anomalous end points and remove itself or them from the traffic path.
Good active probes are actually a pretty meaningful traffic load. It was a HUGE problem for flat virtual network models like a heroku a decade ago. This is exacerbated when you have more clients and more in points.
As a reference, this distributed model it is what AWS moved to 15 years ago. And if you look at any of the high throughput clouds services or CDNs they’ll have a similar model.
one thing to add for passive healthchecking and clientside loadbalancing is that throughput and dilution of signal really matters.
there are obviously plenty of low/sparse call volume services where passive healthchecks would take forever to get signal, or signal is so infrequently collected its meaningless. and even with decent RPS, say 1m RPS distributed between 1000 caller replicas and 1000 callee replicas, that means that any one caller-callee pair is only seeing 1rps. Depending on your noise threshold, a centralized active healthcheck can respond much faster.
There are some ways to improve signal in the latter case using subsetting and aggregating/reporting controllers, but that all comes with added complexity.
From a dataplane perspective, it does mean your healthchecks are running from a different location than your proxy. So there are risks where routability is impacted for proxy -> dest but not for healthchecker -> dest.
For general reliability, you can create partitions of checkers and use quorum across partitions to determine what the health state is for a given dest. This also enables centralized monitoring to detect systemic issues with bad healthcheck configuration changes (i.e. are healthchecks failing because the service is unhealthy or because of a bad healthchecker?)
In industry, I personnaly know AWS has one or two health-check-as-a-service systems that they are using internally for LBs and DNS. Uber runs its own health-check-as-a-service system which it integrates with its managed proxy fleet as well as p2p discovery. IIRC Meta also has a system like this for at least some things? But maybe I'm misremembering.
> when a single endpoint in a service begins having high latency
Yes, have seen this first hand. Tracking the latency per endpoint in a sliding window helped in some way, but it created other problems for low qps services.
Agree - sliding window error rates plus client-side circuit breakers (with half-open probes and ramp-up) work really well in practice, and the recovery-speed point is especially important.
The only nuance I was trying to call out is what happens at very large scale. These mechanisms operate per client instance, so each client needs a few failures before it trips its breaker and then runs its own probes and ramp-up. That's perfectly reasonable locally, but when you have hundreds or thousands of clients, the aggregate "learning traffic" can still be noticeable. Each client might only send a little bad traffic before reacting, but multiplied across the fleet it can still add up. Similarly, recovery can still produce smaller synchronized ramps as many clients independently notice improvement around the same time.
So I tend to think of client-side circuit breakers as necessary but not always sufficient at scale. They're great for fast local containment and tail-latency protection, but they work best when paired with some shared signal (LB, mesh control plane, or similar) that can dampen the aggregate effect and smooth recovery globally.
The idea is attractive (especially for draining), but once you try to map arbitrary inbound client connections onto backend-initiated "reverse" pipes, you end up needing standardized semantics for multiplexing, backpressure, failure recovery, identity propagation, and streaming! So, you're no longer just standardizing "reverse HTTP", you’re standardizing a full proxy transport + control plane. In practice, the ecosystem standardized draining/health via readiness + LB control-plane APIs and (for HTTP/2/3) graceful shutdown signals, which solves the draining problem without flipping the fundamental accept/connect roles.
I wrote this after seeing cases where instances were technically “up” but clearly not serving traffic correctly.
The article explores how client-side and server-side load balancing differ in failure detection speed, consistency, and operational complexity.
I’d love input from people who’ve operated service meshes, Envoy/HAProxy setups, or large distributed fleets — particularly around edge cases and scaling tradeoffs.
I don't think you really need sub-millisecond detection to get sub-millisecond service latency. You mainly need to send backup requests, where appropriate, to backup channels, when the main request didn't respond promptly, and your program needs to be ready for the high probability that the original request wins this race anyway. It's more than fine that Client A and Client B have differing opinions about the health of the channel to Server C at a given time, because there really isn't any such thing as the atomic health of Server C anyway. The health of the channel consists of the client, the server, and the network, and the health of AC may or may not impact the channel BC. It's risky to let clients advertise their opinions about backend health to other clients, because that leads to the event where a bad client shoots down a server, or many servers, for every client.
Modern LBs, like HAProxy, support both active & passive health checks (and others, like agent checks where the app itself can adjust the load balancing behavior). This means that your "client scenario" covering passive checks can be done server side too.
Also, in HAProxy (that's the one I know), server side health checks can be in millisecond intervals. I can't remember the minimum, I think it's 100ms, so theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.
> theoretically you could fail a server within 200-300ms, instead of 15seconds in your post.
You need to be careful here, though, because the server might just be a little sluggish. If it's doing something like garbage collection, your responses might take a couple hundred milliseconds temporarily. A blip of latency could take your server out of rotation. That increases load on your other servers and could cause a cascading failure.
If you don't need sub-second reactions to failures, don't worry too much about it.
Thanks for writing something that's accessible to someone who's only used Nginx server-side load balancing and didn't know client-side load balancing existed at higher scale.
The article walks through what actually happens inside a proxy when multiple requests arrive at the same time: connection handling, request queues, event loops vs thread pools, and how different implementations approach the problem.
It also compares design patterns used in systems like Apache Traffic Server, HAProxy, and Envoy, and explains why these architectural choices matter for latency, throughput, and failure behavior.
The goal was to help zoom in what happens between the client and backend when traffic spikes.
Would love feedback from folks running proxies at scale.