Biggest thing to watch out with this approach is that you will inevitably have some failure or bug that will 10x, 100x, or 1000x the rate of dead messages and that will overload your DLQ database. You need a circuit breaker or rate limit on it.
I worked on an app that sent an internal email with stack trace whenever an unhandled exception occurred. Worked great until the day when there was an OOM in a tight loop on a box in Asia that sent a few hundred emails per second and saturated the company WAN backbone and mailboxes of the whole team. Good times.
The idea behind a DLQ is it will retry (with some backoff) eventually, and if it fails enough, it will stay there. You need monitoring to observe the messages that can't escape DLQ. Ideally, nothing should ever stay in DLQ, and if it does, it's something that should be fixed.
If you are reading from Kafka (for example) and you can't do anything with a message (broken json as an example) and you can't put it into a DLQ - you have not other option but to skip it or stop on it, no?
Your place of last resort with kafka is simply to replay the message back to the same kafka topic since you know it's up. In a simple single consumer setup just throw a retry count on the message and increment it to get monitoring/alerting/etc. Multi consumer? Put an enqueue source tag on it and only process the messages tagged for you. This won't scale to infinity but it scales really really far for really really cheap
Generally yes, but if you use e.g. the parallel consumer, you can potentially keep processing in that partition to avoid head-of-line blocking. There are some downsides to having a very old unprocessed record since it won't advance the consumer group's offset past that record, and it instead keeps track of the individual offsets it has completed beyond it, so you don't want to be in that state indefinitely, but you hope your DLQ eventually succeeds.
But if your DLQ is overloaded, you probably want to slow down or stop since sending a large fraction of your traffic to DLQ is counter productive. E.g. if you are sending 100% of messages to DLQ due to a bug, you should stop processing, fix the bug, and then resume from your normal queue.
Sorry, but what's stopping the DLQ being a different topic on that Kafka - I get that the consumer(s) might be dead, preventing them from moving the message to the DLQ topic, but if that's the case then no messages are being consumed at all.
If the problem is that the consumers themselves cannot write to the DLQ, then that feels like either Kafka is dying (no more writes allowed) or the consumers have been misconfigured.
Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.
> Edit: In fact there seems to be a self inflicted problem being created here - having the DLQ on a different system, whether it be another instance of Kafka, or Postgres, or what have you, is really just creating another point of failure.
There's a balance. Do you want to have your Kafka cluster provisioned for double your normal event intake rate just in case you have the worst-case failure to produce elsewhere that causes 100% of events to get DLQ'd (since now you've doubled your writes to the shared cluster, which could cause failures to produce to the original topic).
In that sort of system, failing to produce to the original topic is probably what you want to avoid most. If your retention period isn't shorter than your time to recover from an incident like that, then priority 1 is often "make sure the events are recorded so they can be processed later."
IMO a good architecture here cleanly separates transient failures (don't DLQ; retry with backoff, don't advance consumer group) from "permanently cannot process" (DLQ only these), unlike in the linked article. That greatly reduces the odds of "everything is being DLQ'd!" causing cascading failures from overloading seldom-stressed parts of the system. Makes it much easier to keep your DLQ in one place, and you can solve some of the visibility problems from the article from a consumer that puts summary info elsewhere or such. There's still a chance for a bug that results in everything being wrongly rejected, but it makes you potentially much more robust against transient downstream deps having a high blast radius. (One nasty case here is if different messages have wildly different sets of downstream deps, do you want some blocking all the others then? IMO they should then be partitioned in a way so that you can still move forward on the others.)
I think that you're right to mention that if the DLQ is over used that that potentially cripples the whole event broker, but I don't think that having a second system that could fall over for the same reason AND a host of other reasons is a good plan. FTR I think doubling kafka provisioned capacity is simpler, easier, cheaper, and more reliable approach.
BUT, you are 100% right to point to what i think is the proper solution, and that is to treat the DLQ with some respect, not a bit bucket where things get dumped because the wind isn't blowing in the right direction.
I am the author of th is article.Thank you for reading and important the insight and I second your opinion about DLQ flooding. We have the following strategy configured in our consumers to avoid DLQ flooding
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
This is explicitly called out in the blog post in the trade-offs section.
I was one of the engineers who helped make the decisions around this migration. There is no one size fits all. We believed in that thinking originally, but after observing how things played out, decided to make different trade-offs.
To me it sounds like so: "We realized that we were not running microservice architecture, but rather a distributed monolith, so it made sense to make it a regular monolith". It's a decision I would wholeheartedly agree with.
I don't think you read the post carefully enough: they were not running a distributed monolith, and every service was using different dependencies (versions of them).
This meant that it was costly to maintain and caused a lot of confusion, especially with internal dependencies (shared libraries): this is the trade-off they did not like and wanted to move away from.
They moved away from this in multiple steps, first one of those being making it a "distributed monolith" (as per your implied definition) by putting services in a monorepo and then making them use the same dependency versions (before finally making them a single service too).
I think the blog post is confusing in this regard. For example, it explicitly states:
> We no longer had to deploy 140+ services for a change to one of the shared libraries.
Taken in isolation, that is a strong indicator that they were indeed running a distributed monolith.
However, the blog post earlier on said that different microservices were using different versions of the library. If that was actually true, then they would never have to deploy all 140+ of their services in response to a single change in their shared library.
Shared telemetry library, you realize that you are missing an important metric to operationalize your services. You now need to deploy all 140 to get the benefit.
Your runtime version is out of date / end of life. You now need to update and deploy all 140 (or at least all the ones that use the same tech stack).
No matter how you slice it, there are always dependencies across all services because there are standards in the environment in which they operate, and there are always going to be situations where you have to redeploy everything or large swaths of things.
Microservices aren’t a panacea. They just let you delay the inevitable but there is gonna be a point where you’re forced to comply with a standard somewhere that changes in a way that services must be updated. A lot of teams use shared libraries for this functionality.
These are great examples. I'll add one more. Object names and metadata definitions. Figuring out what the official name for something is across systems, where to define the source of truth, and who maintains it.
Why do all services need to understand all these objects though? A service should as far as possible care about its own things and treat other services' objects as opaque.
... otherwise you'd have to do something silly like update every service every time that library changed.
As you mention, it said early on that they were using different versions for each service:
> Eventually, all of them were using different versions of these shared libraries.
I believe the need to deploy 140+ services came out of wanting to fix this by using the latest version of the deps everywhere, and to then stay on top of it so it does not deteriorate in the same way (and possibly when they had things like a security fix).
If a change requires cascading changes in almost every other service then yes, you're running a distributed monolith and have achieved zero separation of services. Doesn't matter if each "service" has a different stack if they are so tightly coupled that a change in one necessitates a change in all. This is literally the entire point of micro-services. To reduce the amount of communication and coordination needed among teams. When your team releases "micro-services" which break everything else, it's a failure and hint of a distributed monolith pretending to be micro-services.
As I said, they mention having a problem where each service depended on different versions of internal shared libraries. That indicates they did not need to update all at once:
> When pressed for time, engineers would only include the updated versions of these libraries on a single destination’s codebase.
> Over time, the versions of these shared libraries began to diverge across the different destination codebases.
> ...
> Eventually, all of them were using different versions of these shared libraries.
The blog post says that they had a microservice architecture, then introduced some common libraries which broke the assumptions of compatibility across versions, forcing mass updates if a common dependency was updated. This is when they realized that they were no longer running a microservice architecture, and fused everything into a proper monolith. I see no contradiction.
Which is sort of fine, in my book. Update to the latest version of dependencies opportunistically, when you introduce other changes and roll your nodes anyway. Because you have well-defined, robust interfaces between the microservices, such they don't break when a dependency far down the stack changes, right?
Totally agree. For what it's worth, based on the limited information in the article, I actually do think it was the right decision to pull all of the per-destination services back into one. The shared library problem can go both ways, after all: maybe the solution is to remove the library so your microservices are fully independent, or maybe they really should have never been independent in the first place and the solution is to put them back together.
I don't think either extreme of "every line of code in the company is deployed as one service" or "every function is an independent FaaS" really works in practice, it's all about finding the right balance, which is domain-specific every time.
FWIW, I think it was a great write up. It's clear to me what the rationale was and had good justification. Based on the people responding to all of my comments, it is clear people didn't actually read it and are opining without appropriate context.
Having seen similar patterns play out at other companies, I'm curious about the organizational dynamics involved. Was there a larger dev team at the time you adopted microservices? Was there thinking involved like "we have 10 teams, each of which will have strong, ongoing ownership of ~14 services"?
Because from my perspective that's where microservices can especially break down: attrition or layoffs resulting in service ownership needing to be consolidated between fewer teams, which now spend an unforeseen amount of their time on per-service maintenance overhead. (For example, updating your runtime across all services becomes a massive chore, one that is doable when each team owns a certain number of services, but a morale-killer as soon as some threshold is crossed.)
I think this was either a `number of TBW (terabytes written)` or `% of space used` issue, since both got removed by the OS within 9 hours delta and the same usage (couple of months, ~500GB), because they were in a mirror since the beginning. If it were a sensor issue, SMART data should have shown this. Not saying that a sensor issue does not exist, but I doubt that this was my problem (mine don't have a preinstalled heatsink, I prefer to use my own).
I now use a 3-way mirror and am mixing brands.
One very nice thing: the Samsung Pro 990 4TB has the exact same space (down to the byte) as the WD_BLACK SN850X 4TB, so they can be replaced without any issues. This rarely was the case with SSDs and HDDs and probably other NVMes. Looks like they learned.
> It's a much better answer to hook up everything on Ethernet that you possibly can than it is to follow the more traveled route of more channels and more congestion with mesh Wi-Fi.
Certainly this is the brute-force way to do it and can work if you can run enough UTP everywhere. As a counterexample, I went all-in on WiFi and have 5 access points with dedicated backhauls. This is in SF too, so neighbors are right up against us. I have ~60 devices on the WiFi and have no issues, with fast roaming handoff, low jitter, and ~500Mbit up/down. I built this on UniFi, but I suspect Eero PoE gear could get you pretty close too, given how well even their mesh backhaul gear performs.
I'm not super familiar with SF construction materials but I wonder if that plays a part in it too? If your neighbors are separated by concrete walls then you're probably getting less interference from them than you'd think and your mesh might actually work better(?)... but what do I know since I'm no networking engineer.
Old Victorians in SF will sometimes have lathe and plaster walls (the 'wet wall' that drywall replaced). Lathe and plaster walls often have chicken wire in them that degrade wifi more than regular drywall will.
It's way less about device count, and more about AP density - especially in RF challenging environments.
I pretty much just deploy WiFi as a "line of sight" technology these days in a major city. Wherever you use the wifi you need to be able to visually see the AP. Run them in low power mode so they become effectively single-room access points.
Obviously for IoT 2.4ghz stuff sitting in closets or whatever it's still fine, but with 6ghz becoming standard the "AP in every room" model is becoming more and more relevant.
A smart home will definitely run those numbers up. I have about 60 WiFi devices and another 45 Zigbee devices and I'm only about halfway done with the house.
This seems pretty bad from the headline but there's no evidence of any in-the-wild exploits or if there is a feasible real-world exploit here. Some other domino(s) have to fall before it allows RCE. For instance, browser-based exploits are blocked by SELinux restrictions on dlopen from the downloads path.
Magnetic hard drives are 100X cheaper per GB than when S3 launched, and are about 3X cheaper than in 2016 when the price last dropped. Magnetic prices have actually ticked up recently due to supply chain issues, but HAMR is expected to cause a significant drop (50-75%/GB) in magnetic storage prices as it rolls out in next few years. SSDs are ~$120/T and magnetic drives are ~$18/T. This hasn't changed much in the last 2 years.
Is it bullshitting to perform nearly perfect language to language translation or to generate photorealistic depictions from text quite reliably? or to reliably perform named entity extraction or any of the other millions of real-world tasks LLMs already perform quite well?
Picking another task like translation which doesn't really require any knowledge outside of language processing is not a particularly good way to convince me that LLMs are doing anything other than language processing. Additionally, "near perfect" is a bit overselling it, IMX, given that they still struggle with idioms and cultural expressions.
Image generation is a bit better, except it's still not really aware of what the picture is, either. It's aware of what images are described as by others, let alone the truth of the generated image. It makes pictures of dragons quite well, but if you ask it for a contour map of a region, is it going to represent it accurately? It's not concerned about truth, it's concerned about truthiness or the appearance of truth. We know when that distinction is important. It doesn't.
> For e-commerce workloads, the performance benefit of write-back mode isn’t worth the data integrity risk. Our customers depend on transactional consistency, and write-through mode ensures every write operation is safely committed to our replicated Ceph storage before the application considers it complete.
Unless the writer is always overwriting entire files at once blindly (doesn't read-then-write), consistency requires consistency reads AND writes. Even then, potential ordering issues creep in. It would be really interesting to hear how they deal with it.
They mention it as a block device, and the diagram makes it look like there's one reader. If so, this seems like it has the same function as the page cache in RAM, just saving reads, and looks a lot like https://discord.com/blog/how-discord-supercharges-network-di... (which mentions dm-cache too).
If so, safe enough, though if they're going to do that, why stop at 512MB? The big win of Flash would be that you could go much bigger.
none of the things you cited are “miracle weight loss drugs.” they are things people did to lose weight. these are the first class of drugs that actually cause people to lose weight.
Amphetamine is actually a very effective weight loss drug. And it's sort of orthogonal to the fact that it's a stimulant. Stimulants in general can cause an acute reduction in appetite and temporary weight loss. This tends to stabilise with tolerance, however. As someone with obesity and ADHD, thus was my experience with methylphenidate treatment. And I used to think the weight loss effects of amphetamine were analogous until recently.
Amphetamine and methyphenidate(MPH) have very different ways of acting as stimulants. MPH is an inhibitor of the dopamine transporter(DAT) and the norepinehrine transporter(NET). These cross-membrane proteins essentially "suck up" the dopamine or norepinehrine after neurotransmission, thus regulating the effect. MPH inhibits this process, increasing the effect. This is called a norepinephrine/dopamine reuptake inhibitor(NDRI). Cocaine also works like this, as well as the antidepressant wellbutrin(bupropion).
Amphetamine on the other hand, is a bit more complicated. It interacts with DAT/NET as well, as a substrate, actually passing through them into the neuron. Inside the neuron, it has a complex series of interactions with TAAR1, VMAT2, and ion concentrations, causing signaling cascades that lead to DAT reversal. Essentially, enzymes are activated that modify DAT in such a way that it pumps dopamine out of the neuron instead of sucking it up. How that happens is very complicated and beyond the scope of this comment, but amphetamine's activity at TAAR1 is an important contributor. As such, amphetamine is a norepinephrine-dopamine releasing agent(NDRA). Methamphetamine, MDMA, and cathinone(from khat) also work like this.
Anyway, recently I was reading about TAAR1 and learned something new, namely that TAAR1, besides being and internal receptor in monoaminergic neurons, is also expressed in the pancreas, the duodenum, the stomach, and intestines and in these tissues, TAAR1 activation will increase release of GLP-1, PYY, and insulin, as well as slow gastric emptying.
So in essence, there may be some pharmacological overlap between ozempic and amphetamine(I'm still looking for data on how significantly amphetamine reaches TAAR1 in these tissues, so unclear what the relevance is. But amohetamine is known to diffuse across cellular membranes, so it's likely there is an effect).
Also interesting, amphetamine was recently approved as a treatment for binge eating disorder. Not only because it causes weight loss, but because it improves functioning in the prefrontal cortex(crucial to its efficacy in ADHD), which is apparently implicated in the neuropsychological aspects of BED as well.
There is a mixed picture on this. I see a lot reports of reports of it causing binging in the evenings despite no prior issues.
The issue is that therapeutic doses are not the multi-day bender of a speed-freak that forgoes sleep to keep their blood-concentration permanently high. Instead it's a medicated window of 6-12 hours with a third or more of their waking hours remaining for rebound effects to unleash stimulation-seeking demons that run wilder than ever.
The American Tobacco Company marketed cigarettes for women's weight control in the 1920's. Lucky Strikes 'Reach for a Lucky" campaign was a big example of this as well, although they marketed it as an appetite suppressant rather than simply a miracle weight loss cure.
Keeping in theme, Ozempic specifically has already been marketed off label as an appetite suppressant, rather than a pure weight loss drug. That's a more modern construct in its brief history.
reply