This is a brilliant and ambitious project. However last time I tried it a few months ago, it seemed to drop a lot of traffic once the throughput became anything non-trivial (100+ requests per second across 50ish microservices).
Maybe it's a question of UX, more likely it's a question of user error (and therefore partly UX). In any case, I can't wait for it to get better!
My original view was that this wold be used for development and staging environments - is the idea that this can be used in a production environment as well?
Istio delivers encrypted packets across services/ pods but in the pod, envoy proxy does the decryption and puts the packet unencrypted back on the interface to be consumed by the application.
In kubernetes, for calico CNI, the pod networking is done using veth pair. One interface remains on the host to be hooked to the host bridge and the other interface is sent to the pod namespace.
Mizu is deployed as a daemon set(one pod per node) and with enough privilege you can "tcpdump" the pod interface(the one where the envoy proxy puts back the packets unencrypted)
(similar to "ip netns exec")
It sounds like it's not adding anything on the path, that's just another process sniffing the interface. If that process doesn't read fast enough, I assume it will miss packets but not slow things down.
That hasn't really been my experience, I have found an abundance of documentation (sometimes getting lost down the rabbit hole). But I have kept clear of the corporate buzzword solutions, that might be why.
In full production deployment, sure, but in full production deployment everything is "enshrouded by support tooling" - or had better be, if you want to be able to recognize and respond to incidents in a timely fashion. But if that's not a requirement, there are easier options - my Kubernetes home lab, for example, I set up in about 20 minutes with k3s [1], and it's been ticking happily along for over a year at this point with zero administrative burden. Granted, it'll be a PITA if that ever breaks, but it's a lab and I don't have to care more about downtime than I want to; at worst I'll tear it down, follow my notes from last time to stand it up again, and reapply my manifests out of source control.
Meanwhile, as a developer, I find that k8s does simplify a great deal with regard to deploying services and managing them once deployed - this, in turn, makes life easier for our infra folks, who no longer have to spend so much time splitting focus to provide direct dev support for deployment issues. Which is after all one of the reasons to run Kubernetes at all! It's built to simplify things, but not to simplify everything; there is a deal of irreducible complexity involved in the domain k8s addresses, and because it is indeed irreducible, that complexity can at best be managed and abstracted, not removed. That's what k8s intends to do, and in my experience it does that quite well - not without tradeoffs, true enough, but when in our line of work is that ever not the case?
[1] I mention this in particular because I used to take a view much like you do here, until an HN commenter pointed out in response to my own such fussing that it didn't have to be as complicated as I had understood it to be, and recommended trying k3s in that wise. It's been a year and I no longer remember who that was, but they were absolutely correct regardless!
If I can summarize your points in short, I'm getting that your view of Kubernetes is that it abstracts complexity that is inherent to our problem domain in an attempt to simplify it as best it can. In particular that is true around deployments.
Point taken, and it's a valid one. I'll throw some more of my perspective at you here (from a guy who mainly uses AWS managed services).
Let's talk scaling. Scaling is, and has been a solved problem before Kubernetes (see autoscaling groups). Imagine an autoscaling group whose instances launch and run one docker container -- your service (maybe also nginx, if you want). Maybe you even went the extra mile and baked your app's docker container into an AMI so you don't take a runtime dependency on Docker Hub or similar.
In an autoscaling group, you define where instances can launch and what type you want. You can get fancy about capacity by leveraging weighted capacities to increase instance type diversity. You set the amount of capacity you want, and it goes. ASG instances register with a target group behind a load balancer, and you're off to the races. Oh and by the way: if any of those things break, it's Amazon's problem, not mine (could be good or bad; I personally think it's good).
AWS has scaling figured out. Busy neighbors? Can't say that I have been impacted by that recently. Run out of disk? The impact of that is limited to one node that can be respun independently of everything else.
Kubernetes scaling is a similar story with a different interface... except now you have to worry about how services stack up on nodes. For example, suppose you have a node of size "4X" and a service of size "3X" (X can be whatever limiting factor you want - CPU or memory). You now have to deal with the potential issue of 1X being wasted on nodes that the 3X service is deployed to. Maybe there's a 1X service that can slot in there. Maybe the 1X service is not scaled as high as the 3X service, so some nodes just sit there with wasted utilization. You can deploy the 3X service to a correctly sized node group, but that's another thing to manage. You could go the route of oversubscribing resources, which would save money but it's another failure mode for your infrastructure.
If you run Kubernetes on AWS, it's probably using autoscaling groups anyway. And then on top it's doing its own orchestration. So to run Kubernetes on AWS, you have to understand both autoscaling AND how Kubernetes works.
How about security? The wisdom of the day is: don't colocate sensitive services on the same Kubernetes nodes due to container escape concerns (and probably some others). So now you're looking at running an entirely separate Kubernetes cluster, or at least segregating the sensitive service to a different set of nodes. That's more management that is simply a non-issue running a VM-focused workload.
How about load balancing? AWS can manage your load balancer and get you an SSL certificate that renews automatically. In Kubernetes, you can use the managed stuff or roll your own load balancing. If you elect to roll your own, you get the same functionality -- except if it breaks, it's your problem. If you roll with the managed stuff, you now have to understand not only how AWS load balancing works, but also how Kubernetes interacts with it. If the Kube interfaces breaks, it's probably your problem.
I could continue here, but I think you get the idea.
The central issue of Kubernetes is that you take on more problems than you otherwise have to versus using managed services. Amazon has entire teams working SPECIFICALLY on autoscaling, load balancing, SSL cert provisioning, etc. Any individual or team who thinks they can do better (in addition to their other job duties) probably has a bit of an ego issue.
Sure, that all checks out, and it's worth clarifying that I've never been all that close to k8s production ops on EKS or elsewhere - my experience with that has always been primarily as a developer, and my perspective is necessarily informed by that, so it's not easy for me to speak in detail on the tradeoffs you describe.
That said, the largest production infrastructures I've worked with have all been hosted in EKS, so I'm not entirely without relevant experience, and that experience does give me still to think that k8s delivers value despite the added overhead and complexity you correctly describe.
The example that comes to mind is a multi-day downtime we took, a few years and one job ago. I don't recall the details in full at this point, but my interpretation of (what I overheard of) what our TAMs were saying is that we had scaled at a rate and to a point where their infra had failed to keep up - in any case, they and we were both clear that the problem was not of our making. Either way, though, none of our EKS-hosted infra was able to meaningfully operate, with nodes going unhealthy almost as soon as they came up.
But despite the EKS downtime lasting as long as it did, we only took a few hours' worth of revenue impact, because we were able quickly to bring up our critical-path services on GKS, cut over our LBs, and get back up and running with just enough functionality to do business. It wasn't perfect, and our CX folks in particular had a somewhat rough go of it, but we got back in business as quickly as we did because we hadn't gone all-in on AWS tools, and because one of the best infra teams I've ever had the privilege of working with had put real thought and planning into "never happen" DR cases like "what if EKS breaks?"
I've worked with some amazing teams, including the one I'm with now. Sure, I don't expect us to be as good at Amazon is in the things Amazon specializes in, and I certainly don't imagine that most teams are too likely to run into the kind of mishap I just described. But most of the stuff you're describing is almost table stakes these days - Amazon, Google, Azure, Tencent, no doubt a dozen others, hell, even DigitalOcean offers most of those capabilities these days, although they're newer enough at it that I wouldn't go straight there for a significant prod workload - I tried a side project on their managed k8s a couple years back and had a lot of trouble, and while I'm sure they're a lot better now, I'd still be a little nervous.
(They're a lot cheaper, though! Or were, last time I looked - cheap enough I didn't think twice about paying out of pocket, which I super would have to weigh carefully with one of the big players. And if my stuff is k8s native rather than platform native, it's a lot easier to think about starting cheap with DO and migrating to AWS when it's time to join the big leagues, right?)
Maybe AWS stays good forever, or maybe it doesn't. For a side project, I'd probably be fine with just using their own-brand offerings. But for a real business? I've never been a founder, and I don't suppose I can really imagine I know. But I do think that, at the point where I'd accrued enough runway to start thinking hard about the future - planning in detail for a horizon of not a month, not a year, but two or three or five years out - I think that'd be the point where I would want very much to prioritize making sure I could keep the doors open to customers even if AWS did one day catch fire, because I've seen that happen before. Sure, if everyone else is AWS-bound too, I'm probably not in worse shape for the outage than they are. But why settle for "not worse" when I can shoot for better? Why be born and die with AWS if, at the price of a little more work early on, I can buy myself a meaningfully longer runway by avoiding big-boy infra costs until I know I need big-boy infra?
I don't really know, I suppose. As I've been at pains to point out in all these conversations today, it's always a tradeoff no matter what you do, and the trick is all in getting the most upside out of whatever choices you make - especially the complicated ones. If I'd been more hands-on with AWS and knew more of what you know, maybe I'd hold an opinion much like yours. But I don't, and this is the thinking that my judgment and experience lead me to.
I think I'm not full of it, but who knows? Maybe I'll go scare up a seed round and we'll find out in the only way that really counts whether or not I'm talking nonsense. :) But in the meantime, kicking it around seems a fine way to pass an afternoon when I'm too ill to work.
Thanks for the detailed reply! It's been fun engaging with you.
You have really hit the nail on the head about risk, and what it means to go "all-in" on a particular provider.
There's a very relevant case to be made for electing to remain provider agnostic if your business is sensitive to outages. There is a balance between avoiding administrative overhead and mitigating risk.
I like to think about it like buying insurance (I'd put security posture in the same bucket, too). Pursuing a cloud-agnostic design has some cost per month. A cloud vendor outage has some chance of happening per month, and if it does occur it will cost you $X. The business needs to figure out when:
Figuring that out isn't trivial, especially because a large part of cost(cloud-agnostic design) will be in employee hours. At best you probably get a ballpark estimate.
Is it necessary? I've operated in Kubernetes environments for several years and not had a visualizer like this and survived just fine. If anything, it's neat you can do this in k8s relatively easily and "for free" on an existing cluster. The only complexity this calls out is that of microservice architecture done to an extreme, but not that of Kubernetes.
At many of my clients, when an app breaks on kubernetes, it's generally on the admins to show that it wasn't the cluster's fault before the app team tales ownership of the issue. Tools that help very clearly show that traffic made it to the app, and what the app responded with, are super helpful.
I'm assuming that they didn't spend time authoring this for no good reason. It got 2700 stars on GitHub, which isn't nothing.
> The only complexity this calls out is that of microservice architecture done to an extreme, but not that of Kubernetes.
There are better approaches to dealing with tracking requests through a microservice architecture than a ham-fisted packet capture.
By default, the way networking is configured in Kubernetes is exceptionally complex. It certainly stands to reason that a packet capture may be useful at some point.
* log of all traffic
* service graph
* automatic swagger doc generation
THANK YOU!!