Hacker Newsnew | past | comments | ask | show | jobs | submit | maplebed's commentslogin

There are a number of articles on this topic coming from the perspective of martial arts rather than music. Start at http://codekata.com and you’ll find good articles. (Note - I am not associated with codekata; just find the idea neat.)


And of course (irony warning), the first line on codekata.com says "How do you get to be a great musician?"... sorry, just poking fun.


For the curious, the first failure I saw was at 13:15UTC and the last was 14:59UTC.


13:14 UTC through 15:54 UTC here.


Yup! It's hard! All the things you point out are right on.

We don't have the visualizations for histograms yet (though you can chart specific percentiles), but for the reasons you mention, Honeycomb is perfectly suited to give you that kind of data. I can't say we'll get that out the door soon, but it's one of my pet most wanted features so as soon as I can convince myself it's actually more important than all the other mountain of things that need to get done, you'll get your histograms and your time over time comparisons.

I've been advocating for a heat map style presentation of histograms for a long time, but I hadn't considered the difficulty that creates when trying to show time over time. That's an interesting one to noodle on.

Thanks for articulating well the value and reasons for difficulty in implementing histograms!

(bias alert - I work on Honeycomb)


Are you a plant? It must just be coincidence that the second post in the series is titled "measuring capacity." :) https://honeycomb.io/blog/2017/01/instrumentation-measuring-...

(bias alert - I work on Honeycomb)


we are all learning from the same folks ahead of us it seems :)

I agree with other comments though the devil is in the details of how to actually setup these "golden signals" so that they are useful and not just drown everyone in packet level non-sense.


By creating events that contain both the duration of the request and whether it succeeded, you can create graphs that show you the detail you need. Unless you include those data together at the beginning, it will be impossible to tease them apart later on. Combining them into one graph will likely conceal the difference in the two cases, as you describe, unless you feed them in to a system that an natively tease them apart as easily as show them together (such as http://honeycomb.io). So it seems like the disagreement is more about visualization than collection (the section of the blog in which that quote appears).

The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.

I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.

(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)


In practice, if your system if complicated and you have to look at the visualization, you are already in trouble. For anything complicated, you need exactly the inputs you describe, but everything has to be processed already by another layer that can give you higher level ideas.

This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.

There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.

That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.


> Combining them into one graph will likely conceal the difference in the two cases, as you describe

Indeed. The first order issue is locating the problem though.

If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).

> unless you feed them in to a system that an natively tease them apart as easily as show them together

And the user actually thinks to perform that additional analysis.

> So it seems like the disagreement is more about visualization than collection

What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.

Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.

> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.

It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.

Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.

>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).

The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.

The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.

I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/


Nice to see that 36 years later they're writing the same thing. https://www.nytimes.com/2017/01/21/us/san-francisco-children... "San Francisco Asks: Where Have All the Children Gone?"


Once you have your report card, don't forget to revoke access. https://github.com/settings/applications


Good point. I can see a lot of people are worried this is getting write access. Thanks for providing a link and a reminder.


This immediately suggests the ability to provide permissions only for a specific transaction at a time.

In fact, I believe that that is essentially how Vault manages security.


[nit] Daniel-san


Though there's no answer in the video, the classic demonstration of TMS (Transcranial Magnetic Stimulation) you'll find in an intro to brain imaging class is to trigger sections of brain containing the motor neurons. This creates involuntary movement in the muscles of your body, rather than any kind of information you would process willfully act upon. Given the mention of placing the TMS paddle on the opposite hemisphere of the brain from the subject's arm, it's plausible that they are actually just aiming at the motor neurons.


If the TMS causing a twitch is existing textbook knowledge, then are they doing anything new at all? Reading some kind of signal from a trained person's brain is also already commonplace.


As a tech worker who is more focused on the back end than the visible product, I have found a very low correlation between my belief in the product and my satisfaction with the job. I find that my job is pretty much the same regardless of the actual product. This experience is likely different for people more directly involved in the actual product.

I get much higher signal to indicate whether I will enjoy a job from my coworkers and the culture of the company. If I am working with smart motivated people, I will be happy no matter what I'm working on. If I am in an environment where people are continually innovating and pushing the boundaries of the status quo in the field, I will be happy with my job.

This awareness leads to a very different typo of job search. It's easy to start a job search by thinking of a product you like then trying to see if you can work for that company. It's harder to think of a culture you want to join and then look for that.

I thankfully have not been employed by a company whose product I actively despise; I'm sure that would have a disastrous effect on my job satisfaction.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: