More

maplebed · on Sept 26, 2018

There are a number of articles on this topic coming from the perspective of martial arts rather than music. Start at http://codekata.com and you’ll find good articles. (Note - I am not associated with codekata; just find the idea neat.)

loco5niner · on Sept 26, 2018

And of course (irony warning), the first line on codekata.com says "How do you get to be a great musician?"... sorry, just poking fun.

maplebed · on Sept 20, 2017

For the curious, the first failure I saw was at 13:15UTC and the last was 14:59UTC.

laCour · on Sept 20, 2017

13:14 UTC through 15:54 UTC here.

maplebed · on Jan 29, 2017

Yup! It's hard! All the things you point out are right on.

We don't have the visualizations for histograms yet (though you can chart specific percentiles), but for the reasons you mention, Honeycomb is perfectly suited to give you that kind of data. I can't say we'll get that out the door soon, but it's one of my pet most wanted features so as soon as I can convince myself it's actually more important than all the other mountain of things that need to get done, you'll get your histograms and your time over time comparisons.

I've been advocating for a heat map style presentation of histograms for a long time, but I hadn't considered the difficulty that creates when trying to show time over time. That's an interesting one to noodle on.

Thanks for articulating well the value and reasons for difficulty in implementing histograms!

(bias alert - I work on Honeycomb)

maplebed · on Jan 26, 2017

Are you a plant? It must just be coincidence that the second post in the series is titled "measuring capacity." :) https://honeycomb.io/blog/2017/01/instrumentation-measuring-...

(bias alert - I work on Honeycomb)

ackerman80 · on Jan 26, 2017

we are all learning from the same folks ahead of us it seems :)

I agree with other comments though the devil is in the details of how to actually setup these "golden signals" so that they are useful and not just drown everyone in packet level non-sense.

maplebed · on Jan 26, 2017

By creating events that contain both the duration of the request and whether it succeeded, you can create graphs that show you the detail you need. Unless you include those data together at the beginning, it will be impossible to tease them apart later on. Combining them into one graph will likely conceal the difference in the two cases, as you describe, unless you feed them in to a system that an natively tease them apart as easily as show them together (such as http://honeycomb.io). So it seems like the disagreement is more about visualization than collection (the section of the blog in which that quote appears).

The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.

I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.

(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

hibikir · on Jan 27, 2017

In practice, if your system if complicated and you have to look at the visualization, you are already in trouble. For anything complicated, you need exactly the inputs you describe, but everything has to be processed already by another layer that can give you higher level ideas.

This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.

There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.

That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.

bbrazil · on Jan 27, 2017

> Combining them into one graph will likely conceal the difference in the two cases, as you describe

Indeed. The first order issue is locating the problem though.

If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).

> unless you feed them in to a system that an natively tease them apart as easily as show them together

And the user actually thinks to perform that additional analysis.

> So it seems like the disagreement is more about visualization than collection

What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.

Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.

> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.

It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.

Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.

>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).

The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.

The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.

I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/

maplebed · on Jan 23, 2017

Nice to see that 36 years later they're writing the same thing. https://www.nytimes.com/2017/01/21/us/san-francisco-children... "San Francisco Asks: Where Have All the Children Gone?"

maplebed · on Jan 4, 2017

Once you have your report card, don't forget to revoke access. https://github.com/settings/applications

josh_carterPDX · on Jan 4, 2017

Good point. I can see a lot of people are worried this is getting write access. Thanks for providing a link and a reminder.

aisofteng · on Jan 5, 2017

This immediately suggests the ability to provide permissions only for a specific transaction at a time.

In fact, I believe that that is essentially how Vault manages security.

maplebed · on Feb 18, 2015

[nit] Daniel-san

maplebed · on Nov 6, 2014

Though there's no answer in the video, the classic demonstration of TMS (Transcranial Magnetic Stimulation) you'll find in an intro to brain imaging class is to trigger sections of brain containing the motor neurons. This creates involuntary movement in the muscles of your body, rather than any kind of information you would process willfully act upon. Given the mention of placing the TMS paddle on the opposite hemisphere of the brain from the subject's arm, it's plausible that they are actually just aiming at the motor neurons.

torpmode · on Nov 6, 2014

If the TMS causing a twitch is existing textbook knowledge, then are they doing anything new at all? Reading some kind of signal from a trained person's brain is also already commonplace.

maplebed · on March 26, 2014

As a tech worker who is more focused on the back end than the visible product, I have found a very low correlation between my belief in the product and my satisfaction with the job. I find that my job is pretty much the same regardless of the actual product. This experience is likely different for people more directly involved in the actual product.

I get much higher signal to indicate whether I will enjoy a job from my coworkers and the culture of the company. If I am working with smart motivated people, I will be happy no matter what I'm working on. If I am in an environment where people are continually innovating and pushing the boundaries of the status quo in the field, I will be happy with my job.

This awareness leads to a very different typo of job search. It's easy to start a job search by thinking of a product you like then trying to see if you can work for that company. It's harder to think of a culture you want to join and then look for that.

I thankfully have not been employed by a company whose product I actively despise; I'm sure that would have a disastrous effect on my job satisfaction.