Common Probability Distributions: The Data Scientist’s Crib Sheet

ggrothendieck · on Dec 13, 2015

The image beside each node is nice but the meaning of the edges needs to be well defined for this chart to be useful. An interactive graphic of 76 probability distributions which does define the edges and also links each node to a description is available here: http://www.math.wm.edu/~leemis/chart/UDR/UDR.html and is further discussed here: http://www.amstat.org/publications/jse/v20n3/leemis.pdf

Wikipedia also has a chart: https://en.wikipedia.org/wiki/Relationships_among_probabilit...

jamessb · on Dec 13, 2015

The blog post does link to http://www.math.wm.edu/~leemis/chart/UDR/UDR.html

John Cook also has a nice diagram adapted from it (Clicking on the arrows takes you to the explanation): http://www.johndcook.com/blog/distribution_chart/

mavam · on Dec 13, 2015

On a related note, the first few pages of my statistics cookbook [1] contain visualizations of several distributions with varying parameterization, both P[MD]Fs and CDFs. I found this quite helpful in getting an intuitive understanding of a function's "behavior."

[1] http://statistics.zone

stared · on Dec 13, 2015

It looks interesting (especially the diagram), but it's pain to read due to low contrast (vide http://contrastrebellion.com/). I read it only after manually changing #666666 to #000000.

jrapdx3 · on Dec 13, 2015

The "reader" function in Firefox works pretty well for that. I also found the contrast too low, especially on an older laptop. Switch to FF reader mode made a big difference.

graycat · on Dec 13, 2015

Nicely done.

There is an important point that the article makes although only implicitly: If have some data and want to know what the probability distribution is, then hopefully know enough good things about where the data came from basically to know, even without looking at the data, what the probability distribution must be. The article gave such ways to know.

A biggie point: In practice this way of knowing is not only powerful but, really, nearly the only little platform you have to stand on to know how your data is distributed.

Here is an example one step beyond the article: You have a Web site, and users arrive. Okay, each user has their own complicated life, maybe use the Internet only in the morning, only in the evening, have nearly a fixed list of sites they go to, only get to your site from links at other sites they do see regularly, etc. That is, each user can have wildly complicated, unique personal behaviors on the internet.

Still, the arrivals at your site will be as in a Poisson process, that is, the number of arrivals in the next minute will have the Poisson distribution and the time until the next arrival will have the exponential distribution. Why? A classic result called the renewal theorem. There is a careful proof in the second volume (the one difficult to read) on probability by W. Feller.

So, the arrivals at your Web site from user user #1, Joe, is some complicated, unknowable stochastic arrival process. Fine. Joe has a complicated life. User #2, Mary, also has a complicated life but has essentially nothing to do with Joe (Joe is a nerd, and Mary is pretty!). So, Mary acts independently of Joe. Similarly for users #2, 3, ..., 2 billion. Then the arrivals at your Web site are the sum of those 2 billion complicated, unique, with details unknowable, independent arrival processes. Then, with a few more meager assumptions, presto, bingo, the renewal theorem says that the arrivals at your site form a Poisson arrival process.

There's a terrific chapter on the Poisson process in E. Cinlar's introduction to stochastic processes. Terrific. Some of what you can say, knowing that you have a Poisson process, is amazing. All with no or meager attention to the data and, instead, from knowing you have a Poisson process, e.g., from the renewal theorem from a sum of many independent arrival processes.

Bigger lesson: The renewal theorem is true in the limit of a sum of many independent arrival processes. So, it is a limit theorem. Then, more generally, many of the crown jewels of probability are limit theorems that say what happens in the big picture when it is a limit of some kind of smaller things about which have nearly no ability to understand. So, astoundingly, such limit theorems show that the effects of some universe of detail, maybe even big data, just wash out. Often very powerful stuff. A big part of a good course in probability is the full collection of classic limit theorems -- astounding, powerful stuff in there. Wait until discover martingales -- totally mind blowing that any such powerful things could be true, but they are!

Final lesson: It's possible also to take from the article and from much of introductory lessons in statistics an implicit lesson that is wrong and even dangerous: That lesson is that, given some data, right away, ASAP, do not pass GO, do not collect $100, and ASAP rush to find the probability distribution. Well, if can find the distribution via something like the Poisson process outlined above, terrific. But usually can't do that. Instead just have the data, just the darned data. Maybe even big data. Then, sure, can get a histogram and look at it. Okay, no harm done so far. But, then, maybe, from the implicit but dangerous lesson, feel an urge, a need, a compulsion, a strong drive to find the distribution of that data, go through some huge list of increasingly bizarre well known probability distributions looking for a fit, etc. Mostly, don't do that.

Or, yes, there is a probability distribution, but, usually in practice, especially when you are given data without any additional information that will let you conclude something like Poisson above, beyond just that histogram, you don't have much chance of finding or approximating the probability distribution in any way that stands to be useful. Or, mostly just get the histogram and stop there.

Next, all the above holds for one dimensional data, that is, single numbers. But if your data comes in pairs of numbers, say, points on a plane, or, for some positive integer n, n-tuples, then your desire to find the distribution is much, much less promising. Indeed, just getting a histogram is much less promising. For n > 2, already histograms are tough to see or work with.

But, fear not: The field of applied probability and statistics is just awash in techniques where you don't need anything like precise data on distributions!

Succinct version of this lesson: Yes, the probability distribution exists, but commonly you can't really find it and commonly you don't need to find it.

bm1362 · on Dec 13, 2015

Awesome primer on probability distributions.