I did 3 years of PhD work in computer vision, then dropped out of the PhD progra...

I did 3 years of PhD work in computer vision, then dropped out of the PhD program to work in finance, eventually found my way back to a career in deep learning for image processing and NLP and some other smaller stats problems in causal inference.

My undergrad degree was a very advanced pure math curriculum as well, so I had already done multiple years worth of linear algebra, measure theory and measure theoretic probability theory even before the PhD work.

I think in terms of being effective at utilizing machine learning or statistics in a company that creates products, there is absolutely no value whatsoever, emphatically none, not even in terms of mathematical thinking, formalism or ability to grok research publications, associated with measure theory, measure theoretic probability, formal derivations of common ML algorithms or optimization problems, theoretical topics in convergence, etc. None.

The absolute most critical thing you need is skepticism that algorithms are not working. After that you need a great understanding of all the complex failure cases that are possible, which includes tons of things that business people will not think of, from multi-collinearity to mode collapse to unsound reasoning based on p-values to overfitting to missing data treatments and so on.

If you can grok basic linear algebra and algorithms, can assemble modern machine learning library components efficiently and have good judgment about statisical fallacies and unsound statistical reasoning, then it does not matter what other credential you have at all, period.

In fact, I have worked with very decorated PhD level ML researchers who had such horrible programming skills that it was nearly impossible to incorporate their work into actual products. I’ve also worked with decorated PhD level ML researchers who did not understand basic things about general statistics outside the scope of loss function optimization, for example like topics in MCMC sampling, or cases where reasoning about a model’s goodness of fit needs to holistically consider residual analysis, outlier analysis, posterior predictive checking and plausible effect sizing from literature reviews. They argued and argued that purely optimizing log-loss (with appropriate controls for overfitting) should always be the best model, which is just very naive.

The people saying these things had PhDs in top programs, many publications and conference presentations, and usually considerable software engineering skills.

Truly, credentials in ML really don’t mean anything. It’s about work experience and what you know about pragmatically analyzing statistical problems in the service of product development, and academic training is just not a very important part of this.