Totally agreed on the snake oil. But you can't deny there are definitely strong signals. For example, NN's can detect with 91% accuracy the sexual orientation of males, with only five images.
Also, not saying 91% is good enough for any government or business to commit to.
Have you read the actual paper? That figure is at best inflated, and their methodology is riddled with confounds, including but not limited to the role of social networking profile pictures as dating signals, intentional signaling such as grooming and makeup, and factors like the the socio-economic status and locale of the people being classified, which can be more predictive of declared sexual orientation than anything resulting from physiological features.
Oh right, it's probably worthwhile to note that since there are considerable reasons in many parts of the world to hide one's sexual orientation, and this study's design only conditions on reported sexual orientation on social media, the results are intrinsically skewed just by being from a population of out gay people
Also, bear in mind that if we take Wikipedia's reported rate of homosexuality in the general human population, for which 9% would be... pretty generous (The "Demographics of sexual orientation" article lists several statistics and I can't find a world aggregate, but e.g. San Francisco is 15%), a null-classifier that always guesses "straight" would be just as accurate. If the population levels were lower, more accurate
However, among the 100 of
585 individuals with the highest probability of being gay according to the classifier, 47 were gay. In
other words, the classifier provided for a nearly seven-fold improvement in precision over a random draw (47/7 = 6.71). The precision could be further increased by narrowing the targeted
subsample. Among 30 males with the highest probability of being gay, 23 were gay, an eleven-
fold improvement in precision over a random draw (23/2.1 = 11). Finally, among the top 10
individuals with the highest probability of being gay, 9 were indeed gay: a thirteen-fold
improvement in precision over a random draw.
Yes, and this further demonstrates how ridiculous the reporting on this result was. Their sample population was tiny and skewed, and this paragraph is a great example of why you can make your results look better to lay readers by reducing N (which in statistical terms should reduce your level of confidence in the result you got because the likelihood of spurious accuracy from random factors increases) and choosing whatever metric sounds best when you do that (here, they do so by only talking about precision - IE avoiding false positives, with no mention of the false negatives).
But I do have to give you some credit for providing a case study for why scientific literacy is super hard, and doubly so in a context where researchers are strongly incentivized to try to make their results sound as convincingly meaningful as possible
How do you explain the accuracy going up with more samples then?
Also, it wasn't all 90/10. It was all pairwise:
Among men, the classification accuracy equaled AUC = .81 when provided with one
image per person. This means that in 81% of randomly selected pairs—composed of one gay and
one heterosexual man—gay men were correctly ranked as more likely to be gay. The accuracy
grew significantly with the number of images available per person, reaching 91% for five
images. The accuracy was somewhat lower for women, ranging from 71% (one image) to 83%
(five images per person).
Why do you think that defining accuracy in relative terms works in favor of this model? This pairwise relative measure should give you less confidence that the model generalizes, because now we don't even have an idea of what the relative levels of confidence given by the models between these pairs are, just that they were ordered correctly. This further supports my claim that the way they're measuring results is designed to make them appear more significant than is justified
Explaining the model becoming more "accurate" by this measure is pretty easy. The model is working with an extremely small and skewed dataset for this sort of thing, and has overtrained to tendencies in the dataset. Given the kinds of numbers we're working with and that measure, a jump from 81 to 91% "accuracy" does not seem particularly significant, especially given that, again, the classifier fails meet even the baseline of accuracy we need under a more realistic accuracy measurement to beat a null hypothesis, and probably this baseline would need to be even higher to reflect the lower statistical power of this standard of accuracy.
In any real-world application, this classifier would need to make a judgement in situ based on some threshold of confidence. From that perspective, this metric is worse than useless, because while it doesn't really demonstrate that the result is even as significant as the (again, not meeting the base rate) thresholds described in the summary of it, this methodological smoke and mirrors has seemingly convinced you after reading it more thoroughly. I imagine this is similar to the process by which these systems are sold to investors
However, there's still a valid chance that the classifier relates more to any attributions provided in profile images than to facial features. (E.g., we're pretty good at inferring social status or role from medieval portraits based on such attributions without knowing any about phrenology or physiognomy.)
Also, not saying 91% is good enough for any government or business to commit to.
https://www.semanticscholar.org/paper/Deep-Neural-Networks-C...