Peroni writes in his top-level comment, "Give the candidate a realistic technical challenge to complete in a realistic timeframe, in an environment that will be indicative of the environment they would expect to work in should they succeed in getting the job." And of course that is saying "Give the candidate a work-sample test." That's a very well validated hiring procedure,[1] one that every company ought to use for essentially every job. In most parts of the world, you can add incremental validity to the hiring process by also testing the job candidate's general cognitive ability (a.k.a. IQ). In the United States, you have to take careful legal steps to be able to add a general cognitive ability test to your hiring process, but you would have to take the SAME legal steps to make a diploma or degree a requirement for hiring (a little known fact about the key Supreme Court case on the issue). Anything else you do in hiring has less impact on gaining successful workers than work-sample tests and general cognitive ability tests.
The challenge here, in my experience on both sides of this, is that a developer's deep domain knowledge of a large or large-ish app is essential to a 'work-sample' environment. And that's virtually impossible to duplicate even in the longest interview time frame.
Making decisions about managing technical debt, adding architecturally-significant changes, balancing good OOP with responsiveness, knowing the difference between future-proofing and conscientious coding -- all of those are both crucial (in many cases the most crucial) to day-to-day work, and also so highly context-specific that those decision-making traits are nearly impossible to identify during a technical exercise.
So for coding challenges, that leaves short-term tactical/analytic/algorithmic exercises, which in (anecdotally) 95% percent of cases cannot begin to approach a 'work-sample' environment. I've yet to encounter a technical challenge that would tell me much more about a candidate than basically how fluent they are with their tools, how well they know syntax and some general design principles, and what, for instance, their TDD (or lack of) workflow is like. Probably some insight into line-level analytic and algorithmic ability.
All of that is helpful, but -- Trust Me Here!! -- can also be very deceiving. The same coders that can knock those challenges out of the park can also be highly-proficient Debt Machines, all the more destructive because of their special genius for cranking out architecturally suspect code at a breathtaking rate.
To get into a real 'work-context' flow of a large app requires weeks, sometimes months, and only then can you get full perspective on how a given coder is going to contribute to your team on an ongoing basis. To get a feel for what that will look like in an interview, I've found I have to pretty much rely on the candidate's past projects, and informal conversations around larger architectural and OO principles.
I think your point is well made that what you can sample in a work-sample test is not the full set of long-term skills that benefit a for-profit company. That's why the predictive validity of work-sample tests is only about .50 across a wide range of industries. But the key point is that EVERY other hiring procedure, except for general cognitive ability tests, has lower validity, so a company is throwing away a lot of opportunity to hire good workers if it doesn't use a combination of work-sample testing and cognitive ability testing for all of its hiring. Your sound analysis can be turned around to using interviews as a hiring procedure--which is much more commonplace than using work-sample testing as a hiring procedure--to make the correct point that an applicant who looks good in an interview may not be a "team player" once hired. Any hiring procedure is a sample of applicant behavior, not fully representative of how the worker will behave on the job after being hired. But work-sample tests get much closer to what the worker will do on the job long-term than any other procedure besides general cognitive ability tests. Because work-sample tests and cognitive ability tests each have incremental validity when added to the other, it's best to use both in combination to get a hiring procedure with somewhat more than .50 validity in finding good workers.
That's a very well-researched comment! There a few things probably worth noting. Roth, Bobko and McFarland have been pretty active in this topic for the past decade. They've found the validity coefficient cited by Schmidt and Hunter in 1998 is likely an over-estimate due to relying on research conducted when there were less rigorous statistical and methodological best-practices.
The validity coefficient provided by Roth and Bobko is likely more accurate. That isn't to diminish their value as they are still valuable, but the aren't the cure-all we'd like them to be. They do continue to show promise in reduced adverse impact though, which is great (note: the full article is behind a paywall - what is the HN-approved method of sharing the information?):
That is for gender. With regards to ethnicity, the evidence isn't quite as optimistic yet. Like other predictors including cognitive ability tests, if they are showing notable adverse impact you may be in trouble Like other predictors including cognitive ability tests, if they are showing adverse impact you may be in trouble despite their validity.
It's a problem with lots of predictors, though scope of the problem varies. There is work being done all the time, even in the most reliably stalwart predictor, the cognitive ability test:
Anyone interested in the great "diversity-validity dilemma" can check out this link for more information, though there's always progress. It's a great article.
For my money I endorse integrity tests as a part of the solution. Decent validity, including incremental validity over cognitive ability due to a low correlation between the two, and small sub-group differences.
Having said all that, I imagine the efficacy of work samples is moderated by the type of work, and I'd have to believe they are more amenable to demonstrations of technical skill like coding (I don't know of any references for this now, but I'll look later). Coding-related jobs would be nice because it would be possible to blindly judge on the output as well, and in programming-type jobs it would be much easier and cost-efficient to test large numbers of applicants than it would for many other jobs. Cost and ease of large-scale administrations are their big problems, so overcoming those would be gravy. I don't know how subgroup differences are impacted though.
[1] https://news.ycombinator.com/item?id=5227923 (this earlier Hacker News comment gives full references for the statements in this comment)