> Imagine a process where the only criteria are technical soundness and novelty, and as long as minimal standards are met, it's a "go". Call it the "ArXiv + quality check" model.
One possible issue is that researchers usually need to justify their research to somebody who's not in their field. Conferences are one way to do this. So are citation counts. Both are highly imperfect, but outsiders typically want some signal that doesn't require being an expert in a person's chosen field. The "Arxiv + quality check" model doesn't seem to provide this.
> I suspect the bigger problem that CS has is a large percentage of poor-quality work that couldn't be replicated.
As a sort of ML researcher for several years, I agree.
As a fellow ML researcher, I want to add that the lack of code along with the publication makes the problem worse.$BIGGROUP gets a paper whose core contribution is a library, published, and yet they haven't released the code 6 months after the conference, effectively claiming credits for something unverifiable.
I guess this can be different depending on your specific field, but in NLP it really changed for the better in the last few years.
I don't have data, but from subjective experience, 5-6 years ago most papers in major NLP conferences didn't have an associated code repository. Now, the overwhelming majority do.
There are still many other problems, for example a big one is reporting of spurious improvements that can vanish if you get a less lucky random seed. But at least including code is now common practice.
Back when I did a stint at something NLP-ish for my master's, one of the problems seemed to be that, apart from lack of code, the data was also often non-public and specific to the study. That made it impossible to compare different algorithms even as far as the results reported in the publications themselves go because the testing methodology was all over the place and the datasets used for testing various algorithms might have been all different. You couldn't really make much out of the reported results even if you believed the authors reported honestly and had their methodology more or less straight.
I suppose the situation regarding common datasets might vary between subfields and NLP tasks, so maybe I just saw a weird corner of it.
Of course the code was also nowhere to be seen.
Availability of code would of course be even more important, both because of replicability and general verifiability, and also because that would allow you to do a comparison with any number of datasets yourself.
Glad to hear that code availability has been improving.
> There are still many other problems, for example a big one is reporting of spurious improvements that can vanish if you get a less lucky random seed.
Considering that a lot of NLP is at least somewhat based on machine learning, don't people do cross-validation or something?
> You do a paper showing that problem X can be solved slightly better by downloading and training on a billion tweets.
That's true. Sometimes you might try to tweak the algorithm itself rather than the data, though, or experiment with different kinds of preprocessing or something, and in those cases it would be helpful to be able to do different experiments with shared datasets.
My limited experiences were from around the time deep learning was only about to become a big thing, so it might have been different then. Maybe you nowadays just throw more tweets and GPUs at the problem.
One possible issue is that researchers usually need to justify their research to somebody who's not in their field. Conferences are one way to do this. So are citation counts. Both are highly imperfect, but outsiders typically want some signal that doesn't require being an expert in a person's chosen field. The "Arxiv + quality check" model doesn't seem to provide this.
> I suspect the bigger problem that CS has is a large percentage of poor-quality work that couldn't be replicated.
As a sort of ML researcher for several years, I agree.