And this isn't that surprising, it's human readable, and gets the job done, and zipping will give decent compression.
I'm not sure who you think the target market for this would be, but I'm sure that if it's an efficient local format, you could probably get the ML crowd on board.
Right. A shocking amount of public data is distributed this way.
Also, we routinely talk to developers who complain about the difficulty of consuming data snapshots from partners, parsing it, trying to understand how it has changed since last time, etc.
With high value datasets, people frequently build an API to combat these problems. But it's hard to design a good API, and even if you succeed, it has to be secured, documented, scaled, and maintained indefinitely.
It'd be entirely up to the distributor of the data, so perhaps the answer to your question is "all of the above".
For example, (1) our command line tools use URL-like paths which implies "use this hostname" (to copy-paste into terminal), (2) we have some in-browser visualisations like http://splore.noms.io/?db=http://demo.noms.io/cli-tour which implies more of a "click here" type UI.
Hmm, if you want people to be able to link to Noms datasets on the web, maybe you should switch to using URLs to name the datasets, instead of a two-part identifier with an URL separated from a dataset name by a "::"? Darcs and Git seem to get by more or less with URLs and relative URLs; do you think that cold work for Noms too?
The super REST harmonious way to do this would be to define a new media-type for Noms databases with a smallish document that links to the component parts. Like torrent files, but using URLs (maybe relative URLs) instead of SHA1 hashes for the components, maybe?
This is a good point. We never thought of these strings as URLs, but there are places where it would be nice to use them that only want URLs (the href attribute, for example).
The way we have it now is nice in that any valid URL can be used to locate a database. I am loathe to restrict that.
The hash portion of a URL is not transmitted to the server by browsers, so it wouldn't help in the case of putting the string into a URL bar or a hyperlink.
If the resource you're linking to is a database (or to speak more strictly, if its only representation is a resource of a noms-database media-type), rather than an HTML page or something, can't the browser can be configured to pass it off to a Noms implementation, complete with the dataset identifier within? I mean, that's what people do with page numbers in PDF files, right?
Not only public data. At my main project I'm testing systems that generally crunch data from various sources, and yes, most of them are in CSV format, and then we process them only slightly (some filtering, aggregation, translation), and spit other CSVs out. I was amazed that the company had not bothered creating more... civilized (?) solution for internal data processing - but I guess that since it works, there's no drive to change it on a whim.
CSV for shoving files around (or TSV or whatever similar thing) is great because it generally just works. I can throw it into virtually any language or system, open and read it myself, grep it, check it with any platform. I can often get away with just looking at the files and absolutely nothing else, though a data dictionary is hugely appreciated.
I don't need to make sure I've got postgres 9.5 setup with a particular user account & set configs for the password, start ES (but not version Y because of a feature change) on port Z, etc. I don't need to manage making sure the two branches I'm looking at don't overlap or try to write to the same database. Keeping multiple results and comparing their output can be easily done as they're just files to be moved. Small tasks that read a file and spit out another can be checkpointed just by making them look to see if the file they expect to create already exists.
I'm hugely in favour of CSV for external data too. Sure, provide other options as well, but I love that the "get all the data" command can be as simple as a curl command. I don't want to read your API docs and build something custom that tries to grab everything, I don't want to iterate over 2M pages, I don't want to deal with timeouts, rate limits, etc. Just give me a URL with a compressed CSV file.
All the problems that come along with it, for me, are related to poor data management which I doubt a format change would fix.
Maybe CSV isn't the best internally, but for a vast amount of cases it's nearly the best and gives you a lot of flexibility. My general advice would be to start with CSV unless you've got a good reason not to, and then try and move to a different line based thing (jsonl, messagepack?). It is highly unlikely to be the biggest problem you have with your data, and the time spent putting it into a more "sane" format is often (in my experience) better spent on QA and analysis of the data itself.
I'd say the current problem is that lots of data is available only either in excel files, pdfs, and APIs pointing to a possibly constantly changing data store.
For example, if you browse the UC Irvine ML datasets
https://archive.ics.uci.edu/ml/index.html
You'll find that many are in csv format.
If you do a search on data.gov
http://catalog.data.gov/dataset#sec-res_format
You'll see that it's about as popular as JSON.
Also, the World Health Organization
http://www.who.int/tb/country/data/download/en/
Also, many of the datasets at kaggle are in csv format.
https://www.kaggle.com/datasets
And this isn't that surprising, it's human readable, and gets the job done, and zipping will give decent compression.
I'm not sure who you think the target market for this would be, but I'm sure that if it's an efficient local format, you could probably get the ML crowd on board.