Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

CSV for shoving files around (or TSV or whatever similar thing) is great because it generally just works. I can throw it into virtually any language or system, open and read it myself, grep it, check it with any platform. I can often get away with just looking at the files and absolutely nothing else, though a data dictionary is hugely appreciated.

I don't need to make sure I've got postgres 9.5 setup with a particular user account & set configs for the password, start ES (but not version Y because of a feature change) on port Z, etc. I don't need to manage making sure the two branches I'm looking at don't overlap or try to write to the same database. Keeping multiple results and comparing their output can be easily done as they're just files to be moved. Small tasks that read a file and spit out another can be checkpointed just by making them look to see if the file they expect to create already exists.

I'm hugely in favour of CSV for external data too. Sure, provide other options as well, but I love that the "get all the data" command can be as simple as a curl command. I don't want to read your API docs and build something custom that tries to grab everything, I don't want to iterate over 2M pages, I don't want to deal with timeouts, rate limits, etc. Just give me a URL with a compressed CSV file.

All the problems that come along with it, for me, are related to poor data management which I doubt a format change would fix.

Maybe CSV isn't the best internally, but for a vast amount of cases it's nearly the best and gives you a lot of flexibility. My general advice would be to start with CSV unless you've got a good reason not to, and then try and move to a different line based thing (jsonl, messagepack?). It is highly unlikely to be the biggest problem you have with your data, and the time spent putting it into a more "sane" format is often (in my experience) better spent on QA and analysis of the data itself.

I'd say the current problem is that lots of data is available only either in excel files, pdfs, and APIs pointing to a possibly constantly changing data store.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: