Like a SQLite DB? Actually, why don't we just transfer stuff as SQLite DBs. Sing...

derefr · on Dec 6, 2018

I'm currently implementing an "individual-scale data-warehouse" service (i.e. "Hadoop without the Hadoop part"), and I'm currently pondering between the choices of "a tarball of CSVs", an SQLite file, and an Apache Avro file, as input-side wire formats. (And now HDF5 as well; didn't know about that one.)

I'm still leaning toward "a tarball of CSVs", though:

1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)

2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.

3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:

    comm -1 -3 "$old_csv" "$new_csv" > diff.csv

And then, getting an SQL data-migration file out of it (at least for an SQL DB that has something like Postgres's COPY statement, which can read CSV directly) is as simple as:

    cat pre_ddl.sql copy_stmt_begin.sql header.csv diff.csv copy_stmt_end.sql post_ddl.sql > migration.sql

You can then throw that file right into, say, Google Cloud SQL's "import" command.

That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)

But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)