Spark is like my pet-hate. Data Engineering team used it at my old work (in conc...

Spark is like my pet-hate.

Data Engineering team used it at my old work (in concert with Notebooks) and it resulted in some of the worst code I’ve ever seen, and most inappropriate use of resources:

9 node DataBricks cluster to push 200gb of JSON into an ElasticSearch cluster. This process consisted of:

* close to 5 notebooks. * things getting serialised to S3 at every possible opportunity. * a hand-rolled JSON serialisation method that would string-concat all the parts together: “but it only took me 2 minutes to write, what’s the problem?”

* hand rolled logging functions

* zero appropriate dependency management; packages were installed globally, never updated, etc

Nothing inherently about that workflow actually needed spark, which was the most egregious part. The whole thing could have been done in a python app with some job lib/multiprocessing thrown in and run as single container/etc.