Data Engineering team used it at my old work (in concert with Notebooks) and it resulted in some of the worst code I’ve ever seen, and most inappropriate use of resources:
9 node DataBricks cluster to push 200gb of JSON into an ElasticSearch cluster. This process consisted of:
* close to 5 notebooks.
* things getting serialised to S3 at every possible opportunity.
* a hand-rolled JSON serialisation method that would string-concat all the parts together: “but it only took me 2 minutes to write, what’s the problem?”
* hand rolled logging functions
* zero appropriate dependency management; packages were installed globally, never updated, etc
Nothing inherently about that workflow actually needed spark, which was the most egregious part. The whole thing could have been done in a python app with some job lib/multiprocessing thrown in and run as single container/etc.
Spark was the worst when I used it. Unhelpful error messages and failure scenarios. Inscrutable stack traces. Thing felt like the worst kind of black box and figuring out why a node timed out during a step or shuffle was soul crushing.
Data Engineering team used it at my old work (in concert with Notebooks) and it resulted in some of the worst code I’ve ever seen, and most inappropriate use of resources:
9 node DataBricks cluster to push 200gb of JSON into an ElasticSearch cluster. This process consisted of:
* close to 5 notebooks. * things getting serialised to S3 at every possible opportunity. * a hand-rolled JSON serialisation method that would string-concat all the parts together: “but it only took me 2 minutes to write, what’s the problem?”
* hand rolled logging functions
* zero appropriate dependency management; packages were installed globally, never updated, etc
Nothing inherently about that workflow actually needed spark, which was the most egregious part. The whole thing could have been done in a python app with some job lib/multiprocessing thrown in and run as single container/etc.