Although we decided to start using Scala specifically because PySpark was not as...

sandGorgon · on April 20, 2018

I will stay away from veering into a statically typed vs dynamically typed conversation here ;)

But I'm very excited about pyspark 2.3 UDF bringing grouped map . It will be interesting to hear your views on that https://databricks.com/blog/2017/10/30/introducing-vectorize...

RBerenguel · on April 20, 2018

Only checked the implementation of the "Arrow UDFs" recently, because I'm interested in the Arrow interaction (for curiosity), so still don't have a strong opinion. My main concern is that a lot of the PySpark systems are playing around how to interact and speed up the systems while still staying on top of the Scala base.

I'd recommend Dask (haven't tried it much but from all I've seen is top-notch) to anyone who wants Python all the way down (at least until you hit the C at the bottom) ;)

sandGorgon · on April 20, 2018

well we run a hundred machine cluster on Dataproc for doing our stuff. Dask is still not battle-tested, cloud ready (or available) and is generally harder to work with than pyspark.

In general, I will stay happily in the spark world using pyspark rather than go to Dask right now.