Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

def mirrors my experience too. Vast majority of spark jobs were easily ported to sql/dbt and the remaining ones are in pyspark. I used to use a lot of scala spark in backend data processing in 2016 but now its almost down to zero.

scala is real big impediment to making data processing accessible to general public in your company. order of preference now at my company is,

1. sql 2. pyspark 3. java spark 4. scala spark

eg: shopify found that 70% of their pyspark could be converted to just sql https://shopify.engineering/build-production-grade-workflow-...



I've rolled out Scala based Spark interfaces to non-programmers in Databricks notebooks, so it's definitely possible, but only if you stick with the basic language features.

Here's a more detailed PySpark vs Scala comparison in case folks are interested: https://mungingdata.com/apache-spark/python-pyspark-scala-wh...

I think Scala Spark (using 10% of the language features) is a better technical decision (because it provides huge benefits like fat JARs, shading, better text editor support, etc), but the worse overall choice for most organizations because people are generally terrified of Scala.

They'd rather do nothing than write Scala code. I can empathize with their position.


Even when Scala is used more or less like python?


> scala is real big impediment to making data processing accessible to general public in your company

Ding Ding Ding! Presto/Athena now is becoming huge in BI ecosystem. We don't really use Spark for ad-hoc BI anymore, we use it for data science and large repetitive workload.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: