You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Enrico Minack <in...@enrico.minack.dev> on 2023/12/01 15:23:02 UTC

10x to 100x faster df.groupby().applyInPandas()

Hi devs,

I am looking for some PySpark dev that is interested in some 10x to 100x 
speed up of df.groupby().applyInPandas() for small groups.

A PoC and benchmark can be found at 
https://github.com/apache/spark/pull/37360#issuecomment-1228293766.

I suppose, the same approach could be taken to improve performance of 
vectorized UDFs (for small groups): 
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.pandas_udf.html

Happy to turn this into a proper pull request if someone volunteers to 
review this.

Cheers,
Enrico


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org