You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/23 02:13:00 UTC

[jira] [Updated] (SPARK-37100) Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf

     [ https://issues.apache.org/jira/browse/SPARK-37100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-37100:
---------------------------------
    Fix Version/s:     (was: 3.2.1)

> Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37100
>                 URL: https://issues.apache.org/jira/browse/SPARK-37100
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.1.2
>            Reporter: Richard Williamson
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> when running high cardinality pandas udf groupby steps (100,000s+ of unique groups) - jobs will either fail or have high amount of task failures due to network errors on larger clusters 100+ nodes - this was not the specific code causing issues but should be close to representative:
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.functions import rand
> from fancyimpute import IterativeSVD
> import numpy as np
> import pandas as pd
> 
> df = spark.range(0, 100000).withColumn('v', rand())
> @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
> def solver(pdf):
> pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
> return pdf
> 
> df.groupby('id').apply(solver).count()
>  
> df.repartition('id') – this is required to fix it - can we make this automatically happen without any adverse impacts?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org