You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/10/23 02:13:00 UTC
[jira] [Updated] (SPARK-37100) Pandas groupby UDFs would benefit
from automatically redistributing data on the groupby key in order to
prevent network issues running udf
[ https://issues.apache.org/jira/browse/SPARK-37100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-37100:
---------------------------------
Fix Version/s: (was: 3.2.1)
> Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-37100
> URL: https://issues.apache.org/jira/browse/SPARK-37100
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.1.2
> Reporter: Richard Williamson
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> when running high cardinality pandas udf groupby steps (100,000s+ of unique groups) - jobs will either fail or have high amount of task failures due to network errors on larger clusters 100+ nodes - this was not the specific code causing issues but should be close to representative:
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.functions import rand
> from fancyimpute import IterativeSVD
> import numpy as np
> import pandas as pd
>
> df = spark.range(0, 100000).withColumn('v', rand())
> @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
> def solver(pdf):
> pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy()))
> return pdf
>
> df.groupby('id').apply(solver).count()
>
> df.repartition('id') – this is required to fix it - can we make this automatically happen without any adverse impacts?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org