You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/12/15 06:49:00 UTC

[jira] [Resolved] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

     [ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-32294.
----------------------------------
    Fix Version/s: 3.3.0
       Resolution: Cannot Reproduce

This is fixed in the master branch. I can't reproduce it anymore:

{code}
>>> df = spark.range(1024 * 1024 * 1024 * 1).selectExpr("1 as a", "1 as b", "1 as c").coalesce(1)  # More than 2GB to hit ARROW-4890.
>>> df.groupby("a").applyInPandas(lambda pdf: pdf, schema=df.schema).count()
1073741824
{code}

> GroupedData Pandas UDF 2Gb limit
> --------------------------------
>
>                 Key: SPARK-32294
>                 URL: https://issues.apache.org/jira/browse/SPARK-32294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>             Fix For: 3.3.0
>
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org