You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ruslan Dautkhanov (Jira)" <ji...@apache.org> on 2020/07/13 18:28:00 UTC

[jira] [Created] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

Ruslan Dautkhanov created SPARK-32294:
-----------------------------------------

             Summary: GroupedData Pandas UDF 2Gb limit
                 Key: SPARK-32294
                 URL: https://issues.apache.org/jira/browse/SPARK-32294
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.0.0, 3.1.0
            Reporter: Ruslan Dautkhanov


`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF as once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 

Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. 

cc [~hyukjin.kwon] 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org