You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/07/14 01:30:00 UTC

[jira] [Commented] (SPARK-32294) GroupedData Pandas UDF 2Gb limit

    [ https://issues.apache.org/jira/browse/SPARK-32294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157065#comment-17157065 ] 

Hyukjin Kwon commented on SPARK-32294:
--------------------------------------

Thanks for filing the issue, [~Tagar].

> GroupedData Pandas UDF 2Gb limit
> --------------------------------
>
>                 Key: SPARK-32294
>                 URL: https://issues.apache.org/jira/browse/SPARK-32294
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Ruslan Dautkhanov
>            Priority: Major
>
> `spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF at once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890 
> Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue. 
> cc [~hyukjin.kwon] 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org