You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ruslan Dautkhanov (Jira)" <ji...@apache.org> on 2020/07/13 18:28:00 UTC
[jira] [Created] (SPARK-32294) GroupedData Pandas UDF 2Gb limit
Ruslan Dautkhanov created SPARK-32294:
-----------------------------------------
Summary: GroupedData Pandas UDF 2Gb limit
Key: SPARK-32294
URL: https://issues.apache.org/jira/browse/SPARK-32294
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.0, 3.1.0
Reporter: Ruslan Dautkhanov
`spark.sql.execution.arrow.maxRecordsPerBatch` is not respected for GroupedData, the whole group is passed to Pandas UDF as once, which can cause various 2Gb limitations on Arrow side (and in current versions of Arrow, also 2Gb limitation on Netty allocator side) - https://issues.apache.org/jira/browse/ARROW-4890
Would be great to consider feeding GroupedData into a pandas UDF in batches to solve this issue.
cc [~hyukjin.kwon]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org