You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/01/29 18:02:00 UTC

[jira] [Created] (SPARK-23258) Should not split Arrow record batches based on row count

Bryan Cutler created SPARK-23258:
------------------------------------

             Summary: Should not split Arrow record batches based on row count
                 Key: SPARK-23258
                 URL: https://issues.apache.org/jira/browse/SPARK-23258
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Bryan Cutler


Currently when executing scalar {{pandas_udf}} or using {{toPandas()}} the Arrow record batches are split up once the record count reaches a max value, which is configured with "spark.sql.execution.arrow.maxRecordsPerBatch".  This is not ideal because the number of columns is not taken into account and if there are many columns, then OOMs can occur.  An alternative approach could be to look at the size of the Arrow buffers being used and cap it at a certain size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org