You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2018/08/29 21:09:00 UTC

[jira] [Commented] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

    [ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596860#comment-16596860 ] 

Bryan Cutler commented on SPARK-25274:
--------------------------------------

This is a followup to SPARK-23030 that is now possible since Arrow stream format is being used

> Improve `toPandas` with Arrow by sending out-of-order record batches
> --------------------------------------------------------------------
>
>                 Key: SPARK-25274
>                 URL: https://issues.apache.org/jira/browse/SPARK-25274
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark, SQL
>    Affects Versions: 2.4.0
>            Reporter: Bryan Cutler
>            Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as they arrive in the JVM, followed by a list of indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org