You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2019/04/25 17:45:00 UTC

[jira] [Created] (SPARK-27569) Pandas UDF prefetches Arrow batches in the queue while executing the current batch

Xiangrui Meng created SPARK-27569:
-------------------------------------

             Summary: Pandas UDF prefetches Arrow batches in the queue while executing the current batch
                 Key: SPARK-27569
                 URL: https://issues.apache.org/jira/browse/SPARK-27569
             Project: Spark
          Issue Type: Story
          Components: PySpark
    Affects Versions: 3.0.0
            Reporter: Xiangrui Meng


Current Pandas UDF implementation only fetches the next batch after the execution of the current batch. On the JVM side, writing next batch to the socket is blocked if the Python side doesn't fetch the next batch. We can prefetch the next batch on Python side to enable data pipelining. Theoretically, this can achieve 2x on I/O and compute balanced workload. We saw ~1.5x on real workload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org