You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2019/04/25 17:45:00 UTC
[jira] [Created] (SPARK-27569) Pandas UDF prefetches Arrow batches
in the queue while executing the current batch
Xiangrui Meng created SPARK-27569:
-------------------------------------
Summary: Pandas UDF prefetches Arrow batches in the queue while executing the current batch
Key: SPARK-27569
URL: https://issues.apache.org/jira/browse/SPARK-27569
Project: Spark
Issue Type: Story
Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng
Current Pandas UDF implementation only fetches the next batch after the execution of the current batch. On the JVM side, writing next batch to the socket is blocked if the Python side doesn't fetch the next batch. We can prefetch the next batch on Python side to enable data pipelining. Theoretically, this can achieve 2x on I/O and compute balanced workload. We saw ~1.5x on real workload.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org