You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2019/05/29 03:41:00 UTC

[jira] [Updated] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

     [ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weichen Xu updated SPARK-27870:
-------------------------------
    Description: 
Flush each batch for pandas UDF.

This could improve performance when multiple pandas UDF plans are pipelined.

When batch being flushed in time, downstream pandas UDFs will get pipelined as soon as possible, and pipeline will help hide the donwstream UDFs computation time. For example:

When the first UDF start computing on batch-3, the second pipelined UDF can start computing on batch-2, and the third pipelined UDF can start computing on batch-1.

If we do not flush each batch in time, the donwstream UDF's pipeline will lag behind too much, which may increase the total processing time.

 

  was:
Flush each batch for python UDF.

This could improve performance when multiple python UDF plans are pipelined.

When batch being flushed in time, downstream python UDFs will get pipelined as soon as possible, and pipeline will help hide the donwstream UDFs computation time. For example:

When the first UDF start computing on batch-3, the second pipelined UDF can start computing on batch-2, and the third pipelined UDF can start computing on batch-1.

If we do not flush each batch in time, the donwstream UDF's pipeline will lag behind too much, which may increase the total processing time.

 


> Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
> --------------------------------------------------------------------
>
>                 Key: SPARK-27870
>                 URL: https://issues.apache.org/jira/browse/SPARK-27870
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Weichen Xu
>            Priority: Major
>
> Flush each batch for pandas UDF.
> This could improve performance when multiple pandas UDF plans are pipelined.
> When batch being flushed in time, downstream pandas UDFs will get pipelined as soon as possible, and pipeline will help hide the donwstream UDFs computation time. For example:
> When the first UDF start computing on batch-3, the second pipelined UDF can start computing on batch-2, and the third pipelined UDF can start computing on batch-1.
> If we do not flush each batch in time, the donwstream UDF's pipeline will lag behind too much, which may increase the total processing time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org