You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/30 02:11:59 UTC

[GitHub] [spark] WeichenXu123 edited a comment on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

WeichenXu123 edited a comment on issue #24734: [SPARK-27870][SQL][PySpark] Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
URL: https://github.com/apache/spark/pull/24734#issuecomment-497169812
 
 
   @BryanCutler 
   > Are you thinking of the case where an entire batch does not completely fill the write buffer?
   
   The case occurs at the beginning time.
   Suppose the batch size is small and the write buffer can hold 100 batches, now the first 100 batches could not be pipelined when run UDFs. Let's give a example, suppose there're 3 pandas UDFs (UDF1, UDF2, UDF3) pipelined:
   
   On my PR, it will be:
   ```
   run UDF1 on batch-1
   parallelly run UDF1 on batch-2, UDF2 on batch-1
   parallelly run UDF1 on batch-3, UDF2 on batch-2, UDF3 on batch-1
   parallelly run UDF1 on batch-4, UDF2 on batch-3, UDF3 on batch-2
   parallelly run UDF1 on batch-5, UDF2 on batch-4, UDF3 on batch-3
   ...
   ```
   On the master code, it will be: (suppose write buffer can hold 100 batches)
   ```
   run UDF1 on batch-1
   run UDF1 on batch-2
   ...
   run UDF1 on batch-100
   parallelly run UDF1 on batch-101, UDF2 on batch-1
   parallelly run UDF1 on batch-102, UDF2 on batch-2
   ...
   parallelly run UDF1 on batch-200, UDF2 on batch-100
   parallelly run UDF1 on batch-201, UDF2 on batch-101, UDF3 on batch-1
   parallelly run UDF1 on batch-202, UDF2 on batch-102, UDF3 on batch-2
   ...
   ```
   
   We can see on the master code, the downstream UDF (UDF2 and UDF3) lag behind 100 batches in the pipeline. So the first 200 batches, the performance is bad.
   
   This case typically matters, in some machine learning cases, for real-time prediction, we need the batch size to be small, and inside UDF, it will run some code on GPU which will consume much time, so it deserves to parallelize these UDFs computation as much as possible.
   
   Thanks for your review!
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org