You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sidhavratha Kumar (JIRA)" <ji...@apache.org> on 2018/07/01 02:16:00 UTC

[jira] [Created] (SPARK-24707) Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll

Sidhavratha Kumar created SPARK-24707:
-----------------------------------------

             Summary: Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll
                 Key: SPARK-24707
                 URL: https://issues.apache.org/jira/browse/SPARK-24707
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 2.4.0
            Reporter: Sidhavratha Kumar


Currently Spark Kafka RDD will block on kafka consumer poll. Specially in Spark-Kafka-streaming job this poll duration adds into batch processing time which result in
 * Increased batch processing time (which is apart from time taken to process records)
 * Results in unpredictable batch processing time based on poll time.

If we can poll kafka in background thread and maintain buffer for each partition, poll time will not get added into batch processing time, and this will make processing time more predicatble (based on time taken to process each record, instead of extra time taken to poll records from source)

For ex. we are facing issues where sometime kafka poll is ~30 secs, and sometime it returns within second. With backpressure enabled this reduces our job speed to great extent. In this situation it is also difficult to scale our processing or calculate resource requirement for future increase in records.

Even if someone does not face varying kafka poll time, it will be provide performance improvement if some buffer is already maintained for each partition, so that each batch can just concentrate on processing records.

Ex : 
Lets consider
 * each kafka poll takes 2sec average
 * batch duration is 10 sec
 * to process 100 records we take 10 sec
 ## Spark Job starts
 ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 12 sec processing time
 ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec processing time
 ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec processing time
 ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 12 sec processing time

If we poll in async and always maintain 500 records for each partition, only Batch-1 will take 12 sec. After that all batches will complete in 10 sec (unless some rebalancing/failure happens, in that case buffer will be cleaned and next batch will take 12 sec).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org