You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2018/07/01 03:10:00 UTC
[jira] [Assigned] (SPARK-24707) Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll

     [ https://issues.apache.org/jira/browse/SPARK-24707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-24707:
------------------------------------

    Assignee: Apache Spark

> Enable spark-kafka-streaming to maintain min buffer using async thread to avoid blocking kafka poll
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24707
>                 URL: https://issues.apache.org/jira/browse/SPARK-24707
>             Project: Spark
>          Issue Type: Improvement
>          Components: DStreams
>    Affects Versions: 2.4.0
>            Reporter: Sidhavratha Kumar
>            Assignee: Apache Spark
>            Priority: Major
>         Attachments: 40_partition_topic_without_buffer.pdf
>
>
> Currently Spark Kafka RDD will block on kafka consumer poll. Specially in Spark-Kafka-streaming job this poll duration adds into batch processing time which result in
>  * Increased batch processing time (which is apart from time taken to process records)
>  * Results in unpredictable batch processing time based on poll time.
> If we can poll kafka in background thread and maintain buffer for each partition, poll time will not get added into batch processing time, and this will make processing time more predicatble (based on time taken to process each record, instead of extra time taken to poll records from source)
> For ex. we are facing issues where sometime kafka poll is ~30 secs, and sometime it returns within second. With backpressure enabled this reduces our job speed to great extent. In this situation it is also difficult to scale our processing or calculate resource requirement for future increase in records.
> Even if someone does not face varying kafka poll time, it will be provide performance improvement if some buffer is already maintained for each partition, so that each batch can just concentrate on processing records.
> Ex : 
>  Lets consider
>  * each kafka poll takes 2sec average
>  * batch duration is 10 sec
>  * to process 100 records we take 10 sec
>  * each kafka poll returns 300 records 
>  ## Spark Job starts
>  ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) => 12 sec processing time
>  ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec processing time
>  ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec processing time
>  ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) => 12 sec processing time
> If we poll in async and always maintain 500 records for each partition, only Batch-1 will take 12 sec. After that all batches will complete in 10 sec (unless some rebalancing/failure happens, in that case buffer will be cleaned and next batch will take 12 sec).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org