You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gaurav Shah (JIRA)" <ji...@apache.org> on 2017/01/27 05:08:24 UTC
[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

    [ https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841009#comment-15841009 ] 

Gaurav Shah commented on SPARK-19304:
-------------------------------------

There are two issues in `KinesisSequenceRangeIterator.getNext`

* when `internalIterator` has exhausted it makes call to `getRecords` which in turn makes two api call , one to get shardIterator & other to get records. This can be avoided by storing the nextIterator sequence number returned from `getRecordsAndNextKinesisIterator`

* Second issue is more complicated where each checkpoint block is used as a separate api call. Consider there is one shard and 10 checkpoint blocks. for each block ( or range) we invoke `new KinesisSequenceRangeIterator`, which makes separate `getRecords` call. This api call might return much more records than what is actually required by this range. which is then wasted. Now for second range we again call  `new KinesisSequenceRangeIterator` which might have records returned from first api call itself.

Example:
ranges \[1-10\],\[11-20\],\[21-30\],\[31-40\]. First "new KinesisSequenceRangeIterator" will get "\[1-30\]" records but will ignore all records post "10". The next "new KinesisSequenceRangeIterator" will again get records from "\[11-40\]" but will make use of only "\[11-20\]"

> Kinesis checkpoint recovery is 10x slow
> ---------------------------------------
>
>                 Key: SPARK-19304
>                 URL: https://issues.apache.org/jira/browse/SPARK-19304
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: using s3 for checkpoints using 1 executor, with 19g mem & 3 cores per executor
>            Reporter: Gaurav Shah
>              Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing time is less than 30 minutes on average. For some reason lets say the application crashes, and we try to restart from checkpoint. The processing now takes forever and does not move forward. We tried to test out the same thing at batch interval of 1 minute, the processing runs fine and takes 1.2 minutes for batch to finish. When we recover from checkpoint it takes about 15 minutes for each batch. Post the recovery the batches again process at normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org