You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Abhishek Anand <ab...@gmail.com> on 2016/02/01 11:31:00 UTC

Re: Repartition taking place for all previous windows even after checkpointing

Any insights on this ?


On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand <ab...@gmail.com>
wrote:

> Hi All,
>
> Can someone help me with the following doubts regarding checkpointing :
>
> My code flow is something like follows ->
>
> 1) create direct stream from kafka
> 2) repartition kafka stream
> 3)  mapToPair followed by reduceByKey
> 4)  filter
> 5)  reduceByKeyAndWindow without the inverse function
> 6)  write to cassandra
>
> Now when I restart my application from checkpoint, I see repartition and
> other steps being called for the previous windows which takes longer and
> delays my aggregations.
>
> My understanding  was that once data checkpointing is done it should not
> re-read from kafka and use the saved RDDs but guess I am wrong.
>
> Is there a way to avoid the repartition or any workaround for this.
>
> Spark Version is 1.4.0
>
> Cheers !!
> Abhi
>