You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Abhishek Anand <ab...@gmail.com> on 2016/01/29 08:38:19 UTC

Repartition taking place for all previous windows even after checkpointing

Hi All,

Can someone help me with the following doubts regarding checkpointing :

My code flow is something like follows ->

1) create direct stream from kafka
2) repartition kafka stream
3)  mapToPair followed by reduceByKey
4)  filter
5)  reduceByKeyAndWindow without the inverse function
6)  write to cassandra

Now when I restart my application from checkpoint, I see repartition and
other steps being called for the previous windows which takes longer and
delays my aggregations.

My understanding  was that once data checkpointing is done it should not
re-read from kafka and use the saved RDDs but guess I am wrong.

Is there a way to avoid the repartition or any workaround for this.

Spark Version is 1.4.0

Cheers !!
Abhi

Re: Repartition taking place for all previous windows even after checkpointing

Posted by Abhishek Anand <ab...@gmail.com>.

Any insights on this ?


On Fri, Jan 29, 2016 at 1:08 PM, Abhishek Anand <ab...@gmail.com>
wrote:

> Hi All,
>
> Can someone help me with the following doubts regarding checkpointing :
>
> My code flow is something like follows ->
>
> 1) create direct stream from kafka
> 2) repartition kafka stream
> 3)  mapToPair followed by reduceByKey
> 4)  filter
> 5)  reduceByKeyAndWindow without the inverse function
> 6)  write to cassandra
>
> Now when I restart my application from checkpoint, I see repartition and
> other steps being called for the previous windows which takes longer and
> delays my aggregations.
>
> My understanding  was that once data checkpointing is done it should not
> re-read from kafka and use the saved RDDs but guess I am wrong.
>
> Is there a way to avoid the repartition or any workaround for this.
>
> Spark Version is 1.4.0
>
> Cheers !!
> Abhi
>