You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stephen Durfey <sj...@gmail.com> on 2015/08/13 23:53:02 UTC

Retrieving offsets from previous spark streaming checkpoint

When deploying a spark streaming application I want to be able to retrieve
the lastest kafka offsets that were processed by the pipeline, and create
my kafka direct streams from those offsets. Because the checkpoint
directory isn't guaranteed to be compatible between job deployments, I
don't want to re-use the checkpoint directory from the previous job
deployment. I also don't want to have to re-process everything in my kafka
queues. Is there any way to retrieve this information from the checkpoint
directory, or has anyone else solved this problem already?

* I apologize if this is a duplicate message. I didn't see it go through
earlier today, and I didn't see it in the archive.

Re: Retrieving offsets from previous spark streaming checkpoint

Posted by Cody Koeninger <co...@koeninger.org>.
Access the offsets using HasOffsetRanges, save them in your datastore,
provide them as the fromOffsets argument when starting the stream.

See https://github.com/koeninger/kafka-exactly-once

On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey <sj...@gmail.com> wrote:

> When deploying a spark streaming application I want to be able to
> retrieve the lastest kafka offsets that were processed by the pipeline, and
> create my kafka direct streams from those offsets. Because the checkpoint
> directory isn't guaranteed to be compatible between job deployments, I
> don't want to re-use the checkpoint directory from the previous job
> deployment. I also don't want to have to re-process everything in my kafka
> queues. Is there any way to retrieve this information from the checkpoint
> directory, or has anyone else solved this problem already?
>
> * I apologize if this is a duplicate message. I didn't see it go through
> earlier today, and I didn't see it in the archive.
>