You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Abhishek Anand <ab...@gmail.com> on 2016/01/11 10:09:40 UTC

Getting kafka offsets at beginning of spark streaming application

Hi,

Is there a way so that I can fetch the offsets from where the spark
streaming starts reading from Kafka when my application starts ?

What I am trying is to create an initial RDD with offsest at a particular
time passed as input from the command line and the offsets from where my
spark streaming starts.

Eg -

Partition 0 -> 1000 to (offset at which my spark streaming starts)

Thanks !!

Re: Getting kafka offsets at beginning of spark streaming application

Posted by Cody Koeninger <co...@koeninger.org>.
You can use HasOffsetRanges to get the offsets from the rdd, see
http://spark.apache.org/docs/latest/streaming-kafka-integration.html

Although if you're already saving the offsets to a DB, why not just use
that as the starting point of your application?

On Mon, Jan 11, 2016 at 11:00 AM, kundan kumar <ii...@gmail.com>
wrote:

> Hi Cody,
>
> My use case is something like follows :
>
> My application dies at X time and I write the offsets to a DB.
>
> Now when my application starts at time Y (few minutes later) and spark
> streaming reads the latest offsets using createDirectStream method. Now
> here I want to get the exact offset that is being picked up by the
> createDirectStream method at the begining of the batch. I need this to
> create an initialRDD.
>
> Please let me know if anything is unclear.
>
> Thanks !!!
>
>
> On Mon, Jan 11, 2016 at 8:54 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> I'm not 100% sure what you're asking.
>>
>> If you're asking if it's possible to start a stream at a particular set
>> of offsets, yes, one of the createDirectStream methods takes a map from
>> topicpartition to starting offset.
>>
>> If you're asking if it's possible to query Kafka for the offset
>> corresponding to a particular time, yes, but the granularity for that API
>> is very poor, because it's based on filesystem timestamp.  You're better
>> off keeping an index of time to offset on your own.
>>
>> On Mon, Jan 11, 2016 at 3:09 AM, Abhishek Anand <ab...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Is there a way so that I can fetch the offsets from where the spark
>>> streaming starts reading from Kafka when my application starts ?
>>>
>>> What I am trying is to create an initial RDD with offsest at a
>>> particular time passed as input from the command line and the offsets from
>>> where my spark streaming starts.
>>>
>>> Eg -
>>>
>>> Partition 0 -> 1000 to (offset at which my spark streaming starts)
>>>
>>> Thanks !!
>>>
>>>
>>>
>>
>

Re: Getting kafka offsets at beginning of spark streaming application

Posted by kundan kumar <ii...@gmail.com>.
Hi Cody,

My use case is something like follows :

My application dies at X time and I write the offsets to a DB.

Now when my application starts at time Y (few minutes later) and spark
streaming reads the latest offsets using createDirectStream method. Now
here I want to get the exact offset that is being picked up by the
createDirectStream method at the begining of the batch. I need this to
create an initialRDD.

Please let me know if anything is unclear.

Thanks !!!


On Mon, Jan 11, 2016 at 8:54 PM, Cody Koeninger <co...@koeninger.org> wrote:

> I'm not 100% sure what you're asking.
>
> If you're asking if it's possible to start a stream at a particular set of
> offsets, yes, one of the createDirectStream methods takes a map from
> topicpartition to starting offset.
>
> If you're asking if it's possible to query Kafka for the offset
> corresponding to a particular time, yes, but the granularity for that API
> is very poor, because it's based on filesystem timestamp.  You're better
> off keeping an index of time to offset on your own.
>
> On Mon, Jan 11, 2016 at 3:09 AM, Abhishek Anand <ab...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Is there a way so that I can fetch the offsets from where the spark
>> streaming starts reading from Kafka when my application starts ?
>>
>> What I am trying is to create an initial RDD with offsest at a particular
>> time passed as input from the command line and the offsets from where my
>> spark streaming starts.
>>
>> Eg -
>>
>> Partition 0 -> 1000 to (offset at which my spark streaming starts)
>>
>> Thanks !!
>>
>>
>>
>

Re: Getting kafka offsets at beginning of spark streaming application

Posted by Cody Koeninger <co...@koeninger.org>.
I'm not 100% sure what you're asking.

If you're asking if it's possible to start a stream at a particular set of
offsets, yes, one of the createDirectStream methods takes a map from
topicpartition to starting offset.

If you're asking if it's possible to query Kafka for the offset
corresponding to a particular time, yes, but the granularity for that API
is very poor, because it's based on filesystem timestamp.  You're better
off keeping an index of time to offset on your own.

On Mon, Jan 11, 2016 at 3:09 AM, Abhishek Anand <ab...@gmail.com>
wrote:

> Hi,
>
> Is there a way so that I can fetch the offsets from where the spark
> streaming starts reading from Kafka when my application starts ?
>
> What I am trying is to create an initial RDD with offsest at a particular
> time passed as input from the command line and the offsets from where my
> spark streaming starts.
>
> Eg -
>
> Partition 0 -> 1000 to (offset at which my spark streaming starts)
>
> Thanks !!
>
>
>