You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Milind Vaidya <ka...@gmail.com> on 2016/05/04 18:31:47 UTC

Getting Kafka Offset in Storm Bolt

Is there any way I can know what Kafka offset corresponds to current tuple
I am processing in a bolt ?

Use case : Need to batch events from Kafka, persists them to a local file
and eventually upload it to the S3. To manager failure cases, need to know
the Kafka offset for a message, so that it can be persisted to Zookeeper
and will be used to write / upload file.

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

Well I will have a look into it.


I know it is kind of conflict of interest to use storm to batch data. But
S3 does not support individual message appending so needs to be batched,
persisted locally and then bulk uploaded.

I am just trying to explore if it is possible as we have pretty stable
kafka-storm set up.

On Wed, May 11, 2016 at 12:17 PM, saiprasad mishra <
saiprasadmishra@gmail.com> wrote:

> I think you are looking for MessageMetadataSchemeAsMultiScheme to be used
> by the kafka spout config which got added in 1.0.0 i guess
> Regards
> Sai
>
> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com> wrote:
>
>> Why not just ack the tuple once it's been written to a file.  If your
>> topology fails then the data will be re-read from Kafka.  Kafka spout
>> already does this for you.  Then uploading files to S3 is the
>> responsibility of another job.  For example, a storm topology that monitors
>> the output folder.
>>
>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>
>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com> wrote:
>>
>>> It does not matter, in the sense I am ready to upgrade if this thing is
>>> in the roadmap.
>>>
>>> None the less
>>>
>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>
>>>
>>>
>>>
>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
>>> wrote:
>>>
>>>> which version of storm-kafka, are you using?
>>>>
>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Anybody ? Anything about this ?
>>>>>
>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>>> tuple I am processing in a bolt ?
>>>>>>
>>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Abhishek Agarwal
>>>>
>>>>
>>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by saiprasad mishra <sa...@gmail.com>.

I think you are looking for MessageMetadataSchemeAsMultiScheme to be used
by the kafka spout config which got added in 1.0.0 i guess
Regards
Sai

On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com> wrote:

> Why not just ack the tuple once it's been written to a file.  If your
> topology fails then the data will be re-read from Kafka.  Kafka spout
> already does this for you.  Then uploading files to S3 is the
> responsibility of another job.  For example, a storm topology that monitors
> the output folder.
>
> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>
> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> It does not matter, in the sense I am ready to upgrade if this thing is
>> in the roadmap.
>>
>> None the less
>>
>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>
>>
>>
>>
>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
>> wrote:
>>
>>> which version of storm-kafka, are you using?
>>>
>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>> wrote:
>>>
>>>> Anybody ? Anything about this ?
>>>>
>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>> tuple I am processing in a bolt ?
>>>>>
>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>> Zookeeper and will be used to write / upload file.
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Abhishek Agarwal
>>>
>>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

That was "the" thing in mind. I guess I should give it a try and then see
how it performs and see how convenient it is, can't just speculate that.

On Wed, May 11, 2016 at 2:44 PM, Nathan Leung <nc...@gmail.com> wrote:

> You don't have to batch the whole tuple in supervisor memory, the data is
> already in Kafka. Just keep the tuple ID and write to file. When you close
> the file ack all of the tuple IDs.
> On May 11, 2016 5:42 PM, "Steven Lewis" <St...@walmart.com> wrote:
>
>> It sounds like you want to use Spark / Spark Streaming to do that kind of
>> batching output.
>>
>> From: Milind Vaidya <ka...@gmail.com>
>> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
>> Date: Wednesday, May 11, 2016 at 4:24 PM
>> To: "user@storm.apache.org" <us...@storm.apache.org>
>> Subject: Re: Getting Kafka Offset in Storm Bolt
>>
>> Yeah. We have some microbatching in place for other topologies. This one
>> is little ambitious, in the sense each message is 1~2KB in size so grouping
>> them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
>> guess, I am not sure how much does S3 support or what is optimum). Once
>> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
>> guess storm is not even meant to support that kind of a use case.
>>
>> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> You can micro batch kafka contents into a file that's replicated (e.g.
>>> HDFS) and then ack all of the input tuples after the file has been closed.
>>>
>>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com>
>>> wrote:
>>>
>>>> in case of failure to upload a file or disk corruption leading to loss
>>>> of file, we have only current offset in Kafka Spout but have no record as
>>>> to which offsets were lost in the file which need to be replayed. So these
>>>> can be stored externally in zookeeper and could be used to account for lost
>>>> data. For them to save in ZK, they should be available in a bolt.
>>>>
>>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Why not just ack the tuple once it's been written to a file.  If your
>>>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>>>> already does this for you.  Then uploading files to S3 is the
>>>>> responsibility of another job.  For example, a storm topology that monitors
>>>>> the output folder.
>>>>>
>>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>>>
>>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> It does not matter, in the sense I am ready to upgrade if this thing
>>>>>> is in the roadmap.
>>>>>>
>>>>>> None the less
>>>>>>
>>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <
>>>>>> abhishcool@gmail.com> wrote:
>>>>>>
>>>>>>> which version of storm-kafka, are you using?
>>>>>>>
>>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Anybody ? Anything about this ?
>>>>>>>>
>>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Is there any way I can know what Kafka offset corresponds to
>>>>>>>>> current tuple I am processing in a bolt ?
>>>>>>>>>
>>>>>>>>> Use case : Need to batch events from Kafka, persists them to a
>>>>>>>>> local file and eventually upload it to the S3. To manager failure cases,
>>>>>>>>> need to know the Kafka offset for a message, so that it can be persisted to
>>>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>> Abhishek Agarwal
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> This email and any files transmitted with it are confidential and
>> intended solely for the individual or entity to whom they are addressed. If
>> you have received this email in error destroy it immediately. *** Walmart
>> Confidential ***
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Nathan Leung <nc...@gmail.com>.

You don't have to batch the whole tuple in supervisor memory, the data is
already in Kafka. Just keep the tuple ID and write to file. When you close
the file ack all of the tuple IDs.
On May 11, 2016 5:42 PM, "Steven Lewis" <St...@walmart.com> wrote:

> It sounds like you want to use Spark / Spark Streaming to do that kind of
> batching output.
>
> From: Milind Vaidya <ka...@gmail.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Wednesday, May 11, 2016 at 4:24 PM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Getting Kafka Offset in Storm Bolt
>
> Yeah. We have some microbatching in place for other topologies. This one
> is little ambitious, in the sense each message is 1~2KB in size so grouping
> them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
> guess, I am not sure how much does S3 support or what is optimum). Once
> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
> guess storm is not even meant to support that kind of a use case.
>
> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <nc...@gmail.com> wrote:
>
>> You can micro batch kafka contents into a file that's replicated (e.g.
>> HDFS) and then ack all of the input tuples after the file has been closed.
>>
>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com> wrote:
>>
>>> in case of failure to upload a file or disk corruption leading to loss
>>> of file, we have only current offset in Kafka Spout but have no record as
>>> to which offsets were lost in the file which need to be replayed. So these
>>> can be stored externally in zookeeper and could be used to account for lost
>>> data. For them to save in ZK, they should be available in a bolt.
>>>
>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com>
>>> wrote:
>>>
>>>> Why not just ack the tuple once it's been written to a file.  If your
>>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>>> already does this for you.  Then uploading files to S3 is the
>>>> responsibility of another job.  For example, a storm topology that monitors
>>>> the output folder.
>>>>
>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>>
>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> It does not matter, in the sense I am ready to upgrade if this thing
>>>>> is in the roadmap.
>>>>>
>>>>> None the less
>>>>>
>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <
>>>>> abhishcool@gmail.com> wrote:
>>>>>
>>>>>> which version of storm-kafka, are you using?
>>>>>>
>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Anybody ? Anything about this ?
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any way I can know what Kafka offset corresponds to
>>>>>>>> current tuple I am processing in a bolt ?
>>>>>>>>
>>>>>>>> Use case : Need to batch events from Kafka, persists them to a
>>>>>>>> local file and eventually upload it to the S3. To manager failure cases,
>>>>>>>> need to know the Kafka offset for a message, so that it can be persisted to
>>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

I also considered that. The approach was follows

1. Can the existing storm - kafka set up be leveraged ?
2. Is there any "proven" open source framework for the same ?

Spark is next "best" option looks like by keeping paradigm same.

We also considered
Secor (https://github.com/pinterest/secor/blob/master/DESIGN.md)
Streamx (https://github.com/qubole/streamx) looks promising too.

With Secor looking more promising.





On Wed, May 11, 2016 at 2:40 PM, Steven Lewis <St...@walmart.com>
wrote:

> It sounds like you want to use Spark / Spark Streaming to do that kind of
> batching output.
>
> From: Milind Vaidya <ka...@gmail.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Wednesday, May 11, 2016 at 4:24 PM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Getting Kafka Offset in Storm Bolt
>
> Yeah. We have some microbatching in place for other topologies. This one
> is little ambitious, in the sense each message is 1~2KB in size so grouping
> them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
> guess, I am not sure how much does S3 support or what is optimum). Once
> that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
> guess storm is not even meant to support that kind of a use case.
>
> On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <nc...@gmail.com> wrote:
>
>> You can micro batch kafka contents into a file that's replicated (e.g.
>> HDFS) and then ack all of the input tuples after the file has been closed.
>>
>> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com> wrote:
>>
>>> in case of failure to upload a file or disk corruption leading to loss
>>> of file, we have only current offset in Kafka Spout but have no record as
>>> to which offsets were lost in the file which need to be replayed. So these
>>> can be stored externally in zookeeper and could be used to account for lost
>>> data. For them to save in ZK, they should be available in a bolt.
>>>
>>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com>
>>> wrote:
>>>
>>>> Why not just ack the tuple once it's been written to a file.  If your
>>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>>> already does this for you.  Then uploading files to S3 is the
>>>> responsibility of another job.  For example, a storm topology that monitors
>>>> the output folder.
>>>>
>>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>>
>>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> It does not matter, in the sense I am ready to upgrade if this thing
>>>>> is in the roadmap.
>>>>>
>>>>> None the less
>>>>>
>>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <
>>>>> abhishcool@gmail.com> wrote:
>>>>>
>>>>>> which version of storm-kafka, are you using?
>>>>>>
>>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Anybody ? Anything about this ?
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Is there any way I can know what Kafka offset corresponds to
>>>>>>>> current tuple I am processing in a bolt ?
>>>>>>>>
>>>>>>>> Use case : Need to batch events from Kafka, persists them to a
>>>>>>>> local file and eventually upload it to the S3. To manager failure cases,
>>>>>>>> need to know the Kafka offset for a message, so that it can be persisted to
>>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Abhishek Agarwal
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Steven Lewis <St...@walmart.com>.

It sounds like you want to use Spark / Spark Streaming to do that kind of batching output.

From: Milind Vaidya <ka...@gmail.com>>
Reply-To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Date: Wednesday, May 11, 2016 at 4:24 PM
To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Subject: Re: Getting Kafka Offset in Storm Bolt

Yeah. We have some microbatching in place for other topologies. This one is little ambitious, in the sense each message is 1~2KB in size so grouping them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my guess, I am not sure how much does S3 support or what is optimum). Once that chunk is uploaded, all of them can be acked. But isn't it overkill ? I guess storm is not even meant to support that kind of a use case.

On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <nc...@gmail.com>> wrote:
You can micro batch kafka contents into a file that's replicated (e.g. HDFS) and then ack all of the input tuples after the file has been closed.

On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com>> wrote:
in case of failure to upload a file or disk corruption leading to loss of file, we have only current offset in Kafka Spout but have no record as to which offsets were lost in the file which need to be replayed. So these can be stored externally in zookeeper and could be used to account for lost data. For them to save in ZK, they should be available in a bolt.

On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com>> wrote:
Why not just ack the tuple once it's been written to a file.  If your topology fails then the data will be re-read from Kafka.  Kafka spout already does this for you.  Then uploading files to S3 is the responsibility of another job.  For example, a storm topology that monitors the output folder.

Monitoring the data from Kafka all the way out to S3 seems unnecessary.

On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com>> wrote:

It does not matter, in the sense I am ready to upgrade if this thing is in the roadmap.

None the less

kafka_2.9.2-0.8.1.1 apache-storm-0.9.4

On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>> wrote:
which version of storm-kafka, are you using?

On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>> wrote:
Anybody ? Anything about this ?

On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>> wrote:
Is there any way I can know what Kafka offset corresponds to current tuple I am processing in a bolt ?

Use case : Need to batch events from Kafka, persists them to a local file and eventually upload it to the S3. To manager failure cases, need to know the Kafka offset for a message, so that it can be persisted to Zookeeper and will be used to write / upload file.

--
Regards,
Abhishek Agarwal

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

Yeah. We have some microbatching in place for other topologies. This one is
little ambitious, in the sense each message is 1~2KB in size so grouping
them to a reasonable chunk is necessary say 500KB  ~ 1 GB (this is just my
guess, I am not sure how much does S3 support or what is optimum). Once
that chunk is uploaded, all of them can be acked. But isn't it overkill ? I
guess storm is not even meant to support that kind of a use case.

On Wed, May 11, 2016 at 12:59 PM, Nathan Leung <nc...@gmail.com> wrote:

> You can micro batch kafka contents into a file that's replicated (e.g.
> HDFS) and then ack all of the input tuples after the file has been closed.
>
> On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> in case of failure to upload a file or disk corruption leading to loss of
>> file, we have only current offset in Kafka Spout but have no record as to
>> which offsets were lost in the file which need to be replayed. So these can
>> be stored externally in zookeeper and could be used to account for lost
>> data. For them to save in ZK, they should be available in a bolt.
>>
>> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com> wrote:
>>
>>> Why not just ack the tuple once it's been written to a file.  If your
>>> topology fails then the data will be re-read from Kafka.  Kafka spout
>>> already does this for you.  Then uploading files to S3 is the
>>> responsibility of another job.  For example, a storm topology that monitors
>>> the output folder.
>>>
>>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>>
>>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com>
>>> wrote:
>>>
>>>> It does not matter, in the sense I am ready to upgrade if this thing is
>>>> in the roadmap.
>>>>
>>>> None the less
>>>>
>>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <abhishcool@gmail.com
>>>> > wrote:
>>>>
>>>>> which version of storm-kafka, are you using?
>>>>>
>>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Anybody ? Anything about this ?
>>>>>>
>>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>>>> tuple I am processing in a bolt ?
>>>>>>>
>>>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Abhishek Agarwal
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Nathan Leung <nc...@gmail.com>.

You can micro batch kafka contents into a file that's replicated (e.g.
HDFS) and then ack all of the input tuples after the file has been closed.

On Wed, May 11, 2016 at 3:43 PM, Milind Vaidya <ka...@gmail.com> wrote:

> in case of failure to upload a file or disk corruption leading to loss of
> file, we have only current offset in Kafka Spout but have no record as to
> which offsets were lost in the file which need to be replayed. So these can
> be stored externally in zookeeper and could be used to account for lost
> data. For them to save in ZK, they should be available in a bolt.
>
> On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com> wrote:
>
>> Why not just ack the tuple once it's been written to a file.  If your
>> topology fails then the data will be re-read from Kafka.  Kafka spout
>> already does this for you.  Then uploading files to S3 is the
>> responsibility of another job.  For example, a storm topology that monitors
>> the output folder.
>>
>> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>>
>> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com> wrote:
>>
>>> It does not matter, in the sense I am ready to upgrade if this thing is
>>> in the roadmap.
>>>
>>> None the less
>>>
>>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>>
>>>
>>>
>>>
>>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
>>> wrote:
>>>
>>>> which version of storm-kafka, are you using?
>>>>
>>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Anybody ? Anything about this ?
>>>>>
>>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>>> tuple I am processing in a bolt ?
>>>>>>
>>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>>> Zookeeper and will be used to write / upload file.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Abhishek Agarwal
>>>>
>>>>
>>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

in case of failure to upload a file or disk corruption leading to loss of
file, we have only current offset in Kafka Spout but have no record as to
which offsets were lost in the file which need to be replayed. So these can
be stored externally in zookeeper and could be used to account for lost
data. For them to save in ZK, they should be available in a bolt.

On Wed, May 11, 2016 at 11:10 AM, Nathan Leung <nc...@gmail.com> wrote:

> Why not just ack the tuple once it's been written to a file.  If your
> topology fails then the data will be re-read from Kafka.  Kafka spout
> already does this for you.  Then uploading files to S3 is the
> responsibility of another job.  For example, a storm topology that monitors
> the output folder.
>
> Monitoring the data from Kafka all the way out to S3 seems unnecessary.
>
> On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> It does not matter, in the sense I am ready to upgrade if this thing is
>> in the roadmap.
>>
>> None the less
>>
>> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>>
>>
>>
>>
>> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
>> wrote:
>>
>>> which version of storm-kafka, are you using?
>>>
>>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>>> wrote:
>>>
>>>> Anybody ? Anything about this ?
>>>>
>>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there any way I can know what Kafka offset corresponds to current
>>>>> tuple I am processing in a bolt ?
>>>>>
>>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>>> know the Kafka offset for a message, so that it can be persisted to
>>>>> Zookeeper and will be used to write / upload file.
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Abhishek Agarwal
>>>
>>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Nathan Leung <nc...@gmail.com>.

Why not just ack the tuple once it's been written to a file.  If your
topology fails then the data will be re-read from Kafka.  Kafka spout
already does this for you.  Then uploading files to S3 is the
responsibility of another job.  For example, a storm topology that monitors
the output folder.

Monitoring the data from Kafka all the way out to S3 seems unnecessary.

On Wed, May 11, 2016 at 1:50 PM, Milind Vaidya <ka...@gmail.com> wrote:

> It does not matter, in the sense I am ready to upgrade if this thing is in
> the roadmap.
>
> None the less
>
> kafka_2.9.2-0.8.1.1 apache-storm-0.9.4
>
>
>
>
> On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
> wrote:
>
>> which version of storm-kafka, are you using?
>>
>> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com>
>> wrote:
>>
>>> Anybody ? Anything about this ?
>>>
>>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com>
>>> wrote:
>>>
>>>> Is there any way I can know what Kafka offset corresponds to current
>>>> tuple I am processing in a bolt ?
>>>>
>>>> Use case : Need to batch events from Kafka, persists them to a local
>>>> file and eventually upload it to the S3. To manager failure cases, need to
>>>> know the Kafka offset for a message, so that it can be persisted to
>>>> Zookeeper and will be used to write / upload file.
>>>>
>>>>
>>>
>>
>>
>> --
>> Regards,
>> Abhishek Agarwal
>>
>>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

It does not matter, in the sense I am ready to upgrade if this thing is in
the roadmap.

None the less

kafka_2.9.2-0.8.1.1 apache-storm-0.9.4




On Wed, May 11, 2016 at 5:53 AM, Abhishek Agarwal <ab...@gmail.com>
wrote:

> which version of storm-kafka, are you using?
>
> On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> Anybody ? Anything about this ?
>>
>> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com> wrote:
>>
>>> Is there any way I can know what Kafka offset corresponds to current
>>> tuple I am processing in a bolt ?
>>>
>>> Use case : Need to batch events from Kafka, persists them to a local
>>> file and eventually upload it to the S3. To manager failure cases, need to
>>> know the Kafka offset for a message, so that it can be persisted to
>>> Zookeeper and will be used to write / upload file.
>>>
>>>
>>
>
>
> --
> Regards,
> Abhishek Agarwal
>
>

Re: Getting Kafka Offset in Storm Bolt

Posted by Abhishek Agarwal <ab...@gmail.com>.

which version of storm-kafka, are you using?

On Wed, May 11, 2016 at 12:29 AM, Milind Vaidya <ka...@gmail.com> wrote:

> Anybody ? Anything about this ?
>
> On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com> wrote:
>
>> Is there any way I can know what Kafka offset corresponds to current
>> tuple I am processing in a bolt ?
>>
>> Use case : Need to batch events from Kafka, persists them to a local file
>> and eventually upload it to the S3. To manager failure cases, need to know
>> the Kafka offset for a message, so that it can be persisted to Zookeeper
>> and will be used to write / upload file.
>>
>>
>


-- 
Regards,
Abhishek Agarwal

Re: Getting Kafka Offset in Storm Bolt

Posted by Milind Vaidya <ka...@gmail.com>.

Anybody ? Anything about this ?

On Wed, May 4, 2016 at 11:31 AM, Milind Vaidya <ka...@gmail.com> wrote:

> Is there any way I can know what Kafka offset corresponds to current tuple
> I am processing in a bolt ?
>
> Use case : Need to batch events from Kafka, persists them to a local file
> and eventually upload it to the S3. To manager failure cases, need to know
> the Kafka offset for a message, so that it can be persisted to Zookeeper
> and will be used to write / upload file.
>
>