You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by R S <my...@gmail.com> on 2012/04/13 10:01:38 UTC

Can hadoop-consumer be time based instead of offset based

Hi ,

I looked at hadoop-consumer , which fetches data directly from the kafka
broker . But from what i understand it is based on min and max offset and
map task complete once they reach the maximum offset for a given topic .

In our use case we would not know about the max offset before hand. Instead
we want map to keep reading data from a min offset and roll over every 30
mins . At 30th min we would again generate the offsets which would be used
for the next run.

any suggestions would be helpful .

regards,
rks

Re: Can hadoop-consumer be time based instead of offset based

Posted by Jun Rao <ju...@gmail.com>.
Currently, as you are iterating messages returned by SimpleConsumer, you
also get the offset for the next message. In the map, you can just run for
30 mins and save the next offset for the next run.

Thanks,

Jun

On Fri, Apr 13, 2012 at 1:01 AM, R S <my...@gmail.com> wrote:

> Hi ,
>
> I looked at hadoop-consumer , which fetches data directly from the kafka
> broker . But from what i understand it is based on min and max offset and
> map task complete once they reach the maximum offset for a given topic .
>
> In our use case we would not know about the max offset before hand. Instead
> we want map to keep reading data from a min offset and roll over every 30
> mins . At 30th min we would again generate the offsets which would be used
> for the next run.
>
> any suggestions would be helpful .
>
> regards,
> rks
>

Re: Can hadoop-consumer be time based instead of offset based

Posted by Neha Narkhede <ne...@gmail.com>.
> we want map to keep reading data from a min offset and roll over every 30
> mins . At 30th min we would again generate the offsets which would be used
> for the next run.

Using the max offset would avoid deserializing the data. You could use
timestamp too, but for that you would need to include a timestamp in your
Kafka message and then deserialize data in the map task.

Thanks,
Neha