You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by James Cheng <jc...@tivo.com> on 2015/07/21 03:24:07 UTC

Consuming from Kafka but don't need to save offsets

Hi,

I have a web service that serves up some data that it obtains from a kafka topic. When the process starts up, it wants to load the entire kafka topic into memory, and serve the data up from an in-memory hashtable. The data in the topic has primary keys and is log compacted, and so the total dataset will be small enough to fit in memory. My web service will only start serving up data when the entire topic is loaded. (And for that, https://issues.apache.org/jira/browse/KAFKA-1977 would be super useful).

I am only storing this data in memory. In the event of process death or restart, my in-memory state is gone, and so I will always want to rebuild it by again consuming the topic from the earliest offset. I will never need to checkpoint my offsets.

Also, I will have N instances of this application, each one needing to consume the entire topic. This is how I plan to do horizontal scaling of my web service.

I would like to use the high level consumer, so that I don't need to manually discover which broker is the leader, and so that I don't have to handle leader rebalancing.

A couple questions:
1) Does this use case make sense? Is this pattern used by anyone else? I like it because it makes my web service completely stateless.
2) In order to make each instance consume all partitions of the topic, I need each consumer group id to be unique to that process. So I was thinking of just using a UUID or something similar. What is the "cost" of creating a new consumer group id? If I am creating a new one every time I start my application, would I be cluttering up zookeeper or the __consumer_offsets topic? Note there will only every be N instances of my application running. Since I never will need to checkpoint my offsets, does that affect my question about "cluttering up" zookeeper/kafka? Are old consumer groups ever cleaned up out of zookeeper or the __consumer_offsets topic?
3) Are the stored offsets used for any other reason, aside from at startup of a new consumer? Are offsets used after rebalancing when partition leaders change due to broker failure? I know that offsets can be used for Burrow-like monitoring.
4) Since I don't need for support checkpointing, another option is to use the SimpleConsumer. The sample code at https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example looks fairly comprehensive. It handles discovery of the partition leader, and handles leader rebalancing. Are there any other situations that I should be aware of before relying on that sample code?
5) Will any of this change when the new consumer comes out? Will the SimpleConsumer still exist when the new consumer comes out?

Thanks,
-James


Re: Consuming from Kafka but don't need to save offsets

Posted by tao xiao <xi...@gmail.com>.
James,

You can reference confluent IO schema registry implementation.
http://docs.confluent.io/1.0/schema-registry/docs/index.html

It does similar thing as what you described. A REST front end that serves
data from a compacted topic and HA is also provided in the solution.

On Tue, 21 Jul 2015 at 09:25 James Cheng <jc...@tivo.com> wrote:

> Hi,
>
> I have a web service that serves up some data that it obtains from a kafka
> topic. When the process starts up, it wants to load the entire kafka topic
> into memory, and serve the data up from an in-memory hashtable. The data in
> the topic has primary keys and is log compacted, and so the total dataset
> will be small enough to fit in memory. My web service will only start
> serving up data when the entire topic is loaded. (And for that,
> https://issues.apache.org/jira/browse/KAFKA-1977 would be super useful).
>
> I am only storing this data in memory. In the event of process death or
> restart, my in-memory state is gone, and so I will always want to rebuild
> it by again consuming the topic from the earliest offset. I will never need
> to checkpoint my offsets.
>
> Also, I will have N instances of this application, each one needing to
> consume the entire topic. This is how I plan to do horizontal scaling of my
> web service.
>
> I would like to use the high level consumer, so that I don't need to
> manually discover which broker is the leader, and so that I don't have to
> handle leader rebalancing.
>
> A couple questions:
> 1) Does this use case make sense? Is this pattern used by anyone else? I
> like it because it makes my web service completely stateless.
> 2) In order to make each instance consume all partitions of the topic, I
> need each consumer group id to be unique to that process. So I was thinking
> of just using a UUID or something similar. What is the "cost" of creating a
> new consumer group id? If I am creating a new one every time I start my
> application, would I be cluttering up zookeeper or the __consumer_offsets
> topic? Note there will only every be N instances of my application running.
> Since I never will need to checkpoint my offsets, does that affect my
> question about "cluttering up" zookeeper/kafka? Are old consumer groups
> ever cleaned up out of zookeeper or the __consumer_offsets topic?
> 3) Are the stored offsets used for any other reason, aside from at startup
> of a new consumer? Are offsets used after rebalancing when partition
> leaders change due to broker failure? I know that offsets can be used for
> Burrow-like monitoring.
> 4) Since I don't need for support checkpointing, another option is to use
> the SimpleConsumer. The sample code at
> https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
> looks fairly comprehensive. It handles discovery of the partition leader,
> and handles leader rebalancing. Are there any other situations that I
> should be aware of before relying on that sample code?
> 5) Will any of this change when the new consumer comes out? Will the
> SimpleConsumer still exist when the new consumer comes out?
>
> Thanks,
> -James
>
>