You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jim <ji...@gmail.com> on 2014/08/01 17:05:34 UTC

Re: Most common kafka client comsumer implementations?

Thanks Guozhang,

I was looking for actual real world workflows. I realize you can commit
after each message but if you’re using ZK for offsets for instance you’ll
put too much write load on the nodes and crush your throughput. So I was
interested in batching strategies people have used that balance high/full
throughput and fully committed events.


On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Hi Jim,
>
> Whether to use high level or simple consumer depends on your use case. If
> you need to manually manage partition assignments among your consumers, or
> you need to commit your offsets elsewhere than ZK, or you do not want auto
> rebalancing of consumers upon failures etc, you will use simple consumers;
> otherwise you use high level consumer.
>
> From your description of pulling a batch of messages it seems you are
> currently using the simple consumer. Suppose you are using the high level
> consumer, to achieve at-lease-once basically you can do sth like:
>
> message = consumer.iter.next()
> process(message)
> consumer.commit()
>
> which is effectively the same as option 2 for using a simple consumer. Of
> course, doing so has a heavy overhead of one-commit-per-message, you can
> also do option 1, by the cost of duplicates, which is tolerable for
> at-least-once.
>
> Guozhang
>
>
> On Wed, Jul 30, 2014 at 8:25 PM, Jim <ji...@gmail.com> wrote:
>
> > Curious on a couple questions...
> >
> > Are most people(are you?) using the simple consumer vs the high level
> > consumer in production?
> >
> >
> > What is the common processing paradigm for maintaining a full pipeline
> for
> > kafka consumers for at-least-once messaging? E.g. you pull a batch of
> 1000
> > messages and:
> >
> > option 1.
> > you wait for the slowest worker to finish working on that message, when
> you
> > get back 1000 acks internally you commit your offset and pull another
> batch
> >
> > option 2.
> > you feed your workers n msgs at a time in sequence and move your offset
> up
> > as you work through your batch
> >
> > option 3.
> > you maintain a full stream of 1000 messages ideally and as you get acks
> > back from your workers you see if you can move your offset up in the
> stream
> > to pull n more messages to fill up your pipeline so you're not blocked by
> > the slowest consumer (probability wise)
> >
> >
> > any good docs or articles on the subject would be great, thanks!
> >
>
>
>
> --
> -- Guozhang
>

Re: Most common kafka client comsumer implementations?

Posted by Jim <ji...@gmail.com>.
Neha, I believe option 2 is more like the high level consumer, no? High
level consumer doesn't really guarantee processing of the message. E.g. I
take 1000 messages feed them off to a bunch of threads and "hope and pray"
they get processed. Option 3 is something I haven't seen in the kafka
client world yet.

Option 3 would be

I pull message set [1,2,3,4,5]

Fire those off to 5 workers. Those workers need to ensure successful
processing of a message and you can't move your offset up until you know
all the offsets you're zooming past have been properly consumed. High level
consumer appears to be a hope & pray strategy.





On Wed, Aug 13, 2014 at 9:31 AM, Neha Narkhede <ne...@gmail.com>
wrote:

> option1 would take a throughput hit as you are trying to commit one message
> at a time. Option 2 is pretty widely used at LinkedIn and am pretty sure at
> several other places as well. Option 3 is essentially what the high level
> consumer does under the covers already. It prefetches data in batches from
> the server to provide high throughput.
>
>
> On Wed, Aug 13, 2014 at 2:20 AM, Anand Nalya <an...@gmail.com>
> wrote:
>
> > Hi Jim,
> >
> > In one of the applications, we implemented option #1:
> >
> > messageList = getNext(1000)
> > process(messageList)
> > commit()
> >
> > In case of failure, this resulted in duplicate processing for at most
> 1000
> > records per partition.
> >
> > Regards,
> > Anand
> >
> >
> > On 1 August 2014 20:35, Jim <ji...@gmail.com> wrote:
> >
> > > Thanks Guozhang,
> > >
> > > I was looking for actual real world workflows. I realize you can commit
> > > after each message but if you’re using ZK for offsets for instance
> you’ll
> > > put too much write load on the nodes and crush your throughput. So I
> was
> > > interested in batching strategies people have used that balance
> high/full
> > > throughput and fully committed events.
> > >
> > >
> > > On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hi Jim,
> > > >
> > > > Whether to use high level or simple consumer depends on your use
> case.
> > If
> > > > you need to manually manage partition assignments among your
> consumers,
> > > or
> > > > you need to commit your offsets elsewhere than ZK, or you do not want
> > > auto
> > > > rebalancing of consumers upon failures etc, you will use simple
> > > consumers;
> > > > otherwise you use high level consumer.
> > > >
> > > > From your description of pulling a batch of messages it seems you are
> > > > currently using the simple consumer. Suppose you are using the high
> > level
> > > > consumer, to achieve at-lease-once basically you can do sth like:
> > > >
> > > > message = consumer.iter.next()
> > > > process(message)
> > > > consumer.commit()
> > > >
> > > > which is effectively the same as option 2 for using a simple
> consumer.
> > Of
> > > > course, doing so has a heavy overhead of one-commit-per-message, you
> > can
> > > > also do option 1, by the cost of duplicates, which is tolerable for
> > > > at-least-once.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Wed, Jul 30, 2014 at 8:25 PM, Jim <ji...@gmail.com> wrote:
> > > >
> > > > > Curious on a couple questions...
> > > > >
> > > > > Are most people(are you?) using the simple consumer vs the high
> level
> > > > > consumer in production?
> > > > >
> > > > >
> > > > > What is the common processing paradigm for maintaining a full
> > pipeline
> > > > for
> > > > > kafka consumers for at-least-once messaging? E.g. you pull a batch
> of
> > > > 1000
> > > > > messages and:
> > > > >
> > > > > option 1.
> > > > > you wait for the slowest worker to finish working on that message,
> > when
> > > > you
> > > > > get back 1000 acks internally you commit your offset and pull
> another
> > > > batch
> > > > >
> > > > > option 2.
> > > > > you feed your workers n msgs at a time in sequence and move your
> > offset
> > > > up
> > > > > as you work through your batch
> > > > >
> > > > > option 3.
> > > > > you maintain a full stream of 1000 messages ideally and as you get
> > acks
> > > > > back from your workers you see if you can move your offset up in
> the
> > > > stream
> > > > > to pull n more messages to fill up your pipeline so you're not
> > blocked
> > > by
> > > > > the slowest consumer (probability wise)
> > > > >
> > > > >
> > > > > any good docs or articles on the subject would be great, thanks!
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Re: Most common kafka client comsumer implementations?

Posted by Neha Narkhede <ne...@gmail.com>.
option1 would take a throughput hit as you are trying to commit one message
at a time. Option 2 is pretty widely used at LinkedIn and am pretty sure at
several other places as well. Option 3 is essentially what the high level
consumer does under the covers already. It prefetches data in batches from
the server to provide high throughput.


On Wed, Aug 13, 2014 at 2:20 AM, Anand Nalya <an...@gmail.com> wrote:

> Hi Jim,
>
> In one of the applications, we implemented option #1:
>
> messageList = getNext(1000)
> process(messageList)
> commit()
>
> In case of failure, this resulted in duplicate processing for at most 1000
> records per partition.
>
> Regards,
> Anand
>
>
> On 1 August 2014 20:35, Jim <ji...@gmail.com> wrote:
>
> > Thanks Guozhang,
> >
> > I was looking for actual real world workflows. I realize you can commit
> > after each message but if you’re using ZK for offsets for instance you’ll
> > put too much write load on the nodes and crush your throughput. So I was
> > interested in batching strategies people have used that balance high/full
> > throughput and fully committed events.
> >
> >
> > On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hi Jim,
> > >
> > > Whether to use high level or simple consumer depends on your use case.
> If
> > > you need to manually manage partition assignments among your consumers,
> > or
> > > you need to commit your offsets elsewhere than ZK, or you do not want
> > auto
> > > rebalancing of consumers upon failures etc, you will use simple
> > consumers;
> > > otherwise you use high level consumer.
> > >
> > > From your description of pulling a batch of messages it seems you are
> > > currently using the simple consumer. Suppose you are using the high
> level
> > > consumer, to achieve at-lease-once basically you can do sth like:
> > >
> > > message = consumer.iter.next()
> > > process(message)
> > > consumer.commit()
> > >
> > > which is effectively the same as option 2 for using a simple consumer.
> Of
> > > course, doing so has a heavy overhead of one-commit-per-message, you
> can
> > > also do option 1, by the cost of duplicates, which is tolerable for
> > > at-least-once.
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Jul 30, 2014 at 8:25 PM, Jim <ji...@gmail.com> wrote:
> > >
> > > > Curious on a couple questions...
> > > >
> > > > Are most people(are you?) using the simple consumer vs the high level
> > > > consumer in production?
> > > >
> > > >
> > > > What is the common processing paradigm for maintaining a full
> pipeline
> > > for
> > > > kafka consumers for at-least-once messaging? E.g. you pull a batch of
> > > 1000
> > > > messages and:
> > > >
> > > > option 1.
> > > > you wait for the slowest worker to finish working on that message,
> when
> > > you
> > > > get back 1000 acks internally you commit your offset and pull another
> > > batch
> > > >
> > > > option 2.
> > > > you feed your workers n msgs at a time in sequence and move your
> offset
> > > up
> > > > as you work through your batch
> > > >
> > > > option 3.
> > > > you maintain a full stream of 1000 messages ideally and as you get
> acks
> > > > back from your workers you see if you can move your offset up in the
> > > stream
> > > > to pull n more messages to fill up your pipeline so you're not
> blocked
> > by
> > > > the slowest consumer (probability wise)
> > > >
> > > >
> > > > any good docs or articles on the subject would be great, thanks!
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: Most common kafka client comsumer implementations?

Posted by Anand Nalya <an...@gmail.com>.
Hi Jim,

In one of the applications, we implemented option #1:

messageList = getNext(1000)
process(messageList)
commit()

In case of failure, this resulted in duplicate processing for at most 1000
records per partition.

Regards,
Anand


On 1 August 2014 20:35, Jim <ji...@gmail.com> wrote:

> Thanks Guozhang,
>
> I was looking for actual real world workflows. I realize you can commit
> after each message but if you’re using ZK for offsets for instance you’ll
> put too much write load on the nodes and crush your throughput. So I was
> interested in batching strategies people have used that balance high/full
> throughput and fully committed events.
>
>
> On Thu, Jul 31, 2014 at 8:16 AM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Jim,
> >
> > Whether to use high level or simple consumer depends on your use case. If
> > you need to manually manage partition assignments among your consumers,
> or
> > you need to commit your offsets elsewhere than ZK, or you do not want
> auto
> > rebalancing of consumers upon failures etc, you will use simple
> consumers;
> > otherwise you use high level consumer.
> >
> > From your description of pulling a batch of messages it seems you are
> > currently using the simple consumer. Suppose you are using the high level
> > consumer, to achieve at-lease-once basically you can do sth like:
> >
> > message = consumer.iter.next()
> > process(message)
> > consumer.commit()
> >
> > which is effectively the same as option 2 for using a simple consumer. Of
> > course, doing so has a heavy overhead of one-commit-per-message, you can
> > also do option 1, by the cost of duplicates, which is tolerable for
> > at-least-once.
> >
> > Guozhang
> >
> >
> > On Wed, Jul 30, 2014 at 8:25 PM, Jim <ji...@gmail.com> wrote:
> >
> > > Curious on a couple questions...
> > >
> > > Are most people(are you?) using the simple consumer vs the high level
> > > consumer in production?
> > >
> > >
> > > What is the common processing paradigm for maintaining a full pipeline
> > for
> > > kafka consumers for at-least-once messaging? E.g. you pull a batch of
> > 1000
> > > messages and:
> > >
> > > option 1.
> > > you wait for the slowest worker to finish working on that message, when
> > you
> > > get back 1000 acks internally you commit your offset and pull another
> > batch
> > >
> > > option 2.
> > > you feed your workers n msgs at a time in sequence and move your offset
> > up
> > > as you work through your batch
> > >
> > > option 3.
> > > you maintain a full stream of 1000 messages ideally and as you get acks
> > > back from your workers you see if you can move your offset up in the
> > stream
> > > to pull n more messages to fill up your pipeline so you're not blocked
> by
> > > the slowest consumer (probability wise)
> > >
> > >
> > > any good docs or articles on the subject would be great, thanks!
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>