You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Tom Dearman <to...@gmail.com> on 2016/07/06 15:42:28 UTC

Monitoring offset lag

I recently had a problem on my production which I believe was a manifestation of the issue kafka-2978 (Topic partition is not sometimes consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and we will upgrade our client soon.  However, it made me realise that I didn’t have any monitoring set up on this.  The only thing I can find as a metric is the kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+), which, if I understand correctly, is the max lag of any partition that that particular consumer is consuming.  
1. If I had been monitoring this, and if my consumer was suffering from the issue in kafka-2978, would I actually have been alerted, i.e. since the consumer would think it is consuming correctly would it not have updated the metric.
2. There is another way to see offset lag using the command /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server 10.10.1.61:9092 --describe —group consumer_group_name and parsing the response.  Is it safe or advisable to do this?  I like the fact that it tells me each partition lag, although it is also not available if no consumer from the group is currently consuming.
3. Is there a better way of doing this?

Re: Monitoring offset lag

Posted by Todd Palino <tp...@gmail.com>.
For “first partition”, I was speaking specifically of your example - Burrow
doesn’t care about partition 0 vs. any other partition. Looking at that
output from the groups tool, it looks like there are a lot of partition
with no committed offsets. There’s even one partition with a committed
offset past the log end offset, which is concerning. My guess here is that
after you started Burrow, there were no offset commits until after that
message was written to partition 1. After that, there was an offset commit
which allowed Burrow to discover the consumer group.

One of the things I want to do is to have Burrow bootstrap the
__consumer_offsets topic from the oldest offsets, which should avoid some
confusion like this. However, there’s a couple things with higher priority
for me personally first.

-Todd


On Fri, Jul 8, 2016 at 9:22 AM, Tom Dearman <to...@gmail.com> wrote:

> Sorry, I should say only partition 1 had something at first, then zero:
>
> Toms-iMac:betwave-server tomdearman$
> /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh
> --new-consumer --bootstrap-server localhost:9092 --describe --group
> voidbridge-oneworks-dummy
> GROUP                          TOPIC                          PARTITION
> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> voidbridge-oneworks-dummy      integration-oneworks-dummy     2
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     7
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     12
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     17
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     4
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     9
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     14
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     19
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     1
> 3               3               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     6
> 0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     11
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     16
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     3
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     8
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     13
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     18
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     0
> 10              0               -10
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     5
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     10
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     15
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> Toms-iMac:betwave-server tomdearman$
> /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh
> --new-consumer --bootstrap-server localhost:9092 --describe --group
> voidbridge-oneworks-dummy
> GROUP                          TOPIC                          PARTITION
> CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
> voidbridge-oneworks-dummy      integration-oneworks-dummy     2
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     7
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     12
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     17
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     4
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     9
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     14
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     19
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     1
> 3               3               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     6
> 0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     11
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     16
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     3
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     8
> unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     13
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     18
>  unknown         0               unknown
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     0
> 1               1               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     5
> 0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     10
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy      integration-oneworks-dummy     15
>  0               0               0
>  integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
>
> > On 8 Jul 2016, at 17:20, Tom Dearman <to...@gmail.com> wrote:
> >
> > When you say ‘for the first partition’ do you literally mean partition
> zero, or you mean any partition.  It is true that when I had only 1 user
> there were only messages on partition 15 but the second user happened to go
> to partition zero.  Is it the case that partition zero must have a consumer
> commit?
> >
> >> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
> >>
> >> If you open up an issue on the project, I'd be happy to dig into this in
> >> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
> >> enumerate consumer groups - it learns about them from offset commits. It
> >> sounds like maybe your consumer had not committed offsets for the first
> >> partition (at least not after Burrow was started).
> >>
> >> -Todd
> >>
> >> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
> >>
> >>> Todd,
> >>>
> >>> Thanks for that I am taking a look.
> >>>
> >>> Is there a bug whereby if you only have a couple of messages on a
> topic,
> >>> both with the same key, that burrow doesn’t return correct info.  I was
> >>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
> >>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a
> message
> >>> with empty consumers until I put on another message with a different
> key,
> >>> i.e. a minimum of 2 partitions with something in them.  I know this is
> not
> >>> very like production, but on my local this I was only testing with one
> user
> >>> so get just one partition filled.
> >>>
> >>> Tom
> >>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com
> <javascript:;>>
> >>> wrote:
> >>>>
> >>>> Yeah, I've written dissertations at this point on why MaxLag is
> flawed.
> >>> We
> >>>> also used to use the offset checker tool, and later something similar
> >>> that
> >>>> was a little easier to slot into our monitoring systems. Problems with
> >>> all
> >>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
> >>>>
> >>>> For more details, you can also check out my blog post on the release:
> >>>>
> >>>
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
> >>>>
> >>>> -Todd
> >>>>
> >>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
> >>> <javascript:;>> wrote:
> >>>>
> >>>>> I recently had a problem on my production which I believe was a
> >>>>> manifestation of the issue kafka-2978 (Topic partition is not
> sometimes
> >>>>> consumed after rebalancing of consumer group), this is fixed in
> 0.9.0.1
> >>> and
> >>>>> we will upgrade our client soon.  However, it made me realise that I
> >>> didn’t
> >>>>> have any monitoring set up on this.  The only thing I can find as a
> >>> metric
> >>>>> is the
> >>>>>
> >>>
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> >>>>> which, if I understand correctly, is the max lag of any partition
> that
> >>> that
> >>>>> particular consumer is consuming.
> >>>>> 1. If I had been monitoring this, and if my consumer was suffering
> from
> >>>>> the issue in kafka-2978, would I actually have been alerted, i.e.
> since
> >>> the
> >>>>> consumer would think it is consuming correctly would it not have
> updated
> >>>>> the metric.
> >>>>> 2. There is another way to see offset lag using the command
> >>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> >>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing
> the
> >>>>> response.  Is it safe or advisable to do this?  I like the fact that
> it
> >>>>> tells me each partition lag, although it is also not available if no
> >>>>> consumer from the group is currently consuming.
> >>>>> 3. Is there a better way of doing this?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> *Todd Palino*
> >>>> Staff Site Reliability Engineer
> >>>> Data Infrastructure Streaming
> >>>>
> >>>>
> >>>>
> >>>> linkedin.com/in/toddpalino
> >>>
> >>>
> >>
> >> --
> >> *Todd Palino*
> >> Staff Site Reliability Engineer
> >> Data Infrastructure Streaming
> >>
> >>
> >>
> >> linkedin.com/in/toddpalino
> >
>
>


-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: Monitoring offset lag

Posted by Tom Dearman <to...@gmail.com>.
Sorry, I should say only partition 1 had something at first, then zero:

Toms-iMac:betwave-server tomdearman$ /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group voidbridge-oneworks-dummy
GROUP                          TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
voidbridge-oneworks-dummy      integration-oneworks-dummy     2          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     7          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     12         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     17         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     4          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     9          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     14         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     19         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     1          3               3               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     6          0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     11         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     16         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     3          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     8          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     13         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     18         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     0          10              0               -10             integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     5          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     10         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     15         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
Toms-iMac:betwave-server tomdearman$ /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group voidbridge-oneworks-dummy
GROUP                          TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             OWNER
voidbridge-oneworks-dummy      integration-oneworks-dummy     2          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     7          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     12         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     17         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     4          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     9          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     14         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     19         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     1          3               3               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     6          0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     11         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     16         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     3          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     8          unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     13         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     18         unknown         0               unknown         integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     0          1               1               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     5          0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     10         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy      integration-oneworks-dummy     15         0               0               0               integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113

> On 8 Jul 2016, at 17:20, Tom Dearman <to...@gmail.com> wrote:
> 
> When you say ‘for the first partition’ do you literally mean partition zero, or you mean any partition.  It is true that when I had only 1 user there were only messages on partition 15 but the second user happened to go to partition zero.  Is it the case that partition zero must have a consumer commit?
> 
>> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
>> 
>> If you open up an issue on the project, I'd be happy to dig into this in
>> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
>> enumerate consumer groups - it learns about them from offset commits. It
>> sounds like maybe your consumer had not committed offsets for the first
>> partition (at least not after Burrow was started).
>> 
>> -Todd
>> 
>> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
>> 
>>> Todd,
>>> 
>>> Thanks for that I am taking a look.
>>> 
>>> Is there a bug whereby if you only have a couple of messages on a topic,
>>> both with the same key, that burrow doesn’t return correct info.  I was
>>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
>>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
>>> with empty consumers until I put on another message with a different key,
>>> i.e. a minimum of 2 partitions with something in them.  I know this is not
>>> very like production, but on my local this I was only testing with one user
>>> so get just one partition filled.
>>> 
>>> Tom
>>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
>>> wrote:
>>>> 
>>>> Yeah, I've written dissertations at this point on why MaxLag is flawed.
>>> We
>>>> also used to use the offset checker tool, and later something similar
>>> that
>>>> was a little easier to slot into our monitoring systems. Problems with
>>> all
>>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
>>>> 
>>>> For more details, you can also check out my blog post on the release:
>>>> 
>>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
>>>> 
>>>> -Todd
>>>> 
>>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
>>> <javascript:;>> wrote:
>>>> 
>>>>> I recently had a problem on my production which I believe was a
>>>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
>>> and
>>>>> we will upgrade our client soon.  However, it made me realise that I
>>> didn’t
>>>>> have any monitoring set up on this.  The only thing I can find as a
>>> metric
>>>>> is the
>>>>> 
>>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>>>> which, if I understand correctly, is the max lag of any partition that
>>> that
>>>>> particular consumer is consuming.
>>>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>>>> the issue in kafka-2978, would I actually have been alerted, i.e. since
>>> the
>>>>> consumer would think it is consuming correctly would it not have updated
>>>>> the metric.
>>>>> 2. There is another way to see offset lag using the command
>>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>>>> response.  Is it safe or advisable to do this?  I like the fact that it
>>>>> tells me each partition lag, although it is also not available if no
>>>>> consumer from the group is currently consuming.
>>>>> 3. Is there a better way of doing this?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> *Todd Palino*
>>>> Staff Site Reliability Engineer
>>>> Data Infrastructure Streaming
>>>> 
>>>> 
>>>> 
>>>> linkedin.com/in/toddpalino
>>> 
>>> 
>> 
>> -- 
>> *Todd Palino*
>> Staff Site Reliability Engineer
>> Data Infrastructure Streaming
>> 
>> 
>> 
>> linkedin.com/in/toddpalino
> 


Re: Monitoring offset lag

Posted by Tom Dearman <to...@gmail.com>.
When you say ‘for the first partition’ do you literally mean partition zero, or you mean any partition.  It is true that when I had only 1 user there were only messages on partition 15 but the second user happened to go to partition zero.  Is it the case that partition zero must have a consumer commit?

> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
> 
> If you open up an issue on the project, I'd be happy to dig into this in
> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
> enumerate consumer groups - it learns about them from offset commits. It
> sounds like maybe your consumer had not committed offsets for the first
> partition (at least not after Burrow was started).
> 
> -Todd
> 
> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
> 
>> Todd,
>> 
>> Thanks for that I am taking a look.
>> 
>> Is there a bug whereby if you only have a couple of messages on a topic,
>> both with the same key, that burrow doesn’t return correct info.  I was
>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
>> with empty consumers until I put on another message with a different key,
>> i.e. a minimum of 2 partitions with something in them.  I know this is not
>> very like production, but on my local this I was only testing with one user
>> so get just one partition filled.
>> 
>> Tom
>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
>> wrote:
>>> 
>>> Yeah, I've written dissertations at this point on why MaxLag is flawed.
>> We
>>> also used to use the offset checker tool, and later something similar
>> that
>>> was a little easier to slot into our monitoring systems. Problems with
>> all
>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
>>> 
>>> For more details, you can also check out my blog post on the release:
>>> 
>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
>>> 
>>> -Todd
>>> 
>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
>> <javascript:;>> wrote:
>>> 
>>>> I recently had a problem on my production which I believe was a
>>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
>> and
>>>> we will upgrade our client soon.  However, it made me realise that I
>> didn’t
>>>> have any monitoring set up on this.  The only thing I can find as a
>> metric
>>>> is the
>>>> 
>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>>> which, if I understand correctly, is the max lag of any partition that
>> that
>>>> particular consumer is consuming.
>>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>>> the issue in kafka-2978, would I actually have been alerted, i.e. since
>> the
>>>> consumer would think it is consuming correctly would it not have updated
>>>> the metric.
>>>> 2. There is another way to see offset lag using the command
>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>>> response.  Is it safe or advisable to do this?  I like the fact that it
>>>> tells me each partition lag, although it is also not available if no
>>>> consumer from the group is currently consuming.
>>>> 3. Is there a better way of doing this?
>>> 
>>> 
>>> 
>>> --
>>> *Todd Palino*
>>> Staff Site Reliability Engineer
>>> Data Infrastructure Streaming
>>> 
>>> 
>>> 
>>> linkedin.com/in/toddpalino
>> 
>> 
> 
> -- 
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
> 
> 
> 
> linkedin.com/in/toddpalino


Re: Monitoring offset lag

Posted by Todd Palino <tp...@gmail.com>.
If you open up an issue on the project, I'd be happy to dig into this in
more detail if needed. Excluding the ZK offset checking, Burrow doesn't
enumerate consumer groups - it learns about them from offset commits. It
sounds like maybe your consumer had not committed offsets for the first
partition (at least not after Burrow was started).

-Todd

On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:

> Todd,
>
> Thanks for that I am taking a look.
>
> Is there a bug whereby if you only have a couple of messages on a topic,
> both with the same key, that burrow doesn’t return correct info.  I was
> finding that http://localhost:8100/v2/kafka/betwave/consumer <
> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
> with empty consumers until I put on another message with a different key,
> i.e. a minimum of 2 partitions with something in them.  I know this is not
> very like production, but on my local this I was only testing with one user
> so get just one partition filled.
>
> Tom
> > On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
> wrote:
> >
> > Yeah, I've written dissertations at this point on why MaxLag is flawed.
> We
> > also used to use the offset checker tool, and later something similar
> that
> > was a little easier to slot into our monitoring systems. Problems with
> all
> > of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
> >
> > For more details, you can also check out my blog post on the release:
> >
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
> >
> > -Todd
> >
> > On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
> <javascript:;>> wrote:
> >
> >> I recently had a problem on my production which I believe was a
> >> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> >> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
> and
> >> we will upgrade our client soon.  However, it made me realise that I
> didn’t
> >> have any monitoring set up on this.  The only thing I can find as a
> metric
> >> is the
> >>
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> >> which, if I understand correctly, is the max lag of any partition that
> that
> >> particular consumer is consuming.
> >> 1. If I had been monitoring this, and if my consumer was suffering from
> >> the issue in kafka-2978, would I actually have been alerted, i.e. since
> the
> >> consumer would think it is consuming correctly would it not have updated
> >> the metric.
> >> 2. There is another way to see offset lag using the command
> >> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> >> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> >> response.  Is it safe or advisable to do this?  I like the fact that it
> >> tells me each partition lag, although it is also not available if no
> >> consumer from the group is currently consuming.
> >> 3. Is there a better way of doing this?
> >
> >
> >
> > --
> > *Todd Palino*
> > Staff Site Reliability Engineer
> > Data Infrastructure Streaming
> >
> >
> >
> > linkedin.com/in/toddpalino
>
>

-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: Monitoring offset lag

Posted by Tom Dearman <to...@gmail.com>.
I should mention this was using the web server to check status.

> On 8 Jul 2016, at 16:56, Tom Dearman <to...@gmail.com> wrote:
> 
> Todd,
> 
> Thanks for that I am taking a look.
> 
> Is there a bug whereby if you only have a couple of messages on a topic, both with the same key, that burrow doesn’t return correct info.  I was finding that http://localhost:8100/v2/kafka/betwave/consumer <http://localhost:8100/v2/kafka/betwave/consumer> was returning a message with empty consumers until I put on another message with a different key, i.e. a minimum of 2 partitions with something in them.  I know this is not very like production, but on my local this I was only testing with one user so get just one partition filled.
> 
> Tom
>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Yeah, I've written dissertations at this point on why MaxLag is flawed. We
>> also used to use the offset checker tool, and later something similar that
>> was a little easier to slot into our monitoring systems. Problems with all
>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow <https://github.com/linkedin/Burrow>)
>> 
>> For more details, you can also check out my blog post on the release:
>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented <https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented>
>> 
>> -Todd
>> 
>> On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:
>> 
>>> I recently had a problem on my production which I believe was a
>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
>>> we will upgrade our client soon.  However, it made me realise that I didn’t
>>> have any monitoring set up on this.  The only thing I can find as a metric
>>> is the
>>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>> which, if I understand correctly, is the max lag of any partition that that
>>> particular consumer is consuming.
>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>> the issue in kafka-2978, would I actually have been alerted, i.e. since the
>>> consumer would think it is consuming correctly would it not have updated
>>> the metric.
>>> 2. There is another way to see offset lag using the command
>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>> response.  Is it safe or advisable to do this?  I like the fact that it
>>> tells me each partition lag, although it is also not available if no
>>> consumer from the group is currently consuming.
>>> 3. Is there a better way of doing this?
>> 
>> 
>> 
>> -- 
>> *Todd Palino*
>> Staff Site Reliability Engineer
>> Data Infrastructure Streaming
>> 
>> 
>> 
>> linkedin.com/in/toddpalino
> 


Re: Monitoring offset lag

Posted by Tom Dearman <to...@gmail.com>.
Todd,

Thanks for that I am taking a look.

Is there a bug whereby if you only have a couple of messages on a topic, both with the same key, that burrow doesn’t return correct info.  I was finding that http://localhost:8100/v2/kafka/betwave/consumer <http://localhost:8100/v2/kafka/betwave/consumer> was returning a message with empty consumers until I put on another message with a different key, i.e. a minimum of 2 partitions with something in them.  I know this is not very like production, but on my local this I was only testing with one user so get just one partition filled.

Tom
> On 6 Jul 2016, at 18:08, Todd Palino <tp...@gmail.com> wrote:
> 
> Yeah, I've written dissertations at this point on why MaxLag is flawed. We
> also used to use the offset checker tool, and later something similar that
> was a little easier to slot into our monitoring systems. Problems with all
> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
> 
> For more details, you can also check out my blog post on the release:
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
> 
> -Todd
> 
> On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:
> 
>> I recently had a problem on my production which I believe was a
>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
>> we will upgrade our client soon.  However, it made me realise that I didn’t
>> have any monitoring set up on this.  The only thing I can find as a metric
>> is the
>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>> which, if I understand correctly, is the max lag of any partition that that
>> particular consumer is consuming.
>> 1. If I had been monitoring this, and if my consumer was suffering from
>> the issue in kafka-2978, would I actually have been alerted, i.e. since the
>> consumer would think it is consuming correctly would it not have updated
>> the metric.
>> 2. There is another way to see offset lag using the command
>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>> response.  Is it safe or advisable to do this?  I like the fact that it
>> tells me each partition lag, although it is also not available if no
>> consumer from the group is currently consuming.
>> 3. Is there a better way of doing this?
> 
> 
> 
> -- 
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
> 
> 
> 
> linkedin.com/in/toddpalino


Re: Monitoring offset lag

Posted by Todd Palino <tp...@gmail.com>.
Yeah, I've written dissertations at this point on why MaxLag is flawed. We
also used to use the offset checker tool, and later something similar that
was a little easier to slot into our monitoring systems. Problems with all
of these is why I wrote Burrow (https://github.com/linkedin/Burrow)

For more details, you can also check out my blog post on the release:
https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented

-Todd

On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:

> I recently had a problem on my production which I believe was a
> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
> we will upgrade our client soon.  However, it made me realise that I didn’t
> have any monitoring set up on this.  The only thing I can find as a metric
> is the
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> which, if I understand correctly, is the max lag of any partition that that
> particular consumer is consuming.
> 1. If I had been monitoring this, and if my consumer was suffering from
> the issue in kafka-2978, would I actually have been alerted, i.e. since the
> consumer would think it is consuming correctly would it not have updated
> the metric.
> 2. There is another way to see offset lag using the command
> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> response.  Is it safe or advisable to do this?  I like the fact that it
> tells me each partition lag, although it is also not available if no
> consumer from the group is currently consuming.
> 3. Is there a better way of doing this?



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: Monitoring offset lag

Posted by Marko Bonaći <ma...@sematext.com>.
Hi Tom,
if you need a commercially proven lag monitoring solution (and all other
Kafka and ZK metrics) take a look at our SPM.
Hope you don't mind me plugging this one in :)

[image: Inline image 1]

Marko Bonaći
Monitoring | Alerting | Anomaly Detection | Centralized Log Management
Solr & Elasticsearch Support
Sematext <http://sematext.com/> | Contact
<http://sematext.com/about/contact.html>

On Wed, Jul 6, 2016 at 5:42 PM, Tom Dearman <to...@gmail.com> wrote:

> I recently had a problem on my production which I believe was a
> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
> we will upgrade our client soon.  However, it made me realise that I didn’t
> have any monitoring set up on this.  The only thing I can find as a metric
> is the
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> which, if I understand correctly, is the max lag of any partition that that
> particular consumer is consuming.
> 1. If I had been monitoring this, and if my consumer was suffering from
> the issue in kafka-2978, would I actually have been alerted, i.e. since the
> consumer would think it is consuming correctly would it not have updated
> the metric.
> 2. There is another way to see offset lag using the command
> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> response.  Is it safe or advisable to do this?  I like the fact that it
> tells me each partition lag, although it is also not available if no
> consumer from the group is currently consuming.
> 3. Is there a better way of doing this?