You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Tom Dearman <to...@gmail.com> on 2016/07/06 15:42:28 UTC
Monitoring offset lag
I recently had a problem on my production which I believe was a manifestation of the issue kafka-2978 (Topic partition is not sometimes consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and we will upgrade our client soon. However, it made me realise that I didn’t have any monitoring set up on this. The only thing I can find as a metric is the kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+), which, if I understand correctly, is the max lag of any partition that that particular consumer is consuming.
1. If I had been monitoring this, and if my consumer was suffering from the issue in kafka-2978, would I actually have been alerted, i.e. since the consumer would think it is consuming correctly would it not have updated the metric.
2. There is another way to see offset lag using the command /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server 10.10.1.61:9092 --describe —group consumer_group_name and parsing the response. Is it safe or advisable to do this? I like the fact that it tells me each partition lag, although it is also not available if no consumer from the group is currently consuming.
3. Is there a better way of doing this?
Re: Monitoring offset lag
Posted by Todd Palino <tp...@gmail.com>.
For “first partition”, I was speaking specifically of your example - Burrow
doesn’t care about partition 0 vs. any other partition. Looking at that
output from the groups tool, it looks like there are a lot of partition
with no committed offsets. There’s even one partition with a committed
offset past the log end offset, which is concerning. My guess here is that
after you started Burrow, there were no offset commits until after that
message was written to partition 1. After that, there was an offset commit
which allowed Burrow to discover the consumer group.
One of the things I want to do is to have Burrow bootstrap the
__consumer_offsets topic from the oldest offsets, which should avoid some
confusion like this. However, there’s a couple things with higher priority
for me personally first.
-Todd
On Fri, Jul 8, 2016 at 9:22 AM, Tom Dearman <to...@gmail.com> wrote:
> Sorry, I should say only partition 1 had something at first, then zero:
>
> Toms-iMac:betwave-server tomdearman$
> /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh
> --new-consumer --bootstrap-server localhost:9092 --describe --group
> voidbridge-oneworks-dummy
> GROUP TOPIC PARTITION
> CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
> voidbridge-oneworks-dummy integration-oneworks-dummy 2
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 7
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 12
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 17
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 4
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 9
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 14
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 19
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 1
> 3 3 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 6
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 11
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 16
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 3
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 8
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 13
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 18
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 0
> 10 0 -10
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 5
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 10
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 15
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> Toms-iMac:betwave-server tomdearman$
> /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh
> --new-consumer --bootstrap-server localhost:9092 --describe --group
> voidbridge-oneworks-dummy
> GROUP TOPIC PARTITION
> CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
> voidbridge-oneworks-dummy integration-oneworks-dummy 2
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 7
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 12
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 17
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 4
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 9
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 14
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 19
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 1
> 3 3 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 6
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 11
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 16
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 3
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 8
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 13
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 18
> unknown 0 unknown
> integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 0
> 1 1 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 5
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 10
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> voidbridge-oneworks-dummy integration-oneworks-dummy 15
> 0 0 0
> integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
>
> > On 8 Jul 2016, at 17:20, Tom Dearman <to...@gmail.com> wrote:
> >
> > When you say ‘for the first partition’ do you literally mean partition
> zero, or you mean any partition. It is true that when I had only 1 user
> there were only messages on partition 15 but the second user happened to go
> to partition zero. Is it the case that partition zero must have a consumer
> commit?
> >
> >> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
> >>
> >> If you open up an issue on the project, I'd be happy to dig into this in
> >> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
> >> enumerate consumer groups - it learns about them from offset commits. It
> >> sounds like maybe your consumer had not committed offsets for the first
> >> partition (at least not after Burrow was started).
> >>
> >> -Todd
> >>
> >> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
> >>
> >>> Todd,
> >>>
> >>> Thanks for that I am taking a look.
> >>>
> >>> Is there a bug whereby if you only have a couple of messages on a
> topic,
> >>> both with the same key, that burrow doesn’t return correct info. I was
> >>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
> >>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a
> message
> >>> with empty consumers until I put on another message with a different
> key,
> >>> i.e. a minimum of 2 partitions with something in them. I know this is
> not
> >>> very like production, but on my local this I was only testing with one
> user
> >>> so get just one partition filled.
> >>>
> >>> Tom
> >>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com
> <javascript:;>>
> >>> wrote:
> >>>>
> >>>> Yeah, I've written dissertations at this point on why MaxLag is
> flawed.
> >>> We
> >>>> also used to use the offset checker tool, and later something similar
> >>> that
> >>>> was a little easier to slot into our monitoring systems. Problems with
> >>> all
> >>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
> >>>>
> >>>> For more details, you can also check out my blog post on the release:
> >>>>
> >>>
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
> >>>>
> >>>> -Todd
> >>>>
> >>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
> >>> <javascript:;>> wrote:
> >>>>
> >>>>> I recently had a problem on my production which I believe was a
> >>>>> manifestation of the issue kafka-2978 (Topic partition is not
> sometimes
> >>>>> consumed after rebalancing of consumer group), this is fixed in
> 0.9.0.1
> >>> and
> >>>>> we will upgrade our client soon. However, it made me realise that I
> >>> didn’t
> >>>>> have any monitoring set up on this. The only thing I can find as a
> >>> metric
> >>>>> is the
> >>>>>
> >>>
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> >>>>> which, if I understand correctly, is the max lag of any partition
> that
> >>> that
> >>>>> particular consumer is consuming.
> >>>>> 1. If I had been monitoring this, and if my consumer was suffering
> from
> >>>>> the issue in kafka-2978, would I actually have been alerted, i.e.
> since
> >>> the
> >>>>> consumer would think it is consuming correctly would it not have
> updated
> >>>>> the metric.
> >>>>> 2. There is another way to see offset lag using the command
> >>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> >>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing
> the
> >>>>> response. Is it safe or advisable to do this? I like the fact that
> it
> >>>>> tells me each partition lag, although it is also not available if no
> >>>>> consumer from the group is currently consuming.
> >>>>> 3. Is there a better way of doing this?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> *Todd Palino*
> >>>> Staff Site Reliability Engineer
> >>>> Data Infrastructure Streaming
> >>>>
> >>>>
> >>>>
> >>>> linkedin.com/in/toddpalino
> >>>
> >>>
> >>
> >> --
> >> *Todd Palino*
> >> Staff Site Reliability Engineer
> >> Data Infrastructure Streaming
> >>
> >>
> >>
> >> linkedin.com/in/toddpalino
> >
>
>
--
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming
linkedin.com/in/toddpalino
Re: Monitoring offset lag
Posted by Tom Dearman <to...@gmail.com>.
Sorry, I should say only partition 1 had something at first, then zero:
Toms-iMac:betwave-server tomdearman$ /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group voidbridge-oneworks-dummy
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
voidbridge-oneworks-dummy integration-oneworks-dummy 2 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 7 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 12 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 17 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 4 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 9 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 14 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 19 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 1 3 3 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 6 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 11 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 16 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 3 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 8 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 13 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 18 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 0 10 0 -10 integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 5 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 10 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 15 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
Toms-iMac:betwave-server tomdearman$ /Users/tomdearman/software/kafka_2.11-0.10.0.0/bin/kafka-consumer-groups.sh --new-consumer --bootstrap-server localhost:9092 --describe --group voidbridge-oneworks-dummy
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
voidbridge-oneworks-dummy integration-oneworks-dummy 2 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 7 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 12 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 17 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-3_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 4 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 9 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 14 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 19 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-5_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 1 3 3 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 6 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 11 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 16 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-2_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 3 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 8 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 13 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 18 unknown 0 unknown integration-oneworks-dummy-voidbridge-oneworks-dummy-4_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 0 1 1 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 5 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 10 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
voidbridge-oneworks-dummy integration-oneworks-dummy 15 0 0 0 integration-oneworks-dummy-voidbridge-oneworks-dummy-1_/10.100.0.113
> On 8 Jul 2016, at 17:20, Tom Dearman <to...@gmail.com> wrote:
>
> When you say ‘for the first partition’ do you literally mean partition zero, or you mean any partition. It is true that when I had only 1 user there were only messages on partition 15 but the second user happened to go to partition zero. Is it the case that partition zero must have a consumer commit?
>
>> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
>>
>> If you open up an issue on the project, I'd be happy to dig into this in
>> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
>> enumerate consumer groups - it learns about them from offset commits. It
>> sounds like maybe your consumer had not committed offsets for the first
>> partition (at least not after Burrow was started).
>>
>> -Todd
>>
>> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
>>
>>> Todd,
>>>
>>> Thanks for that I am taking a look.
>>>
>>> Is there a bug whereby if you only have a couple of messages on a topic,
>>> both with the same key, that burrow doesn’t return correct info. I was
>>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
>>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
>>> with empty consumers until I put on another message with a different key,
>>> i.e. a minimum of 2 partitions with something in them. I know this is not
>>> very like production, but on my local this I was only testing with one user
>>> so get just one partition filled.
>>>
>>> Tom
>>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
>>> wrote:
>>>>
>>>> Yeah, I've written dissertations at this point on why MaxLag is flawed.
>>> We
>>>> also used to use the offset checker tool, and later something similar
>>> that
>>>> was a little easier to slot into our monitoring systems. Problems with
>>> all
>>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
>>>>
>>>> For more details, you can also check out my blog post on the release:
>>>>
>>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
>>>>
>>>> -Todd
>>>>
>>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
>>> <javascript:;>> wrote:
>>>>
>>>>> I recently had a problem on my production which I believe was a
>>>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
>>> and
>>>>> we will upgrade our client soon. However, it made me realise that I
>>> didn’t
>>>>> have any monitoring set up on this. The only thing I can find as a
>>> metric
>>>>> is the
>>>>>
>>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>>>> which, if I understand correctly, is the max lag of any partition that
>>> that
>>>>> particular consumer is consuming.
>>>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>>>> the issue in kafka-2978, would I actually have been alerted, i.e. since
>>> the
>>>>> consumer would think it is consuming correctly would it not have updated
>>>>> the metric.
>>>>> 2. There is another way to see offset lag using the command
>>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>>>> response. Is it safe or advisable to do this? I like the fact that it
>>>>> tells me each partition lag, although it is also not available if no
>>>>> consumer from the group is currently consuming.
>>>>> 3. Is there a better way of doing this?
>>>>
>>>>
>>>>
>>>> --
>>>> *Todd Palino*
>>>> Staff Site Reliability Engineer
>>>> Data Infrastructure Streaming
>>>>
>>>>
>>>>
>>>> linkedin.com/in/toddpalino
>>>
>>>
>>
>> --
>> *Todd Palino*
>> Staff Site Reliability Engineer
>> Data Infrastructure Streaming
>>
>>
>>
>> linkedin.com/in/toddpalino
>
Re: Monitoring offset lag
Posted by Tom Dearman <to...@gmail.com>.
When you say ‘for the first partition’ do you literally mean partition zero, or you mean any partition. It is true that when I had only 1 user there were only messages on partition 15 but the second user happened to go to partition zero. Is it the case that partition zero must have a consumer commit?
> On 8 Jul 2016, at 17:16, Todd Palino <tp...@gmail.com> wrote:
>
> If you open up an issue on the project, I'd be happy to dig into this in
> more detail if needed. Excluding the ZK offset checking, Burrow doesn't
> enumerate consumer groups - it learns about them from offset commits. It
> sounds like maybe your consumer had not committed offsets for the first
> partition (at least not after Burrow was started).
>
> -Todd
>
> On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
>
>> Todd,
>>
>> Thanks for that I am taking a look.
>>
>> Is there a bug whereby if you only have a couple of messages on a topic,
>> both with the same key, that burrow doesn’t return correct info. I was
>> finding that http://localhost:8100/v2/kafka/betwave/consumer <
>> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
>> with empty consumers until I put on another message with a different key,
>> i.e. a minimum of 2 partitions with something in them. I know this is not
>> very like production, but on my local this I was only testing with one user
>> so get just one partition filled.
>>
>> Tom
>>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
>> wrote:
>>>
>>> Yeah, I've written dissertations at this point on why MaxLag is flawed.
>> We
>>> also used to use the offset checker tool, and later something similar
>> that
>>> was a little easier to slot into our monitoring systems. Problems with
>> all
>>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
>>>
>>> For more details, you can also check out my blog post on the release:
>>>
>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
>>>
>>> -Todd
>>>
>>> On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
>> <javascript:;>> wrote:
>>>
>>>> I recently had a problem on my production which I believe was a
>>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
>> and
>>>> we will upgrade our client soon. However, it made me realise that I
>> didn’t
>>>> have any monitoring set up on this. The only thing I can find as a
>> metric
>>>> is the
>>>>
>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>>> which, if I understand correctly, is the max lag of any partition that
>> that
>>>> particular consumer is consuming.
>>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>>> the issue in kafka-2978, would I actually have been alerted, i.e. since
>> the
>>>> consumer would think it is consuming correctly would it not have updated
>>>> the metric.
>>>> 2. There is another way to see offset lag using the command
>>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>>> response. Is it safe or advisable to do this? I like the fact that it
>>>> tells me each partition lag, although it is also not available if no
>>>> consumer from the group is currently consuming.
>>>> 3. Is there a better way of doing this?
>>>
>>>
>>>
>>> --
>>> *Todd Palino*
>>> Staff Site Reliability Engineer
>>> Data Infrastructure Streaming
>>>
>>>
>>>
>>> linkedin.com/in/toddpalino
>>
>>
>
> --
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
>
>
>
> linkedin.com/in/toddpalino
Re: Monitoring offset lag
Posted by Todd Palino <tp...@gmail.com>.
If you open up an issue on the project, I'd be happy to dig into this in
more detail if needed. Excluding the ZK offset checking, Burrow doesn't
enumerate consumer groups - it learns about them from offset commits. It
sounds like maybe your consumer had not committed offsets for the first
partition (at least not after Burrow was started).
-Todd
On Friday, July 8, 2016, Tom Dearman <to...@gmail.com> wrote:
> Todd,
>
> Thanks for that I am taking a look.
>
> Is there a bug whereby if you only have a couple of messages on a topic,
> both with the same key, that burrow doesn’t return correct info. I was
> finding that http://localhost:8100/v2/kafka/betwave/consumer <
> http://localhost:8100/v2/kafka/betwave/consumer> was returning a message
> with empty consumers until I put on another message with a different key,
> i.e. a minimum of 2 partitions with something in them. I know this is not
> very like production, but on my local this I was only testing with one user
> so get just one partition filled.
>
> Tom
> > On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <javascript:;>>
> wrote:
> >
> > Yeah, I've written dissertations at this point on why MaxLag is flawed.
> We
> > also used to use the offset checker tool, and later something similar
> that
> > was a little easier to slot into our monitoring systems. Problems with
> all
> > of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
> >
> > For more details, you can also check out my blog post on the release:
> >
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
> >
> > -Todd
> >
> > On Wednesday, July 6, 2016, Tom Dearman <tom.dearman@gmail.com
> <javascript:;>> wrote:
> >
> >> I recently had a problem on my production which I believe was a
> >> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> >> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1
> and
> >> we will upgrade our client soon. However, it made me realise that I
> didn’t
> >> have any monitoring set up on this. The only thing I can find as a
> metric
> >> is the
> >>
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> >> which, if I understand correctly, is the max lag of any partition that
> that
> >> particular consumer is consuming.
> >> 1. If I had been monitoring this, and if my consumer was suffering from
> >> the issue in kafka-2978, would I actually have been alerted, i.e. since
> the
> >> consumer would think it is consuming correctly would it not have updated
> >> the metric.
> >> 2. There is another way to see offset lag using the command
> >> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> >> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> >> response. Is it safe or advisable to do this? I like the fact that it
> >> tells me each partition lag, although it is also not available if no
> >> consumer from the group is currently consuming.
> >> 3. Is there a better way of doing this?
> >
> >
> >
> > --
> > *Todd Palino*
> > Staff Site Reliability Engineer
> > Data Infrastructure Streaming
> >
> >
> >
> > linkedin.com/in/toddpalino
>
>
--
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming
linkedin.com/in/toddpalino
Re: Monitoring offset lag
Posted by Tom Dearman <to...@gmail.com>.
I should mention this was using the web server to check status.
> On 8 Jul 2016, at 16:56, Tom Dearman <to...@gmail.com> wrote:
>
> Todd,
>
> Thanks for that I am taking a look.
>
> Is there a bug whereby if you only have a couple of messages on a topic, both with the same key, that burrow doesn’t return correct info. I was finding that http://localhost:8100/v2/kafka/betwave/consumer <http://localhost:8100/v2/kafka/betwave/consumer> was returning a message with empty consumers until I put on another message with a different key, i.e. a minimum of 2 partitions with something in them. I know this is not very like production, but on my local this I was only testing with one user so get just one partition filled.
>
> Tom
>> On 6 Jul 2016, at 18:08, Todd Palino <tpalino@gmail.com <ma...@gmail.com>> wrote:
>>
>> Yeah, I've written dissertations at this point on why MaxLag is flawed. We
>> also used to use the offset checker tool, and later something similar that
>> was a little easier to slot into our monitoring systems. Problems with all
>> of these is why I wrote Burrow (https://github.com/linkedin/Burrow <https://github.com/linkedin/Burrow>)
>>
>> For more details, you can also check out my blog post on the release:
>> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented <https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented>
>>
>> -Todd
>>
>> On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:
>>
>>> I recently had a problem on my production which I believe was a
>>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
>>> we will upgrade our client soon. However, it made me realise that I didn’t
>>> have any monitoring set up on this. The only thing I can find as a metric
>>> is the
>>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>>> which, if I understand correctly, is the max lag of any partition that that
>>> particular consumer is consuming.
>>> 1. If I had been monitoring this, and if my consumer was suffering from
>>> the issue in kafka-2978, would I actually have been alerted, i.e. since the
>>> consumer would think it is consuming correctly would it not have updated
>>> the metric.
>>> 2. There is another way to see offset lag using the command
>>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>>> response. Is it safe or advisable to do this? I like the fact that it
>>> tells me each partition lag, although it is also not available if no
>>> consumer from the group is currently consuming.
>>> 3. Is there a better way of doing this?
>>
>>
>>
>> --
>> *Todd Palino*
>> Staff Site Reliability Engineer
>> Data Infrastructure Streaming
>>
>>
>>
>> linkedin.com/in/toddpalino
>
Re: Monitoring offset lag
Posted by Tom Dearman <to...@gmail.com>.
Todd,
Thanks for that I am taking a look.
Is there a bug whereby if you only have a couple of messages on a topic, both with the same key, that burrow doesn’t return correct info. I was finding that http://localhost:8100/v2/kafka/betwave/consumer <http://localhost:8100/v2/kafka/betwave/consumer> was returning a message with empty consumers until I put on another message with a different key, i.e. a minimum of 2 partitions with something in them. I know this is not very like production, but on my local this I was only testing with one user so get just one partition filled.
Tom
> On 6 Jul 2016, at 18:08, Todd Palino <tp...@gmail.com> wrote:
>
> Yeah, I've written dissertations at this point on why MaxLag is flawed. We
> also used to use the offset checker tool, and later something similar that
> was a little easier to slot into our monitoring systems. Problems with all
> of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
>
> For more details, you can also check out my blog post on the release:
> https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
>
> -Todd
>
> On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:
>
>> I recently had a problem on my production which I believe was a
>> manifestation of the issue kafka-2978 (Topic partition is not sometimes
>> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
>> we will upgrade our client soon. However, it made me realise that I didn’t
>> have any monitoring set up on this. The only thing I can find as a metric
>> is the
>> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
>> which, if I understand correctly, is the max lag of any partition that that
>> particular consumer is consuming.
>> 1. If I had been monitoring this, and if my consumer was suffering from
>> the issue in kafka-2978, would I actually have been alerted, i.e. since the
>> consumer would think it is consuming correctly would it not have updated
>> the metric.
>> 2. There is another way to see offset lag using the command
>> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
>> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
>> response. Is it safe or advisable to do this? I like the fact that it
>> tells me each partition lag, although it is also not available if no
>> consumer from the group is currently consuming.
>> 3. Is there a better way of doing this?
>
>
>
> --
> *Todd Palino*
> Staff Site Reliability Engineer
> Data Infrastructure Streaming
>
>
>
> linkedin.com/in/toddpalino
Re: Monitoring offset lag
Posted by Todd Palino <tp...@gmail.com>.
Yeah, I've written dissertations at this point on why MaxLag is flawed. We
also used to use the offset checker tool, and later something similar that
was a little easier to slot into our monitoring systems. Problems with all
of these is why I wrote Burrow (https://github.com/linkedin/Burrow)
For more details, you can also check out my blog post on the release:
https://engineering.linkedin.com/apache-kafka/burrow-kafka-consumer-monitoring-reinvented
-Todd
On Wednesday, July 6, 2016, Tom Dearman <to...@gmail.com> wrote:
> I recently had a problem on my production which I believe was a
> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
> we will upgrade our client soon. However, it made me realise that I didn’t
> have any monitoring set up on this. The only thing I can find as a metric
> is the
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> which, if I understand correctly, is the max lag of any partition that that
> particular consumer is consuming.
> 1. If I had been monitoring this, and if my consumer was suffering from
> the issue in kafka-2978, would I actually have been alerted, i.e. since the
> consumer would think it is consuming correctly would it not have updated
> the metric.
> 2. There is another way to see offset lag using the command
> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> response. Is it safe or advisable to do this? I like the fact that it
> tells me each partition lag, although it is also not available if no
> consumer from the group is currently consuming.
> 3. Is there a better way of doing this?
--
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming
linkedin.com/in/toddpalino
Re: Monitoring offset lag
Posted by Marko Bonaći <ma...@sematext.com>.
Hi Tom,
if you need a commercially proven lag monitoring solution (and all other
Kafka and ZK metrics) take a look at our SPM.
Hope you don't mind me plugging this one in :)
[image: Inline image 1]
Marko Bonaći
Monitoring | Alerting | Anomaly Detection | Centralized Log Management
Solr & Elasticsearch Support
Sematext <http://sematext.com/> | Contact
<http://sematext.com/about/contact.html>
On Wed, Jul 6, 2016 at 5:42 PM, Tom Dearman <to...@gmail.com> wrote:
> I recently had a problem on my production which I believe was a
> manifestation of the issue kafka-2978 (Topic partition is not sometimes
> consumed after rebalancing of consumer group), this is fixed in 0.9.0.1 and
> we will upgrade our client soon. However, it made me realise that I didn’t
> have any monitoring set up on this. The only thing I can find as a metric
> is the
> kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+),
> which, if I understand correctly, is the max lag of any partition that that
> particular consumer is consuming.
> 1. If I had been monitoring this, and if my consumer was suffering from
> the issue in kafka-2978, would I actually have been alerted, i.e. since the
> consumer would think it is consuming correctly would it not have updated
> the metric.
> 2. There is another way to see offset lag using the command
> /usr/bin/kafka-consumer-groups --new-consumer --bootstrap-server
> 10.10.1.61:9092 --describe —group consumer_group_name and parsing the
> response. Is it safe or advisable to do this? I like the fact that it
> tells me each partition lag, although it is also not available if no
> consumer from the group is currently consuming.
> 3. Is there a better way of doing this?