You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Abhimanyu Nagrath <ab...@gmail.com> on 2017/05/28 11:59:36 UTC

Queries regarding kafka Monitoring tool burrow.

Hi ,

I am using burrow to monitor kafka Lags and I am having following queries :

1.On hitting the API /v2/kafka/local/consumer/group1/lag I am not able to
view all the topics details present in that group and getting complete:
false in the above JSON. What does this mean? Below mentioned is the json
result of the above query.
{

"error": false,
"message": "consumer group status returned",
"status": {
"cluster": "local",
"group": "group1",
"status": "OK",
"complete": false,
"partitions": [
{
"topic": "topic1",
"partition": 1,
"status": "OK",
"start": {
"offset": 144,
"timestamp": 1494566913489,
"lag": 0,
"max_offset": 144
},
"end": {
"offset": 144,
"timestamp": 1494566999000,
"lag": 0,
"max_offset": 144
}
}
],
"partition_count": 17,
"maxlag": null,
"totallag": 0
},
"request": {
"url": "/v2/kafka/local/consumer/group1/lag",
"host": "",
"cluster": "local",
"group": "group1",
"topic": ""
}
}


2. Since burrow returns JSON is there any visualization tools that can be
used to monitor the end results.

3. On hitting group describe command and burrow group lag command results
are different burrow result is somewhat delayed then the results that I am
getting while hitting group describes command on Kafka broker and I am
getting the different result.



Below mentioned is my burrow configuration:


[general]
logdir=log
logconfig=/root/go/src/github.com/linkedin/Burrow/config/logging.cfg
pidfile=burrow.pid
client-id=burrow-lagchecker
group-blacklist=^(console-consumer-|python-kafka-consumer-).*$
#group-whitelist=^(my-important-consumer).*$

[zookeeper]
hostname=<zookeeper ip>
port=2181
timeout=6
lock-path=/burrow/notifier

[kafka "local"]
broker=<Kafka Ip>
broker-port=9092
zookeeper=<zookeeper ip>
zookeeper-port=2181
zookeeper-path=/
offsets-topic=__consumer_offsets

#[storm "local"]
#zookeeper=zkhost01.example.com
#zookeeper-port=2181
#zookeeper-path=/kafka-cluster/stormconsumers

[tickers]
broker-offsets=20

[lagcheck]
intervals=10
expire-group=604800

[notify]
interval=10

[httpserver]
server=on
port=8000
; Alternatively, use listen (cannot be specified when port is)
; listen=host:port
; listen=host2:port2

[smtp]
server=mailserver.example.com
port=25
from=burrow-noreply@example.com
template=config/default-email.tmpl

[emailnotifier "bofh@example.com"]
group=local,critical-consumer-group
group=local,other-consumer-group
interval=60

[notify]
interval=10

[httpnotifier]
url=http://notification.server.example.com:9000/v1/alert
interval=60
extra=app=burrow
extra=tier=STG
template-post=config/default-http-post.tmpl
template-delete=config/default-http-delete.tmpl
timeout=5
keepalive=30

So Can you please let me know what I am missing and how to fix these
issues.Any help would be appreciated.



Regards,
Abhimanyu

Re: Queries regarding kafka Monitoring tool burrow.

Posted by Todd Palino <tp...@gmail.com>.

The lag numbers are never going to be exactly the same as what the CLI tool
returns, as the broker is queried on an interval for the offset at the end
of each partition. As far as crashing goes, I’d be interested to hear about
specifics as we run it (obviously) and don’t have that problem. It could be
environmental differences or other problems that we’re not running into. I
regret that I’m not able to be as active on the GitHub issues as I would
like to be, but I do try and get through them (and others will answer
questions as well).

Abhimanyu, to answer your questions…

1 - When complete is false, this means that Burrow does not yet have enough
information on the consumer group to work. There’s a configurable number of
intervals, or offset commits, that Burrow requires for each partition. So
if your consumer group commits offsets every minute, it will take 10
minutes for there to be enough offset commits (for the default interval
config of 10). If you commit offsets every 10 minutes, it will take 1 hour
and 40 minutes.

2 - We haven’t included any visualization tools with Burrow directly, but
some other people have been working on add-ons. Check out the associated
projects - https://github.com/linkedin/Burrow/wiki/Associated-Projects

For our own use, we actual collect the metrics from Burrow (partition
counts for the error stats, total lag) into our internal metrics system
that does graphing. I know some others are doing that as well.

3 - As noted above, because of the way Burrow gets the broker log end
offsets, the numbers won’t match the CLI exactly. In addition, Burrow
currently only calculates lag for a given partition when the consumer
commits an offset. The reasoning for this is that the lag numbers were not
really designed to be exposed in the original design - we wanted to create
overall status for each consumer group, not specific lag metrics. We used
to do the latter, using the CLI tool and some wrappers around it, similar
to what Ian has described with remora, and found it significantly lacking.
Specifically, we had too many lag numbers to deal with for thousands of
topics and tens of thousands of partitions over many consumers, and no good
way to define thresholds.

-Todd

On Mon, May 29, 2017 at 3:51 PM, Ian Duffy <ia...@ianduffy.ie> wrote:

> Hey Abhimanyu,
>
> Not directly answering your questions but in the past we used burrow at my
> current company and we had a horrible time with it. It would crash daily
> and its lag metrics were very different to what was returned when you would
> run the kafka-consumer-group describe command as you noted.
>
> My co-worker ended up building our own solution that basically just wraps
> around the command line tools. https://github.com/zalando-incubator/remora
>
> > 2. Since burrow returns JSON is there any visualization tools that can
> be used
> to monitor the end results.
>
> We've an monitoring solution (https://github.com/zalando/zmon) that polls
> the HTTP endpoint every 60 seconds and places the data into kariosdb. From
> there we've a time series db to query directly from grafana. It should be
> possible to throw a simple poller script together that does this for you.
>
> > 3. On hitting group describe command and burrow group lag command
> results are
> different burrow result is somewhat delayed then the results that I am
> getting
> while hitting group describes command on Kafka broker and I am getting the
> different result.
>
> They use a different lag calculation method
> https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules
> describes it.
>
> On 28 May 2017 at 12:59, Abhimanyu Nagrath <ab...@gmail.com>
> wrote:
>
> > Hi ,
> >
> > I am using burrow to monitor kafka Lags and I am having following
> queries :
> >
> > 1.On hitting the API /v2/kafka/local/consumer/group1/lag I am not able
> to
> > view all the topics details present in that group and getting complete:
> > false in the above JSON. What does this mean? Below mentioned is the json
> > result of the above query.
> > {
> >
> > "error": false,
> > "message": "consumer group status returned",
> > "status": {
> > "cluster": "local",
> > "group": "group1",
> > "status": "OK",
> > "complete": false,
> > "partitions": [
> > {
> > "topic": "topic1",
> > "partition": 1,
> > "status": "OK",
> > "start": {
> > "offset": 144,
> > "timestamp": 1494566913489,
> > "lag": 0,
> > "max_offset": 144
> > },
> > "end": {
> > "offset": 144,
> > "timestamp": 1494566999000,
> > "lag": 0,
> > "max_offset": 144
> > }
> > }
> > ],
> > "partition_count": 17,
> > "maxlag": null,
> > "totallag": 0
> > },
> > "request": {
> > "url": "/v2/kafka/local/consumer/group1/lag",
> > "host": "",
> > "cluster": "local",
> > "group": "group1",
> > "topic": ""
> > }
> > }
> >
> >
> > 2. Since burrow returns JSON is there any visualization tools that can be
> > used to monitor the end results.
> >
> > 3. On hitting group describe command and burrow group lag command results
> > are different burrow result is somewhat delayed then the results that I
> am
> > getting while hitting group describes command on Kafka broker and I am
> > getting the different result.
> >
> >
> >
> > Below mentioned is my burrow configuration:
> >
> >
> > [general]
> > logdir=log
> > logconfig=/root/go/src/github.com/linkedin/Burrow/config/logging.cfg
> > pidfile=burrow.pid
> > client-id=burrow-lagchecker
> > group-blacklist=^(console-consumer-|python-kafka-consumer-).*$
> > #group-whitelist=^(my-important-consumer).*$
> >
> > [zookeeper]
> > hostname=<zookeeper ip>
> > port=2181
> > timeout=6
> > lock-path=/burrow/notifier
> >
> > [kafka "local"]
> > broker=<Kafka Ip>
> > broker-port=9092
> > zookeeper=<zookeeper ip>
> > zookeeper-port=2181
> > zookeeper-path=/
> > offsets-topic=__consumer_offsets
> >
> > #[storm "local"]
> > #zookeeper=zkhost01.example.com
> > #zookeeper-port=2181
> > #zookeeper-path=/kafka-cluster/stormconsumers
> >
> > [tickers]
> > broker-offsets=20
> >
> > [lagcheck]
> > intervals=10
> > expire-group=604800
> >
> > [notify]
> > interval=10
> >
> > [httpserver]
> > server=on
> > port=8000
> > ; Alternatively, use listen (cannot be specified when port is)
> > ; listen=host:port
> > ; listen=host2:port2
> >
> > [smtp]
> > server=mailserver.example.com
> > port=25
> > from=burrow-noreply@example.com
> > template=config/default-email.tmpl
> >
> > [emailnotifier "bofh@example.com"]
> > group=local,critical-consumer-group
> > group=local,other-consumer-group
> > interval=60
> >
> > [notify]
> > interval=10
> >
> > [httpnotifier]
> > url=http://notification.server.example.com:9000/v1/alert
> > interval=60
> > extra=app=burrow
> > extra=tier=STG
> > template-post=config/default-http-post.tmpl
> > template-delete=config/default-http-delete.tmpl
> > timeout=5
> > keepalive=30
> >
> > So Can you please let me know what I am missing and how to fix these
> > issues.Any help would be appreciated.
> >
> >
> >
> > Regards,
> > Abhimanyu
> >
>

-- 
*Todd Palino*
Senior Staff Engineer, Site Reliability
Data Infrastructure Streaming

linkedin.com/in/toddpalino

Re: Queries regarding kafka Monitoring tool burrow.

Posted by Ian Duffy <ia...@ianduffy.ie>.

Hey Abhimanyu,

Not directly answering your questions but in the past we used burrow at my
current company and we had a horrible time with it. It would crash daily
and its lag metrics were very different to what was returned when you would
run the kafka-consumer-group describe command as you noted.

My co-worker ended up building our own solution that basically just wraps
around the command line tools. https://github.com/zalando-incubator/remora

> 2. Since burrow returns JSON is there any visualization tools that can be used
to monitor the end results.

We've an monitoring solution (https://github.com/zalando/zmon) that polls
the HTTP endpoint every 60 seconds and places the data into kariosdb. From
there we've a time series db to query directly from grafana. It should be
possible to throw a simple poller script together that does this for you.

> 3. On hitting group describe command and burrow group lag command results are
different burrow result is somewhat delayed then the results that I am getting
while hitting group describes command on Kafka broker and I am getting the
different result.

They use a different lag calculation method
https://github.com/linkedin/Burrow/wiki/Consumer-Lag-Evaluation-Rules
describes it.

On 28 May 2017 at 12:59, Abhimanyu Nagrath <ab...@gmail.com>
wrote:

> Hi ,
>
> I am using burrow to monitor kafka Lags and I am having following queries :
>
> 1.On hitting the API /v2/kafka/local/consumer/group1/lag I am not able to
> view all the topics details present in that group and getting complete:
> false in the above JSON. What does this mean? Below mentioned is the json
> result of the above query.
> {
>
> "error": false,
> "message": "consumer group status returned",
> "status": {
> "cluster": "local",
> "group": "group1",
> "status": "OK",
> "complete": false,
> "partitions": [
> {
> "topic": "topic1",
> "partition": 1,
> "status": "OK",
> "start": {
> "offset": 144,
> "timestamp": 1494566913489,
> "lag": 0,
> "max_offset": 144
> },
> "end": {
> "offset": 144,
> "timestamp": 1494566999000,
> "lag": 0,
> "max_offset": 144
> }
> }
> ],
> "partition_count": 17,
> "maxlag": null,
> "totallag": 0
> },
> "request": {
> "url": "/v2/kafka/local/consumer/group1/lag",
> "host": "",
> "cluster": "local",
> "group": "group1",
> "topic": ""
> }
> }
>
>
> 2. Since burrow returns JSON is there any visualization tools that can be
> used to monitor the end results.
>
> 3. On hitting group describe command and burrow group lag command results
> are different burrow result is somewhat delayed then the results that I am
> getting while hitting group describes command on Kafka broker and I am
> getting the different result.
>
>
>
> Below mentioned is my burrow configuration:
>
>
> [general]
> logdir=log
> logconfig=/root/go/src/github.com/linkedin/Burrow/config/logging.cfg
> pidfile=burrow.pid
> client-id=burrow-lagchecker
> group-blacklist=^(console-consumer-|python-kafka-consumer-).*$
> #group-whitelist=^(my-important-consumer).*$
>
> [zookeeper]
> hostname=<zookeeper ip>
> port=2181
> timeout=6
> lock-path=/burrow/notifier
>
> [kafka "local"]
> broker=<Kafka Ip>
> broker-port=9092
> zookeeper=<zookeeper ip>
> zookeeper-port=2181
> zookeeper-path=/
> offsets-topic=__consumer_offsets
>
> #[storm "local"]
> #zookeeper=zkhost01.example.com
> #zookeeper-port=2181
> #zookeeper-path=/kafka-cluster/stormconsumers
>
> [tickers]
> broker-offsets=20
>
> [lagcheck]
> intervals=10
> expire-group=604800
>
> [notify]
> interval=10
>
> [httpserver]
> server=on
> port=8000
> ; Alternatively, use listen (cannot be specified when port is)
> ; listen=host:port
> ; listen=host2:port2
>
> [smtp]
> server=mailserver.example.com
> port=25
> from=burrow-noreply@example.com
> template=config/default-email.tmpl
>
> [emailnotifier "bofh@example.com"]
> group=local,critical-consumer-group
> group=local,other-consumer-group
> interval=60
>
> [notify]
> interval=10
>
> [httpnotifier]
> url=http://notification.server.example.com:9000/v1/alert
> interval=60
> extra=app=burrow
> extra=tier=STG
> template-post=config/default-http-post.tmpl
> template-delete=config/default-http-delete.tmpl
> timeout=5
> keepalive=30
>
> So Can you please let me know what I am missing and how to fix these
> issues.Any help would be appreciated.
>
>
>
> Regards,
> Abhimanyu
>