You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Jeremiah Adams <JA...@helixeducation.com> on 2018/11/26 20:31:07 UTC

Alerting and Monitoring Samza Checkpointing?

We are replacing a node.js app that consumed topics on a Kafka cluster with Samza jobs. We use kafka-offsets to trigger alerts based on message lag. e.g., message lag is greater than 10, wake up support persons.


Samza doesn't use the same mechanism for offset storage and the tools for examining a topic's checkpoint aren't readily useful for application consumption.


Can some of you share your approaches to monitoring and alerting on consumer lag?


Regards.


Jeremiah Adams
Software Engineer
www.helixeducation.com<http://www.helixeducation.com/>
Blog<http://www.helixeducation.com/blog/> | Twitter<https://twitter.com/HelixEducation> | Facebook<https://www.facebook.com/HelixEducation> | LinkedIn<http://www.linkedin.com/company/3609946>

Re: Alerting and Monitoring Samza Checkpointing?

Posted by Jeremiah Adams <JA...@helixeducation.com>.

Thanks for the information Tom and Jagadish.



Jeremiah Adams
Software Engineer
www.helixeducation.com
Blog | Twitter | Facebook | LinkedIn

________________________________________
From: Jagadish Venkatraman <ja...@gmail.com>
Sent: Tuesday, November 27, 2018 4:43 PM
To: dev@samza.apache.org
Subject: Re: Alerting and Monitoring Samza Checkpointing?

Hi Jeremiah,

+1 to what Tom said. Samza currently does not rely on Kafka consumer's
checkpointing behavior and
exposes its own notion of a "lag". This is reported as a per-partition
metric under KafkaSystemConsumerMetrics#messagesBehindHighWatermark
<https://url.emailprotection.link/?aZyQRg2CGut2qgyHrdHxA3r2wRZBhFBnHgQFe8bv7-el-kky1mI0NNorGT67_4hGvk9kwfq-xqe3AhS5fbDhgMvvevjSSie9YcK39B_XuWMa1wwlETzquMJOzILPI_fs8kXNOSL8pFBntugBWc-9uZMmpGC-UEQpMb9CvPN_cxKSus8rQ50_dgENdDG7FYPqW7BSwuNTl7DzgPysJqxE0ow~~>
.
My first recommendation would be to setup alerts on this metric and monitor
it.

To use *Burrow* for lag monitoring, you can extend the *KafkaSystemConsumer*
to implement the
CheckpointListener
<https://url.emailprotection.link/?a5r3dp0BEHNX0237G8v6KFo8_GB1EZ1iyx01zxvfwK5Rm6d4ZFP_sbdggncCCDswySzk_Y1fqB5Ud0NAEk1IuUhjU20WoKjGH5uErkSWavdUPur-7LHWIppwLwsNRbGWmTvYjsmOVPtSptgIn0w625ZGv_jYlGxBoTiXzqhFLDOQ~>
interface. This interface allows you to intercept Samza's checkpointing
sequence and
plug-in your own logic. An example implementation of the CheckpointListener
could instantiate
a Kafka consumer and periodically commit Samza's checkpointed offsets to
it.

Please let me know if you have any questions.

Best,
Jagadish





On Tue, Nov 27, 2018 at 10:50 AM Jeremiah Adams <JA...@helixeducation.com>
wrote:

>
> I am referring to the "Lag" that can exist when a Consumer Offset is
> significantly less than the Log Size. This difference is Lag and is often
> symptomatic of a problem - processing has stopped or being overwhelmed etc.
>
> Our Legacy Node.js system uses Consumer Groups, (same as say older
> Spark).  To get the Offset, we can use kafka-consumer-groups.sh tool to get
> the offset. For Ops related work we use Kafkamon for these to get a UI up
> for Ops folks.
>
> Our newer stuff uses Samza and I see zero Consumer Groups. Instead I see
> checkpoint topics (example:
> __samza_checkpoint_ver_1_for_generic-delivery_1). I can consume this topic
> and get the current offset by partition, but I don't have the log size, so
> cannot compute the lag. All I can do is see these numbers increment but
> know clue how behind my process is.
>
> I just took Linkedin's Burrow (https://url.emailprotection.link/?aZyQRg2CGut2qgyHrdHxA3nJh1kgchfH4Ntw9gzf6uyPMb70s0sXRMiq-yoBXqdnS8SqL9elFNrlOevL0tB3-NQ~~) for a
> test drive locally, hoping it would solve my problem due to it looking at
> the internal consumers. However, I have the same problem - can't get data
> on a consumer group that doesn't exist.
>
>
>
>
>
> Jeremiah Adams
> Software Engineer
> https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Tom Davis <to...@recursivedream.com>
> Sent: Monday, November 26, 2018 6:59 PM
> To: dev@samza.apache.org
> Subject: Re: Alerting and Monitoring Samza Checkpointing?
>
> Have you looked into KafkaSystemConsumerMetrics? Is the meaning of "lag"
> there different from what you mean?
>
>
> Jeremiah Adams <JA...@helixeducation.com> writes:
>
> > We are replacing a node.js app that consumed topics on a Kafka cluster
> with
> > Samza jobs. We use kafka-offsets to trigger alerts based on message lag.
> e.g.,
> > message lag is greater than 10, wake up support persons.
> >
> >
> > Samza doesn't use the same mechanism for offset storage and the tools for
> > examining a topic's checkpoint aren't readily useful for application
> > consumption.
> >
> >
> > Can some of you share your approaches to monitoring and alerting on
> consumer
> > lag?
> >
> >
> > Regards.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~
> <
> https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHp-qKE3Xn2gNiZ3dlqAeSDA~
> >
> > Blog<
> https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHgFEZu-KYuiu8doY66NWwmmyWxz7kC-27Yfnbdgd2wyh5gjXUa6LMT_NRXsj1g1VVg~~>
> |
> > Twitter<
> https://url.emailprotection.link/?a0Q7ct5_6cOdbJ86kpWB0zx6RbtgugTVC7lU_W7za50jLdZQGpLgVlR1V06zckSaM5oOKb6QBo46Qp9xt0Tt7Aw~~>
> |
> > Facebook<
> https://url.emailprotection.link/?aAmyAO_nS_C1aDgBLeKyGTu0tksTt1_mn2PcS8KJXNJPM04iRHKgX96qGgENV-dMSER5wl8zDVRr3RsS0OmcF9A~~>
> |
> > LinkedIn<
> https://url.emailprotection.link/?aanlcNI-cN74Gdz-TD332xAl6lHu7TRNICWoHUFjYf-KlBjrCGHoYR65b3rl-OyW10nWFv6hwYvUSoVHL4b3vGA~~
> >
>


--
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: Alerting and Monitoring Samza Checkpointing?

Posted by Jagadish Venkatraman <ja...@gmail.com>.

Hi Jeremiah,

+1 to what Tom said. Samza currently does not rely on Kafka consumer's
checkpointing behavior and
exposes its own notion of a "lag". This is reported as a per-partition
metric under KafkaSystemConsumerMetrics#messagesBehindHighWatermark
<https://github.com/apache/samza/blob/master/samza-kafka/src/main/scala/org/apache/samza/system/kafka/KafkaSystemConsumerMetrics.scala#L45>
.
My first recommendation would be to setup alerts on this metric and monitor
it.

To use *Burrow* for lag monitoring, you can extend the *KafkaSystemConsumer*
to implement the
CheckpointListener
<https://samza.apache.org/learn/documentation/0.13/api/javadocs/org/apache/samza/checkpoint/CheckpointListener.html>
interface. This interface allows you to intercept Samza's checkpointing
sequence and
plug-in your own logic. An example implementation of the CheckpointListener
could instantiate
a Kafka consumer and periodically commit Samza's checkpointed offsets to
it.

Please let me know if you have any questions.

Best,
Jagadish





On Tue, Nov 27, 2018 at 10:50 AM Jeremiah Adams <JA...@helixeducation.com>
wrote:

>
> I am referring to the "Lag" that can exist when a Consumer Offset is
> significantly less than the Log Size. This difference is Lag and is often
> symptomatic of a problem - processing has stopped or being overwhelmed etc.
>
> Our Legacy Node.js system uses Consumer Groups, (same as say older
> Spark).  To get the Offset, we can use kafka-consumer-groups.sh tool to get
> the offset. For Ops related work we use Kafkamon for these to get a UI up
> for Ops folks.
>
> Our newer stuff uses Samza and I see zero Consumer Groups. Instead I see
> checkpoint topics (example:
> __samza_checkpoint_ver_1_for_generic-delivery_1). I can consume this topic
> and get the current offset by partition, but I don't have the log size, so
> cannot compute the lag. All I can do is see these numbers increment but
> know clue how behind my process is.
>
> I just took Linkedin's Burrow (https://github.com/linkedin/Burrow) for a
> test drive locally, hoping it would solve my problem due to it looking at
> the internal consumers. However, I have the same problem - can't get data
> on a consumer group that doesn't exist.
>
>
>
>
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com
> Blog | Twitter | Facebook | LinkedIn
>
> ________________________________________
> From: Tom Davis <to...@recursivedream.com>
> Sent: Monday, November 26, 2018 6:59 PM
> To: dev@samza.apache.org
> Subject: Re: Alerting and Monitoring Samza Checkpointing?
>
> Have you looked into KafkaSystemConsumerMetrics? Is the meaning of "lag"
> there different from what you mean?
>
>
> Jeremiah Adams <JA...@helixeducation.com> writes:
>
> > We are replacing a node.js app that consumed topics on a Kafka cluster
> with
> > Samza jobs. We use kafka-offsets to trigger alerts based on message lag.
> e.g.,
> > message lag is greater than 10, wake up support persons.
> >
> >
> > Samza doesn't use the same mechanism for offset storage and the tools for
> > examining a topic's checkpoint aren't readily useful for application
> > consumption.
> >
> >
> > Can some of you share your approaches to monitoring and alerting on
> consumer
> > lag?
> >
> >
> > Regards.
> >
> >
> > Jeremiah Adams
> > Software Engineer
> >
> https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~
> <
> https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHp-qKE3Xn2gNiZ3dlqAeSDA~
> >
> > Blog<
> https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHgFEZu-KYuiu8doY66NWwmmyWxz7kC-27Yfnbdgd2wyh5gjXUa6LMT_NRXsj1g1VVg~~>
> |
> > Twitter<
> https://url.emailprotection.link/?a0Q7ct5_6cOdbJ86kpWB0zx6RbtgugTVC7lU_W7za50jLdZQGpLgVlR1V06zckSaM5oOKb6QBo46Qp9xt0Tt7Aw~~>
> |
> > Facebook<
> https://url.emailprotection.link/?aAmyAO_nS_C1aDgBLeKyGTu0tksTt1_mn2PcS8KJXNJPM04iRHKgX96qGgENV-dMSER5wl8zDVRr3RsS0OmcF9A~~>
> |
> > LinkedIn<
> https://url.emailprotection.link/?aanlcNI-cN74Gdz-TD332xAl6lHu7TRNICWoHUFjYf-KlBjrCGHoYR65b3rl-OyW10nWFv6hwYvUSoVHL4b3vGA~~
> >
>


-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: Alerting and Monitoring Samza Checkpointing?

Posted by Jeremiah Adams <JA...@helixeducation.com>.

I am referring to the "Lag" that can exist when a Consumer Offset is significantly less than the Log Size. This difference is Lag and is often symptomatic of a problem - processing has stopped or being overwhelmed etc.

Our Legacy Node.js system uses Consumer Groups, (same as say older Spark).  To get the Offset, we can use kafka-consumer-groups.sh tool to get the offset. For Ops related work we use Kafkamon for these to get a UI up for Ops folks. 

Our newer stuff uses Samza and I see zero Consumer Groups. Instead I see checkpoint topics (example: __samza_checkpoint_ver_1_for_generic-delivery_1). I can consume this topic and get the current offset by partition, but I don't have the log size, so cannot compute the lag. All I can do is see these numbers increment but know clue how behind my process is.

I just took Linkedin's Burrow (https://github.com/linkedin/Burrow) for a test drive locally, hoping it would solve my problem due to it looking at the internal consumers. However, I have the same problem - can't get data on a consumer group that doesn't exist.

Jeremiah Adams
Software Engineer
www.helixeducation.com
Blog | Twitter | Facebook | LinkedIn

________________________________________
From: Tom Davis <to...@recursivedream.com>
Sent: Monday, November 26, 2018 6:59 PM
To: dev@samza.apache.org
Subject: Re: Alerting and Monitoring Samza Checkpointing?

Have you looked into KafkaSystemConsumerMetrics? Is the meaning of "lag"
there different from what you mean?

Jeremiah Adams <JA...@helixeducation.com> writes:

> We are replacing a node.js app that consumed topics on a Kafka cluster with
> Samza jobs. We use kafka-offsets to trigger alerts based on message lag. e.g.,
> message lag is greater than 10, wake up support persons.
>
>
> Samza doesn't use the same mechanism for offset storage and the tools for
> examining a topic's checkpoint aren't readily useful for application
> consumption.
>
>
> Can some of you share your approaches to monitoring and alerting on consumer
> lag?
>
>
> Regards.
>
>
> Jeremiah Adams
> Software Engineer
> https://url.emailprotection.link/?ahfhEufaAWbezBrUFPG98ZJcterGfIerU3ZwsA3Gv_C0~<https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHp-qKE3Xn2gNiZ3dlqAeSDA~>
> Blog<https://url.emailprotection.link/?a49H2rNGIIBtQOw6md8OcHgFEZu-KYuiu8doY66NWwmmyWxz7kC-27Yfnbdgd2wyh5gjXUa6LMT_NRXsj1g1VVg~~> |
> Twitter<https://url.emailprotection.link/?a0Q7ct5_6cOdbJ86kpWB0zx6RbtgugTVC7lU_W7za50jLdZQGpLgVlR1V06zckSaM5oOKb6QBo46Qp9xt0Tt7Aw~~> |
> Facebook<https://url.emailprotection.link/?aAmyAO_nS_C1aDgBLeKyGTu0tksTt1_mn2PcS8KJXNJPM04iRHKgX96qGgENV-dMSER5wl8zDVRr3RsS0OmcF9A~~> |
> LinkedIn<https://url.emailprotection.link/?aanlcNI-cN74Gdz-TD332xAl6lHu7TRNICWoHUFjYf-KlBjrCGHoYR65b3rl-OyW10nWFv6hwYvUSoVHL4b3vGA~~>

Re: Alerting and Monitoring Samza Checkpointing?

Posted by Tom Davis <to...@recursivedream.com>.

Have you looked into KafkaSystemConsumerMetrics? Is the meaning of "lag"
there different from what you mean?


Jeremiah Adams <JA...@helixeducation.com> writes:

> We are replacing a node.js app that consumed topics on a Kafka cluster with 
> Samza jobs. We use kafka-offsets to trigger alerts based on message lag. e.g., 
> message lag is greater than 10, wake up support persons.
>
>
> Samza doesn't use the same mechanism for offset storage and the tools for 
> examining a topic's checkpoint aren't readily useful for application 
> consumption.
>
>
> Can some of you share your approaches to monitoring and alerting on consumer 
> lag?
>
>
> Regards.
>
>
> Jeremiah Adams
> Software Engineer
> www.helixeducation.com<http://www.helixeducation.com/>
> Blog<http://www.helixeducation.com/blog/> | 
> Twitter<https://twitter.com/HelixEducation> | 
> Facebook<https://www.facebook.com/HelixEducation> | 
> LinkedIn<http://www.linkedin.com/company/3609946>