You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Steven Wu <st...@gmail.com> on 2015/01/28 00:20:55 UTC

[DISCUSSION] generate explicit error/failing metrics

To illustrate my point, I will use "allTopicsOwnedPartitionsCount" guage
from  ZookeeperConsumerConnector as an example. It captures number of
partitions for a topic that has been assigned owner for the consumer group.
let's say that I have a topic with 9 partitions. this metrics should
normally report value 9. I can setup alert
if allTopicsOwnedPartitionsCount <9.

here are the drawbacks of this kind of metric.
1) if our metrics report/aggregation system has data loss and cause the
value reported as zero, we can't really distinguish whether it's an real
error or it is data loss. so we can get false positive/alarm from data loss
2) if we change the number of partitions (e.g. from 9 to 18). we need to
remember to change the alert rule to "allTopicsOwnedPartitionsCount <18".
this kind of coupling is a maintenance nightmare.

A more explicit metric is "NoOwnerPartitionsCount". it should be zero
normally. if it is not zero, we should be alerted. this way, we won't get
false alarm from data loss.

We don't have to change/fix this particular example since a new consumer is
being worked on. But in new consumer please consider more explicit error
signals.

Thanks,
Steven

Re: [DISCUSSION] generate explicit error/failing metrics

Posted by Guozhang Wang <wa...@gmail.com>.

I think this is more of a tooling issue that the new consumer may not
directly resolve. On the other hand, the ConsumerOffsetChecker tool will
return for each partition the consumer's current offset as well as the log
end offset, if some partitions are not owned by consumers, their "owner"
field will be null. We can easily augment this tool so that it can expose
alert metrics when this happens.

On Tue, Jan 27, 2015 at 10:26 PM, Steven Wu <st...@gmail.com> wrote:

> Joel, thanks for the clarifications.
>
> maybe I misunderstand the intent of that metric. Yes, we are looking for
> alerting if some partitions aren't owned by any consumer from the group
> (just in case this ever happens).
>
> yes, MaxLag mbeans only apply for partitions owned by a consumer.
>
> looking forward to the new consumer :)
>
> On Tue, Jan 27, 2015 at 8:55 PM, Joel Koshy <jj...@gmail.com> wrote:
>
> > I'm not sure if I'm misunderstanding the suggestion, but this metric
> > was ever intended for alerts. Some metrics are more for informational
> > purposes than for setting up alerts. In fact it is possible for some
> > consumers to have zero owned partitions if there are fewer partitions
> > than consumers in the group.
> >
> > I think you are looking for some mechanism to determine if a
> > particular partition has not been owned by an instance in the group.
> > I think it is a bit difficult to do that directly in the current high
> > level consumer. Instead, you can monitor the consumer lag using the
> > consumer offset checker - which is not ideal since it is not
> > integrated in the consumer. The consumer does have lag mbeans but
> > those apply only for partitions that are owned. This concern can be
> > addressed with the new consumer.
> >
> > On Tue, Jan 27, 2015 at 03:20:55PM -0800, Steven Wu wrote:
> > > To illustrate my point, I will use "allTopicsOwnedPartitionsCount"
> guage
> > > from  ZookeeperConsumerConnector as an example. It captures number of
> > > partitions for a topic that has been assigned owner for the consumer
> > group.
> > > let's say that I have a topic with 9 partitions. this metrics should
> > > normally report value 9. I can setup alert
> > > if allTopicsOwnedPartitionsCount <9.
> > >
> > > here are the drawbacks of this kind of metric.
> > > 1) if our metrics report/aggregation system has data loss and cause the
> > > value reported as zero, we can't really distinguish whether it's an
> real
> > > error or it is data loss. so we can get false positive/alarm from data
> > loss
> > > 2) if we change the number of partitions (e.g. from 9 to 18). we need
> to
> > > remember to change the alert rule to "allTopicsOwnedPartitionsCount
> <18".
> > > this kind of coupling is a maintenance nightmare.
> > >
> > > A more explicit metric is "NoOwnerPartitionsCount". it should be zero
> > > normally. if it is not zero, we should be alerted. this way, we won't
> get
> > > false alarm from data loss.
> > >
> > > We don't have to change/fix this particular example since a new
> consumer
> > is
> > > being worked on. But in new consumer please consider more explicit
> error
> > > signals.
> > >
> > > Thanks,
> > > Steven
> >
> >
>



-- 
-- Guozhang

Re: [DISCUSSION] generate explicit error/failing metrics

Posted by Steven Wu <st...@gmail.com>.

Joel, thanks for the clarifications.

maybe I misunderstand the intent of that metric. Yes, we are looking for
alerting if some partitions aren't owned by any consumer from the group
(just in case this ever happens).

yes, MaxLag mbeans only apply for partitions owned by a consumer.

looking forward to the new consumer :)

On Tue, Jan 27, 2015 at 8:55 PM, Joel Koshy <jj...@gmail.com> wrote:

> I'm not sure if I'm misunderstanding the suggestion, but this metric
> was ever intended for alerts. Some metrics are more for informational
> purposes than for setting up alerts. In fact it is possible for some
> consumers to have zero owned partitions if there are fewer partitions
> than consumers in the group.
>
> I think you are looking for some mechanism to determine if a
> particular partition has not been owned by an instance in the group.
> I think it is a bit difficult to do that directly in the current high
> level consumer. Instead, you can monitor the consumer lag using the
> consumer offset checker - which is not ideal since it is not
> integrated in the consumer. The consumer does have lag mbeans but
> those apply only for partitions that are owned. This concern can be
> addressed with the new consumer.
>
> On Tue, Jan 27, 2015 at 03:20:55PM -0800, Steven Wu wrote:
> > To illustrate my point, I will use "allTopicsOwnedPartitionsCount" guage
> > from  ZookeeperConsumerConnector as an example. It captures number of
> > partitions for a topic that has been assigned owner for the consumer
> group.
> > let's say that I have a topic with 9 partitions. this metrics should
> > normally report value 9. I can setup alert
> > if allTopicsOwnedPartitionsCount <9.
> >
> > here are the drawbacks of this kind of metric.
> > 1) if our metrics report/aggregation system has data loss and cause the
> > value reported as zero, we can't really distinguish whether it's an real
> > error or it is data loss. so we can get false positive/alarm from data
> loss
> > 2) if we change the number of partitions (e.g. from 9 to 18). we need to
> > remember to change the alert rule to "allTopicsOwnedPartitionsCount <18".
> > this kind of coupling is a maintenance nightmare.
> >
> > A more explicit metric is "NoOwnerPartitionsCount". it should be zero
> > normally. if it is not zero, we should be alerted. this way, we won't get
> > false alarm from data loss.
> >
> > We don't have to change/fix this particular example since a new consumer
> is
> > being worked on. But in new consumer please consider more explicit error
> > signals.
> >
> > Thanks,
> > Steven
>
>

Re: [DISCUSSION] generate explicit error/failing metrics

Posted by Joel Koshy <jj...@gmail.com>.

I'm not sure if I'm misunderstanding the suggestion, but this metric
was ever intended for alerts. Some metrics are more for informational
purposes than for setting up alerts. In fact it is possible for some
consumers to have zero owned partitions if there are fewer partitions
than consumers in the group.

I think you are looking for some mechanism to determine if a
particular partition has not been owned by an instance in the group.
I think it is a bit difficult to do that directly in the current high
level consumer. Instead, you can monitor the consumer lag using the
consumer offset checker - which is not ideal since it is not
integrated in the consumer. The consumer does have lag mbeans but
those apply only for partitions that are owned. This concern can be
addressed with the new consumer.

On Tue, Jan 27, 2015 at 03:20:55PM -0800, Steven Wu wrote:
> To illustrate my point, I will use "allTopicsOwnedPartitionsCount" guage
> from  ZookeeperConsumerConnector as an example. It captures number of
> partitions for a topic that has been assigned owner for the consumer group.
> let's say that I have a topic with 9 partitions. this metrics should
> normally report value 9. I can setup alert
> if allTopicsOwnedPartitionsCount <9.
> 
> here are the drawbacks of this kind of metric.
> 1) if our metrics report/aggregation system has data loss and cause the
> value reported as zero, we can't really distinguish whether it's an real
> error or it is data loss. so we can get false positive/alarm from data loss
> 2) if we change the number of partitions (e.g. from 9 to 18). we need to
> remember to change the alert rule to "allTopicsOwnedPartitionsCount <18".
> this kind of coupling is a maintenance nightmare.
> 
> A more explicit metric is "NoOwnerPartitionsCount". it should be zero
> normally. if it is not zero, we should be alerted. this way, we won't get
> false alarm from data loss.
> 
> We don't have to change/fix this particular example since a new consumer is
> being worked on. But in new consumer please consider more explicit error
> signals.
> 
> Thanks,
> Steven