You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Per Steffensen (JIRA)" <ji...@apache.org> on 2018/01/30 14:57:00 UTC

[jira] [Created] (KAFKA-6505) Add simple raw "offset-commit-failures", "offset-commits" and "offset-commit-successes" count metric

Per Steffensen created KAFKA-6505:
-------------------------------------

             Summary: Add simple raw "offset-commit-failures", "offset-commits" and "offset-commit-successes" count metric
                 Key: KAFKA-6505
                 URL: https://issues.apache.org/jira/browse/KAFKA-6505
             Project: Kafka
          Issue Type: Improvement
            Reporter: Per Steffensen


MBean "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" has several attributes. Most of them seems to be avg/max/pct over the entire lifetime of the process. They are not very useful when monitoring a system, where you typically want to see when there have been problems and if there are problems right now.

E.g. I would like to expose to an administrator when offset-commits have been failing (e.g. timing out) including if they are failing right now. It is really hard to do that properly, just using attribute "offset-commit-failure-percentage". You can expose a number telling how much the percentage has changed between two consecutive polls of the metric - if it changed to the positive side, we saw offset-commit failures, and if it changed to the negative side (or is stable at 0) we saw offset-commit success - at least as long as the system has not been running for so long that a single failing offset-commit does not even change the percentage. But it is really odd, to do it this way.

*I would like to just see an attribute "offset-commit-failures" just counting how many offset-commits have failed, as an ever-increasing number. Maybe also attributes "offset-commits" and "offset-commit-successes". Then I can do a delta between the two last metric-polls to show how many offset-commit-attempts have failed "very recently". Let this ticket be about that particular added attribute (or the three added attributes).*



Just a note on metrics IMHO (should probably be posted somewhere else):

In general consider getting rid of stuff like avg, max, pct over the entire lifetime of the process - current state is what interests people, especially when it comes to failure-related metrics (failure-pct over the lifetime of the process is not very useful). And people will continuously be polling and storing the metrics, so we will have a history of "current state" somewhere else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring tools can do all the avg, max, pct for you based on a time-series of metrics-poll-results - and they can do it for periods of your choice (e.g. average over the last minute or 5 minutes) - have a look at Prometheus PromQL (e.g. used through Grafana). Just expose the raw number and let the average/max/min/pct calculation be done on the collect/presentation side. Only do "advanced" stuff for cases that are very interesting and where it cannot be done based on simple raw number (e.g. percentiles), and consider whether doing it for fairly short intervals is better than for the entire lifetime of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)