You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "jsancio (via GitHub)" <gi...@apache.org> on 2023/02/07 18:08:18 UTC

[GitHub] [kafka] jsancio commented on a diff in pull request #13207: KAFKA-14664; Fix inaccurate raft idle ratio metric

jsancio commented on code in PR #13207:
URL: https://github.com/apache/kafka/pull/13207#discussion_r1098994575


##########
raft/src/main/java/org/apache/kafka/raft/internals/KafkaRaftMetrics.java:
##########
@@ -133,26 +131,27 @@ public KafkaRaftMetrics(Metrics metrics, String metricGrpPrefix, QuorumState sta
                 "The average number of records appended per sec as the leader of the raft quorum."),
                 new Rate(TimeUnit.SECONDS, new WindowedSum()));
 
-        this.pollIdleSensor = metrics.sensor("poll-idle-ratio");
-        this.pollIdleSensor.add(metrics.metricName("poll-idle-ratio-avg",
+        this.pollDurationSensor = metrics.sensor("poll-idle-ratio");
+        this.pollDurationSensor.add(metrics.metricName(

Review Comment:
   Minor but I would add a newline before `metrics.metricName`.



##########
raft/src/main/java/org/apache/kafka/raft/internals/KafkaRaftMetrics.java:
##########
@@ -133,26 +131,27 @@ public KafkaRaftMetrics(Metrics metrics, String metricGrpPrefix, QuorumState sta
                 "The average number of records appended per sec as the leader of the raft quorum."),
                 new Rate(TimeUnit.SECONDS, new WindowedSum()));
 
-        this.pollIdleSensor = metrics.sensor("poll-idle-ratio");
-        this.pollIdleSensor.add(metrics.metricName("poll-idle-ratio-avg",
+        this.pollDurationSensor = metrics.sensor("poll-idle-ratio");
+        this.pollDurationSensor.add(metrics.metricName(
+                "poll-idle-ratio-avg",
                 metricGroupName,
-                "The average fraction of time the client's poll() is idle as opposed to waiting for the user code to process records."),
-                new Avg());
+                "The ratio of time the Raft IO thread is idle as opposed to " +
+                    "doing work (e.g. handling requests or replicating from the leader)"
+            ),
+            new TimeRatio(1.0)
+        );
     }
 
     public void updatePollStart(long currentTimeMs) {
-        if (pollEndMs.isPresent() && pollStartMs.isPresent()) {
-            long pollTimeMs = Math.max(pollEndMs.getAsLong() - pollStartMs.getAsLong(), 0L);
-            long totalTimeMs = Math.max(currentTimeMs - pollStartMs.getAsLong(), 1L);
-            this.pollIdleSensor.record(pollTimeMs / (double) totalTimeMs, currentTimeMs);
-        }
-
         this.pollStartMs = OptionalLong.of(currentTimeMs);
-        this.pollEndMs = OptionalLong.empty();
     }
 
     public void updatePollEnd(long currentTimeMs) {
-        this.pollEndMs = OptionalLong.of(currentTimeMs);
+        if (pollStartMs.isPresent()) {
+            long pollDurationMs = Math.max(currentTimeMs - pollStartMs.getAsLong(), 0L);

Review Comment:
   Instead of taking the max should we throw `IllegalArgumentException` or `IllegalStateException` if the difference is negative?



##########
raft/src/test/java/org/apache/kafka/raft/internals/KafkaRaftMetricsTest.java:
##########
@@ -190,25 +190,48 @@ public void shouldRecordNumUnknownVoterConnections() throws IOException {
     }
 
     @Test
-    public void shouldRecordPollIdleRatio() throws IOException {
+    public void shouldRecordPollIdleRatio() {
         QuorumState state = buildQuorumState(Collections.singleton(localId));
         state.initialize(new OffsetAndEpoch(0L, 0));
         raftMetrics = new KafkaRaftMetrics(metrics, "raft", state);
 
+        // First recording is discarded (in order to align the interval of measurement)
+        raftMetrics.updatePollStart(time.milliseconds());
+        raftMetrics.updatePollEnd(time.milliseconds());
+
+        // Idle for 100ms
         raftMetrics.updatePollStart(time.milliseconds());
         time.sleep(100L);
         raftMetrics.updatePollEnd(time.milliseconds());
-        time.sleep(900L);
+
+        // Busy for 100ms
+        time.sleep(100L);
+
+        // Idle for 200ms
         raftMetrics.updatePollStart(time.milliseconds());
+        time.sleep(200L);
+        raftMetrics.updatePollEnd(time.milliseconds());
 
-        assertEquals(0.1, getMetric(metrics, "poll-idle-ratio-avg").metricValue());
+        assertEquals(0.75, getMetric(metrics, "poll-idle-ratio-avg").metricValue());
 
+        // Busy for 100ms
         time.sleep(100L);
+
+        // Idle for 75ms
+        raftMetrics.updatePollStart(time.milliseconds());
+        time.sleep(75L);
         raftMetrics.updatePollEnd(time.milliseconds());
-        time.sleep(100L);
+
+        // Idle for 25ms
         raftMetrics.updatePollStart(time.milliseconds());
+        time.sleep(25L);
+        raftMetrics.updatePollEnd(time.milliseconds());
+
+        // Idle for 0ms
+        raftMetrics.updatePollStart(time.milliseconds());
+        raftMetrics.updatePollEnd(time.milliseconds());
 
-        assertEquals(0.3, getMetric(metrics, "poll-idle-ratio-avg").metricValue());
+        assertEquals(0.5, getMetric(metrics, "poll-idle-ratio-avg").metricValue());

Review Comment:
   Should we add a test for measuring the metric in between `updatePollStart` and `updatePollEnd`?



##########
raft/src/main/java/org/apache/kafka/raft/internals/TimeRatio.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.raft.internals;
+
+import org.apache.kafka.common.metrics.MeasurableStat;
+import org.apache.kafka.common.metrics.MetricConfig;
+
+/**
+ * Maintains an approximate ratio of the duration of a specific event
+ * over all time. For example, this can be used to compute the ratio of
+ * time that a thread is busy or idle. The value is approximate since the
+ * measurement and recording intervals may not be aligned.
+ *
+ * Note that the duration of the event is assumed to be small relative to
+ * the interval of measurement.
+ *
+ */
+public class TimeRatio implements MeasurableStat {
+    private long intervalStartTimestampMs = -1;
+    private long lastRecordedTimestampMs = -1;
+    private double totalRecordedDurationMs = 0;
+
+    private final double defaultRatio;
+
+    public TimeRatio(double defaultRatio) {
+        this.defaultRatio = defaultRatio;

Review Comment:
   Should this check that `defaultRatio` is between `1.0` and `0.0`?



##########
raft/src/test/java/org/apache/kafka/raft/internals/TimeRatioTest.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.raft.internals;
+
+import org.apache.kafka.common.metrics.MetricConfig;
+import org.apache.kafka.common.utils.MockTime;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+
+class TimeRatioTest {
+
+    @Test
+    public void testRatio() {
+        MetricConfig config = new MetricConfig();
+        MockTime time = new MockTime();
+        TimeRatio ratio = new TimeRatio(1.0);
+
+        ratio.record(config, 0.0, time.milliseconds());
+        time.sleep(10);
+        ratio.record(config, 10, time.milliseconds());
+        time.sleep(10);
+        ratio.record(config, 0, time.milliseconds());
+        assertEquals(0.5, ratio.measure(config, time.milliseconds()));
+
+        time.sleep(10);
+        ratio.record(config, 10, time.milliseconds());
+        time.sleep(40);
+        ratio.record(config, 0, time.milliseconds());
+        assertEquals(0.2, ratio.measure(config, time.milliseconds()));
+    }
+
+}

Review Comment:
   Missing newline.



##########
raft/src/main/java/org/apache/kafka/raft/internals/TimeRatio.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.kafka.raft.internals;
+
+import org.apache.kafka.common.metrics.MeasurableStat;
+import org.apache.kafka.common.metrics.MetricConfig;
+
+/**
+ * Maintains an approximate ratio of the duration of a specific event
+ * over all time. For example, this can be used to compute the ratio of
+ * time that a thread is busy or idle. The value is approximate since the
+ * measurement and recording intervals may not be aligned.
+ *
+ * Note that the duration of the event is assumed to be small relative to
+ * the interval of measurement.
+ *
+ */
+public class TimeRatio implements MeasurableStat {
+    private long intervalStartTimestampMs = -1;
+    private long lastRecordedTimestampMs = -1;
+    private double totalRecordedDurationMs = 0;
+
+    private final double defaultRatio;
+
+    public TimeRatio(double defaultRatio) {
+        this.defaultRatio = defaultRatio;
+    }
+
+    @Override
+    public double measure(MetricConfig config, long currentTimestampMs) {
+        if (lastRecordedTimestampMs < 0) {
+            // Return the default value if no recordings have been captured.
+            return defaultRatio;
+        } else {
+            // We measure the ratio over the
+            double intervalDurationMs = Math.max(lastRecordedTimestampMs - intervalStartTimestampMs, 0);
+            final double ratio;
+            if (intervalDurationMs == 0) {
+                ratio = defaultRatio;
+            } else if (totalRecordedDurationMs > intervalDurationMs) {
+                ratio = 1.0;
+            } else {
+                ratio = totalRecordedDurationMs / intervalDurationMs;
+            }
+
+            // The next interval begins at the
+            intervalStartTimestampMs = lastRecordedTimestampMs;
+            totalRecordedDurationMs = 0;
+            return ratio;
+        }
+    }
+
+    @Override
+    public void record(MetricConfig config, double value, long currentTimestampMs) {
+        if (intervalStartTimestampMs < 0) {
+            // Discard the initial value since the value occurred prior to the interval start
+            intervalStartTimestampMs = currentTimestampMs;

Review Comment:
   Got it. To be able to remove this restriction we would have to change the `Sensor` API to allow the setting of this interval start time without recording a `value`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org