You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kafka.apache.org by ju...@apache.org on 2013/09/05 06:59:39 UTC

svn commit: r1520209 - /kafka/site/08/ops.html

Author: junrao
Date: Thu Sep  5 04:59:38 2013
New Revision: 1520209

URL: http://svn.apache.org/r1520209
Log:
list important jmx beans in 0.8 doc

Modified:
    kafka/site/08/ops.html

Modified: kafka/site/08/ops.html
URL: http://svn.apache.org/viewvc/kafka/site/08/ops.html?rev=1520209&r1=1520208&r2=1520209&view=diff
==============================================================================
--- kafka/site/08/ops.html (original)
+++ kafka/site/08/ops.html Thu Sep  5 04:59:38 2013
@@ -171,24 +171,122 @@ Kafka uses Yammer Metrics for metrics re
 The easiest way to see the available metrics to fire up jconsole and point it at a running kafka client or server; this will all browsing all metrics with JMX.
 <p>
 We pay particular we do graphing and alerting on the following metrics:
-<ul>
-	<li>The rate of data in and out of the cluster and the number of messages written
-	<li>The log flush rate and the time taken to flush the log
-	<li>The number of partitions that have replicas that are down or have fallen behind and are underreplicated.
-	<li>Is the controller active? Answer had better be yes.
-	<li>Unclean leader elections. This shouldn't happen.
-	<li>Number of partitions each node is the leader for.
-	<li>Leader elections: we track each time this happens and how long it took
-	<li>Any changes to the ISR
-	<li>The lag in messages per partition in the follower. If a broker is restarted, these metrics tell you how quickly the followers are catching up.
-	<li>The number of produce requests waiting on replication to report back
-	<li>The number of fetch requests waiting on data to arrive
-	<li>Avg and 99th percentile time for each request for waiting in queue, local processing, and waiting on other servers
-	<li>The raw rate of incoming fetch and produce requests
-	<li>GC time and other stats
-	<li>Various server stats such as CPU utilization, I/O service time, etc.
-	<li>On the client side, the message/byte rate (global and per topic), request rate/size/time. On the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
-</ul>
+<table class="data-table">
+<tbody><tr>
+      <th>Description</th>
+      <th>Mbean name</th>
+      <th>Normal value</th>
+    </tr>
+    <tr>
+      <td>Message in rate</td>
+      <td>"kafka.server":name="AllTopicsMessagesInPerSec",type="BrokerTopicMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Byte in rate</td>
+      <td>"kafka.server":name="AllTopicsBytesInPerSec",type="BrokerTopicMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Request rate</td>
+      <td>"kafka.network":name="{Produce|Fetch-consumer|Fetch-follower}-RequestsPerSec",type="RequestMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Byte out rate</td>
+      <td>"kafka.server":name="AllTopicsBytesOutPerSec",type="BrokerTopicMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Log flush rate and time</td>
+      <td>"kafka.log":name="LogFlushRateAndTimeMs",type="LogFlushStats"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Number of under replicated partitions (|ISR| &lt |all replicas|)</td>
+      <td>"kafka.server":name="UnderReplicatedPartitions",type="ReplicaManager"</td>
+      <td>0</td>
+    </tr>
+    <tr>
+      <td>Is controller active on broker</td>
+      <td>"kafka.controller":name="ActiveControllerCount",type="KafkaController"</td>
+      <td>only one broker in the cluster should have 1</td>
+    </tr>
+    <tr>
+      <td>Leader election rate</td>
+      <td>"kafka.controller":name="LeaderElectionRateAndTimeMs",type="ControllerStats"</td>
+      <td>non-zero when there are broker failures</td>
+    </tr>
+    <tr>
+      <td>Unclean leader election rate</td>
+      <td>"kafka.controller":name="UncleanLeaderElectionsPerSec",type="ControllerStats"</td>
+      <td>0</td>
+    </tr>
+    <tr>
+      <td>Partition counts</td>
+      <td>"kafka.server":name="PartitionCount",type="ReplicaManager"</td>
+      <td>mostly even across brokers</td>
+    </tr>
+    <tr>
+      <td>Leader replica counts</td>
+      <td>"kafka.server":name="LeaderCount",type="ReplicaManager"</td>
+      <td>mostly even across brokers</td>
+    </tr>
+    <tr>
+      <td>ISR expansion rate</td>
+      <td>"kafka.server":name="ISRShrinksPerSec",type="ReplicaManager"</td>
+      <td>non-zero only during broker startup</td>
+    </tr>
+    <tr>
+      <td>ISR shrink rate</td>
+      <td>"kafka.server":name="ISRShrinksPerSec",type="ReplicaManager"</td>
+      <td>0</td>
+    </tr>
+    <tr>
+      <td>Max lag in messages btw the follower replicas and the leader replicas</td>
+      <td>"kafka.server":name="([-.\w]+)-MaxLag",type="ReplicaFetcherManager"</td>
+      <td>&lt replica.lag.max.messages</td>
+    </tr>
+    <tr>
+      <td>Requests waiting in the producer purgatory</td>
+      <td>"kafka.server":name="PurgatorySize",type="ProducerRequestPurgatory"</td>
+      <td>non-zero if ack=-1 is used</td>
+    </tr>
+    <tr>
+      <td>Requests waiting in the fetch purgatory</td>
+      <td>"kafka.server":name="PurgatorySize",type="FetchRequestPurgatory"</td>
+      <td>size depends on fetch.wait.max.ms in the consumer</td>
+    </tr>
+    <tr>
+      <td>Request total time</td>
+      <td>"kafka.network":name="{Produce|Fetch-Consumer|Fetch-Follower}-TotalTimeMs",type="RequestMetrics"</td>
+      <td>broken into queue, local, remote and response send time</td>
+    </tr>
+    <tr>
+      <td>Time the request waiting in the request queue</td>
+      <td>"kafka.network":name="{Produce|Fetch-Consumer|Fetch-Follower}-QueueTimeMs",type="RequestMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Time the request being processed at the leader</td>
+      <td>"kafka.network":name="{Produce|Fetch-Consumer|Fetch-Follower}-LocalTimeMs",type="RequestMetrics"</td>
+      <td></td>
+    </tr>
+    <tr>
+      <td>Time the request waits for the follower</td>
+      <td>"kafka.network":name="{Produce|Fetch-Consumer|Fetch-Follower}-RemoteTimeMs",type="RequestMetrics"</td>
+      <td>non-zero for produce requests when ack=-1</td>
+    </tr>
+    <tr>
+      <td>Time to send the response</td>
+      <td>"kafka.network":name="{Produce|Fetch-Consumer|Fetch-Follower}-ResponseSendTimeMs",type="RequestMetrics"</td>
+      <td></td>
+    </tr>
+</tbody></table>
+
+We recommend monitor GC time and other stats and various server stats such as CPU utilization, I/O service time, etc.
+
+On the client side, we recommend monitor the message/byte rate (global and per topic), request rate/size/time, and on the consumer side, max lag in messages among all partitions and min fetch request rate. For a consumer to keep up, max lag needs to be less than a threshold and min fetch rate needs to be larger than 0.
 
 <h4>Audit</h4>
 The final alerting we do is on the correctness of the data delivery. We audit that every message that is sent is consumed by all consumers and measure the lag for this to occur. For important topics we alert if a certain completeness is not achieved in a certain time period. The details of this are discussed in KAFKA-260.