You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by uc...@apache.org on 2017/01/24 09:56:06 UTC

[4/5] flink git commit: [FLINK-5446] [docs] Rework system-metrics section

[FLINK-5446] [docs] Rework system-metrics section


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/db160401
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/db160401
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/db160401

Branch: refs/heads/release-1.2
Commit: db160401dd5f79f0d358e547b82dbd8c575dbef3
Parents: a62ffa6
Author: zentol <ch...@apache.org>
Authored: Fri Jan 13 12:18:34 2017 +0100
Committer: Ufuk Celebi <uc...@apache.org>
Committed: Tue Jan 24 10:53:10 2017 +0100

----------------------------------------------------------------------
 docs/monitoring/metrics.md | 315 ++++++++++++++++++++++++++++++++--------
 1 file changed, 255 insertions(+), 60 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/db160401/docs/monitoring/metrics.md
----------------------------------------------------------------------
diff --git a/docs/monitoring/metrics.md b/docs/monitoring/metrics.md
index 2103cfd..6dbc705 100644
--- a/docs/monitoring/metrics.md
+++ b/docs/monitoring/metrics.md
@@ -429,119 +429,270 @@ metrics.reporter.stsd.port: 8125
 
 ## System metrics
 
-Flink exposes the following system metrics:
+By default Flink gathers several metrics that provide deep insights on the current state.
+This section is a reference of all these metrics.
 
+The tables below generally feature 4 columns:
+
+* The "Scope" column describes which scope format is used to generate the system scope.
+  For example, if the cell contains "Operator" then the scope format for "metrics.scope.operator" is used.
+  If the cell contains multiple values, separated by a slash, then the metrics are reported multiple
+  times for different entities, like for both job- and taskmanagers.
+
+* The (optional)"Infix" column describes which infix is appended to the system scope.
+
+* The "Metrics" column lists the names of all metrics that are registered for the given scope and infix.
+
+* The "Description" column provides information as to what a given metric is measuring.
+
+Note that all dots in the infix/metric name columns are still subject to the "metrics.delimiter" setting.
+
+Thus, in order to infer the metric identifier:
+
+1. Take the scope-format based on the "Scope" column
+2. Append the value in the "Infix" column if present, and account for the "metrics.delimiter" setting
+3. Append metric name.
+
+#### CPU:
 <table class="table table-bordered">
   <thead>
     <tr>
       <th class="text-left" style="width: 20%">Scope</th>
-      <th class="text-left">Metrics</th>
-      <th class="text-left">Description</th>
+      <th class="text-left" style="width: 25%">Infix</th>
+      <th class="text-left" style="width: 23%">Metrics</th>
+      <th class="text-left" style="width: 32%">Description</th>
     </tr>
   </thead>
   <tbody>
     <tr>
-      <th rowspan="1"><strong>JobManager</strong></th>
-      <td></td>
-      <td></td>
-    </tr>
-    <tr>
-      <th rowspan="2"><strong>TaskManager.Status</strong></th>
-      <td>Network.AvailableMemorySegments</td>
-      <td>The number of unused memory segments.</td>
-    </tr>
-    <tr>
-      <td>Network.TotalMemorySegments</td>
-      <td>The number of allocated memory segments.</td>
-    </tr>
-    <tr>
-      <th rowspan="19"><strong>TaskManager.Status.JVM</strong></th>
-      <td>ClassLoader.ClassesLoaded</td>
-      <td>The total number of classes loaded since the start of the JVM.</td>
-    </tr>
-    <tr>
-      <td>ClassLoader.ClassesUnloaded</td>
-      <td>The total number of classes unloaded since the start of the JVM.</td>
-    </tr>
-    <tr>
-      <td>GargabeCollector.&lt;garbageCollector&gt;.Count</td>
-      <td>The total number of collections that have occurred.</td>
+      <th rowspan="2"><strong>Job-/TaskManager</strong></th>
+      <td rowspan="2">Status.JVM.CPU</td>
+      <td>Load</td>
+      <td>The recent CPU usage of the JVM.</td>
     </tr>
     <tr>
-      <td>GargabeCollector.&lt;garbageCollector&gt;.Time</td>
-      <td>The total time spent performing garbage collection.</td>
+      <td>Time</td>
+      <td>The CPU time used by the JVM.</td>
     </tr>
-    <tr>
+  </tbody>
+</table>
+
+#### Memory:
+<table class="table table-bordered">                               
+  <thead>                                                          
+    <tr>                                                           
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 25%">Infix</th>          
+      <th class="text-left" style="width: 23%">Metrics</th>                           
+      <th class="text-left" style="width: 32%">Description</th>                       
+    </tr>                                                          
+  </thead>                                                         
+  <tbody>                                                          
+    <tr>                                                           
+      <th rowspan="12"><strong>Job-/TaskManager</strong></th>
+      <td rowspan="12">Status.JVM.Memory</td>
       <td>Memory.Heap.Used</td>
       <td>The amount of heap memory currently used.</td>
     </tr>
     <tr>
-      <td>Memory.Heap.Committed</td>
+      <td>Heap.Committed</td>
       <td>The amount of heap memory guaranteed to be available to the JVM.</td>
     </tr>
     <tr>
-      <td>Memory.Heap.Max</td>
+      <td>Heap.Max</td>
       <td>The maximum amount of heap memory that can be used for memory management.</td>
     </tr>
     <tr>
-      <td>Memory.NonHeap.Used</td>
+      <td>NonHeap.Used</td>
       <td>The amount of non-heap memory currently used.</td>
     </tr>
     <tr>
-      <td>Memory.NonHeap.Committed</td>
+      <td>NonHeap.Committed</td>
       <td>The amount of non-heap memory guaranteed to be available to the JVM.</td>
     </tr>
     <tr>
-      <td>Memory.NonHeap.Max</td>
+      <td>NonHeap.Max</td>
       <td>The maximum amount of non-heap memory that can be used for memory management.</td>
     </tr>
     <tr>
-      <td>Memory.Direct.Count</td>
+      <td>Direct.Count</td>
       <td>The number of buffers in the direct buffer pool.</td>
     </tr>
     <tr>
-      <td>Memory.Direct.MemoryUsed</td>
+      <td>Direct.MemoryUsed</td>
       <td>The amount of memory used by the JVM for the direct buffer pool.</td>
     </tr>
     <tr>
-      <td>Memory.Direct.TotalCapacity</td>
+      <td>Direct.TotalCapacity</td>
       <td>The total capacity of all buffers in the direct buffer pool.</td>
     </tr>
     <tr>
-      <td>Memory.Mapped.Count</td>
+      <td>Mapped.Count</td>
       <td>The number of buffers in the mapped buffer pool.</td>
     </tr>
     <tr>
-      <td>Memory.Mapped.MemoryUsed</td>
+      <td>Mapped.MemoryUsed</td>
       <td>The amount of memory used by the JVM for the mapped buffer pool.</td>
     </tr>
     <tr>
-      <td>Memory.Mapped.TotalCapacity</td>
+      <td>Mapped.TotalCapacity</td>
       <td>The number of buffers in the mapped buffer pool.</td>
+    </tr>                                                         
+  </tbody>                                                         
+</table>
+
+#### Threads:
+<table class="table table-bordered">
+  <thead>
+    <tr>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 25%">Infix</th>
+      <th class="text-left" style="width: 23%">Metrics</th>
+      <th class="text-left" style="width: 32%">Description</th>
     </tr>
+  </thead>
+  <tbody>
     <tr>
+      <th rowspan="1"><strong>Job-/TaskManager</strong></th>
+      <td rowspan="1">Status.JVM.ClassLoader</td>
       <td>Threads.Count</td>
       <td>The total number of live threads.</td>
     </tr>
+  </tbody>
+</table>
+
+#### GarbageCollection:
+<table class="table table-bordered">
+  <thead>
     <tr>
-      <td>CPU.Load</td>
-      <td>The recent CPU usage of the JVM.</td>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 25%">Infix</th>
+      <th class="text-left" style="width: 23%">Metrics</th>
+      <th class="text-left" style="width: 32%">Description</th>
     </tr>
+  </thead>
+  <tbody>
     <tr>
-      <td>CPU.Time</td>
-      <td>The CPU time used by the JVM.</td>
+      <th rowspan="2"><strong>Job-/TaskManager</strong></th>
+      <td rowspan="2">Status.JVM.GarbageCollector</td>
+      <td>&lt;GarbageCollector&gt;.Count</td>
+      <td>The total number of collections that have occurred.</td>
     </tr>
     <tr>
-      <th rowspan="1"><strong>Job</strong></th>
-      <td></td>
-      <td></td>
+      <td>&lt;GarbageCollector&gt;.Time</td>
+      <td>The total time spent performing garbage collection.</td>
     </tr>
+  </tbody>
+</table>
+
+#### ClassLoader:
+<table class="table table-bordered">
+  <thead>
     <tr>
-      <th rowspan="7"><strong>Task</strong></th>
-      <td>currentLowWatermark</td>
-      <td>The lowest watermark a task has received.</td>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 25%">Infix</th>
+      <th class="text-left" style="width: 23%">Metrics</th>
+      <th class="text-left" style="width: 32%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th rowspan="2"><strong>Job-/TaskManager</strong></th>
+      <td rowspan="2">Status.JVM.ClassLoader</td>
+      <td>ClassesLoaded</td>
+      <td>The total number of classes loaded since the start of the JVM.</td>
+    </tr>
+    <tr>
+      <td>ClassesUnloaded</td>
+      <td>The total number of classes unloaded since the start of the JVM.</td>
     </tr>
+  </tbody>
+</table>
+
+#### Network:
+<table class="table table-bordered">
+  <thead>
     <tr>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 25%">Infix</th>
+      <th class="text-left" style="width: 25%">Metrics</th>
+      <th class="text-left" style="width: 30%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th rowspan="2"><strong>TaskManager</strong></th>
+      <td rowspan="2">Status.Network</td>
+      <td>AvailableMemorySegments</td>
+      <td>The number of unused memory segments.</td>
+    </tr>
+    <tr>
+      <td>TotalMemorySegments</td>
+      <td>The number of allocated memory segments.</td>
+    </tr>
+    <tr>
+      <th rowspan="4">Task</th>
+      <td rowspan="4">buffers</td>
+      <td>inputQueueLength</td>
+      <td>The number of queued input buffers.</td>
+    </tr>
+    <tr>
+      <td>outputQueueLength</td>
+      <td>The number of queued output buffers.</td>
+    </tr>
+    <tr>
+      <td>inPoolUsage</td>
+      <td>An estimate of the input buffers usage.</td>
+    </tr>
+    <tr>
+      <td>outPoolUsage</td>
+      <td>An estimate of the output buffers usage.</td>
+    </tr>
+  </tbody>
+</table>
+
+#### Cluster:
+<table class="table table-bordered">
+  <thead>
+    <tr>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 30%">Metrics</th>
+      <th class="text-left" style="width: 50%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th rowspan="4"><strong>JobManager</strong></th>
+      <td>numRegisteredTaskManagers</td>
+      <td>The number of registered taskmanagers.</td>
+    </tr>
+    <tr>
+      <td>numRunningJobs</td>
+      <td>The number of running jobs.</td>
+    </tr>
+    <tr>
+      <td>taskSlotsAvailable</td>
+      <td>The number of available task slots.</td>
+    </tr>
+    <tr>
+      <td>taskSlotsTotal</td>
+      <td>The total number of task slots.</td>
+    </tr>
+  </tbody>
+</table>
+
+#### Checkpointing:
+<table class="table table-bordered">
+  <thead>
+    <tr>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 30%">Metrics</th>
+      <th class="text-left" style="width: 50%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th rowspan="3"><strong>Job (only available on JobManager)</strong></th>
       <td>lastCheckpointDuration</td>
       <td>The time it took to complete the last checkpoint.</td>
     </tr>
@@ -550,37 +701,81 @@ Flink exposes the following system metrics:
       <td>The total size of the last checkpoint.</td>
     </tr>
     <tr>
-      <td>restartingTime</td>
-      <td>The time it took to restart the job.</td>
+      <td>lastCheckpointExternalPath</td>
+      <td>The path where the last checkpoint was stored.</td>
+    </tr>
+    <tr>
+      <th rowspan="1">Task</th>
+      <td>checkpointAlignmentTime</td>
+      <td>The time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far.</td>
+    </tr>
+  </tbody>
+</table>
+
+#### IO:
+<table class="table table-bordered">
+  <thead>
+    <tr>
+      <th class="text-left" style="width: 20%">Scope</th>
+      <th class="text-left" style="width: 30%">Metrics</th>
+      <th class="text-left" style="width: 50%">Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th rowspan="7"><strong>Task</strong></th>
+      <td>currentLowWatermark</td>
+      <td>The lowest watermark this task has received.</td>
     </tr>
     <tr>
       <td>numBytesInLocal</td>
       <td>The total number of bytes this task has read from a local source.</td>
     </tr>
     <tr>
+      <td>numBytesInLocalPerSecond</td>
+      <td>The number of bytes this task reads from a local source per second.</td>
+    </tr>
+    <tr>
       <td>numBytesInRemote</td>
       <td>The total number of bytes this task has read from a remote source.</td>
     </tr>
     <tr>
+      <td>numBytesInRemotePerSecond</td>
+      <td>The number of bytes this task reads from a remote source per second.</td>
+    </tr>
+    <tr>
       <td>numBytesOut</td>
       <td>The total number of bytes this task has emitted.</td>
     </tr>
     <tr>
-      <th rowspan="4"><strong>Operator</strong></th>
+      <td>numBytesOutPerSecond</td>
+      <td>The number of bytes this task emits per second.</td>
+    </tr>
+    <tr>
+      <th rowspan="4"><strong>Task/Operator</strong></th>
       <td>numRecordsIn</td>
-      <td>The total number of records this operator has received.</td>
+      <td>The total number of records this operator/task has received.</td>
+    </tr>
+    <tr>
+      <td>numRecordsInPerSecond</td>
+      <td>The number of records this operator/task receives per second.</td>
     </tr>
     <tr>
       <td>numRecordsOut</td>
-      <td>The total number of records this operator has emitted.</td>
+      <td>The total number of records this operator/task has emitted.</td>
     </tr>
     <tr>
-      <td>numSplitsProcessed</td>
-      <td>The total number of InputSplits this data source has processed (if the operator is a data source).</td>
+      <td>numRecordsOutPerSecond</td>
+      <td>The number of records this operator/task sends per second.</td>
     </tr>
     <tr>
+      <th rowspan="2"><strong>Operator</strong></th>
       <td>latency</td>
-      <td>A latency gauge reporting the latency distribution from the different sources.</td>
+      <td>The latency distributions from all incoming sources.</td>
+    </tr>
+    <tr>
+      <td>numSplitsProcessed</td>
+      <td>The total number of InputSplits this data source has processed (if the operator is a data source).</td>
     </tr>
   </tbody>
 </table>