You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by nk...@apache.org on 2019/07/23 15:48:21 UTC

[flink-web] 05/05: Rebuild website

This is an automated email from the ASF dual-hosted git repository.

nkruber pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/flink-web.git

commit 606b5df34df5a3cc2ba216368de810a0c37b8247
Author: Nico Kruber <ni...@ververica.com>
AuthorDate: Tue Jul 23 17:47:00 2019 +0200

    Rebuild website
---
 content/2019/07/23/flink-network-stack-2.html      | 579 +++++++++++++++++++++
 content/blog/feed.xml                              | 360 +++++++++++++
 content/blog/index.html                            |  36 +-
 content/blog/page2/index.html                      |  38 +-
 content/blog/page3/index.html                      |  38 +-
 content/blog/page4/index.html                      |  38 +-
 content/blog/page5/index.html                      |  40 +-
 content/blog/page6/index.html                      |  40 +-
 content/blog/page7/index.html                      |  40 +-
 content/blog/page8/index.html                      |  40 +-
 content/blog/page9/index.html                      |  25 +
 content/css/flink.css                              |   5 +
 .../back_pressure_sampling_high.png                | Bin 0 -> 77546 bytes
 content/index.html                                 |   6 +-
 content/roadmap.html                               |   4 +-
 content/zh/community.html                          |   6 +
 content/zh/index.html                              |   6 +-
 17 files changed, 1177 insertions(+), 124 deletions(-)

diff --git a/content/2019/07/23/flink-network-stack-2.html b/content/2019/07/23/flink-network-stack-2.html
new file mode 100644
index 0000000..3601198
--- /dev/null
+++ b/content/2019/07/23/flink-network-stack-2.html
@@ -0,0 +1,579 @@
+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
+    <title>Apache Flink: Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</title>
+    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">
+    <link rel="icon" href="/favicon.ico" type="image/x-icon">
+
+    <!-- Bootstrap -->
+    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css">
+    <link rel="stylesheet" href="/css/flink.css">
+    <link rel="stylesheet" href="/css/syntax.css">
+
+    <!-- Blog RSS feed -->
+    <link href="/blog/feed.xml" rel="alternate" type="application/rss+xml" title="Apache Flink Blog: RSS feed" />
+
+    <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
+    <!-- We need to load Jquery in the header for custom google analytics event tracking-->
+    <script src="/js/jquery.min.js"></script>
+
+    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
+    <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
+    <!--[if lt IE 9]>
+      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
+      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
+    <![endif]-->
+  </head>
+  <body>  
+    
+
+    <!-- Main content. -->
+    <div class="container">
+    <div class="row">
+
+      
+     <div id="sidebar" class="col-sm-3">
+        
+
+<!-- Top navbar. -->
+    <nav class="navbar navbar-default">
+        <!-- The logo. -->
+        <div class="navbar-header">
+          <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+            <span class="icon-bar"></span>
+          </button>
+          <div class="navbar-logo">
+            <a href="/">
+              <img alt="Apache Flink" src="/img/flink-header-logo.svg" width="147px" height="73px">
+            </a>
+          </div>
+        </div><!-- /.navbar-header -->
+
+        <!-- The navigation links. -->
+        <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
+          <ul class="nav navbar-nav navbar-main">
+
+            <!-- First menu section explains visitors what Flink is -->
+
+            <!-- What is Stream Processing? -->
+            <!--
+            <li><a href="/streamprocessing1.html">What is Stream Processing?</a></li>
+            -->
+
+            <!-- What is Flink? -->
+            <li><a href="/flink-architecture.html">What is Apache Flink?</a></li>
+
+            
+            <ul class="nav navbar-nav navbar-subnav">
+              <li >
+                  <a href="/flink-architecture.html">Architecture</a>
+              </li>
+              <li >
+                  <a href="/flink-applications.html">Applications</a>
+              </li>
+              <li >
+                  <a href="/flink-operations.html">Operations</a>
+              </li>
+            </ul>
+            
+
+            <!-- Use cases -->
+            <li><a href="/usecases.html">Use Cases</a></li>
+
+            <!-- Powered by -->
+            <li><a href="/poweredby.html">Powered By</a></li>
+
+            <!-- FAQ -->
+            <li><a href="/faq.html">FAQ</a></li>
+
+            &nbsp;
+            <!-- Second menu section aims to support Flink users -->
+
+            <!-- Downloads -->
+            <li><a href="/downloads.html">Downloads</a></li>
+
+            <!-- Quickstart -->
+            <li>
+              <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/quickstart/setup_quickstart.html" target="_blank">Tutorials <small><span class="glyphicon glyphicon-new-window"></span></small></a>
+            </li>
+
+            <!-- Documentation -->
+            <li class="dropdown">
+              <a class="dropdown-toggle" data-toggle="dropdown" href="#">Documentation<span class="caret"></span></a>
+              <ul class="dropdown-menu">
+                <li><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8" target="_blank">1.8 (Latest stable release) <small><span class="glyphicon glyphicon-new-window"></span></small></a></li>
+                <li><a href="https://ci.apache.org/projects/flink/flink-docs-master" target="_blank">1.9 (Snapshot) <small><span class="glyphicon glyphicon-new-window"></span></small></a></li>
+              </ul>
+            </li>
+
+            <!-- getting help -->
+            <li><a href="/gettinghelp.html">Getting Help</a></li>
+
+            <!-- Blog -->
+            <li><a href="/blog/"><b>Flink Blog</b></a></li>
+
+            &nbsp;
+
+            <!-- Third menu section aim to support community and contributors -->
+
+            <!-- Community -->
+            <li><a href="/community.html">Community &amp; Project Info</a></li>
+
+            <!-- Roadmap -->
+            <li><a href="/roadmap.html">Roadmap</a></li>
+
+            <!-- Contribute -->
+            <li><a href="/contributing/how-to-contribute.html">How to Contribute</a></li>
+            
+
+            <!-- GitHub -->
+            <li>
+              <a href="https://github.com/apache/flink" target="_blank">Flink on GitHub <small><span class="glyphicon glyphicon-new-window"></span></small></a>
+            </li>
+
+            &nbsp;
+
+            <!-- Language Switcher -->
+            <li>
+              
+                
+                  <a href="/zh/2019/07/23/flink-network-stack-2.html">中文版</a>
+                
+              
+            </li>
+
+          </ul>
+
+          <ul class="nav navbar-nav navbar-bottom">
+          <hr />
+
+            <!-- Twitter -->
+            <li><a href="https://twitter.com/apacheflink" target="_blank">@ApacheFlink <small><span class="glyphicon glyphicon-new-window"></span></small></a></li>
+
+            <!-- Visualizer -->
+            <li class=" hidden-md hidden-sm"><a href="/visualizer/" target="_blank">Plan Visualizer <small><span class="glyphicon glyphicon-new-window"></span></small></a></li>
+
+          </ul>
+        </div><!-- /.navbar-collapse -->
+    </nav>
+
+      </div>
+      <div class="col-sm-9">
+      <div class="row-fluid">
+  <div class="col-sm-12">
+    <div class="row">
+      <h1>Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</h1>
+
+      <article>
+        <p>23 Jul 2019 Nico Kruber  &amp; Piotr Nowojski </p>
+
+<style type="text/css">
+.tg  {border-collapse:collapse;border-spacing:0;}
+.tg td{padding:10px 10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
+.tg th{padding:10px 10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;background-color:#eff0f1;}
+.tg .tg-wide{padding:10px 30px;}
+.tg .tg-top{vertical-align:top}
+.tg .tg-topcenter{text-align:center;vertical-align:top}
+.tg .tg-center{text-align:center;vertical-align:center}
+</style>
+
+<p>In a <a href="/2019/06/05/flink-network-stack.html">previous blog post</a>, we presented how Flink’s network stack works from the high-level abstractions to the low-level details. This second blog post in the series of network stack posts extends on this knowledge and discusses monitoring network-related metrics to identify effects such as backpressure or bottlenecks in throughput and latency. Although this post briefly covers what to do with backpressure, the topic of tuning the netw [...]
+
+<div class="page-toc">
+<ul id="markdown-toc">
+  <li><a href="#monitoring" id="markdown-toc-monitoring">Monitoring</a>    <ul>
+      <li><a href="#backpressure-monitor" id="markdown-toc-backpressure-monitor">Backpressure Monitor</a></li>
+    </ul>
+  </li>
+  <li><a href="#network-metrics" id="markdown-toc-network-metrics">Network Metrics</a>    <ul>
+      <li><a href="#backpressure" id="markdown-toc-backpressure">Backpressure</a></li>
+      <li><a href="#resource-usage--throughput" id="markdown-toc-resource-usage--throughput">Resource Usage / Throughput</a></li>
+      <li><a href="#latency-tracking" id="markdown-toc-latency-tracking">Latency Tracking</a></li>
+    </ul>
+  </li>
+  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
+</ul>
+
+</div>
+
+<h2 id="monitoring">Monitoring</h2>
+
+<p>Probably the most important part of network monitoring is <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html">monitoring backpressure</a>, a situation where a system is receiving data at a higher rate than it can process¹. Such behaviour will result in the sender being backpressured and may be caused by two things:</p>
+
+<ul>
+  <li>
+    <p>The receiver is slow.<br />
+This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.</p>
+  </li>
+  <li>
+    <p>The network channel is slow.<br />
+  Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous <a href="https://www.ververica.com/blog/how-to-size-your-apache-flink- [...]
+  </li>
+</ul>
+
+<p><sup>1</sup> In case you are unfamiliar with backpressure and how it interacts with Flink, we recommend reading through <a href="https://www.ververica.com/blog/how-flink-handles-backpressure">this blog post on backpressure</a> from 2015.</p>
+
+<p><br />
+If backpressure occurs, it will bubble upstream and eventually reach your sources and slow them down. This is not a bad thing per-se and merely states that you lack resources for the current load. However, you may want to improve your job so that it can cope with higher loads without using more resources. In order to do so, you need to find (1) where (at which task/operator) the bottleneck is and (2) what is causing it. Flink offers two mechanisms for identifying where the bottleneck is:</p>
+
+<ul>
+  <li>directly via Flink’s web UI and its backpressure monitor, or</li>
+  <li>indirectly through some of the network metrics.</li>
+</ul>
+
+<p>Flink’s web UI is likely the first entry point for a quick troubleshooting but has some disadvantages that we will explain below. On the other hand, Flink’s network metrics are better suited for continuous monitoring and reasoning about the exact nature of the bottleneck causing backpressure. We will cover both in the sections below. In both cases, you need to identify the origin of backpressure from the sources to the sinks. Your starting point for the current and future investigatio [...]
+
+<h3 id="backpressure-monitor">Backpressure Monitor</h3>
+
+<p>The <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html">backpressure monitor</a> is only exposed via Flink’s web UI². Since it’s an active component that is only triggered on request, it is currently not available via metrics. The backpressure monitor samples the running tasks’ threads on all TaskManagers via <code>Thread.getStackTrace()</code> and computes the number of samples where tasks were blocked on a buffer request. These tasks w [...]
+
+<ul>
+  <li><span style="color:green">OK</span> for <code>ratio ≤ 0.10</code>,</li>
+  <li><span style="color:orange">LOW</span> for <code>0.10 &lt; Ratio ≤ 0.5</code>, and</li>
+  <li><span style="color:red">HIGH</span> for <code>0.5 &lt; Ratio ≤ 1</code>.</li>
+</ul>
+
+<p>Although you can tune things like the refresh-interval, the number of samples, or the delay between samples, normally, you would not need to touch these since the defaults already give good-enough results.</p>
+
+<center>
+<img src="/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png" width="600px" alt="Backpressure sampling:high" />
+</center>
+
+<p><sup>2</sup> You may also access the backpressure monitor via the REST API: <code>/jobs/:jobid/vertices/:vertexid/backpressure</code></p>
+
+<p><br />
+The backpressure monitor can help you find where (at which task/operator) backpressure originates from. However, it does not support you in further reasoning about the causes of it. Additionally, for larger jobs or higher parallelism, the backpressure monitor becomes too crowded to use and may also take some time to gather all information from all TaskManagers. Please also note that sampling may affect your running job’s performance.</p>
+
+<h2 id="network-metrics">Network Metrics</h2>
+
+<p><a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#network">Network</a> and <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#io">task I/O</a> metrics are more lightweight than the backpressure monitor and are continuously published for each running job. We can leverage those and get even more insights, not only for backpressure monitoring. The most relevant metrics for users are:</p>
+
+<ul>
+  <li>
+    <p><strong><span style="color:orange">up to Flink 1.8:</span></strong> <code>outPoolUsage</code>, <code>inPoolUsage</code><br />
+An estimate on the ratio of buffers used vs. buffers available in the respective local buffer pools.
+While interpreting <code>inPoolUsage</code> in Flink 1.5 - 1.8 with credit-based flow control, please note that this only relates to floating buffers (exclusive buffers are not part of the pool).</p>
+  </li>
+  <li>
+    <p><strong><span style="color:green">Flink 1.9 and above:</span></strong> <code>outPoolUsage</code>, <code>inPoolUsage</code>, <code>floatingBuffersUsage</code>, <code>exclusiveBuffersUsage</code><br />
+An estimate on the ratio of buffers used vs. buffers available in the respective local buffer pools.
+Starting with Flink 1.9, <code>inPoolUsage</code> is the sum of <code>floatingBuffersUsage</code> and <code>exclusiveBuffersUsage</code>.</p>
+  </li>
+  <li>
+    <p><code>numRecordsOut</code>, <code>numRecordsIn</code><br />
+Each metric comes with two scopes: one scoped to the operator and one scoped to the subtask. For network monitoring, the subtask-scoped metric is relevant and shows the total number of records it has sent/received. You may need to further look into these figures to extract the number of records within a certain time span or use the equivalent <code>…PerSecond</code> metrics.</p>
+  </li>
+  <li>
+    <p><code>numBytesOut</code>, <code>numBytesInLocal</code>, <code>numBytesInRemote</code><br />
+The total number of bytes this subtask has emitted or read from a local/remote source. These are also available as meters via <code>…PerSecond</code> metrics.</p>
+  </li>
+  <li>
+    <p><code>numBuffersOut</code>, <code>numBuffersInLocal</code>, <code>numBuffersInRemote</code><br />
+Similar to <code>numBytes…</code> but counting the number of network buffers.</p>
+  </li>
+</ul>
+
+<div class="alert alert-warning">
+  <p><span class="label label-warning" style="display: inline-block"><span class="glyphicon glyphicon-warning-sign" aria-hidden="true"></span> Warning</span>
+For the sake of completeness and since they have been used in the past, we will briefly look at the <code>outputQueueLength</code> and <code>inputQueueLength</code> metrics. These are somewhat similar to the <code>[out,in]PoolUsage</code> metrics but show the number of buffers sitting in a sender subtask’s output queues and in a receiver subtask’s input queues, respectively. Reasoning about absolute numbers of buffers, however, is difficult and there is also a special subtlety with local [...]
+
+  <p>Overall, <strong>we discourage the use of</strong> <code>outputQueueLength</code> <strong>and</strong> <code>inputQueueLength</code> because their interpretation highly depends on the current parallelism of the operator and the configured numbers of exclusive and floating buffers. Instead, we recommend using the various <code>*PoolUsage</code> metrics which even reveal more detailed insight.</p>
+</div>
+
+<div class="alert alert-info">
+  <p><span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+ If you reason about buffer usage, please keep the following in mind:</p>
+
+  <ul>
+    <li>Any outgoing channel which has been used at least once will always occupy one buffer (since Flink 1.5).
+      <ul>
+        <li><strong><span style="color:orange">up to Flink 1.8:</span></strong> This buffer (even if empty!) was always counted as a backlog of 1 and thus receivers tried to reserve a floating buffer for it.</li>
+        <li><strong><span style="color:green">Flink 1.9 and above:</span></strong> A buffer is only counted in the backlog if it is ready for consumption, i.e. it is full or was flushed (see FLINK-11082)</li>
+      </ul>
+    </li>
+    <li>The receiver will only release a received buffer after deserialising the last record in it.</li>
+  </ul>
+</div>
+
+<p>The following sections make use of and combine these metrics to reason about backpressure and resource usage / efficiency with respect to throughput. A separate section will detail latency related metrics.</p>
+
+<h3 id="backpressure">Backpressure</h3>
+
+<p>Backpressure may be indicated by two different sets of metrics: (local) buffer pool usages as well as input/output queue lengths. They provide a different level of granularity but, unfortunately, none of these are exhaustive and there is room for interpretation. Because of the inherent problems with interpreting these queue lengths we will focus on the usage of input and output pools below which also provides more detail.</p>
+
+<ul>
+  <li>
+    <p><strong>If a subtask’s</strong> <code>outPoolUsage</code> <strong>is 100%</strong>, it is backpressured. Whether the subtask is already blocking or still writing records into network buffers depends on how full the buffers are, that the <code>RecordWriters</code> are currently writing into.<br />
+<span class="glyphicon glyphicon-warning-sign" aria-hidden="true" style="color:orange;"></span> This is different to what the backpressure monitor is showing!</p>
+  </li>
+  <li>
+    <p>An <code>inPoolUsage</code> of 100% means that all floating buffers are assigned to channels and eventually backpressure will be exercised upstream. These floating buffers are in either of the following conditions: they are reserved for future use on a channel due to an exclusive buffer being utilised (remote input channels always try to maintain <code>#exclusive buffers</code> credits), they are reserved for a sender’s backlog and wait for data, they may contain data and are enqu [...]
+  </li>
+  <li>
+    <p><strong><span style="color:orange">up to Flink 1.8:</span></strong> Due to <a href="https://issues.apache.org/jira/browse/FLINK-11082">FLINK-11082</a>, an <code>inPoolUsage</code> of 100% is quite common even in normal situations.</p>
+  </li>
+  <li>
+    <p><strong><span style="color:green">Flink 1.9 and above:</span></strong> If <code>inPoolUsage</code> is constantly around 100%, this is a strong indicator for exercising backpressure upstream.</p>
+  </li>
+</ul>
+
+<p>The following table summarises all combinations and their interpretation. Bear in mind, though, that backpressure may be minor or temporary (no need to look into it), on particular channels only, or caused by other JVM processes on a particular TaskManager, such as GC, synchronisation, I/O, resource shortage, instead of a specific subtask.</p>
+
+<center>
+<table class="tg">
+  <tr>
+    <th></th>
+    <th class="tg-center"><code>outPoolUsage</code> low</th>
+    <th class="tg-center"><code>outPoolUsage</code> high</th>
+  </tr>
+  <tr>
+    <th class="tg-top"><code>inPoolUsage</code> low</th>
+    <td class="tg-topcenter">
+      <span class="glyphicon glyphicon-ok-sign" aria-hidden="true" style="color:green;font-size:1.5em;"></span></td>
+    <td class="tg-topcenter">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" style="color:orange;font-size:1.5em;"></span><br />
+      (backpressured, temporary situation: upstream is not backpressured yet or not anymore)</td>
+  </tr>
+  <tr>
+    <th class="tg-top" rowspan="2">
+      <code>inPoolUsage</code> high<br />
+      (<strong><span style="color:green">Flink 1.9+</span></strong>)</th>
+    <td class="tg-topcenter">
+      if all upstream tasks’<code>outPoolUsage</code> are low: <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" style="color:orange;font-size:1.5em;"></span><br />
+      (may eventually cause backpressure)</td>
+    <td class="tg-topcenter" rowspan="2">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" style="color:red;font-size:1.5em;"></span><br />
+      (backpressured by downstream task(s) or network, probably forwarding backpressure upstream)</td>
+  </tr>
+  <tr>
+    <td class="tg-topcenter">if any upstream task’s<code>outPoolUsage</code> is high: <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" style="color:red;font-size:1.5em;"></span><br />
+      (may exercise backpressure upstream and may be the source of backpressure)</td>
+  </tr>
+</table>
+</center>
+
+<p><br />
+We may even reason more about the cause of backpressure by looking at the network metrics of the subtasks of two consecutive tasks:</p>
+
+<ul>
+  <li>If all subtasks of the receiver task have low <code>inPoolUsage</code> values and any upstream subtask’s <code>outPoolUsage</code> is high, then there may be a network bottleneck causing backpressure.
+Since network is a shared resource among all subtasks of a TaskManager, this may not directly originate from this subtask, but rather from various concurrent operations, e.g. checkpoints, other streams, external connections, or other TaskManagers/processes on the same machine.</li>
+</ul>
+
+<p>Backpressure can also be caused by all parallel instances of a task or by a single task instance. The first usually happens because the task is performing some time consuming operation that applies to all input partitions. The latter is usually the result of some kind of skew, either data skew or resource availability/allocation skew. In either case, you can find some hints on how to handle such situations in the <a href="#span-classlabel-label-info-styledisplay-inline-blockspan-class [...]
+
+<div class="alert alert-info">
+  <h3 class="no_toc" id="span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-flink-19-and-above"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Flink 1.9 and above</h3>
+
+  <ul>
+    <li>If <code>floatingBuffersUsage</code> is not 100%, it is unlikely that there is backpressure. If it is 100% and any upstream task is backpressured, it suggests that this input is exercising backpressure on either a single, some or all input channels. To differentiate between those three situations you can use <code>exclusiveBuffersUsage</code>:
+      <ul>
+        <li>Assuming that <code>floatingBuffersUsage</code> is around 100%, the higher the <code>exclusiveBuffersUsage</code> the more input channels are backpressured. In an extreme case of <code>exclusiveBuffersUsage</code> being close to 100%, it means that all channels are backpressured.</li>
+      </ul>
+    </li>
+  </ul>
+
+  <p><br />
+The relation between <code>exclusiveBuffersUsage</code>, <code>floatingBuffersUsage</code>, and the upstream tasks’ <code>outPoolUsage</code> is summarised in the following table and extends on the table above with <code>inPoolUsage = floatingBuffersUsage + exclusiveBuffersUsage</code>:</p>
+
+  <center>
+<table class="tg">
+  <tr>
+    <th></th>
+    <th><code>exclusiveBuffersUsage</code> low</th>
+    <th><code>exclusiveBuffersUsage</code> high</th>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> low +<br />
+      <em>all</em> upstream <code>outPoolUsage</code> low</th>
+    <td class="tg-center"><span class="glyphicon glyphicon-ok-sign" aria-hidden="true" style="color:green;font-size:1.5em;"></span></td>
+    <td class="tg-center">-<sup>3</sup></td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> low +<br />
+      <em>any</em> upstream <code>outPoolUsage</code> high</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" style="color:red;font-size:1.5em;"></span><br />
+      (potential network bottleneck)</td>
+    <td class="tg-center">-<sup>3</sup></td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> high +<br />
+      <em>all</em> upstream <code>outPoolUsage</code> low</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" style="color:orange;font-size:1.5em;"></span><br />
+      (backpressure eventually appears on only some of the input channels)</td>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-warning-sign" aria-hidden="true" style="color:orange;font-size:1.5em;"></span><br />
+      (backpressure eventually appears on most or all of the input channels)</td>
+  </tr>
+  <tr>
+    <th class="tg-top" style="min-width:33%;">
+      <code>floatingBuffersUsage</code> high +<br />
+      any upstream <code>outPoolUsage</code> high</th>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" style="color:red;font-size:1.5em;"></span><br />
+      (backpressure on only some of the input channels)</td>
+    <td class="tg-center">
+      <span class="glyphicon glyphicon-remove-sign" aria-hidden="true" style="color:red;font-size:1.5em;"></span><br />
+      (backpressure on most or all of the input channels)</td>
+  </tr>
+</table>
+</center>
+
+  <p><sup>3</sup> this should not happen</p>
+
+</div>
+
+<h3 id="resource-usage--throughput">Resource Usage / Throughput</h3>
+
+<p>Besides the obvious use of each individual metric mentioned above, there are also a few combinations providing useful insight into what is happening in the network stack:</p>
+
+<ul>
+  <li>
+    <p>Low throughput with frequent <code>outPoolUsage</code> values around 100% but low <code>inPoolUsage</code> on all receivers is an indicator that the round-trip-time of our credit-notification (depends on your network’s latency) is too high for the default number of exclusive buffers to make use of your bandwidth. Consider increasing the <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-network-memory-buffers-per-channel">buffers-per-c [...]
+  </li>
+  <li>
+    <p>Combining <code>numRecordsOut</code> and <code>numBytesOut</code> helps identifying average serialised record sizes which supports you in capacity planning for peak scenarios.</p>
+  </li>
+  <li>
+    <p>If you want to reason about buffer fill rates and the influence of the output flusher, you may combine <code>numBytesInRemote</code> with <code>numBuffersInRemote</code>. When tuning for throughput (and not latency!), low buffer fill rates may indicate reduced network efficiency. In such cases, consider increasing the buffer timeout.
+Please note that, as of Flink 1.8 and 1.9, <code>numBuffersOut</code> only increases for buffers getting full or for an event cutting off a buffer (e.g. a checkpoint barrier) and may lag behind. Please also note that reasoning about buffer fill rates on local channels is unnecessary since buffering is an optimisation technique for remote channels with limited effect on local channels.</p>
+  </li>
+  <li>
+    <p>You may also separate local from remote traffic using numBytesInLocal and numBytesInRemote but in most cases this is unnecessary.</p>
+  </li>
+</ul>
+
+<div class="alert alert-info">
+  <h3 class="no_toc" id="span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-what-to-do-with-backpressure"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> What to do with Backpressure?</h3>
+
+  <p>Assuming that you identified where the source of backpressure — a bottleneck — is located, the next step is to analyse why this is happening. Below, we list some potential causes of backpressure from the more basic to the more complex ones. We recommend to check the basic causes first, before diving deeper on the more complex ones and potentially drawing false conclusions.</p>
+
+  <p>Please also recall that backpressure might be temporary and the result of a load spike, checkpointing, or a job restart with a data backlog waiting to be processed. If backpressure is temporary, you should simply ignore it. Alternatively, keep in mind that the process of analysing and solving the issue can be affected by the intermittent nature of your bottleneck. Having said that, here are a couple of things to check.</p>
+
+  <h4 id="system-resources">System Resources</h4>
+
+  <p>Firstly, you should check the incriminated machines’ basic resource usage like CPU, network, or disk I/O. If some resource is fully or heavily utilised you can do one of the following:</p>
+
+  <ol>
+    <li>Try to optimise your code. Code profilers are helpful in this case.</li>
+    <li>Tune Flink for that specific resource.</li>
+    <li>Scale out by increasing the parallelism and/or increasing the number of machines in the cluster.</li>
+  </ol>
+
+  <h4 id="garbage-collection">Garbage Collection</h4>
+
+  <p>Oftentimes, performance issues arise from long GC pauses. You can verify whether you are in such a situation by either printing debug GC logs (via -<code>XX:+PrintGCDetails</code>) or by using some memory/GC profilers. Since dealing with GC issues is highly application-dependent and independent of Flink, we will not go into details here (<a href="https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/index.html">Oracle’s Garbage Collection Tuning Guide</a> or <a href="ht [...]
+
+  <h4 id="cputhread-bottleneck">CPU/Thread Bottleneck</h4>
+
+  <p>Sometimes a CPU bottleneck might not be visible at first glance if one or a couple of threads are causing the CPU bottleneck while the CPU usage of the overall machine remains relatively low. For instance, a single CPU-bottlenecked thread on a 48-core machine would result in only 2% CPU use. Consider using code profilers for this as they can identify hot threads by showing each threads’ CPU usage, for example.</p>
+
+  <h4 id="thread-contention">Thread Contention</h4>
+
+  <p>Similarly to the CPU/thread bottleneck issue above, a subtask may be bottlenecked due to high thread contention on shared resources. Again, CPU profilers are your best friend here! Consider looking for synchronisation overhead / lock contention in user code — although adding synchronisation in user code should be avoided and may even be dangerous! Also consider investigating shared system resources. The default JVM’s SSL implementation, for example, can become contented around the s [...]
+
+  <h4 id="load-imbalance">Load Imbalance</h4>
+
+  <p>If your bottleneck is caused by data skew, you can try to remove it or mitigate its impact by changing the data partitioning to separate heavy keys or by implementing local/pre-aggregation.</p>
+
+  <p><br />
+This list is far from exhaustive. Generally, in order to reduce a bottleneck and thus backpressure, first analyse where it is happening and then find out why. The best place to start reasoning about the “why” is by checking what resources are fully utilised.</p>
+</div>
+
+<h3 id="latency-tracking">Latency Tracking</h3>
+
+<p>Tracking latencies at the various locations they may occur is a topic of its own. In this section, we will focus on the time records wait inside Flink’s network stack — including the system’s network connections. In low throughput scenarios, these latencies are influenced directly by the output flusher via the buffer timeout parameter or indirectly by any application code latencies. When processing a record takes longer than expected or when (multiple) timers fire at the same time — a [...]
+
+<p>Flink offers some support for <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#latency-tracking">tracking the latency</a> of records passing through the system (outside of user code). However, this is disabled by default (see below why!) and must be enabled by setting a latency tracking interval either in Flink’s <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#metrics-latency-interval">configuration via < [...]
+
+<ul>
+  <li><code>single</code>: one histogram for each operator subtask</li>
+  <li><code>operator</code> (default): one histogram for each combination of source task and operator subtask</li>
+  <li><code>subtask</code>: one histogram for each combination of source subtask and operator subtask (quadratic in the parallelism!)</li>
+</ul>
+
+<p>These metrics are collected through special “latency markers”: each source subtask will periodically emit a special record containing the timestamp of its creation. The latency markers then flow alongside normal records while not overtaking them on the wire or inside a buffer queue. However, <em>a latency marker does not enter application logic</em> and is overtaking records there. Latency markers therefore only measure the waiting time between the user code and not a full “end-to-end [...]
+
+<p>Since <code>LatencyMarkers</code> sit in network buffers just like normal records, they will also wait for the buffer to be full or flushed due to buffer timeouts. When a channel is on high load, there is no added latency by the network buffering data. However, as soon as one channel is under low load, records and latency markers will experience an expected average delay of at most <code>buffer_timeout / 2</code>. This delay will add to each network connection towards a subtask and sh [...]
+
+<p>By looking at the exposed latency tracking metrics for each subtask, for example at the 95th percentile, you should nevertheless be able to identify subtasks which are adding substantially to the overall source-to-sink latency and continue with optimising there.</p>
+
+<div class="alert alert-info">
+  <p><span class="label label-info" style="display: inline-block"><span class="glyphicon glyphicon-info-sign" aria-hidden="true"></span> Note</span>
+Flink’s latency markers assume that the clocks on all machines in the cluster are in sync. We recommend setting up an automated clock synchronisation service (like NTP) to avoid false latency results.</p>
+</div>
+
+<div class="alert alert-warning">
+  <p><span class="label label-warning" style="display: inline-block"><span class="glyphicon glyphicon-warning-sign" aria-hidden="true"></span> Warning</span>
+Enabling latency metrics can significantly impact the performance of the cluster (in particular for <code>subtask</code> granularity) due to the sheer amount of metrics being added as well as the use of histograms which are quite expensive to maintain. It is highly recommended to only use them for debugging purposes.</p>
+</div>
+
+<h2 id="conclusion">Conclusion</h2>
+
+<p>In the previous sections we discussed how to monitor Flink’s network stack which primarily involves identifying backpressure: where it occurs, where it originates from, and (potentially) why it occurs. This can be executed in two ways: for simple cases and debugging sessions by using the backpressure monitor; for continuous monitoring, more in-depth analysis, and less runtime overhead by using Flink’s task and network stack metrics. Backpressure can be caused by the network layer itse [...]
+
+<p>Stay tuned for the third blog post in the series of network stack posts that will focus on tuning techniques and anti-patterns to avoid.</p>
+
+
+      </article>
+    </div>
+
+    <div class="row">
+      <div id="disqus_thread"></div>
+      <script type="text/javascript">
+        /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
+        var disqus_shortname = 'stratosphere-eu'; // required: replace example with your forum shortname
+
+        /* * * DON'T EDIT BELOW THIS LINE * * */
+        (function() {
+            var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
+            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
+             (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
+        })();
+      </script>
+    </div>
+  </div>
+</div>
+      </div>
+    </div>
+
+    <hr />
+
+    <div class="row">
+      <div class="footer text-center col-sm-12">
+        <p>Copyright © 2014-2019 <a href="http://apache.org">The Apache Software Foundation</a>. All Rights Reserved.</p>
+        <p>Apache Flink, Flink®, Apache®, the squirrel logo, and the Apache feather logo are either registered trademarks or trademarks of The Apache Software Foundation.</p>
+        <p><a href="/privacy-policy.html">Privacy Policy</a> &middot; <a href="/blog/feed.xml">RSS feed</a></p>
+      </div>
+    </div>
+    </div><!-- /.container -->
+
+    <!-- Include all compiled plugins (below), or include individual files as needed -->
+    <script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/js/bootstrap.min.js"></script>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery.matchHeight/0.7.0/jquery.matchHeight-min.js"></script>
+    <script src="/js/codetabs.js"></script>
+    <script src="/js/stickysidebar.js"></script>
+
+    <!-- Google Analytics -->
+    <script>
+      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+      ga('create', 'UA-52545728-1', 'auto');
+      ga('send', 'pageview');
+    </script>
+  </body>
+</html>
diff --git a/content/blog/feed.xml b/content/blog/feed.xml
index ebfe803..cc8dfe3 100644
--- a/content/blog/feed.xml
+++ b/content/blog/feed.xml
@@ -7,6 +7,366 @@
 <atom:link href="https://flink.apache.org/blog/feed.xml" rel="self" type="application/rss+xml" />
 
 <item>
+<title>Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</title>
+<description>&lt;style type=&quot;text/css&quot;&gt;
+.tg  {border-collapse:collapse;border-spacing:0;}
+.tg td{padding:10px 10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
+.tg th{padding:10px 10px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;background-color:#eff0f1;}
+.tg .tg-wide{padding:10px 30px;}
+.tg .tg-top{vertical-align:top}
+.tg .tg-topcenter{text-align:center;vertical-align:top}
+.tg .tg-center{text-align:center;vertical-align:center}
+&lt;/style&gt;
+
+&lt;p&gt;In a &lt;a href=&quot;/2019/06/05/flink-network-stack.html&quot;&gt;previous blog post&lt;/a&gt;, we presented how Flink’s network stack works from the high-level abstractions to the low-level details. This second blog post in the series of network stack posts extends on this knowledge and discusses monitoring network-related metrics to identify effects such as backpressure or bottlenecks in throughput and latency. Although this post briefly covers what to do with backpressure,  [...]
+
+&lt;div class=&quot;page-toc&quot;&gt;
+&lt;ul id=&quot;markdown-toc&quot;&gt;
+  &lt;li&gt;&lt;a href=&quot;#monitoring&quot; id=&quot;markdown-toc-monitoring&quot;&gt;Monitoring&lt;/a&gt;    &lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#backpressure-monitor&quot; id=&quot;markdown-toc-backpressure-monitor&quot;&gt;Backpressure Monitor&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;#network-metrics&quot; id=&quot;markdown-toc-network-metrics&quot;&gt;Network Metrics&lt;/a&gt;    &lt;ul&gt;
+      &lt;li&gt;&lt;a href=&quot;#backpressure&quot; id=&quot;markdown-toc-backpressure&quot;&gt;Backpressure&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#resource-usage--throughput&quot; id=&quot;markdown-toc-resource-usage--throughput&quot;&gt;Resource Usage / Throughput&lt;/a&gt;&lt;/li&gt;
+      &lt;li&gt;&lt;a href=&quot;#latency-tracking&quot; id=&quot;markdown-toc-latency-tracking&quot;&gt;Latency Tracking&lt;/a&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;&lt;a href=&quot;#conclusion&quot; id=&quot;markdown-toc-conclusion&quot;&gt;Conclusion&lt;/a&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;/div&gt;
+
+&lt;h2 id=&quot;monitoring&quot;&gt;Monitoring&lt;/h2&gt;
+
+&lt;p&gt;Probably the most important part of network monitoring is &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html&quot;&gt;monitoring backpressure&lt;/a&gt;, a situation where a system is receiving data at a higher rate than it can process¹. Such behaviour will result in the sender being backpressured and may be caused by two things:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;The receiver is slow.&lt;br /&gt;
+This can happen because the receiver is backpressured itself, is unable to keep processing at the same rate as the sender, or is temporarily blocked by garbage collection, lack of system resources, or I/O.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;The network channel is slow.&lt;br /&gt;
+  Even though in such case the receiver is not (directly) involved, we call the sender backpressured due to a potential oversubscription on network bandwidth shared by all subtasks running on the same machine. Beware that, in addition to Flink’s network stack, there may be more network users, such as sources and sinks, distributed file systems (checkpointing, network-attached storage), logging, and metrics. A previous &lt;a href=&quot;https://www.ververica.com/blog/how-to-size-your-apach [...]
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;sup&gt;1&lt;/sup&gt; In case you are unfamiliar with backpressure and how it interacts with Flink, we recommend reading through &lt;a href=&quot;https://www.ververica.com/blog/how-flink-handles-backpressure&quot;&gt;this blog post on backpressure&lt;/a&gt; from 2015.&lt;/p&gt;
+
+&lt;p&gt;&lt;br /&gt;
+If backpressure occurs, it will bubble upstream and eventually reach your sources and slow them down. This is not a bad thing per-se and merely states that you lack resources for the current load. However, you may want to improve your job so that it can cope with higher loads without using more resources. In order to do so, you need to find (1) where (at which task/operator) the bottleneck is and (2) what is causing it. Flink offers two mechanisms for identifying where the bottleneck is: [...]
+
+&lt;ul&gt;
+  &lt;li&gt;directly via Flink’s web UI and its backpressure monitor, or&lt;/li&gt;
+  &lt;li&gt;indirectly through some of the network metrics.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Flink’s web UI is likely the first entry point for a quick troubleshooting but has some disadvantages that we will explain below. On the other hand, Flink’s network metrics are better suited for continuous monitoring and reasoning about the exact nature of the bottleneck causing backpressure. We will cover both in the sections below. In both cases, you need to identify the origin of backpressure from the sources to the sinks. Your starting point for the current and future invest [...]
+
+&lt;h3 id=&quot;backpressure-monitor&quot;&gt;Backpressure Monitor&lt;/h3&gt;
+
+&lt;p&gt;The &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html&quot;&gt;backpressure monitor&lt;/a&gt; is only exposed via Flink’s web UI². Since it’s an active component that is only triggered on request, it is currently not available via metrics. The backpressure monitor samples the running tasks’ threads on all TaskManagers via &lt;code&gt;Thread.getStackTrace()&lt;/code&gt; and computes the number of samples where tasks were bl [...]
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;span style=&quot;color:green&quot;&gt;OK&lt;/span&gt; for &lt;code&gt;ratio ≤ 0.10&lt;/code&gt;,&lt;/li&gt;
+  &lt;li&gt;&lt;span style=&quot;color:orange&quot;&gt;LOW&lt;/span&gt; for &lt;code&gt;0.10 &amp;lt; Ratio ≤ 0.5&lt;/code&gt;, and&lt;/li&gt;
+  &lt;li&gt;&lt;span style=&quot;color:red&quot;&gt;HIGH&lt;/span&gt; for &lt;code&gt;0.5 &amp;lt; Ratio ≤ 1&lt;/code&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Although you can tune things like the refresh-interval, the number of samples, or the delay between samples, normally, you would not need to touch these since the defaults already give good-enough results.&lt;/p&gt;
+
+&lt;center&gt;
+&lt;img src=&quot;/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png&quot; width=&quot;600px&quot; alt=&quot;Backpressure sampling:high&quot; /&gt;
+&lt;/center&gt;
+
+&lt;p&gt;&lt;sup&gt;2&lt;/sup&gt; You may also access the backpressure monitor via the REST API: &lt;code&gt;/jobs/:jobid/vertices/:vertexid/backpressure&lt;/code&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;br /&gt;
+The backpressure monitor can help you find where (at which task/operator) backpressure originates from. However, it does not support you in further reasoning about the causes of it. Additionally, for larger jobs or higher parallelism, the backpressure monitor becomes too crowded to use and may also take some time to gather all information from all TaskManagers. Please also note that sampling may affect your running job’s performance.&lt;/p&gt;
+
+&lt;h2 id=&quot;network-metrics&quot;&gt;Network Metrics&lt;/h2&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#network&quot;&gt;Network&lt;/a&gt; and &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#io&quot;&gt;task I/O&lt;/a&gt; metrics are more lightweight than the backpressure monitor and are continuously published for each running job. We can leverage those and get even more insights, not only for backpressure monitoring. The most re [...]
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up to Flink 1.8:&lt;/span&gt;&lt;/strong&gt; &lt;code&gt;outPoolUsage&lt;/code&gt;, &lt;code&gt;inPoolUsage&lt;/code&gt;&lt;br /&gt;
+An estimate on the ratio of buffers used vs. buffers available in the respective local buffer pools.
+While interpreting &lt;code&gt;inPoolUsage&lt;/code&gt; in Flink 1.5 - 1.8 with credit-based flow control, please note that this only relates to floating buffers (exclusive buffers are not part of the pool).&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9 and above:&lt;/span&gt;&lt;/strong&gt; &lt;code&gt;outPoolUsage&lt;/code&gt;, &lt;code&gt;inPoolUsage&lt;/code&gt;, &lt;code&gt;floatingBuffersUsage&lt;/code&gt;, &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;&lt;br /&gt;
+An estimate on the ratio of buffers used vs. buffers available in the respective local buffer pools.
+Starting with Flink 1.9, &lt;code&gt;inPoolUsage&lt;/code&gt; is the sum of &lt;code&gt;floatingBuffersUsage&lt;/code&gt; and &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numRecordsOut&lt;/code&gt;, &lt;code&gt;numRecordsIn&lt;/code&gt;&lt;br /&gt;
+Each metric comes with two scopes: one scoped to the operator and one scoped to the subtask. For network monitoring, the subtask-scoped metric is relevant and shows the total number of records it has sent/received. You may need to further look into these figures to extract the number of records within a certain time span or use the equivalent &lt;code&gt;…PerSecond&lt;/code&gt; metrics.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numBytesOut&lt;/code&gt;, &lt;code&gt;numBytesInLocal&lt;/code&gt;, &lt;code&gt;numBytesInRemote&lt;/code&gt;&lt;br /&gt;
+The total number of bytes this subtask has emitted or read from a local/remote source. These are also available as meters via &lt;code&gt;…PerSecond&lt;/code&gt; metrics.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;code&gt;numBuffersOut&lt;/code&gt;, &lt;code&gt;numBuffersInLocal&lt;/code&gt;, &lt;code&gt;numBuffersInRemote&lt;/code&gt;&lt;br /&gt;
+Similar to &lt;code&gt;numBytes…&lt;/code&gt; but counting the number of network buffers.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;alert alert-warning&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-warning&quot; style=&quot;display: inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Warning&lt;/span&gt;
+For the sake of completeness and since they have been used in the past, we will briefly look at the &lt;code&gt;outputQueueLength&lt;/code&gt; and &lt;code&gt;inputQueueLength&lt;/code&gt; metrics. These are somewhat similar to the &lt;code&gt;[out,in]PoolUsage&lt;/code&gt; metrics but show the number of buffers sitting in a sender subtask’s output queues and in a receiver subtask’s input queues, respectively. Reasoning about absolute numbers of buffers, however, is difficult and there i [...]
+
+  &lt;p&gt;Overall, &lt;strong&gt;we discourage the use of&lt;/strong&gt; &lt;code&gt;outputQueueLength&lt;/code&gt; &lt;strong&gt;and&lt;/strong&gt; &lt;code&gt;inputQueueLength&lt;/code&gt; because their interpretation highly depends on the current parallelism of the operator and the configured numbers of exclusive and floating buffers. Instead, we recommend using the various &lt;code&gt;*PoolUsage&lt;/code&gt; metrics which even reveal more detailed insight.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-info&quot; style=&quot;display: inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Note&lt;/span&gt;
+ If you reason about buffer usage, please keep the following in mind:&lt;/p&gt;
+
+  &lt;ul&gt;
+    &lt;li&gt;Any outgoing channel which has been used at least once will always occupy one buffer (since Flink 1.5).
+      &lt;ul&gt;
+        &lt;li&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up to Flink 1.8:&lt;/span&gt;&lt;/strong&gt; This buffer (even if empty!) was always counted as a backlog of 1 and thus receivers tried to reserve a floating buffer for it.&lt;/li&gt;
+        &lt;li&gt;&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9 and above:&lt;/span&gt;&lt;/strong&gt; A buffer is only counted in the backlog if it is ready for consumption, i.e. it is full or was flushed (see FLINK-11082)&lt;/li&gt;
+      &lt;/ul&gt;
+    &lt;/li&gt;
+    &lt;li&gt;The receiver will only release a received buffer after deserialising the last record in it.&lt;/li&gt;
+  &lt;/ul&gt;
+&lt;/div&gt;
+
+&lt;p&gt;The following sections make use of and combine these metrics to reason about backpressure and resource usage / efficiency with respect to throughput. A separate section will detail latency related metrics.&lt;/p&gt;
+
+&lt;h3 id=&quot;backpressure&quot;&gt;Backpressure&lt;/h3&gt;
+
+&lt;p&gt;Backpressure may be indicated by two different sets of metrics: (local) buffer pool usages as well as input/output queue lengths. They provide a different level of granularity but, unfortunately, none of these are exhaustive and there is room for interpretation. Because of the inherent problems with interpreting these queue lengths we will focus on the usage of input and output pools below which also provides more detail.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;If a subtask’s&lt;/strong&gt; &lt;code&gt;outPoolUsage&lt;/code&gt; &lt;strong&gt;is 100%&lt;/strong&gt;, it is backpressured. Whether the subtask is already blocking or still writing records into network buffers depends on how full the buffers are, that the &lt;code&gt;RecordWriters&lt;/code&gt; are currently writing into.&lt;br /&gt;
+&lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:orange;&quot;&gt;&lt;/span&gt; This is different to what the backpressure monitor is showing!&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;An &lt;code&gt;inPoolUsage&lt;/code&gt; of 100% means that all floating buffers are assigned to channels and eventually backpressure will be exercised upstream. These floating buffers are in either of the following conditions: they are reserved for future use on a channel due to an exclusive buffer being utilised (remote input channels always try to maintain &lt;code&gt;#exclusive buffers&lt;/code&gt; credits), they are reserved for a sender’s backlog and wait for data, they [...]
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:orange&quot;&gt;up to Flink 1.8:&lt;/span&gt;&lt;/strong&gt; Due to &lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-11082&quot;&gt;FLINK-11082&lt;/a&gt;, an &lt;code&gt;inPoolUsage&lt;/code&gt; of 100% is quite common even in normal situations.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9 and above:&lt;/span&gt;&lt;/strong&gt; If &lt;code&gt;inPoolUsage&lt;/code&gt; is constantly around 100%, this is a strong indicator for exercising backpressure upstream.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;The following table summarises all combinations and their interpretation. Bear in mind, though, that backpressure may be minor or temporary (no need to look into it), on particular channels only, or caused by other JVM processes on a particular TaskManager, such as GC, synchronisation, I/O, resource shortage, instead of a specific subtask.&lt;/p&gt;
+
+&lt;center&gt;
+&lt;table class=&quot;tg&quot;&gt;
+  &lt;tr&gt;
+    &lt;th&gt;&lt;/th&gt;
+    &lt;th class=&quot;tg-center&quot;&gt;&lt;code&gt;outPoolUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;th class=&quot;tg-center&quot;&gt;&lt;code&gt;outPoolUsage&lt;/code&gt; high&lt;/th&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot;&gt;&lt;code&gt;inPoolUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-ok-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:green;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;/td&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressured, temporary situation: upstream is not backpressured yet or not anymore)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; rowspan=&quot;2&quot;&gt;
+      &lt;code&gt;inPoolUsage&lt;/code&gt; high&lt;br /&gt;
+      (&lt;strong&gt;&lt;span style=&quot;color:green&quot;&gt;Flink 1.9+&lt;/span&gt;&lt;/strong&gt;)&lt;/th&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;
+      if all upstream tasks’&lt;code&gt;outPoolUsage&lt;/code&gt; are low: &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (may eventually cause backpressure)&lt;/td&gt;
+    &lt;td class=&quot;tg-topcenter&quot; rowspan=&quot;2&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressured by downstream task(s) or network, probably forwarding backpressure upstream)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;td class=&quot;tg-topcenter&quot;&gt;if any upstream task’s&lt;code&gt;outPoolUsage&lt;/code&gt; is high: &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (may exercise backpressure upstream and may be the source of backpressure)&lt;/td&gt;
+  &lt;/tr&gt;
+&lt;/table&gt;
+&lt;/center&gt;
+
+&lt;p&gt;&lt;br /&gt;
+We may even reason more about the cause of backpressure by looking at the network metrics of the subtasks of two consecutive tasks:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;If all subtasks of the receiver task have low &lt;code&gt;inPoolUsage&lt;/code&gt; values and any upstream subtask’s &lt;code&gt;outPoolUsage&lt;/code&gt; is high, then there may be a network bottleneck causing backpressure.
+Since network is a shared resource among all subtasks of a TaskManager, this may not directly originate from this subtask, but rather from various concurrent operations, e.g. checkpoints, other streams, external connections, or other TaskManagers/processes on the same machine.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Backpressure can also be caused by all parallel instances of a task or by a single task instance. The first usually happens because the task is performing some time consuming operation that applies to all input partitions. The latter is usually the result of some kind of skew, either data skew or resource availability/allocation skew. In either case, you can find some hints on how to handle such situations in the &lt;a href=&quot;#span-classlabel-label-info-styledisplay-inline-b [...]
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;h3 class=&quot;no_toc&quot; id=&quot;span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-flink-19-and-above&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Flink 1.9 and above&lt;/h3&gt;
+
+  &lt;ul&gt;
+    &lt;li&gt;If &lt;code&gt;floatingBuffersUsage&lt;/code&gt; is not 100%, it is unlikely that there is backpressure. If it is 100% and any upstream task is backpressured, it suggests that this input is exercising backpressure on either a single, some or all input channels. To differentiate between those three situations you can use &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;:
+      &lt;ul&gt;
+        &lt;li&gt;Assuming that &lt;code&gt;floatingBuffersUsage&lt;/code&gt; is around 100%, the higher the &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; the more input channels are backpressured. In an extreme case of &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; being close to 100%, it means that all channels are backpressured.&lt;/li&gt;
+      &lt;/ul&gt;
+    &lt;/li&gt;
+  &lt;/ul&gt;
+
+  &lt;p&gt;&lt;br /&gt;
+The relation between &lt;code&gt;exclusiveBuffersUsage&lt;/code&gt;, &lt;code&gt;floatingBuffersUsage&lt;/code&gt;, and the upstream tasks’ &lt;code&gt;outPoolUsage&lt;/code&gt; is summarised in the following table and extends on the table above with &lt;code&gt;inPoolUsage = floatingBuffersUsage + exclusiveBuffersUsage&lt;/code&gt;:&lt;/p&gt;
+
+  &lt;center&gt;
+&lt;table class=&quot;tg&quot;&gt;
+  &lt;tr&gt;
+    &lt;th&gt;&lt;/th&gt;
+    &lt;th&gt;&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;th&gt;&lt;code&gt;exclusiveBuffersUsage&lt;/code&gt; high&lt;/th&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; low +&lt;br /&gt;
+      &lt;em&gt;all&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-ok-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:green;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;-&lt;sup&gt;3&lt;/sup&gt;&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; low +&lt;br /&gt;
+      &lt;em&gt;any&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; high&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (potential network bottleneck)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;-&lt;sup&gt;3&lt;/sup&gt;&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; high +&lt;br /&gt;
+      &lt;em&gt;all&lt;/em&gt; upstream &lt;code&gt;outPoolUsage&lt;/code&gt; low&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure eventually appears on only some of the input channels)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:orange;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure eventually appears on most or all of the input channels)&lt;/td&gt;
+  &lt;/tr&gt;
+  &lt;tr&gt;
+    &lt;th class=&quot;tg-top&quot; style=&quot;min-width:33%;&quot;&gt;
+      &lt;code&gt;floatingBuffersUsage&lt;/code&gt; high +&lt;br /&gt;
+      any upstream &lt;code&gt;outPoolUsage&lt;/code&gt; high&lt;/th&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure on only some of the input channels)&lt;/td&gt;
+    &lt;td class=&quot;tg-center&quot;&gt;
+      &lt;span class=&quot;glyphicon glyphicon-remove-sign&quot; aria-hidden=&quot;true&quot; style=&quot;color:red;font-size:1.5em;&quot;&gt;&lt;/span&gt;&lt;br /&gt;
+      (backpressure on most or all of the input channels)&lt;/td&gt;
+  &lt;/tr&gt;
+&lt;/table&gt;
+&lt;/center&gt;
+
+  &lt;p&gt;&lt;sup&gt;3&lt;/sup&gt; this should not happen&lt;/p&gt;
+
+&lt;/div&gt;
+
+&lt;h3 id=&quot;resource-usage--throughput&quot;&gt;Resource Usage / Throughput&lt;/h3&gt;
+
+&lt;p&gt;Besides the obvious use of each individual metric mentioned above, there are also a few combinations providing useful insight into what is happening in the network stack:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;Low throughput with frequent &lt;code&gt;outPoolUsage&lt;/code&gt; values around 100% but low &lt;code&gt;inPoolUsage&lt;/code&gt; on all receivers is an indicator that the round-trip-time of our credit-notification (depends on your network’s latency) is too high for the default number of exclusive buffers to make use of your bandwidth. Consider increasing the &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#taskmanager-network-mem [...]
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Combining &lt;code&gt;numRecordsOut&lt;/code&gt; and &lt;code&gt;numBytesOut&lt;/code&gt; helps identifying average serialised record sizes which supports you in capacity planning for peak scenarios.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;If you want to reason about buffer fill rates and the influence of the output flusher, you may combine &lt;code&gt;numBytesInRemote&lt;/code&gt; with &lt;code&gt;numBuffersInRemote&lt;/code&gt;. When tuning for throughput (and not latency!), low buffer fill rates may indicate reduced network efficiency. In such cases, consider increasing the buffer timeout.
+Please note that, as of Flink 1.8 and 1.9, &lt;code&gt;numBuffersOut&lt;/code&gt; only increases for buffers getting full or for an event cutting off a buffer (e.g. a checkpoint barrier) and may lag behind. Please also note that reasoning about buffer fill rates on local channels is unnecessary since buffering is an optimisation technique for remote channels with limited effect on local channels.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;You may also separate local from remote traffic using numBytesInLocal and numBytesInRemote but in most cases this is unnecessary.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;h3 class=&quot;no_toc&quot; id=&quot;span-classglyphicon-glyphicon-info-sign-aria-hiddentruespan-what-to-do-with-backpressure&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; What to do with Backpressure?&lt;/h3&gt;
+
+  &lt;p&gt;Assuming that you identified where the source of backpressure — a bottleneck — is located, the next step is to analyse why this is happening. Below, we list some potential causes of backpressure from the more basic to the more complex ones. We recommend to check the basic causes first, before diving deeper on the more complex ones and potentially drawing false conclusions.&lt;/p&gt;
+
+  &lt;p&gt;Please also recall that backpressure might be temporary and the result of a load spike, checkpointing, or a job restart with a data backlog waiting to be processed. If backpressure is temporary, you should simply ignore it. Alternatively, keep in mind that the process of analysing and solving the issue can be affected by the intermittent nature of your bottleneck. Having said that, here are a couple of things to check.&lt;/p&gt;
+
+  &lt;h4 id=&quot;system-resources&quot;&gt;System Resources&lt;/h4&gt;
+
+  &lt;p&gt;Firstly, you should check the incriminated machines’ basic resource usage like CPU, network, or disk I/O. If some resource is fully or heavily utilised you can do one of the following:&lt;/p&gt;
+
+  &lt;ol&gt;
+    &lt;li&gt;Try to optimise your code. Code profilers are helpful in this case.&lt;/li&gt;
+    &lt;li&gt;Tune Flink for that specific resource.&lt;/li&gt;
+    &lt;li&gt;Scale out by increasing the parallelism and/or increasing the number of machines in the cluster.&lt;/li&gt;
+  &lt;/ol&gt;
+
+  &lt;h4 id=&quot;garbage-collection&quot;&gt;Garbage Collection&lt;/h4&gt;
+
+  &lt;p&gt;Oftentimes, performance issues arise from long GC pauses. You can verify whether you are in such a situation by either printing debug GC logs (via -&lt;code&gt;XX:+PrintGCDetails&lt;/code&gt;) or by using some memory/GC profilers. Since dealing with GC issues is highly application-dependent and independent of Flink, we will not go into details here (&lt;a href=&quot;https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/index.html&quot;&gt;Oracle’s Garbage Collecti [...]
+
+  &lt;h4 id=&quot;cputhread-bottleneck&quot;&gt;CPU/Thread Bottleneck&lt;/h4&gt;
+
+  &lt;p&gt;Sometimes a CPU bottleneck might not be visible at first glance if one or a couple of threads are causing the CPU bottleneck while the CPU usage of the overall machine remains relatively low. For instance, a single CPU-bottlenecked thread on a 48-core machine would result in only 2% CPU use. Consider using code profilers for this as they can identify hot threads by showing each threads’ CPU usage, for example.&lt;/p&gt;
+
+  &lt;h4 id=&quot;thread-contention&quot;&gt;Thread Contention&lt;/h4&gt;
+
+  &lt;p&gt;Similarly to the CPU/thread bottleneck issue above, a subtask may be bottlenecked due to high thread contention on shared resources. Again, CPU profilers are your best friend here! Consider looking for synchronisation overhead / lock contention in user code — although adding synchronisation in user code should be avoided and may even be dangerous! Also consider investigating shared system resources. The default JVM’s SSL implementation, for example, can become contented around [...]
+
+  &lt;h4 id=&quot;load-imbalance&quot;&gt;Load Imbalance&lt;/h4&gt;
+
+  &lt;p&gt;If your bottleneck is caused by data skew, you can try to remove it or mitigate its impact by changing the data partitioning to separate heavy keys or by implementing local/pre-aggregation.&lt;/p&gt;
+
+  &lt;p&gt;&lt;br /&gt;
+This list is far from exhaustive. Generally, in order to reduce a bottleneck and thus backpressure, first analyse where it is happening and then find out why. The best place to start reasoning about the “why” is by checking what resources are fully utilised.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;latency-tracking&quot;&gt;Latency Tracking&lt;/h3&gt;
+
+&lt;p&gt;Tracking latencies at the various locations they may occur is a topic of its own. In this section, we will focus on the time records wait inside Flink’s network stack — including the system’s network connections. In low throughput scenarios, these latencies are influenced directly by the output flusher via the buffer timeout parameter or indirectly by any application code latencies. When processing a record takes longer than expected or when (multiple) timers fire at the same ti [...]
+
+&lt;p&gt;Flink offers some support for &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/metrics.html#latency-tracking&quot;&gt;tracking the latency&lt;/a&gt; of records passing through the system (outside of user code). However, this is disabled by default (see below why!) and must be enabled by setting a latency tracking interval either in Flink’s &lt;a href=&quot;https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/config.html#metrics-l [...]
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code&gt;single&lt;/code&gt;: one histogram for each operator subtask&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;operator&lt;/code&gt; (default): one histogram for each combination of source task and operator subtask&lt;/li&gt;
+  &lt;li&gt;&lt;code&gt;subtask&lt;/code&gt;: one histogram for each combination of source subtask and operator subtask (quadratic in the parallelism!)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;These metrics are collected through special “latency markers”: each source subtask will periodically emit a special record containing the timestamp of its creation. The latency markers then flow alongside normal records while not overtaking them on the wire or inside a buffer queue. However, &lt;em&gt;a latency marker does not enter application logic&lt;/em&gt; and is overtaking records there. Latency markers therefore only measure the waiting time between the user code and not  [...]
+
+&lt;p&gt;Since &lt;code&gt;LatencyMarkers&lt;/code&gt; sit in network buffers just like normal records, they will also wait for the buffer to be full or flushed due to buffer timeouts. When a channel is on high load, there is no added latency by the network buffering data. However, as soon as one channel is under low load, records and latency markers will experience an expected average delay of at most &lt;code&gt;buffer_timeout / 2&lt;/code&gt;. This delay will add to each network conne [...]
+
+&lt;p&gt;By looking at the exposed latency tracking metrics for each subtask, for example at the 95th percentile, you should nevertheless be able to identify subtasks which are adding substantially to the overall source-to-sink latency and continue with optimising there.&lt;/p&gt;
+
+&lt;div class=&quot;alert alert-info&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-info&quot; style=&quot;display: inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-info-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Note&lt;/span&gt;
+Flink’s latency markers assume that the clocks on all machines in the cluster are in sync. We recommend setting up an automated clock synchronisation service (like NTP) to avoid false latency results.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;div class=&quot;alert alert-warning&quot;&gt;
+  &lt;p&gt;&lt;span class=&quot;label label-warning&quot; style=&quot;display: inline-block&quot;&gt;&lt;span class=&quot;glyphicon glyphicon-warning-sign&quot; aria-hidden=&quot;true&quot;&gt;&lt;/span&gt; Warning&lt;/span&gt;
+Enabling latency metrics can significantly impact the performance of the cluster (in particular for &lt;code&gt;subtask&lt;/code&gt; granularity) due to the sheer amount of metrics being added as well as the use of histograms which are quite expensive to maintain. It is highly recommended to only use them for debugging purposes.&lt;/p&gt;
+&lt;/div&gt;
+
+&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;
+
+&lt;p&gt;In the previous sections we discussed how to monitor Flink’s network stack which primarily involves identifying backpressure: where it occurs, where it originates from, and (potentially) why it occurs. This can be executed in two ways: for simple cases and debugging sessions by using the backpressure monitor; for continuous monitoring, more in-depth analysis, and less runtime overhead by using Flink’s task and network stack metrics. Backpressure can be caused by the network laye [...]
+
+&lt;p&gt;Stay tuned for the third blog post in the series of network stack posts that will focus on tuning techniques and anti-patterns to avoid.&lt;/p&gt;
+
+</description>
+<pubDate>Tue, 23 Jul 2019 17:30:00 +0200</pubDate>
+<link>https://flink.apache.org/2019/07/23/flink-network-stack-2.html</link>
+<guid isPermaLink="true">/2019/07/23/flink-network-stack-2.html</guid>
+</item>
+
+<item>
 <title>Apache Flink 1.8.1 Released</title>
 <description>&lt;p&gt;The Apache Flink community released the first bugfix version of the Apache Flink 1.8 series.&lt;/p&gt;
 
diff --git a/content/blog/index.html b/content/blog/index.html
index 71c9647..1a5e52b 100644
--- a/content/blog/index.html
+++ b/content/blog/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></h2>
+
+      <p>23 Jul 2019
+       Nico Kruber  &amp; Piotr Nowojski </p>
+
+      <p>In a previous blog post, we presented how Flink’s network stack works from the high-level abstractions to the low-level details. This second  post discusses monitoring network-related metrics to identify backpressure or bottlenecks in throughput and latency.</p>
+
+      <p><a href="/2019/07/23/flink-network-stack-2.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></h2>
 
       <p>02 Jul 2019
@@ -288,19 +301,6 @@ for more details.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2019/03/06/ffsf-preview.html">What to expect from Flink Forward San Francisco 2019</a></h2>
-
-      <p>06 Mar 2019
-       Fabian Hueske (<a href="https://twitter.com/fhueske">@fhueske</a>)</p>
-
-      <p>The third annual Flink Forward conference in San Francisco is just a few weeks away. Let's see what Flink Forward SF 2019 has in store for the Apache Flink and stream processing communities. This post covers some of its highlights!</p>
-
-      <p><a href="/news/2019/03/06/ffsf-preview.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -333,6 +333,16 @@ for more details.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page2/index.html b/content/blog/page2/index.html
index e639939..084a7f7 100644
--- a/content/blog/page2/index.html
+++ b/content/blog/page2/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2019/03/06/ffsf-preview.html">What to expect from Flink Forward San Francisco 2019</a></h2>
+
+      <p>06 Mar 2019
+       Fabian Hueske (<a href="https://twitter.com/fhueske">@fhueske</a>)</p>
+
+      <p>The third annual Flink Forward conference in San Francisco is just a few weeks away. Let's see what Flink Forward SF 2019 has in store for the Apache Flink and stream processing communities. This post covers some of its highlights!</p>
+
+      <p><a href="/news/2019/03/06/ffsf-preview.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2019/02/25/monitoring-best-practices.html">Monitoring Apache Flink Applications 101</a></h2>
 
       <p>25 Feb 2019
@@ -294,21 +307,6 @@ Please check the <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2018/10/29/release-1.5.5.html">Apache Flink 1.5.5 Released</a></h2>
-
-      <p>29 Oct 2018
-      </p>
-
-      <p><p>The Apache Flink community released the fifth bugfix version of the Apache Flink 1.5 series.</p>
-
-</p>
-
-      <p><a href="/news/2018/10/29/release-1.5.5.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -341,6 +339,16 @@ Please check the <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page3/index.html b/content/blog/page3/index.html
index a81abee..dab64bf 100644
--- a/content/blog/page3/index.html
+++ b/content/blog/page3/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2018/10/29/release-1.5.5.html">Apache Flink 1.5.5 Released</a></h2>
+
+      <p>29 Oct 2018
+      </p>
+
+      <p><p>The Apache Flink community released the fifth bugfix version of the Apache Flink 1.5 series.</p>
+
+</p>
+
+      <p><a href="/news/2018/10/29/release-1.5.5.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2018/09/20/release-1.6.1.html">Apache Flink 1.6.1 Released</a></h2>
 
       <p>20 Sep 2018
@@ -296,19 +311,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)</a></h2>
-
-      <p>01 Mar 2018
-       Piotr Nowojski (<a href="https://twitter.com/PiotrNowojski">@PiotrNowojski</a>) &amp; Mike Winters (<a href="https://twitter.com/wints">@wints</a>)</p>
-
-      <p>Flink 1.4.0 introduced a new feature that makes it possible to build end-to-end exactly-once applications with Flink and data sources and sinks that support transactions.</p>
-
-      <p><a href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -341,6 +343,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page4/index.html b/content/blog/page4/index.html
index bd6aa7b..e67e9f0 100644
--- a/content/blog/page4/index.html
+++ b/content/blog/page4/index.html
@@ -162,6 +162,19 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)</a></h2>
+
+      <p>01 Mar 2018
+       Piotr Nowojski (<a href="https://twitter.com/PiotrNowojski">@PiotrNowojski</a>) &amp; Mike Winters (<a href="https://twitter.com/wints">@wints</a>)</p>
+
+      <p>Flink 1.4.0 introduced a new feature that makes it possible to build end-to-end exactly-once applications with Flink and data sources and sinks that support transactions.</p>
+
+      <p><a href="/features/2018/03/01/end-to-end-exactly-once-apache-flink.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2018/02/15/release-1.4.1.html">Apache Flink 1.4.1 Released</a></h2>
 
       <p>15 Feb 2018
@@ -295,21 +308,6 @@ what’s coming in Flink 1.4.0 as well as a preview of what the Flink community
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2017/05/16/official-docker-image.html">Introducing Docker Images for Apache Flink</a></h2>
-
-      <p>16 May 2017 by Patrick Lucas (Data Artisans) and Ismaël Mejía (Talend) (<a href="https://twitter.com/">@iemejia</a>)
-      </p>
-
-      <p><p>For some time, the Apache Flink community has provided scripts to build a Docker image to run Flink. Now, starting with version 1.2.1, Flink will have a <a href="https://hub.docker.com/r/_/flink/">Docker image</a> on the Docker Hub. This image is maintained by the Flink community and curated by the <a href="https://github.com/docker-library/official-images">Docker</a> team to ensure it meets the quality standards for container images of the Docker community.</p>
-
-</p>
-
-      <p><a href="/news/2017/05/16/official-docker-image.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -342,6 +340,16 @@ what’s coming in Flink 1.4.0 as well as a preview of what the Flink community
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page5/index.html b/content/blog/page5/index.html
index 1243990..8463dfb 100644
--- a/content/blog/page5/index.html
+++ b/content/blog/page5/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2017/05/16/official-docker-image.html">Introducing Docker Images for Apache Flink</a></h2>
+
+      <p>16 May 2017 by Patrick Lucas (Data Artisans) and Ismaël Mejía (Talend) (<a href="https://twitter.com/">@iemejia</a>)
+      </p>
+
+      <p><p>For some time, the Apache Flink community has provided scripts to build a Docker image to run Flink. Now, starting with version 1.2.1, Flink will have a <a href="https://hub.docker.com/r/_/flink/">Docker image</a> on the Docker Hub. This image is maintained by the Flink community and curated by the <a href="https://github.com/docker-library/official-images">Docker</a> team to ensure it meets the quality standards for container images of the Docker community.</p>
+
+</p>
+
+      <p><a href="/news/2017/05/16/official-docker-image.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2017/04/26/release-1.2.1.html">Apache Flink 1.2.1 Released</a></h2>
 
       <p>26 Apr 2017
@@ -289,21 +304,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2016/08/24/ff16-keynotes-panels.html">Flink Forward 2016: Announcing Schedule, Keynotes, and Panel Discussion</a></h2>
-
-      <p>24 Aug 2016
-      </p>
-
-      <p><p>An update for the Flink community: the <a href="http://flink-forward.org/kb_day/day-1/">Flink Forward 2016 schedule</a> is now available online. This year's event will include 2 days of talks from stream processing experts at Google, MapR, Alibaba, Netflix, Cloudera, and more. Following the talks is a full day of hands-on Flink training.</p>
-
-</p>
-
-      <p><a href="/news/2016/08/24/ff16-keynotes-panels.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -336,6 +336,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page6/index.html b/content/blog/page6/index.html
index 2a78432..7d4d19a 100644
--- a/content/blog/page6/index.html
+++ b/content/blog/page6/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2016/08/24/ff16-keynotes-panels.html">Flink Forward 2016: Announcing Schedule, Keynotes, and Panel Discussion</a></h2>
+
+      <p>24 Aug 2016
+      </p>
+
+      <p><p>An update for the Flink community: the <a href="http://flink-forward.org/kb_day/day-1/">Flink Forward 2016 schedule</a> is now available online. This year's event will include 2 days of talks from stream processing experts at Google, MapR, Alibaba, Netflix, Cloudera, and more. Following the talks is a full day of hands-on Flink training.</p>
+
+</p>
+
+      <p><a href="/news/2016/08/24/ff16-keynotes-panels.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2016/08/11/release-1.1.1.html">Flink 1.1.1 Released</a></h2>
 
       <p>11 Aug 2016
@@ -293,21 +308,6 @@
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2016/02/11/release-0.10.2.html">Flink 0.10.2 Released</a></h2>
-
-      <p>11 Feb 2016
-      </p>
-
-      <p><p>Today, the Flink community released Flink version <strong>0.10.2</strong>, the second bugfix release of the 0.10 series.</p>
-
-</p>
-
-      <p><a href="/news/2016/02/11/release-0.10.2.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -340,6 +340,16 @@
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page7/index.html b/content/blog/page7/index.html
index cff4bad..bc908cb 100644
--- a/content/blog/page7/index.html
+++ b/content/blog/page7/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2016/02/11/release-0.10.2.html">Flink 0.10.2 Released</a></h2>
+
+      <p>11 Feb 2016
+      </p>
+
+      <p><p>Today, the Flink community released Flink version <strong>0.10.2</strong>, the second bugfix release of the 0.10 series.</p>
+
+</p>
+
+      <p><a href="/news/2016/02/11/release-0.10.2.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2015/12/18/a-year-in-review.html">Flink 2015: A year in review, and a lookout to 2016</a></h2>
 
       <p>18 Dec 2015 by Robert Metzger (<a href="https://twitter.com/">@rmetzger_</a>)
@@ -297,21 +312,6 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Announcing Apache Flink 0.9.0</a></h2>
-
-      <p>24 Jun 2015
-      </p>
-
-      <p><p>The Apache Flink community is pleased to announce the availability of the 0.9.0 release. The release is the result of many months of hard work within the Flink community. It contains many new features and improvements which were previewed in the 0.9.0-milestone1 release and have been polished since then. This is the largest Flink release so far.</p>
-
-</p>
-
-      <p><a href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -344,6 +344,16 @@ vertex-centric or gather-sum-apply to Flink dataflows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page8/index.html b/content/blog/page8/index.html
index 4956e66..88f15bc 100644
--- a/content/blog/page8/index.html
+++ b/content/blog/page8/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Announcing Apache Flink 0.9.0</a></h2>
+
+      <p>24 Jun 2015
+      </p>
+
+      <p><p>The Apache Flink community is pleased to announce the availability of the 0.9.0 release. The release is the result of many months of hard work within the Flink community. It contains many new features and improvements which were previewed in the 0.9.0-milestone1 release and have been polished since then. This is the largest Flink release so far.</p>
+
+</p>
+
+      <p><a href="/news/2015/06/24/announcing-apache-flink-0.9.0-release.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2015/05/14/Community-update-April.html">April 2015 in the Flink community</a></h2>
 
       <p>14 May 2015 by Kostas Tzoumas (<a href="https://twitter.com/">@kostas_tzoumas</a>)
@@ -303,21 +318,6 @@ and offers a new API including definition of flexible windows.</p>
 
     <hr>
     
-    <article>
-      <h2 class="blog-title"><a href="/news/2015/01/06/december-in-flink.html">December 2014 in the Flink community</a></h2>
-
-      <p>06 Jan 2015
-      </p>
-
-      <p><p>This is the first blog post of a “newsletter” like series where we give a summary of the monthly activity in the Flink community. As the Flink project grows, this can serve as a “tl;dr” for people that are not following the Flink dev and user mailing lists, or those that are simply overwhelmed by the traffic.</p>
-
-</p>
-
-      <p><a href="/news/2015/01/06/december-in-flink.html">Continue reading &raquo;</a></p>
-    </article>
-
-    <hr>
-    
 
     <!-- Pagination links -->
     
@@ -350,6 +350,16 @@ and offers a new API including definition of flexible windows.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/blog/page9/index.html b/content/blog/page9/index.html
index a3851f8..7772c90 100644
--- a/content/blog/page9/index.html
+++ b/content/blog/page9/index.html
@@ -162,6 +162,21 @@
     <!-- Blog posts -->
     
     <article>
+      <h2 class="blog-title"><a href="/news/2015/01/06/december-in-flink.html">December 2014 in the Flink community</a></h2>
+
+      <p>06 Jan 2015
+      </p>
+
+      <p><p>This is the first blog post of a “newsletter” like series where we give a summary of the monthly activity in the Flink community. As the Flink project grows, this can serve as a “tl;dr” for people that are not following the Flink dev and user mailing lists, or those that are simply overwhelmed by the traffic.</p>
+
+</p>
+
+      <p><a href="/news/2015/01/06/december-in-flink.html">Continue reading &raquo;</a></p>
+    </article>
+
+    <hr>
+    
+    <article>
       <h2 class="blog-title"><a href="/news/2014/11/18/hadoop-compatibility.html">Hadoop Compatibility in Flink</a></h2>
 
       <p>18 Nov 2014 by Fabian Hüske (<a href="https://twitter.com/">@fhueske</a>)
@@ -271,6 +286,16 @@ academic and open source project that Flink originates from.</p>
 
     <ul id="markdown-toc">
       
+      <li><a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></li>
+
+      
+        
+      
+    
+      
+      
+
+      
       <li><a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></li>
 
       
diff --git a/content/css/flink.css b/content/css/flink.css
index 9f13341..5c0f621 100755
--- a/content/css/flink.css
+++ b/content/css/flink.css
@@ -87,6 +87,11 @@ h1, h2, h3, h4, h5, h6 {
     margin-top: -60px;
 }
 
+/* fix conflict with bootstrap's alert */
+.alert h4 {
+	margin-top: -60px;
+}
+
 h1 {
 	font-size: 160%;
 }
diff --git a/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png b/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png
new file mode 100644
index 0000000..15372fd
Binary files /dev/null and b/content/img/blog/2019-07-23-network-stack-2/back_pressure_sampling_high.png differ
diff --git a/content/index.html b/content/index.html
index d8e8537..48f71fc 100644
--- a/content/index.html
+++ b/content/index.html
@@ -462,6 +462,9 @@
 
   <dl>
       
+        <dt> <a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></dt>
+        <dd>In a previous blog post, we presented how Flink’s network stack works from the high-level abstractions to the low-level details. This second  post discusses monitoring network-related metrics to identify backpressure or bottlenecks in throughput and latency.</dd>
+      
         <dt> <a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of the Apache Flink 1.8 series.</p>
 
@@ -475,9 +478,6 @@
       
         <dt> <a href="/2019/05/19/state-ttl.html">State TTL in Flink 1.8.0: How to Automatically Cleanup Application State in Apache Flink</a></dt>
         <dd>A common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of your state size, or to control how long the application state can be accessed. State TTL enables application state cleanup and efficient state size management in Apache Flink</dd>
-      
-        <dt> <a href="/2019/05/14/temporal-tables.html">Flux capacitor, huh? Temporal Tables and Joins in Streaming SQL</a></dt>
-        <dd>Apache Flink natively supports temporal table joins since the 1.7 release for straightforward temporal data handling. In this blog post, we provide an overview of how this new concept can be leveraged for effective point-in-time analysis in streaming scenarios.</dd>
     
   </dl>
 
diff --git a/content/roadmap.html b/content/roadmap.html
index aed804b..780a53b 100644
--- a/content/roadmap.html
+++ b/content/roadmap.html
@@ -180,7 +180,7 @@ under the License.
 
 <div class="page-toc">
 <ul id="markdown-toc">
-  <li><a href="#analytics-applications-an-the-roles-of-datastream-dataset-and-table-api" id="markdown-toc-analytics-applications-an-the-roles-of-datastream-dataset-and-table-api">Analytics, Applications, an the roles of DataStream, DataSet, and Table API</a></li>
+  <li><a href="#analytics-applications-and-the-roles-of-datastream-dataset-and-table-api" id="markdown-toc-analytics-applications-and-the-roles-of-datastream-dataset-and-table-api">Analytics, Applications, and the roles of DataStream, DataSet, and Table API</a></li>
   <li><a href="#batch-and-streaming-unification" id="markdown-toc-batch-and-streaming-unification">Batch and Streaming Unification</a></li>
   <li><a href="#fast-batch-bounded-streams" id="markdown-toc-fast-batch-bounded-streams">Fast Batch (Bounded Streams)</a></li>
   <li><a href="#stream-processing-use-cases" id="markdown-toc-stream-processing-use-cases">Stream Processing Use Cases</a></li>
@@ -202,7 +202,7 @@ there is consensus that they will happen and what they will roughly look like fo
 
 <p><strong>Last Update:</strong> 2019-05-08</p>
 
-<h1 id="analytics-applications-an-the-roles-of-datastream-dataset-and-table-api">Analytics, Applications, an the roles of DataStream, DataSet, and Table API</h1>
+<h1 id="analytics-applications-and-the-roles-of-datastream-dataset-and-table-api">Analytics, Applications, and the roles of DataStream, DataSet, and Table API</h1>
 
 <p>Flink views stream processing as a <a href="/flink-architecture.html">unifying paradigm for data processing</a>
 (batch and real-time) and event-driven applications. The APIs are evolving to reflect that view:</p>
diff --git a/content/zh/community.html b/content/zh/community.html
index 19133f5..e6136e1 100644
--- a/content/zh/community.html
+++ b/content/zh/community.html
@@ -580,6 +580,12 @@
     <td class="text-center">shaoxuan</td>
   </tr>
   <tr>
+    <td class="text-center"><img src="https://avatars3.githubusercontent.com/u/12387855?s=50" class="committer-avatar" /></td>
+    <td class="text-center">Zhijiang Wang</td>
+    <td class="text-center">Committer</td>
+    <td class="text-center">zhijiang</td>
+  </tr>
+  <tr>
     <td class="text-center"><img src="https://avatars1.githubusercontent.com/u/1826769?s=50" class="committer-avatar" /></td>
     <td class="text-center">Daniel Warneke</td>
     <td class="text-center">PMC, Committer</td>
diff --git a/content/zh/index.html b/content/zh/index.html
index b0be24d..05ee8d0 100644
--- a/content/zh/index.html
+++ b/content/zh/index.html
@@ -460,6 +460,9 @@
 
   <dl>
       
+        <dt> <a href="/2019/07/23/flink-network-stack-2.html">Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing</a></dt>
+        <dd>In a previous blog post, we presented how Flink’s network stack works from the high-level abstractions to the low-level details. This second  post discusses monitoring network-related metrics to identify backpressure or bottlenecks in throughput and latency.</dd>
+      
         <dt> <a href="/news/2019/07/02/release-1.8.1.html">Apache Flink 1.8.1 Released</a></dt>
         <dd><p>The Apache Flink community released the first bugfix version of the Apache Flink 1.8 series.</p>
 
@@ -473,9 +476,6 @@
       
         <dt> <a href="/2019/05/19/state-ttl.html">State TTL in Flink 1.8.0: How to Automatically Cleanup Application State in Apache Flink</a></dt>
         <dd>A common requirement for many stateful streaming applications is to automatically cleanup application state for effective management of your state size, or to control how long the application state can be accessed. State TTL enables application state cleanup and efficient state size management in Apache Flink</dd>
-      
-        <dt> <a href="/2019/05/14/temporal-tables.html">Flux capacitor, huh? Temporal Tables and Joins in Streaming SQL</a></dt>
-        <dd>Apache Flink natively supports temporal table joins since the 1.7 release for straightforward temporal data handling. In this blog post, we provide an overview of how this new concept can be leveraged for effective point-in-time analysis in streaming scenarios.</dd>
     
   </dl>