You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/07/29 07:47:47 UTC
[GitHub] [flink] klion26 commented on a change in pull request #9648: [FLINK-13872] [docs-zh] Translate Operations Playground to Chinese

klion26 commented on a change in pull request #9648:
URL: https://github.com/apache/flink/pull/9648#discussion_r461339990



##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -23,80 +23,70 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-There are many ways to deploy and operate Apache Flink in various environments. Regardless of this
-variety, the fundamental building blocks of a Flink Cluster remain the same, and similar
-operational principles apply.
+Apache Flink 可以以多种方式在不同的环境中部署，抛开这种多样性而言，Flink 集群的基本构建方式和操作原则仍然是相同的。
 
-In this playground, you will learn how to manage and run Flink Jobs. You will see how to deploy and 
-monitor an application, experience how Flink recovers from Job failure, and perform everyday 
-operational tasks like upgrades and rescaling.
+在这篇文章里，你将会学习如何管理和运行 Flink 任务，了解如何部署和监控应用程序、Flink如何从失败作业中进行恢复，同时你还会学习如何执行一些日常操作任务，如升级和扩容。
 
 {% if site.version contains "SNAPSHOT" %}
 <p style="border-radius: 5px; padding: 5px" class="bg-danger">
   <b>
-  NOTE: The Apache Flink Docker images used for this playground are only available for
-  released versions of Apache Flink.
+  注意：本文中使用的 Apache Flink Docker 镜像仅适用于 Apache Flink 发行版。
   </b><br>
-  Since you are currently looking at the latest SNAPSHOT
-  version of the documentation, all version references below will not work.
-  Please switch the documentation to the latest released version via the release picker which you
-  find on the left side below the menu.
+  由于你目前正在浏览快照版的文档，因此下文中引用的分支可能已经不存在了，请先通过左侧菜单下方的版本选择器切换到发行版文档再查看。
 </p>
 {% endif %}
 
 * This will be replaced by the TOC
 {:toc}
 
-## Anatomy of this Playground
+## 场景说明

Review comment:
       对于翻译的标题，建议增加 `<a>` 标签，详细的可以参考 [wiki](https://cwiki.apache.org/confluence/display/FLINK/Flink+Translation+Specifications)
   本文其他标题也一样需要翻译
   
   添加标题有一个小技巧：打开英文文档，看对应的链接 URL 是什么，然后把 URL 中的字符补充到 'a' 标签的 name 部分

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +237,123 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的1000条记录。

Review comment:
       ```suggestion
   如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的 1000 条记录。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -23,80 +23,70 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-There are many ways to deploy and operate Apache Flink in various environments. Regardless of this
-variety, the fundamental building blocks of a Flink Cluster remain the same, and similar
-operational principles apply.
+Apache Flink 可以以多种方式在不同的环境中部署，抛开这种多样性而言，Flink 集群的基本构建方式和操作原则仍然是相同的。
 
-In this playground, you will learn how to manage and run Flink Jobs. You will see how to deploy and 
-monitor an application, experience how Flink recovers from Job failure, and perform everyday 
-operational tasks like upgrades and rescaling.
+在这篇文章里，你将会学习如何管理和运行 Flink 任务，了解如何部署和监控应用程序、Flink如何从失败作业中进行恢复，同时你还会学习如何执行一些日常操作任务，如升级和扩容。
 
 {% if site.version contains "SNAPSHOT" %}
 <p style="border-radius: 5px; padding: 5px" class="bg-danger">
   <b>
-  NOTE: The Apache Flink Docker images used for this playground are only available for
-  released versions of Apache Flink.
+  注意：本文中使用的 Apache Flink Docker 镜像仅适用于 Apache Flink 发行版。
   </b><br>
-  Since you are currently looking at the latest SNAPSHOT
-  version of the documentation, all version references below will not work.
-  Please switch the documentation to the latest released version via the release picker which you
-  find on the left side below the menu.
+  由于你目前正在浏览快照版的文档，因此下文中引用的分支可能已经不存在了，请先通过左侧菜单下方的版本选择器切换到发行版文档再查看。
 </p>
 {% endif %}
 
 * This will be replaced by the TOC
 {:toc}
 
-## Anatomy of this Playground
+## 场景说明
 
-This playground consists of a long living
-[Flink Session Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-session-cluster) and a Kafka
-Cluster.
+这篇文章中的所有操作都是基于如下两个集群进行的： 
+[Flink Session Cluster]({%link concepts/glossary.zh.md %}#flink-session-cluster) 以及一个 Kafka 集群，
+我们会在下文带领大家一起搭建这两个集群。
 
-A Flink Cluster always consists of a 
-[JobManager]({{ site.baseurl }}/concepts/glossary.html#flink-jobmanager) and one or more 
-[Flink TaskManagers]({{ site.baseurl }}/concepts/glossary.html#flink-taskmanager). The JobManager 
-is responsible for handling [Job]({{ site.baseurl }}/concepts/glossary.html#flink-job) submissions, 
-the supervision of Jobs as well as resource management. The Flink TaskManagers are the worker 
-processes and are responsible for the execution of the actual 
-[Tasks]({{ site.baseurl }}/concepts/glossary.html#task) which make up a Flink Job. In this 
-playground you will start with a single TaskManager, but scale out to more TaskManagers later. 
-Additionally, this playground comes with a dedicated *client* container, which we use to submit the 
-Flink Job initially and to perform various operational tasks later on. The *client* container is not
-needed by the Flink Cluster itself but only included for ease of use.
+一个Flink集群总是包含一个 

Review comment:
       ```suggestion
   一个 Flink 集群总是包含一个 
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +237,123 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的1000条记录。

Review comment:
       给所有标题加上 `<a>` 标签后，这里的引用就可以不改了。

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +237,123 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的1000条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复
 
-Once you restart the TaskManager, it reconnects to the JobManager.
+一旦 TaskManager 重启成功，它将会重新连接到 JobManager。
 
 {% highlight bash%}
 docker-compose up -d taskmanager
 {% endhighlight %}
 
-When the JobManager is notified about the new TaskManager, it schedules the tasks of the 
-recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from
-the last successful [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html) that was taken
-before the failure and switch to the `RUNNING` state.
+当 TaskManager 注册成功后，JobManager 就会将处于 `SCHEDULED` 状态的所有任务调度到该 TaskManager 
+的可用 TaskSlots 中运行，此时所有的任务将会从失败前最近一次成功的 
+[checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 进行恢复，
+一旦恢复成功，它们的状态将转变为 `RUNNING`。
 
-The Job will quickly process the full backlog of input events (accumulated during the outage) 
-from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches 
-the head of the stream. In the *output* you will see that all keys (`page`s) are present for all time 
-windows and that every count is exactly one thousand. Since we are using the 
-[FlinkKafkaProducer]({{ site.baseurl }}/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance)
-in its "at-least-once" mode, there is a chance that you will see some duplicate output records.
+接下来该 Job 将快速处理 Kafka input 事件的全部积压（在 Job 中断期间累积的数据），
+并以更快的速度(>24条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。
+此时观察 *output* topic 输出，
+你会看到在每一个时间窗口中都有按 `page` 进行分组的记录，而且计数刚好是1000。
+由于我们使用的是 [FlinkKafkaProducer]({%link dev/connectors/kafka.zh.md %}#kafka-producers-and-fault-tolerance) "至少一次"模式，因此你可能会看到一些记录重复输出多次。
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Most production setups rely on a resource manager (Kubernetes, Yarn, Mesos) to
-  automatically restart failed processes.
+  <b>注意</b>：在大部分生产环境中都需要一个资源管理器 (Kubernetes、Yarn,、Mesos)对
+  失败的 Job 进行自动重启。
 </p>
 
-### Upgrading & Rescaling a Job
+### Job 升级与扩容
 
-Upgrading a Flink Job always involves two steps: First, the Flink Job is gracefully stopped with a
-[Savepoint]({{ site.baseurl }}/ops/state/savepoints.html). A Savepoint is a consistent snapshot of 
-the complete application state at a well-defined, globally consistent point in time (similar to a 
-checkpoint). Second, the upgraded Flink Job is started from the Savepoint. In this context "upgrade" 
-can mean different things including the following:
+升级 Flink 作业一般都需要两步：第一，使用 [Savepoint]({%link ops/state/savepoints.zh.md %}) 优雅地停止 Flink Job。
+Savepoint 是整个应用程序状态的一次快照（类似于 checkpoint ），该快照是在一个明确定义的、全局一致的时间点生成的。第二，从 Savepoint 恢复启动待升级的 Flink Job。
+在此，“升级”包含如下几种含义：
 
-* An upgrade to the configuration (incl. the parallelism of the Job)
-* An upgrade to the topology of the Job (added/removed Operators)
-* An upgrade to the user-defined functions of the Job
+* 配置升级（比如 Job 并行度修改）
+* Job 拓扑升级（比如添加或者删除算子）
+* Job 的用户自定义函数升级
 
-Before starting with the upgrade you might want to start tailing the *output* topic, in order to 
-observe that no data is lost or corrupted in the course the upgrade. 
+在开始升级之前，你可能需要实时查看 *Output* topic 输出，
+以便观察在升级过程中没有数据丢失或损坏。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 1: Stopping the Job
+#### Step 1: 停止 Job
 
-To gracefully stop the Job, you need to use the "stop" command of either the CLI or the REST API. 
-For this you will need the JobID of the Job, which you can obtain by 
-[listing all running Jobs](#listing-running-jobs) or from the WebUI. With the JobID you can proceed 
-to stopping the Job:
+要优雅停止 Job，需要使用 JobID 通过 CLI 或 REST API 调用 “stop” 命令。
+JobID 可以通过[获取所有运行中的 Job](#获取所有运行中的-job) 接口或 Flink WebUI 界面获取，拿到 JobID 后就可以继续停止作业了：

Review comment:
       这里的链接在增加 `<a>` 标签后也改掉吧
   全文其他使用链接的地方也一起修改了吧

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -517,44 +487,42 @@ docker-compose run --no-deps client curl -X POST -H "Expect:" \
 
 {% endhighlight %}
 
-**Request**
+**请求**
 {% highlight bash %}
-# Submitting the Job
+# 提交 Job
 curl -X POST http://localhost:8081/jars/<jar-id>/run \
   -d '{"parallelism": 3, "programArgs": "--bootstrap.servers kafka:9092 --checkpointing --event-time", "savepointPath": "<savepoint-path>"}'
 {% endhighlight %}
-**Expected Response (pretty-printed**
+**预期响应 (结果已格式化)**
 {% highlight json %}
 {
   "jobid": "<job-id>"
 }
 {% endhighlight %}
 </div>
 </div>
-Now, the Job has been resubmitted, but it will not start as there are not enough TaskSlots to
-execute it with the increased parallelism (2 available, 3 needed). With
+现在 Job 已重新提交，但由于我们提高了并行度所以导致 TaskSlots 不够用（1个 TaskSlot 可用，总共需要3个），最终 Job 会重启失败。通过如下命令：
 {% highlight bash %}
 docker-compose scale taskmanager=2
 {% endhighlight %}
-you can add a second TaskManager with two TaskSlots to the Flink Cluster, which will automatically register with the 
-JobManager. Shortly after adding the TaskManager the Job should start running again.
+你可以向 Flink 集群添加第二个 TaskManager（为 Flink 集群提供2个 TaskSlots 资源），
+它会自动向 JobManager 注册，TaskManager 注册完成后，Job 会再次处于 "RUNNING" 状态。
 
-Once the Job is "RUNNING" again, you will see in the *output* Topic that no data was lost during 
-rescaling: all windows are present with a count of exactly one thousand.
+一旦 Job 再次运行起来，从 *output* Topic 的输出中你会看到在扩容期间数据依然没有丢失：
+所有窗口的计数都正好是1000。

Review comment:
       ```suggestion
   所有窗口的计数都正好是 1000。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +237,123 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的1000条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复
 
-Once you restart the TaskManager, it reconnects to the JobManager.
+一旦 TaskManager 重启成功，它将会重新连接到 JobManager。
 
 {% highlight bash%}
 docker-compose up -d taskmanager
 {% endhighlight %}
 
-When the JobManager is notified about the new TaskManager, it schedules the tasks of the 
-recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from
-the last successful [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html) that was taken
-before the failure and switch to the `RUNNING` state.
+当 TaskManager 注册成功后，JobManager 就会将处于 `SCHEDULED` 状态的所有任务调度到该 TaskManager 
+的可用 TaskSlots 中运行，此时所有的任务将会从失败前最近一次成功的 
+[checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 进行恢复，
+一旦恢复成功，它们的状态将转变为 `RUNNING`。
 
-The Job will quickly process the full backlog of input events (accumulated during the outage) 
-from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches 
-the head of the stream. In the *output* you will see that all keys (`page`s) are present for all time 
-windows and that every count is exactly one thousand. Since we are using the 
-[FlinkKafkaProducer]({{ site.baseurl }}/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance)
-in its "at-least-once" mode, there is a chance that you will see some duplicate output records.
+接下来该 Job 将快速处理 Kafka input 事件的全部积压（在 Job 中断期间累积的数据），
+并以更快的速度(>24条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。

Review comment:
       ```suggestion
   并以更快的速度(>24 条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -794,35 +761,22 @@ curl localhost:8081/jobs/<jod-id>
 }
 {% endhighlight %}
 
-Please consult the [REST API reference]({{ site.baseurl }}/monitoring/rest_api.html#api)
-for a complete list of possible queries including how to query metrics of different scopes (e.g. 
-TaskManager metrics);
+请查阅 [REST API 参考]({%link monitoring/rest_api.zh.md %}#api)，该参考上有完整的指标查询接口信息，包括如何查询不同种类的指标（例如 TaskManager 指标）。
 
 {%  top %}
 
-## Variants
+## 延伸拓展
 
-You might have noticed that the *Click Event Count* application was always started with `--checkpointing` 
-and `--event-time` program arguments. By omitting these in the command of the *client* container in the 
-`docker-compose.yaml`, you can change the behavior of the Job.
+你可能已经注意到了，*Click Event Count* 这个 Job 在启动时总是会带上 `--checkpointing` 和 `--event-time` 两个参数，
+如果我们去除这两个参数，那么 Job 的行为也会随之改变。
 
-* `--checkpointing` enables [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html), 
-which is Flink's fault-tolerance mechanism. If you run without it and go through 
-[failure and recovery](#observing-failure--recovery), you should will see that data is actually 
-lost.
+* `--checkpointing` 参数开启了 [checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 配置，savepoint 是 Flink 容错机制的重要保证。
+如果你没有开启 checkpoint，那么在 
+[Job 失败与恢复](#job-失败与恢复)这一节中，你将会看到数据丢失现象发生。
 
-* `--event-time` enables [event time semantics]({{ site.baseurl }}/dev/event_time.html) for your 
-Job. When disabled, the Job will assign events to windows based on the wall-clock time instead of 
-the timestamp of the `ClickEvent`. Consequently, the number of events per window will not be exactly
-one thousand anymore. 
+* `--event-time` 参数开启了 Job 的 [事件时间]({%link dev/event_time.zh.md %}) 机制，该机制会使用 `ClickEvent` 自带的时间戳进行统计。
+如果不指定该参数，Flink 将结合当前机器时间使用事件处理时间进行统计。如此一来，每个窗口计数将不再是准确的1000了。 
 
-The *Click Event Count* application also has another option, turned off by default, that you can 
-enable to explore the behavior of this job under backpressure. You can add this option in the 
-command of the *client* container in `docker-compose.yaml`.
+*Click Event Count* 这个 Job 还有另外一个选项，该选项默认是关闭的，你可以在 *client* 容器的 `docker-compose.yaml` 文件中添加该选项从而观察该 Job 在反压下的表现，该选项描述如下：
 
-* `--backpressure` adds an additional operator into the middle of the job that causes severe backpressure 
-during even-numbered minutes (e.g., during 10:12, but not during 10:13). This can be observed by 
-inspecting various [network metrics]({{ site.baseurl }}/monitoring/metrics.html#default-shuffle-service) 
-such as `outputQueueLength` and `outPoolUsage`, and/or by using the 
-[backpressure monitoring]({{ site.baseurl }}/monitoring/back_pressure.html#monitoring-back-pressure) 
-available in the WebUI.
+* `--backpressure` 将一个额外算子添加到 Job 中，该算子会在偶数分钟内产生严重的反压（比如：10:12期间，而10:13期间不会）。这种现象可以通过多种[网络指标]({%link monitoring/metrics.zh.md %}#default-shuffle-service)观察到，比如：`outputQueueLength` 和 `outPoolUsage` 指标，通过 WebUI 上的[反压监控]({%link monitoring/back_pressure.zh.md %}#monitoring-back-pressure)也可以观察到。

Review comment:
       ```suggestion
   * `--backpressure` 将一个额外算子添加到 Job 中，该算子会在偶数分钟内产生严重的反压（比如：10:12 期间，而10:13 期间不会）。这种现象可以通过多种[网络指标]({%link monitoring/metrics.zh.md %}#default-shuffle-service)观察到，比如：`outputQueueLength` 和 `outPoolUsage` 指标，通过 WebUI 上的[反压监控]({%link monitoring/back_pressure.zh.md %}#monitoring-back-pressure)也可以观察到。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +237,123 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#场景说明)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的1000条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复
 
-Once you restart the TaskManager, it reconnects to the JobManager.
+一旦 TaskManager 重启成功，它将会重新连接到 JobManager。
 
 {% highlight bash%}
 docker-compose up -d taskmanager
 {% endhighlight %}
 
-When the JobManager is notified about the new TaskManager, it schedules the tasks of the 
-recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from
-the last successful [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html) that was taken
-before the failure and switch to the `RUNNING` state.
+当 TaskManager 注册成功后，JobManager 就会将处于 `SCHEDULED` 状态的所有任务调度到该 TaskManager 
+的可用 TaskSlots 中运行，此时所有的任务将会从失败前最近一次成功的 
+[checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 进行恢复，
+一旦恢复成功，它们的状态将转变为 `RUNNING`。
 
-The Job will quickly process the full backlog of input events (accumulated during the outage) 
-from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches 
-the head of the stream. In the *output* you will see that all keys (`page`s) are present for all time 
-windows and that every count is exactly one thousand. Since we are using the 
-[FlinkKafkaProducer]({{ site.baseurl }}/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance)
-in its "at-least-once" mode, there is a chance that you will see some duplicate output records.
+接下来该 Job 将快速处理 Kafka input 事件的全部积压（在 Job 中断期间累积的数据），
+并以更快的速度(>24条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。
+此时观察 *output* topic 输出，
+你会看到在每一个时间窗口中都有按 `page` 进行分组的记录，而且计数刚好是1000。

Review comment:
       ```suggestion
   你会看到在每一个时间窗口中都有按 `page` 进行分组的记录，而且计数刚好是 1000。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -460,55 +432,53 @@ docker-compose run --no-deps client curl -X POST -H "Expect:" \
 
 {% endhighlight %}
 
-**Request**
+**请求**
 {% highlight bash %}
-# Submitting the Job
+# 提交 Job
 curl -X POST http://localhost:8081/jars/<jar-id>/run \
   -d '{"programArgs": "--bootstrap.servers kafka:9092 --checkpointing --event-time", "savepointPath": "<savepoint-path>"}'
 {% endhighlight %}
-**Expected Response (pretty-printed)**
+**预期响应 (结果已格式化)**
 {% highlight json %}
 {
   "jobid": "<job-id>"
 }
 {% endhighlight %}
 </div>
 </div>
+ 
+一旦该 Job 再次处于 `RUNNING` 状态，你将从 *output* Topic 中看到数据在快速输出，
+因为刚启动的 Job 正在处理停止期间积压的大量数据。另外，你还会看到在升级期间
+没有产生任何数据丢失：所有窗口都在输出1000。

Review comment:
       ```suggestion
   没有产生任何数据丢失：所有窗口都在输出 1000。
   ```

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#anatomy-of-this-playground)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的 1000 条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复
 
-Once you restart the TaskManager, it reconnects to the JobManager.
+一旦 TaskManager 重启成功，它将会重新连接到 JobManager。
 
 {% highlight bash%}
 docker-compose up -d taskmanager
 {% endhighlight %}
 
-When the JobManager is notified about the new TaskManager, it schedules the tasks of the 
-recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from
-the last successful [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html) that was taken
-before the failure and switch to the `RUNNING` state.
+当 TaskManager 注册成功后，JobManager 就会将处于 `SCHEDULED` 状态的所有任务调度到该 TaskManager 
+的可用 TaskSlots 中运行，此时所有的任务将会从失败前最近一次成功的 
+[checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 进行恢复，
+一旦恢复成功，它们的状态将转变为 `RUNNING`。
 
-The Job will quickly process the full backlog of input events (accumulated during the outage) 
-from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches 
-the head of the stream. In the *output* you will see that all keys (`page`s) are present for all time 
-windows and that every count is exactly one thousand. Since we are using the 
-[FlinkKafkaProducer]({{ site.baseurl }}/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance)
-in its "at-least-once" mode, there is a chance that you will see some duplicate output records.
+接下来该 Job 将快速处理 Kafka input 事件的全部积压（在 Job 中断期间累积的数据），
+并以更快的速度(>24 条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。
+此时观察 *output* topic 输出，
+你会看到在每一个时间窗口中都有按 `page` 进行分组的记录，而且计数刚好是 1000。
+由于我们使用的是 [FlinkKafkaProducer]({%link dev/connectors/kafka.zh.md %}#kafka-producers-and-fault-tolerance) "至少一次"模式，因此你可能会看到一些记录重复输出多次。
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Most production setups rely on a resource manager (Kubernetes, Yarn, Mesos) to
-  automatically restart failed processes.
+  <b>注意</b>：在大部分生产环境中都需要一个资源管理器 (Kubernetes、Yarn,、Mesos)对
+  失败的 Job 进行自动重启。
 </p>
 
-### Upgrading & Rescaling a Job
+### Job 升级与扩容

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -794,35 +764,22 @@ curl localhost:8081/jobs/<jod-id>
 }
 {% endhighlight %}
 
-Please consult the [REST API reference]({{ site.baseurl }}/monitoring/rest_api.html#api)
-for a complete list of possible queries including how to query metrics of different scopes (e.g. 
-TaskManager metrics);
+请查阅 [REST API 参考]({%link monitoring/rest_api.zh.md %}#api)，该参考上有完整的指标查询接口信息，包括如何查询不同种类的指标（例如 TaskManager 指标）。
 
 {%  top %}
 
-## Variants
+## 延伸拓展
 
-You might have noticed that the *Click Event Count* application was always started with `--checkpointing` 
-and `--event-time` program arguments. By omitting these in the command of the *client* container in the 
-`docker-compose.yaml`, you can change the behavior of the Job.
+你可能已经注意到了，*Click Event Count* 这个 Job 在启动时总是会带上 `--checkpointing` 和 `--event-time` 两个参数，
+如果我们去除这两个参数，那么 Job 的行为也会随之改变。
 
-* `--checkpointing` enables [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html), 
-which is Flink's fault-tolerance mechanism. If you run without it and go through 
-[failure and recovery](#observing-failure--recovery), you should will see that data is actually 
-lost.
+* `--checkpointing` 参数开启了 [checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 配置，savepoint 是 Flink 容错机制的重要保证。

Review comment:
       `savepoint 是 Flink 容错机制的重要保证` -> `checkpoint 是 Flink 容错机制的重要保证`

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -794,35 +764,22 @@ curl localhost:8081/jobs/<jod-id>
 }
 {% endhighlight %}
 
-Please consult the [REST API reference]({{ site.baseurl }}/monitoring/rest_api.html#api)
-for a complete list of possible queries including how to query metrics of different scopes (e.g. 
-TaskManager metrics);
+请查阅 [REST API 参考]({%link monitoring/rest_api.zh.md %}#api)，该参考上有完整的指标查询接口信息，包括如何查询不同种类的指标（例如 TaskManager 指标）。
 
 {%  top %}
 
-## Variants
+## 延伸拓展

Review comment:
       这里也建议添加标题的 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -517,44 +490,42 @@ docker-compose run --no-deps client curl -X POST -H "Expect:" \
 
 {% endhighlight %}
 
-**Request**
+**请求**
 {% highlight bash %}
-# Submitting the Job
+# 提交 Job
 curl -X POST http://localhost:8081/jars/<jar-id>/run \
   -d '{"parallelism": 3, "programArgs": "--bootstrap.servers kafka:9092 --checkpointing --event-time", "savepointPath": "<savepoint-path>"}'
 {% endhighlight %}
-**Expected Response (pretty-printed**
+**预期响应 (结果已格式化)**
 {% highlight json %}
 {
   "jobid": "<job-id>"
 }
 {% endhighlight %}
 </div>
 </div>
-Now, the Job has been resubmitted, but it will not start as there are not enough TaskSlots to
-execute it with the increased parallelism (2 available, 3 needed). With
+现在 Job 已重新提交，但由于我们提高了并行度所以导致 TaskSlots 不够用（1 个 TaskSlot 可用，总共需要 3 个），最终 Job 会重启失败。通过如下命令：
 {% highlight bash %}
 docker-compose scale taskmanager=2
 {% endhighlight %}
-you can add a second TaskManager with two TaskSlots to the Flink Cluster, which will automatically register with the 
-JobManager. Shortly after adding the TaskManager the Job should start running again.
+你可以向 Flink 集群添加第二个 TaskManager（为 Flink 集群提供 2 个 TaskSlots 资源），
+它会自动向 JobManager 注册，TaskManager 注册完成后，Job 会再次处于 "RUNNING" 状态。
 
-Once the Job is "RUNNING" again, you will see in the *output* Topic that no data was lost during 
-rescaling: all windows are present with a count of exactly one thousand.
+一旦 Job 再次运行起来，从 *output* Topic 的输出中你会看到在扩容期间数据依然没有丢失：
+所有窗口的计数都正好是 1000。
 
-### Querying the Metrics of a Job
+### 查询 Job 指标

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -460,55 +435,53 @@ docker-compose run --no-deps client curl -X POST -H "Expect:" \
 
 {% endhighlight %}
 
-**Request**
+**请求**
 {% highlight bash %}
-# Submitting the Job
+# 提交 Job
 curl -X POST http://localhost:8081/jars/<jar-id>/run \
   -d '{"programArgs": "--bootstrap.servers kafka:9092 --checkpointing --event-time", "savepointPath": "<savepoint-path>"}'
 {% endhighlight %}
-**Expected Response (pretty-printed)**
+**预期响应 (结果已格式化)**
 {% highlight json %}
 {
   "jobid": "<job-id>"
 }
 {% endhighlight %}
 </div>
 </div>
+ 
+一旦该 Job 再次处于 `RUNNING` 状态，你将从 *output* Topic 中看到数据在快速输出，
+因为刚启动的 Job 正在处理停止期间积压的大量数据。另外，你还会看到在升级期间
+没有产生任何数据丢失：所有窗口都在输出 1000。
 
-Once the Job is `RUNNING` again, you will see in the *output* Topic that records are produced at a 
-higher rate while the Job is processing the backlog accumulated during the outage. Additionally, 
-you will see that no data was lost during the upgrade: all windows are present with a count of 
-exactly one thousand. 
-
-#### Step 2b: Restart Job with a Different Parallelism (Rescaling)
+#### Step 2b: 重启 Job (修改并行度)

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -423,35 +399,34 @@ curl -X POST localhost:8081/jobs/<job-id>/stop -d '{"drain": false}'
 </div>
 </div>
 
-#### Step 2a: Restart Job without Changes
+#### Step 2a: 重启 Job (不作任何变更)

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#anatomy-of-this-playground)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的 1000 条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复
 
-Once you restart the TaskManager, it reconnects to the JobManager.
+一旦 TaskManager 重启成功，它将会重新连接到 JobManager。
 
 {% highlight bash%}
 docker-compose up -d taskmanager
 {% endhighlight %}
 
-When the JobManager is notified about the new TaskManager, it schedules the tasks of the 
-recovering Job to the newly available TaskSlots. Upon restart, the tasks recover their state from
-the last successful [checkpoint]({{ site.baseurl }}/learn-flink/fault_tolerance.html) that was taken
-before the failure and switch to the `RUNNING` state.
+当 TaskManager 注册成功后，JobManager 就会将处于 `SCHEDULED` 状态的所有任务调度到该 TaskManager 
+的可用 TaskSlots 中运行，此时所有的任务将会从失败前最近一次成功的 
+[checkpoint]({%link learn-flink/fault_tolerance.zh.md %}) 进行恢复，
+一旦恢复成功，它们的状态将转变为 `RUNNING`。
 
-The Job will quickly process the full backlog of input events (accumulated during the outage) 
-from Kafka and produce output at a much higher rate (> 24 records/minute) until it reaches 
-the head of the stream. In the *output* you will see that all keys (`page`s) are present for all time 
-windows and that every count is exactly one thousand. Since we are using the 
-[FlinkKafkaProducer]({{ site.baseurl }}/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance)
-in its "at-least-once" mode, there is a chance that you will see some duplicate output records.
+接下来该 Job 将快速处理 Kafka input 事件的全部积压（在 Job 中断期间累积的数据），
+并以更快的速度(>24 条记录/分钟)产生输出，直到它追上 kafka 的 lag 延迟为止。
+此时观察 *output* topic 输出，
+你会看到在每一个时间窗口中都有按 `page` 进行分组的记录，而且计数刚好是 1000。
+由于我们使用的是 [FlinkKafkaProducer]({%link dev/connectors/kafka.zh.md %}#kafka-producers-and-fault-tolerance) "至少一次"模式，因此你可能会看到一些记录重复输出多次。
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Most production setups rely on a resource manager (Kubernetes, Yarn, Mesos) to
-  automatically restart failed processes.
+  <b>注意</b>：在大部分生产环境中都需要一个资源管理器 (Kubernetes、Yarn,、Mesos)对
+  失败的 Job 进行自动重启。
 </p>
 
-### Upgrading & Rescaling a Job
+### Job 升级与扩容
 
-Upgrading a Flink Job always involves two steps: First, the Flink Job is gracefully stopped with a
-[Savepoint]({{ site.baseurl }}/ops/state/savepoints.html). A Savepoint is a consistent snapshot of 
-the complete application state at a well-defined, globally consistent point in time (similar to a 
-checkpoint). Second, the upgraded Flink Job is started from the Savepoint. In this context "upgrade" 
-can mean different things including the following:
+升级 Flink 作业一般都需要两步：第一，使用 [Savepoint]({%link ops/state/savepoints.zh.md %}) 优雅地停止 Flink Job。
+Savepoint 是整个应用程序状态的一次快照（类似于 checkpoint ），该快照是在一个明确定义的、全局一致的时间点生成的。第二，从 Savepoint 恢复启动待升级的 Flink Job。
+在此，“升级”包含如下几种含义：
 
-* An upgrade to the configuration (incl. the parallelism of the Job)
-* An upgrade to the topology of the Job (added/removed Operators)
-* An upgrade to the user-defined functions of the Job
+* 配置升级（比如 Job 并行度修改）
+* Job 拓扑升级（比如添加或者删除算子）
+* Job 的用户自定义函数升级
 
-Before starting with the upgrade you might want to start tailing the *output* topic, in order to 
-observe that no data is lost or corrupted in the course the upgrade. 
+在开始升级之前，你可能需要实时查看 *Output* topic 输出，
+以便观察在升级过程中没有数据丢失或损坏。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 1: Stopping the Job
+#### Step 1: 停止 Job

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#anatomy-of-this-playground)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的 1000 条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败
 
-In order to simulate a partial failure you can kill a TaskManager. In a production setup, this 
-could correspond to a loss of the TaskManager process, the TaskManager machine or simply a transient 
-exception being thrown from the framework or user code (e.g. due to the temporary unavailability of 
-an external resource).   
+为了模拟部分失败故障，你可以 kill 掉一个 TaskManager，这种失败行为在生产环境中就相当于 
+TaskManager 进程挂掉、TaskManager 机器宕机或者从框架或用户代码中抛出的一个临时异常（例如，由于外部资源暂时不可用）而导致的失败。   
 
 {% highlight bash%}
 docker-compose kill taskmanager
 {% endhighlight %}
 
-After a few seconds, the JobManager will notice the loss of the TaskManager, cancel the affected Job, and 
-immediately resubmit it for recovery.
-When the Job gets restarted, its tasks remain in the `SCHEDULED` state, which is indicated by the 
-purple colored squares (see screenshot below).
+几秒钟后，JobManager 就会感知到 TaskManager 已失联，接下来它会
+取消 Job 运行并且立即重新提交该 Job 以进行恢复。
+当 Job 重启后，所有的任务都会处于 `SCHEDULED` 状态，如以下截图中紫色方格所示：
 
-<img src="{{ site.baseurl }}/fig/playground-webui-failure.png" alt="Playground Flink WebUI" 
+<img src="{%link fig/playground-webui-failure.png %}" alt="Playground Flink WebUI" 
 class="offset" width="100%" />
 
 <p style="border-radius: 5px; padding: 5px" class="bg-info">
-  <b>Note</b>: Even though the tasks of the job are in SCHEDULED state and not RUNNING yet, the overall 
-  status of a Job is shown as RUNNING.
+  <b>注意</b>：虽然 Job 的所有任务都处于 SCHEDULED 状态，但整个 Job 的状态却显示为 RUNNING。
 </p>
 
-At this point, the tasks of the Job cannot move from the `SCHEDULED` state to `RUNNING` because there
-are no resources (TaskSlots provided by TaskManagers) to the run the tasks.
-Until a new TaskManager becomes available, the Job will go through a cycle of cancellations and resubmissions.
+此时，由于 TaskManager 提供的 TaskSlots 资源不够用，Job 的所有任务都不能成功转为 
+`RUNNING` 状态，直到有新的 TaskManager 可用。在此之前，该 Job 将经历一个取消和重新提交
+不断循环的过程。
 
-In the meantime, the data generator keeps pushing `ClickEvent`s into the *input* topic. This is 
-similar to a real production setup where data is produced while the Job to process it is down.
+与此同时，数据生成器 (data generator) 一直不断地往 *input* topic 中生成 `ClickEvent` 事件，在生产环境中也经常出现这种 Job 挂掉但源头还在不断产生数据的情况。
 
-#### Step 3: Recovery
+#### Step 3: 失败恢复

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出
 
-As described [above](#anatomy-of-this-playground), the events in this playground are generate such 
-that each window  contains exactly one thousand records. So, in order to verify that Flink 
-successfully recovers from a TaskManager failure without data loss or duplication you can tail the 
-output topic and check that - after recovery - all windows are present and the count is correct.
+如[前文](#anatomy-of-this-playground)所述，事件以特定速率生成，刚好使得每个统计窗口都包含确切的 1000 条记录。
+因此，你可以实时查看 output topic 的输出，确定失败恢复后所有的窗口依然输出正确的统计数字，
+以此来验证 Flink 在 TaskManager 失败时能够成功恢复，而且不丢失数据、不产生数据重复。
 
-For this, start reading from the *output* topic and leave this command running until after 
-recovery (Step 3).
+为此，通过控制台命令消费 *output* topic，保持消费直到 Job 从失败中恢复 (Step 3)。
 
 {% highlight bash%}
 docker-compose exec kafka kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 --topic output
 {% endhighlight %}
 
-#### Step 2: Introducing a Fault
+#### Step 2: 模拟失败

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -252,135 +239,124 @@ curl localhost:8081/jobs
 </div>
 </div>
 
-The JobID is assigned to a Job upon submission and is needed to perform actions on the Job via the 
-CLI or REST API.
+一旦 Job 提交，Flink 会默认为其生成一个 JobID，后续对该 Job 的
+所有操作（无论是通过 CLI 还是 REST API）都需要带上 JobID。
+<a name="observing-failure--recovery"></a>
 
-### Observing Failure & Recovery
+### Job 失败与恢复
 
-Flink provides exactly-once processing guarantees under (partial) failure. In this playground you 
-can observe and - to some extent - verify this behavior. 
+在 Job (部分)失败的情况下，Flink 对事件处理依然能够提供精确一次的保障，
+在本节中你将会观察到并能够在某种程度上验证这种行为。 
 
-#### Step 1: Observing the Output
+#### Step 1: 观察输出

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -209,22 +197,21 @@ docker-compose exec kafka kafka-console-consumer.sh \
 
 {%  top %}
 
-## Time to Play!
+## 核心特性探索
 
-Now that you learned how to interact with Flink and the Docker containers, let's have a look at 
-some common operational tasks that you can try out on our playground.
-All of these tasks are independent of each other, i.e. you can perform them in any order. 
-Most tasks can be executed via the [CLI](#flink-cli) and the [REST API](#flink-rest-api).
+到目前为止，你已经学习了如何与 Flink 以及 Docker 容器进行交互，现在让我们看一些常用的操作命令。
+本节中的各部分命令不需要按任何特定的顺序执行，这些命令大部分都可以通过 [CLI](#flink-cli) 或 [RESTAPI](#flink-rest-api) 执行。
+<a name="listing-running-jobs"></a>
 
-### Listing Running Jobs
+### 获取所有运行中的 Job

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -209,22 +197,21 @@ docker-compose exec kafka kafka-console-consumer.sh \
 
 {%  top %}
 
-## Time to Play!
+## 核心特性探索

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -120,83 +111,80 @@ operations-playground_taskmanager_1            /docker-entrypoint.sh task ...
 operations-playground_zookeeper_1              /bin/sh -c /usr/sbin/sshd  ...   Up       2181/tcp, 22/tcp, 2888/tcp, 3888/tcp
 {% endhighlight %}
 
-This indicates that the client container has successfully submitted the Flink Job (`Exit 0`) and all 
-cluster components as well as the data generator are running (`Up`).
+从上面的信息可以看出 client 容器已成功提交了 Flink Job (`Exit 0`)，
+同时包含数据生成器在内的所有集群组件都处于运行中状态 (`Up`)。
 
-You can stop the playground environment by calling:
+你可以执行如下命令停止 docker 环境：
 
 {% highlight bash %}
 docker-compose down -v
 {% endhighlight %}
 
-## Entering the Playground
+## 环境讲解
 
-There are many things you can try and check out in this playground. In the following two sections we 
-will show you how to interact with the Flink Cluster and demonstrate some of Flink's key features.
+在这个搭建好的环境中你可以尝试和验证很多事情，在下面的两个部分中我们将向你展示如何与 Flink 集群进行交互以及演示并讲解 Flink 的一些核心特性。
 
-### Flink WebUI
+### Flink WebUI 界面
 
-The most natural starting point to observe your Flink Cluster is the WebUI exposed under 
-[http://localhost:8081](http://localhost:8081). If everything went well, you'll see that the cluster initially consists of 
-one TaskManager and executes a Job called *Click Event Count*.
+观察Flink集群首先想到的就是 Flink WebUI 界面：打开浏览器并访问 
+[http://localhost:8081](http://localhost:8081)，如果一切正常，你将会在界面上看到一个 TaskManager 
+和一个处于 "RUNNING" 状态的名为 *Click Event Count* 的 Job。
 
-<img src="{{ site.baseurl }}/fig/playground-webui.png" alt="Playground Flink WebUI"
+<img src="{%link fig/playground-webui.png %}" alt="Playground Flink WebUI"
 class="offset" width="100%" />
 
-The Flink WebUI contains a lot of useful and interesting information about your Flink Cluster and 
-its Jobs (JobGraph, Metrics, Checkpointing Statistics, TaskManager Status,...). 
+Flink WebUI 界面包含许多关于 Flink 集群以及运行在其上的 Jobs 的有用信息，比如：JobGraph、Metrics、Checkpointing Statistics、TaskManager Status 等等。 
 
-### Logs
+### 日志

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -120,83 +111,80 @@ operations-playground_taskmanager_1            /docker-entrypoint.sh task ...
 operations-playground_zookeeper_1              /bin/sh -c /usr/sbin/sshd  ...   Up       2181/tcp, 22/tcp, 2888/tcp, 3888/tcp
 {% endhighlight %}
 
-This indicates that the client container has successfully submitted the Flink Job (`Exit 0`) and all 
-cluster components as well as the data generator are running (`Up`).
+从上面的信息可以看出 client 容器已成功提交了 Flink Job (`Exit 0`)，
+同时包含数据生成器在内的所有集群组件都处于运行中状态 (`Up`)。
 
-You can stop the playground environment by calling:
+你可以执行如下命令停止 docker 环境：
 
 {% highlight bash %}
 docker-compose down -v
 {% endhighlight %}
 
-## Entering the Playground
+## 环境讲解
 
-There are many things you can try and check out in this playground. In the following two sections we 
-will show you how to interact with the Flink Cluster and demonstrate some of Flink's key features.
+在这个搭建好的环境中你可以尝试和验证很多事情，在下面的两个部分中我们将向你展示如何与 Flink 集群进行交互以及演示并讲解 Flink 的一些核心特性。
 
-### Flink WebUI
+### Flink WebUI 界面

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -23,80 +23,71 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-There are many ways to deploy and operate Apache Flink in various environments. Regardless of this
-variety, the fundamental building blocks of a Flink Cluster remain the same, and similar
-operational principles apply.
+Apache Flink 可以以多种方式在不同的环境中部署，抛开这种多样性而言，Flink 集群的基本构建方式和操作原则仍然是相同的。
 
-In this playground, you will learn how to manage and run Flink Jobs. You will see how to deploy and 
-monitor an application, experience how Flink recovers from Job failure, and perform everyday 
-operational tasks like upgrades and rescaling.
+在这篇文章里，你将会学习如何管理和运行 Flink 任务，了解如何部署和监控应用程序、Flink 如何从失败作业中进行恢复，同时你还会学习如何执行一些日常操作任务，如升级和扩容。
 
 {% if site.version contains "SNAPSHOT" %}
 <p style="border-radius: 5px; padding: 5px" class="bg-danger">
   <b>
-  NOTE: The Apache Flink Docker images used for this playground are only available for
-  released versions of Apache Flink.
+  注意：本文中使用的 Apache Flink Docker 镜像仅适用于 Apache Flink 发行版。
   </b><br>
-  Since you are currently looking at the latest SNAPSHOT
-  version of the documentation, all version references below will not work.
-  Please switch the documentation to the latest released version via the release picker which you
-  find on the left side below the menu.
+  由于你目前正在浏览快照版的文档，因此下文中引用的分支可能已经不存在了，请先通过左侧菜单下方的版本选择器切换到发行版文档再查看。
 </p>
 {% endif %}
 
 * This will be replaced by the TOC
 {:toc}
+<a name="anatomy-of-this-playground"></a>
 
-## Anatomy of this Playground
+## 场景说明
 
-This playground consists of a long living
-[Flink Session Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-session-cluster) and a Kafka
-Cluster.
+这篇文章中的所有操作都是基于如下两个集群进行的： 
+[Flink Session Cluster]({%link concepts/glossary.zh.md %}#flink-session-cluster) 以及一个 Kafka 集群，
+我们会在下文带领大家一起搭建这两个集群。
 
-A Flink Cluster always consists of a 
-[JobManager]({{ site.baseurl }}/concepts/glossary.html#flink-jobmanager) and one or more 
-[Flink TaskManagers]({{ site.baseurl }}/concepts/glossary.html#flink-taskmanager). The JobManager 
-is responsible for handling [Job]({{ site.baseurl }}/concepts/glossary.html#flink-job) submissions, 
-the supervision of Jobs as well as resource management. The Flink TaskManagers are the worker 
-processes and are responsible for the execution of the actual 
-[Tasks]({{ site.baseurl }}/concepts/glossary.html#task) which make up a Flink Job. In this 
-playground you will start with a single TaskManager, but scale out to more TaskManagers later. 
-Additionally, this playground comes with a dedicated *client* container, which we use to submit the 
-Flink Job initially and to perform various operational tasks later on. The *client* container is not
-needed by the Flink Cluster itself but only included for ease of use.
+一个 Flink 集群总是包含一个 
+[JobManager]({%link concepts/glossary.zh.md %}#flink-jobmanager) 以及一个或多个 
+[Flink TaskManager]({%link concepts/glossary.zh.md %}#flink-taskmanager)。JobManager 
+负责处理 [Job]({%link concepts/glossary.zh.md %}#flink-job) 提交、
+Job 监控以及资源管理。Flink TaskManager 运行 worker 进程，
+负责实际任务 
+[Tasks]({%link concepts/glossary.zh.md %}#task) 的执行，而这些任务共同组成了一个 Flink Job。 在这篇文章中，
+我们会先运行一个 TaskManager，接下来会扩容到多个 TaskManager。 
+另外，这里我们会专门使用一个 *client* 容器来提交 Flink Job，
+后续还会使用该容器执行一些操作任务。需要注意的是，Flink 集群的运行并不需要依赖 *client* 容器，
+我们这里引入只是为了使用方便。
 
-The Kafka Cluster consists of a Zookeeper server and a Kafka Broker.
+这里的 Kafka 集群由一个 Zookeeper 服务端和一个 Kafka Broker 组成。
 
-<img src="{{ site.baseurl }}/fig/flink-docker-playground.svg" alt="Flink Docker Playground"
+<img src="{%link fig/flink-docker-playground.svg %}" alt="Flink Docker Playground"
 class="offset" width="80%" />
 
-When the playground is started a Flink Job called *Flink Event Count* will be submitted to the 
-JobManager. Additionally, two Kafka Topics *input* and *output* are created.
+一开始，我们会往 JobManager 提交一个名为 *Flink 事件计数* 的 Job，此外，我们还创建了两个 Kafka Topic：*input* 和 *output*。
 
-<img src="{{ site.baseurl }}/fig/click-event-count-example.svg" alt="Click Event Count Example"
+<img src="{%link fig/click-event-count-example.svg %}" alt="Click Event Count Example"
 class="offset" width="80%" />
 
-The Job consumes `ClickEvent`s from the *input* topic, each with a `timestamp` and a `page`. The 
-events are then keyed by `page` and counted in 15 second
-[windows]({{ site.baseurl }}/dev/stream/operators/windows.html). The results are written to the 
-*output* topic. 
+该 Job 负责从 *input* topic 消费点击事件 `ClickEvent`，每个点击事件都包含一个 `timestamp` 和一个 `page` 属性。
+这些事件将按照 `page` 属性进行分组，然后按照每 15s 窗口 [windows]({%link dev/stream/operators/windows.zh.md %}) 进行统计，
+最终结果输出到 *output* topic 中。
 
-There are six different pages and we generate 1000 click events per page and 15 seconds. Hence, the 
-output of the Flink job should show 1000 views per page and window.
+总共有 6 种不同的 page 属性，针对特定 page，我们会按照每 15s 产生 1000 个点击事件的速率生成数据。
+因此，针对特定 page，该 Flink job 应该能在每个窗口中输出 1000 个该 page 的点击数据。
 
 {% top %}
 
-## Starting the Playground
+## 环境搭建

Review comment:
       这里也建议添加 `<a>` 标签

##########
File path: docs/try-flink/flink-operations-playground.zh.md
##########
@@ -120,83 +111,80 @@ operations-playground_taskmanager_1            /docker-entrypoint.sh task ...
 operations-playground_zookeeper_1              /bin/sh -c /usr/sbin/sshd  ...   Up       2181/tcp, 22/tcp, 2888/tcp, 3888/tcp
 {% endhighlight %}
 
-This indicates that the client container has successfully submitted the Flink Job (`Exit 0`) and all 
-cluster components as well as the data generator are running (`Up`).
+从上面的信息可以看出 client 容器已成功提交了 Flink Job (`Exit 0`)，
+同时包含数据生成器在内的所有集群组件都处于运行中状态 (`Up`)。
 
-You can stop the playground environment by calling:
+你可以执行如下命令停止 docker 环境：
 
 {% highlight bash %}
 docker-compose down -v
 {% endhighlight %}
 
-## Entering the Playground
+## 环境讲解

Review comment:
       这里也建议添加 `<a>` 标签




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org