You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by ja...@apache.org on 2020/06/17 07:57:27 UTC

[flink] 01/02: [FLINK-17269][docs-zh] Translate new Training Overview to Chinese

This is an automated email from the ASF dual-hosted git repository.

jark pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git

commit 5d7ea5d793af3401b773a754d2b06ef448c0564d
Author: Yichao Yang <10...@qq.com>
AuthorDate: Sat May 30 19:31:45 2020 +0800

    [FLINK-17269][docs-zh] Translate new Training Overview to Chinese
    
    This closes #12311
---
 docs/concepts/index.zh.md    |  61 +++---------------
 docs/learn-flink/index.zh.md | 148 ++++++++++++-------------------------------
 2 files changed, 50 insertions(+), 159 deletions(-)

diff --git a/docs/concepts/index.zh.md b/docs/concepts/index.zh.md
index 54f7dfb..4169f6d 100644
--- a/docs/concepts/index.zh.md
+++ b/docs/concepts/index.zh.md
@@ -26,65 +26,24 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-The [Hands-on Training]({% link learn-flink/index.zh.md %}) explains the basic concepts
-of stateful and timely stream processing that underlie Flink's APIs, and provides examples of how
-these mechanisms are used in applications. Stateful stream processing is introduced in the context
-of [Data Pipelines & ETL]({% link learn-flink/etl.zh.md %}#stateful-transformations)
-and is further developed in the section on [Fault Tolerance]({% link learn-flink/fault_tolerance.zh.md %}). Timely stream processing is introduced in the section on
-[Streaming Analytics]({% link learn-flink/streaming_analytics.zh.md %}).
+[实践练习]({% link learn-flink/index.zh.md %})章节介绍了作为 Flink API 根基的有状态实时流处理的基本概念,并且举例说明了如何在 Flink 应用中使用这些机制。其中 [Data Pipelines & ETL]({% link learn-flink/etl.zh.md %}#stateful-transformations) 小节介绍了有状态流处理的概念,并且在 [Fault Tolerance]({% link learn-flink/fault_tolerance.zh.md %}) 小节中进行了深入介绍。[Streaming Analytics]({% link learn-flink/streaming_analytics.zh.md %}) 小节介绍了实时流处理的概念。
 
-This _Concepts in Depth_ section provides a deeper understanding of how Flink's architecture and runtime 
-implement these concepts.
+本章将深入分析 Flink 分布式运行时架构如何实现这些概念。
 
-## Flink's APIs
+## Flink 中的 API
 
-Flink offers different levels of abstraction for developing streaming/batch applications.
+Flink 为流式/批式处理应用程序的开发提供了不同级别的抽象。
 
 <img src="{{ site.baseurl }}/fig/levels_of_abstraction.svg" alt="Programming levels of abstraction" class="offset" width="80%" />
 
-  - The lowest level abstraction simply offers **stateful and timely stream processing**. It is
-    embedded into the [DataStream API]({{ site.baseurl}}{% link
-    dev/datastream_api.zh.md %}) via the [Process Function]({{ site.baseurl }}{%
-    link dev/stream/operators/process_function.zh.md %}). It allows users to freely
-    process events from one or more streams, and provides consistent, fault tolerant
-    *state*. In addition, users can register event time and processing time
-    callbacks, allowing programs to realize sophisticated computations.
+  - Flink API 最底层的抽象为**有状态实时流处理**。其抽象实现是 [Process Function]({{ site.baseurl }}{% link dev/stream/operators/process_function.zh.md %}),并且 **Process Function** 被 Flink 框架集成到了 [DataStream API]({{ site.baseurl}}{% link dev/datastream_api.zh.md %}) 中来为我们使用。它允许用户在应用程序中自由地处理来自单流或多流的事件(数据),并提供具有全局一致性和容错保障的*状态*。此外,用户可以在此层抽象中注册事件时间(event time)和处理时间(processing time)回调方法,从而允许程序可以实现复杂计算。
 
-  - In practice, many applications do not need the low-level
-    abstractions described above, and can instead program against the **Core APIs**: the
-    [DataStream API]({% link dev/datastream_api.zh.md %})
-    (bounded/unbounded streams) and the [DataSet API]({% link
-    dev/batch/index.zh.md %}) (bounded data sets). These fluent APIs offer the
-    common building blocks for data processing, like various forms of
-    user-specified transformations, joins, aggregations, windows, state, etc.
-    Data types processed in these APIs are represented as classes in the
-    respective programming languages.
+  - Flink API 第二层抽象是 **Core APIs**。实际上,许多应用程序不需要使用到上述最底层抽象的 API,而是可以使用 **Core APIs** 进行编程:其中包含 [DataStream API]({% link dev/datastream_api.zh.md %})(应用于有界/无界数据流场景)和 [DataSet API]({% link dev/batch/index.zh.md %})(应用于有界数据集场景)两部分。Core APIs 提供的流式 API(Fluent API)为数据处理提供了通用的模块组件,例如各种形式的用户自定义转换(transformations)、联接(joins)、聚合(aggregations)、窗口(windows)和状态(state)操作等。此层 API 中处理的数据类型在每种编程语言中都有其对应的类。
 
-    The low level *Process Function* integrates with the *DataStream API*,
-    making it possible to use the lower-level abstraction on an as-needed basis. 
-    The *DataSet API* offers additional primitives on bounded data sets,
-    like loops/iterations.
+    *Process Function* 这类底层抽象和 *DataStream API* 的相互集成使得用户可以选择使用更底层的抽象 API 来实现自己的需求。*DataSet API* 还额外提供了一些原语,比如循环/迭代(loop/iteration)操作。
 
-  - The **Table API** is a declarative DSL centered around *tables*, which may
-    be dynamically changing tables (when representing streams).  The [Table
-    API]({% link dev/table/index.zh.md %}) follows the
-    (extended) relational model: Tables have a schema attached (similar to
-    tables in relational databases) and the API offers comparable operations,
-    such as select, project, join, group-by, aggregate, etc.  Table API
-    programs declaratively define *what logical operation should be done*
-    rather than specifying exactly *how the code for the operation looks*.
-    Though the Table API is extensible by various types of user-defined
-    functions, it is less expressive than the *Core APIs*, and more concise to
-    use (less code to write).  In addition, Table API programs also go through
-    an optimizer that applies optimization rules before execution.
+  - Flink API 第三层抽象是 **Table API**。**Table API** 是以表(Table)为中心的声明式编程(DSL)API,例如在流式数据场景下,它可以表示一张正在动态改变的表。[Table API]({% link dev/table/index.zh.md %}) 遵循(扩展)关系模型:即表拥有 schema(类似于关系型数据库中的 schema),并且 Table API 也提供了类似于关系模型中的操作,比如 select、project、join、group-by 和 aggregate 等。Table API 程序是以声明的方式定义*应执行的逻辑操作*,而不是确切地指定程序*应该执行的代码*。尽管 Table API 使用起来很简洁并且可以由各种类型的用户自定义函数扩展功能,但还是比 Core API 的表达能力差。此外,Table API 程序在执行之前还会使用优化器中的优化规则对用户编写的表达式进行优化。
 
-    One can seamlessly convert between tables and *DataStream*/*DataSet*,
-    allowing programs to mix the *Table API* with the *DataStream* and
-    *DataSet* APIs.
+    表和 *DataStream*/*DataSet* 可以进行无缝切换,Flink 允许用户在编写应用程序时将 *Table API* 与 *DataStream*/*DataSet* API 混合使用。
 
-  - The highest level abstraction offered by Flink is **SQL**. This abstraction
-    is similar to the *Table API* both in semantics and expressiveness, but
-    represents programs as SQL query expressions.  The [SQL]({{ site.baseurl
-    }}{% link dev/table/index.zh.md %}#sql) abstraction closely interacts with the
-    Table API, and SQL queries can be executed over tables defined in the
-    *Table API*.
+  - Flink API 最顶层抽象是 **SQL**。这层抽象在语义和程序表达式上都类似于 *Table API*,但是其程序实现都是 SQL 查询表达式。[SQL]({{ site.baseurl}}{% link dev/table/index.zh.md %}#sql) 抽象与 Table API 抽象之间的关联是非常紧密的,并且 SQL 查询语句可以在 *Table API* 中定义的表上执行。
diff --git a/docs/learn-flink/index.zh.md b/docs/learn-flink/index.zh.md
index 012ea0a..75be8be 100644
--- a/docs/learn-flink/index.zh.md
+++ b/docs/learn-flink/index.zh.md
@@ -1,8 +1,8 @@
 ---
-title: "Learn Flink: Hands-on Training"
-nav-id: training
+title: 实践练习
+nav-id: learn-flink
 nav-pos: 2
-nav-title: '<i class="fa fa-hand-paper-o title appetizer" aria-hidden="true"></i> Hands-on Training'
+nav-title: '<i class="fa fa-hand-paper-o title appetizer" aria-hidden="true"></i> 实践练习'
 nav-parent_id: root
 nav-show_overview: true
 ---
@@ -28,158 +28,90 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-## Goals and Scope of this Training
+## 本章教程的目标及涵盖范围
 
-This training presents an introduction to Apache Flink that includes just enough to get you started
-writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of
-(ultimately important) details. The focus is on providing straightforward introductions to Flink's
-APIs for managing state and time, with the expectation that having mastered these fundamentals,
-you'll be much better equipped to pick up the rest of what you need to know from the more detailed
-reference documentation. The links at the end of each section will lead you to where you
-can learn more.
+本章教程对 Apache Flink 的基本概念进行了介绍,虽然省略了许多重要细节,但是如果你掌握了本章内容,就足以实现可扩展并行度的 ETL、数据分析以及事件驱动的流式应用程序。本章重点对 Flink API 中的状态管理和时间进行了介绍,掌握了这些基础知识后,你将能更好地从其他详细参考文档中获取和掌握你所需要的知识。每小节结尾都有链接去引导你了解更多内容。
 
-Specifically, you will learn:
+具体来说,你将在本章学习到以下内容:
 
-- how to implement streaming data processing pipelines
-- how and why Flink manages state
-- how to use event time to consistently compute accurate analytics
-- how to build event-driven applications on continuous streams
-- how Flink is able to provide fault-tolerant, stateful stream processing with exactly-once semantics
+- 如何实现流数据处理管道(pipelines)
+- Flink 如何管理状态以及为何需要管理状态
+- 如何使用事件时间(event time)来一致并准确地进行计算分析
+- 如何在源源不断的数据流上构建事件驱动的应用程序
+- Flink 如何提供具有精确一次(exactly-once)计算语义的可容错、有状态流处理
 
-This training focuses on four critical concepts: continuous processing of streaming data, event
-time, stateful stream processing, and state snapshots. This page introduces these concepts.
+本章教程着重介绍四个概念:源源不断的流式数据处理、事件时间、有状态流处理和状态快照。基本概念介绍如下。
 
-{% info Note %} Accompanying this training is a set of hands-on exercises that will
-guide you through learning how to work with the concepts being presented. A link to the relevant
-exercise is provided at the end of each section.
+{% info Note %} 每小节教程都有实践练习引导你如何在程序中使用其所述的概念,并在小节结尾都提供了相关实践练习的代码链接。
 
 {% top %}
 
-## Stream Processing
+## 流处理
 
-Streams are data's natural habitat. Whether it is events from web servers, trades from a stock
-exchange, or sensor readings from a machine on a factory floor, data is created as part of a stream.
-But when you analyze data, you can either organize your processing around _bounded_ or _unbounded_
-streams, and which of these paradigms you choose has profound consequences.
+在自然环境中,数据的产生原本就是流式的。无论是来自 Web 服务器的事件数据,证券交易所的交易数据,还是来自工厂车间机器上的传感器数据,其数据都是流式的。但是当你分析数据时,可以围绕 _有界流_(_bounded_)或 _无界流_(_unbounded_)两种模型来组织处理数据,当然,选择不同的模型,程序的执行和处理方式也都会不同。
 
 <img src="{{ site.baseurl }}/fig/bounded-unbounded.png" alt="Bounded and unbounded streams" class="offset" width="90%" />
 
-**Batch processing** is the paradigm at work when you process a bounded data stream. In this mode of
-operation you can choose to ingest the entire dataset before producing any results, which means that
-it is possible, for example, to sort the data, compute global statistics, or produce a final report
-that summarizes all of the input.
+**批处理**是有界数据流处理的范例。在这种模式下,你可以选择在计算结果输出之前输入整个数据集,这也就意味着你可以对整个数据集的数据进行排序、统计或汇总计算后再输出结果。
 
-**Stream processing**, on the other hand, involves unbounded data streams. Conceptually, at least,
-the input may never end, and so you are forced to continuously process the data as it arrives. 
+**流处理**正相反,其涉及无界数据流。至少理论上来说,它的数据输入永远不会结束,因此程序必须持续不断地对到达的数据进行处理。
 
-In Flink, applications are composed of **streaming dataflows** that may be transformed by
-user-defined **operators**. These dataflows form directed graphs that start with one or more
-**sources**, and end in one or more **sinks**.
+在 Flink 中,应用程序由用户自定义**算子**转换而来的**流式 dataflows** 所组成。这些流式 dataflows 形成了有向图,以一个或多个**源**(source)开始,并以一个或多个**汇**(sink)结束。
 
 <img src="{{ site.baseurl }}/fig/program_dataflow.svg" alt="A DataStream program, and its dataflow." class="offset" width="80%" />
 
-Often there is a one-to-one correspondence between the transformations in the program and the
-operators in the dataflow. Sometimes, however, one transformation may consist of multiple operators.
+通常,程序代码中的 transformation 和 dataflow 中的算子(operator)之间是一一对应的。但有时也会出现一个 transformation 包含多个算子的情况,如上图所示。
 
-An application may consume real-time data from streaming sources such as message queues or
-distributed logs, like Apache Kafka or Kinesis. But flink can also consume bounded, historic data
-from a variety of data sources. Similarly, the streams of results being produced by a Flink
-application can be sent to a wide variety of systems that can be connected as sinks.
+Flink 应用程序可以消费来自消息队列或分布式日志这类流式数据源(例如 Apache Kafka 或 Kinesis)的实时数据,也可以从各种的数据源中消费有界的历史数据。同样,Flink 应用程序生成的结果流也可以发送到各种数据汇中。
 
 <img src="{{ site.baseurl }}/fig/flink-application-sources-sinks.png" alt="Flink application with sources and sinks" class="offset" width="90%" />
 
-### Parallel Dataflows
+### 并行 Dataflows
 
-Programs in Flink are inherently parallel and distributed. During execution, a
-*stream* has one or more **stream partitions**, and each *operator* has one or
-more **operator subtasks**. The operator subtasks are independent of one
-another, and execute in different threads and possibly on different machines or
-containers.
+Flink 程序本质上是分布式并行程序。在程序执行期间,一个流有一个或多个**流分区**(Stream Partition),每个算子有一个或多个**算子子任务**(Operator Subtask)。每个子任务彼此独立,并在不同的线程中运行,或在不同的计算机或容器中运行。
 
-The number of operator subtasks is the **parallelism** of that particular
-operator.
-Different operators of the same program may have different levels of
-parallelism.
+算子子任务数就是其对应算子的**并行度**。在同一程序中,不同算子也可能具有不同的并行度。
 
 <img src="{{ site.baseurl }}/fig/parallel_dataflow.svg" alt="A parallel dataflow" class="offset" width="80%" />
 
-Streams can transport data between two operators in a *one-to-one* (or
-*forwarding*) pattern, or in a *redistributing* pattern:
-
-  - **One-to-one** streams (for example between the *Source* and the *map()*
-    operators in the figure above) preserve the partitioning and ordering of
-    the elements. That means that subtask[1] of the *map()* operator will see
-    the same elements in the same order as they were produced by subtask[1] of
-    the *Source* operator.
-
-  - **Redistributing** streams (as between *map()* and *keyBy/window* above, as
-    well as between *keyBy/window* and *Sink*) change the partitioning of
-    streams. Each *operator subtask* sends data to different target subtasks,
-    depending on the selected transformation. Examples are *keyBy()* (which
-    re-partitions by hashing the key), *broadcast()*, or *rebalance()* (which
-    re-partitions randomly). In a *redistributing* exchange the ordering among
-    the elements is only preserved within each pair of sending and receiving
-    subtasks (for example, subtask[1] of *map()* and subtask[2] of
-    *keyBy/window*). So, for example, the redistribution between the keyBy/window and
-    the Sink operators shown above introduces non-determinism regarding the 
-    order in which the aggregated results for different keys arrive at the Sink.
+Flink 算子之间可以通过*一对一*(*直传*)模式或*重新分发*模式传输数据:
+
+  - **一对一**模式(例如上图中的 *Source* 和 *map()* 算子之间)可以保留元素的分区和顺序信息。这意味着 *map()* 算子的 subtask[1] 输入的数据以及其顺序与 *Source* 算子的 subtask[1] 输出的数据和顺序完全相同,即同一分区的数据只会进入到下游算子的同一分区。
+
+  - **重新分发**模式(例如上图中的 *map()* 和 *keyBy/window* 之间,以及 *keyBy/window* 和 *Sink* 之间)则会更改数据所在的流分区。当你在程序中选择使用不同的 *transformation*,每个*算子子任务*也会根据不同的 transformation 将数据发送到不同的目标子任务。例如以下这几种 transformation 和其对应分发数据的模式:*keyBy()*(通过散列键重新分区)、*broadcast()*(广播)或 *rebalance()*(随机重新分发)。在*重新分发*数据的过程中,元素只有在每对输出和输入子任务之间才能保留其之间的顺序信息(例如,*keyBy/window* 的 subtask[2] 接收到的 *map()* 的 subtask[1] 中的元素都是有序的)。因此,上图所示的 *keyBy/window* 和 *Sink* 算子之间数据的重新分发时,不同键(key)的聚合结果到达 Sink 的顺序是不确定的。
 
 {% top %}
 
-## Timely Stream Processing
+## 自定义时间流处理
 
-For most streaming applications it is very valuable to be able re-process historic data with the
-same code that is used to process live data -- and to produce deterministic, consistent results,
-regardless.
+对于大多数流数据处理应用程序而言,能够使用处理实时数据的代码重新处理历史数据并产生确定并一致的结果非常有价值。
 
-It can also be crucial to pay attention to the order in which events occurred, rather than the order
-in which they are delivered for processing, and to be able to reason about when a set of events is
-(or should be) complete. For example, consider the set of events involved in an e-commerce
-transaction, or financial trade.
+在处理流式数据时,我们通常更需要关注事件本身发生的顺序而不是事件被传输以及处理的顺序,因为这能够帮助我们推理出一组事件(事件集合)是何时发生以及结束的。例如电子商务交易或金融交易中涉及到的事件集合。
 
-These requirements for timely stream processing can be met by using event time timestamps that are
-recorded in the data stream, rather than using the clocks of the machines processing the data.
+为了满足上述这类的实时流处理场景,我们通常会使用记录在数据流中的事件时间的时间戳,而不是处理数据的机器时钟的时间戳。
 
 {% top %}
 
-## Stateful Stream Processing
+## 有状态流处理
 
-Flink's operations can be stateful. This means that how one event is handled can depend on the
-accumulated effect of all the events that came before it. State may be used for something simple,
-such as counting events per minute to display on a dashboard, or for something more complex, such as
-computing features for a fraud detection model.
+Flink 中的算子可以是有状态的。这意味着如何处理一个事件可能取决于该事件之前所有事件数据的累积结果。Flink 中的状态不仅可以用于简单的场景(例如统计仪表板上每分钟显示的数据),也可以用于复杂的场景(例如训练作弊检测模型)。
 
-A Flink application is run in parallel on a distributed cluster. The various parallel instances of a
-given operator will execute independently, in separate threads, and in general will be running on
-different machines.
+Flink 应用程序可以在分布式群集上并行运行,其中每个算子的各个并行实例会在单独的线程中独立运行,并且通常情况下是会在不同的机器上运行。
 
-The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each
-parallel instance is responsible for handling events for a specific group of keys, and the state for
-those keys is kept locally.
+有状态算子的并行实例组在存储其对应状态时通常是按照键(key)进行分片存储的。每个并行实例算子负责处理一组特定键的事件数据,并且这组键对应的状态会保存在本地。
 
-The diagram below shows a job running with a parallelism of two across the first three operators in
-the job graph, terminating in a sink that has a parallelism of one. The third operator is stateful,
-and you can see that a fully-connected network shuffle is occurring between the second and third
-operators. This is being done to partition the stream by some key, so that all of the events that
-need to be processed together, will be.
+如下图的 Flink 作业,其前三个算子的并行度为 2,最后一个 sink 算子的并行度为 1,其中第三个算子是有状态的,并且你可以看到第二个算子和第三个算子之间是全互联的(fully-connected),它们之间通过网络进行数据分发。通常情况下,实现这种类型的 Flink 程序是为了通过某些键对数据流进行分区,以便将需要一起处理的事件进行汇合,然后做统一计算处理。
 
 <img src="{{ site.baseurl }}/fig/parallel-job.png" alt="State is sharded" class="offset" width="65%" />
 
-State is always accessed locally, which helps Flink applications achieve high throughput and
-low-latency. You can choose to keep state on the JVM heap, or if it is too large, in efficiently
-organized on-disk data structures. 
+Flink 应用程序的状态访问都在本地进行,因为这有助于其提高吞吐量和降低延迟。通常情况下 Flink 应用程序都是将状态存储在 JVM 堆上,但如果状态太大,我们也可以选择将其以结构化数据格式存储在高速磁盘中。
 
 <img src="{{ site.baseurl }}/fig/local-state.png" alt="State is local" class="offset" width="90%" />
 
 {% top %}
 
-## Fault Tolerance via State Snapshots
+## 通过状态快照实现的容错
 
-Flink is able to provide fault-tolerant, exactly-once semantics through a combination of state
-snapshots and stream replay. These snapshots capture the entire state of the distributed pipeline,
-recording offsets into the input queues as well as the state throughout the job graph that has
-resulted from having ingested the data up to that point. When a failure occurs, the sources are
-rewound, the state is restored, and processing is resumed. As depicted above, these state snapshots
-are captured asynchronously, without impeding the ongoing processing.
+通过状态快照和流重放两种方式的组合,Flink 能够提供可容错的,精确一次计算的语义。这些状态快照在执行时会获取并存储分布式 pipeline 中整体的状态,它会将数据源中消费数据的偏移量记录下来,并将整个 job graph 中算子获取到该数据(记录的偏移量对应的数据)时的状态记录并存储下来。当发生故障时,Flink 作业会恢复上次存储的状态,重置数据源从状态中记录的上次消费的偏移量开始重新进行消费处理。而且状态快照在执行时会异步获取状态并存储,并不会阻塞正在进行的数据处理逻辑。
 
 {% top %}