You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/03/01 05:19:46 UTC

[GitHub] [flink] MrWhiteSike commented on a change in pull request #18655: [FLINK-25799] [docs] Translate table/filesystem.md page into Chinese.

MrWhiteSike commented on a change in pull request #18655:
URL: https://github.com/apache/flink/pull/18655#discussion_r816456481



##########
File path: docs/content.zh/docs/connectors/table/filesystem.md
##########
@@ -190,209 +194,217 @@ CREATE TABLE MyUserTableWithFilepath (
 )
 ```
 
+<a name="streaming-sink"></a>
+
 ## Streaming Sink
 
-The file system connector supports streaming writes, based on Flink's [FileSystem]({{< ref "docs/connectors/datastream/filesystem" >}}),
-to write records to file. Row-encoded Formats are CSV and JSON. Bulk-encoded Formats are Parquet, ORC and Avro.
+文件系统连接器支持流写入,是基于 Flink 的 [文件系统]({{< ref "docs/connectors/datastream/filesystem" >}}) 写入文件的。CSV 和 JSON 使用的是 Row-encoded Format。Parquet、ORC 和 Avro 使用的是 Bulk-encoded Format。
 
-You can write SQL directly, insert the stream data into the non-partitioned table.
-If it is a partitioned table, you can configure partition related operations. See [Partition Commit](filesystem.html#partition-commit) for details.
+可以直接编写 SQL,将流数据插入到非分区表。
+如果是分区表,可以配置分区操作相关的属性。请参考[分区提交](#partition-commit)了解更多详情。
 
-### Rolling Policy
+<a name="rolling-policy"></a>
 
-Data within the partition directories are split into part files. Each partition will contain at least one part file for
-each subtask of the sink that has received data for that partition. The in-progress part file will be closed and additional
-part file will be created according to the configurable rolling policy. The policy rolls part files based on size,
-a timeout that specifies the maximum duration for which a file can be open.
+### 滚动策略
+
+分区目录下的数据被分割到 part 文件中。每个分区对应的 sink 的收到的数据的 subtask 都至少会为该分区生成一个 part 文件。根据可配置的滚动策略,当前 in-progress part 文件将被关闭,生成新的 part 文件。该策略基于大小,和指定的文件可被打开的最大 timeout 时长,来滚动 part 文件。
 
 <table class="table table-bordered">
   <thead>
     <tr>
-        <th class="text-left" style="width: 20%">Key</th>
-        <th class="text-left" style="width: 15%">Default</th>
-        <th class="text-left" style="width: 10%">Type</th>
-        <th class="text-left" style="width: 55%">Description</th>
+        <th class="text-left" style="width: 20%">键</th>
+        <th class="text-left" style="width: 15%">默认值</th>
+        <th class="text-left" style="width: 10%">类型</th>
+        <th class="text-left" style="width: 55%">描述</th>
     </tr>
   </thead>
   <tbody>
     <tr>
         <td><h5>sink.rolling-policy.file-size</h5></td>
         <td style="word-wrap: break-word;">128MB</td>
         <td>MemorySize</td>
-        <td>The maximum part file size before rolling.</td>
+        <td> 滚动前,part 文件最大大小。</td>
     </tr>
     <tr>
         <td><h5>sink.rolling-policy.rollover-interval</h5></td>
         <td style="word-wrap: break-word;">30 min</td>
         <td>Duration</td>
-        <td>The maximum time duration a part file can stay open before rolling (by default 30 min to avoid to many small files).
-        The frequency at which this is checked is controlled by the 'sink.rolling-policy.check-interval' option.</td>
+        <td> 滚动前,part 文件处于打开状态的最大时长(默认值30分钟,以避免产生大量小文件)。
+        检查频率是由 'sink.rolling-policy.check-interval' 属性控制的。</td>
     </tr>
     <tr>
         <td><h5>sink.rolling-policy.check-interval</h5></td>
         <td style="word-wrap: break-word;">1 min</td>
         <td>Duration</td>
-        <td>The interval for checking time based rolling policies. This controls the frequency to check whether a part file should rollover based on 'sink.rolling-policy.rollover-interval'.</td>
+        <td> 基于时间的滚动策略的检查间隔。该属性控制了基于 'sink.rolling-policy.rollover-interval' 属性检查文件是否该被滚动的检查频率。</td>
     </tr>
   </tbody>
 </table>
 
-**NOTE:** For bulk formats (parquet, orc, avro), the rolling policy in combination with the checkpoint interval(pending files
-become finished on the next checkpoint) control the size and number of these parts.
+**注意:** 对于 bulk formats 数据 (parquet、orc、avro),滚动策略与 checkpoint 间隔(pending 状态的文件会在下个 checkpoint 完成)控制了 part 文件的大小和个数。
+
+**注意:** 对于 row formats 数据 (csv、json),如果想使得分区文件更快在文件系统中可见,可以设置  `sink.rolling-policy.file-size` 或 `sink.rolling-policy.rollover-interval` 属性以及在 flink-conf.yaml 中的 `execution.checkpointing.interval` 属性。
+对于其他 formats (avro、orc),可以只设置 flink-conf.yaml 中的 `execution.checkpointing.interval` 属性。
 
-**NOTE:** For row formats (csv, json), you can set the parameter `sink.rolling-policy.file-size` or `sink.rolling-policy.rollover-interval` in the connector properties and parameter `execution.checkpointing.interval` in flink-conf.yaml together
-if you don't want to wait a long period before observe the data exists in file system. For other formats (avro, orc), you can just set parameter `execution.checkpointing.interval` in flink-conf.yaml.
+<a name="file-compaction"></a>
 
-### File Compaction
+### 文件合并
 
-The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.
+file sink 支持文件合并,允许应用程序使用较小的 checkpoint 间隔而不产生大量小文件。
 
 <table class="table table-bordered">
   <thead>
     <tr>
-        <th class="text-left" style="width: 20%">Key</th>
-        <th class="text-left" style="width: 15%">Default</th>
-        <th class="text-left" style="width: 10%">Type</th>
-        <th class="text-left" style="width: 55%">Description</th>
+        <th class="text-left" style="width: 20%">键</th>
+        <th class="text-left" style="width: 15%">默认值</th>
+        <th class="text-left" style="width: 10%">类型</th>
+        <th class="text-left" style="width: 55%">描述</th>
     </tr>
   </thead>
   <tbody>
     <tr>
         <td><h5>auto-compaction</h5></td>
         <td style="word-wrap: break-word;">false</td>
         <td>Boolean</td>
-        <td>Whether to enable automatic compaction in streaming sink or not. The data will be written to temporary files. After the checkpoint is completed, the temporary files generated by a checkpoint will be compacted. The temporary files are invisible before compaction.</td>
+        <td> 在流式 sink 中是否开启自动合并功能。数据首先会被写入临时文件。当 checkpoint 完成后,该检查点产生的临时文件会被合并。这些临时文件在合并前不可见。</td>
     </tr>
     <tr>
         <td><h5>compaction.file-size</h5></td>
-        <td style="word-wrap: break-word;">(none)</td>
+        <td style="word-wrap: break-word;">(无)</td>
         <td>MemorySize</td>
-        <td>The compaction target file size, the default value is the rolling file size.</td>
+        <td> 合并目标文件大小,默认值为滚动文件大小。</td>
     </tr>
   </tbody>
 </table>
 
-If enabled, file compaction will merge multiple small files into larger files based on the target file size.
-When running file compaction in production, please be aware that:
-- Only files in a single checkpoint are compacted, that is, at least the same number of files as the number of checkpoints is generated.
-- The file before merging is invisible, so the visibility of the file may be: checkpoint interval + compaction time.
-- If the compaction takes too long, it will backpressure the job and extend the time period of checkpoint.
+如果启用文件合并功能,会根据目标文件大小,将多个小文件合并成大文件。
+在生产环境中使用文件合并功能时,需要注意:
+- 只有 checkpoint 内部的文件才会被合并,至少生成的文件个数与 checkpoint 个数相同。
+- 合并前文件是可见的,那么文件的可见时间是:checkpoint 间隔时长 + 合并时长。
+- 如果合并时间过长,将导致反压,延长 checkpoint 所需时间。
+
+<a name="partition-commit"></a>
+
+### 分区提交
 
-### Partition Commit
+数据写入分区之后,通常需要通知下游应用。例如,在 hive metadata 中新增分区或者在目录下生成 `_SUCCESS` 文件。分区提交策略是可定制的。具体分区提交行为是基于 `triggers` 和 `policies` 的组合。
 
-After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a `_SUCCESS` file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of `triggers` and `policies`.
+- Trigger:分区提交时机,可以基于从分区中提取的时间对应的 watermark,或者基于处理时间。
+- Policy:分区提交策略,内置策略包括生成 `_SUCCESS` 文件和提交 hive metastore,也可以实现自定义策略,例如触发 hive 生成统计信息,合并小文件等。
 
-- Trigger: The timing of the commit of the partition can be determined by the watermark with the time extracted from the partition, or by processing time.
-- Policy: How to commit a partition, built-in policies support for the commit of success files and metastore, you can also implement your own policies, such as triggering hive's analysis to generate statistics, or merging small files, etc.
+**注意:** 分区提交仅在动态分区插入模式下才有效。
 
-**NOTE:** Partition Commit only works in dynamic partition inserting.
+<a name="partition-commit-trigger"></a>
 
-#### Partition commit trigger
+#### 分区提交触发器
 
-To define when to commit a partition, providing partition commit trigger:
+通过配置分区提交触发策略,来决定何时提交分区:
 
 <table class="table table-bordered">
   <thead>
     <tr>
-        <th class="text-left" style="width: 20%">Key</th>
-        <th class="text-left" style="width: 15%">Default</th>
-        <th class="text-left" style="width: 10%">Type</th>
-        <th class="text-left" style="width: 55%">Description</th>
+        <th class="text-left" style="width: 20%">键</th>
+        <th class="text-left" style="width: 15%">默认值</th>
+        <th class="text-left" style="width: 10%">类型</th>
+        <th class="text-left" style="width: 55%">描述</th>
     </tr>
   </thead>
   <tbody>
     <tr>
         <td><h5>sink.partition-commit.trigger</h5></td>
         <td style="word-wrap: break-word;">process-time</td>
         <td>String</td>
-        <td>Trigger type for partition commit: 'process-time': based on the time of the machine, it neither requires partition time extraction nor watermark generation. Commit partition once the 'current system time' passes 'partition creation system time' plus 'delay'. 'partition-time': based on the time that extracted from partition values, it requires watermark generation. Commit partition once the 'watermark' passes 'time extracted from partition values' plus 'delay'.</td>
+        <td> 分区提交触发器类型:
+        'process-time':基于机器时间,既不需要分区时间提取器也不需要 watermark 生成器。一旦 "当前系统时间" 超过了 "分区创建系统时间" 和 'sink.partition-commit.delay' 之和立即提交分区。
+        'partition-time':基于提取的分区时间,需要 watermark 生成。一旦 watermark 超过了 "分区创建系统时间" 和 'sink.partition-commit.delay' 之和立即提交分区。</td>
     </tr>
     <tr>
         <td><h5>sink.partition-commit.delay</h5></td>
         <td style="word-wrap: break-word;">0 s</td>
         <td>Duration</td>
-        <td>The partition will not commit until the delay time. If it is a daily partition, should be '1 d', if it is a hourly partition, should be '1 h'.</td>
+        <td> 该延迟时间之前分区不会被提交。如果是按天分区,可以设置为 '1 d',如果是按小时分区,应设置为 '1 h'。</td>
     </tr>
     <tr>
         <td><h5>sink.partition-commit.watermark-time-zone</h5></td>
         <td style="word-wrap: break-word;">UTC</td>
         <td>String</td>
-        <td>The time zone to parse the long watermark value to TIMESTAMP value, the parsed watermark timestamp is used to compare with partition time to decide the partition should commit or not. This option is only take effect when `sink.partition-commit.trigger` is set to 'partition-time'. If this option is not configured correctly, e.g. source rowtime is defined on TIMESTAMP_LTZ column, but this config is not configured, then users may see the partition committed after a few hours. The default value is 'UTC', which means the watermark is defined on TIMESTAMP column or not defined. If the watermark is defined on TIMESTAMP_LTZ column, the time zone of watermark is the session time zone. The option value is either a full name such as 'America/Los_Angeles', or a custom timezone id such as 'GMT-08:00'.</td>
+        <td> 解析 Long 类型的 watermark 到 TIMESTAMP 类型时所采用的时区,解析得到的 watermark 的 TIMESTAMP 会被用来跟分区时间进行比较以判断是否该被提交。这个属性仅当 `sink.partition-commit.trigger` 被设置为 'partition-time' 时有效。如果这个属性设置的不正确,例如,在 TIMESTAMP_LTZ 类型的列上定义了 source rowtime,如果没有设置该属性,那么用户可能会在若干个小时后才看到分区的提交。默认值为 'UTC',意味着 watermark 是定义在 TIMESTAMP 类型的列上或者没有定义 watermark。如果 watermark 定义在 TIMESTAMP_LTZ 类型的列上,watermark 时区必须是会话时区(session time zone)。该属性的可选值要么是完整的时区名比如 'America/Los_Angeles',要么是自定义时区,例如 'GMT-08:00'。</td>
     </tr>    
   </tbody>
 </table>
 
-There are two types of trigger:
-- The first is partition processing time. It neither requires partition time extraction nor watermark
-  generation. The trigger of partition commit according to partition creation time and current system time. This trigger
-  is more universal, but not so precise. For example, data delay or failover will lead to premature partition commit.
-- The second is the trigger of partition commit according to the time that extracted from partition values and watermark.
-  This requires that your job has watermark generation, and the partition is divided according to time, such as
-  hourly partition or daily partition.
+两种类型分区提交触发器:
+- 第一种是根据分区的处理时间。既不需要额外的分区时间,也不需要 watermark 生成。这种分区提交触发器基于分区创建时间和当前系统时间。
+  这种触发器更具普遍性,但不是很精确。例如,数据延迟或故障将导致过早提交分区。
+- 第二种是根据从分区字段提取的时间以及 watermark。
+  这需要 job 支持 watermark 生成,分区是根据时间来切割的,例如,按小时或按天分区。
 
-If you want to let downstream see the partition as soon as possible, no matter whether its data is complete or not:
-- 'sink.partition-commit.trigger'='process-time' (Default value)
-- 'sink.partition-commit.delay'='0s' (Default value)
-  Once there is data in the partition, it will immediately commit. Note: the partition may be committed multiple times.
+如果想让下游尽快感知到分区,不管分区数据是否完整:
+- 'sink.partition-commit.trigger'='process-time' (默认值)
+- 'sink.partition-commit.delay'='0s' (默认值)
+  一旦数据进入分区,将立即提交分区。注意:这个分区会被提交多次。
 
-If you want to let downstream see the partition only when its data is complete, and your job has watermark generation, and you can extract the time from partition values:
+如果想让下游只有在分区数据完整时才感知到分区,并且 job 中有 watermark 生成,也能从分区字段的值中提取到时间:

Review comment:
       Sorry, I don't think it's necessary to add 'you' in Chinese!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org