You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by ma...@apache.org on 2022/03/18 22:41:08 UTC

[flink] branch master updated: [FLINK-25797][Docs] Translate datastream/formats/parquet.md page into Chinese. This closes #18646

This is an automated email from the ASF dual-hosted git repository.

martijnvisser pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git


The following commit(s) were added to refs/heads/master by this push:
     new 0f4b17c  [FLINK-25797][Docs] Translate datastream/formats/parquet.md page into Chinese. This closes #18646
0f4b17c is described below

commit 0f4b17c534ca092b290115e00f17138312ccad15
Author: wangzhiwu <ch...@cainiao.com>
AuthorDate: Mon Feb 7 21:43:56 2022 +0800

    [FLINK-25797][Docs] Translate datastream/formats/parquet.md page into Chinese. This closes #18646
    
    Co-authored-by: Jing Ge <ge...@gmail.com>
    Co-authored-by: wangzhiwubigdata <ch...@cainiao.com>
    Co-authored-by: Zhiwu Wang <28...@qq.com>
---
 .../docs/connectors/datastream/formats/parquet.md  | 221 ++++++++++-----------
 1 file changed, 106 insertions(+), 115 deletions(-)

diff --git a/docs/content.zh/docs/connectors/datastream/formats/parquet.md b/docs/content.zh/docs/connectors/datastream/formats/parquet.md
index 87c1a1b..16b0249 100644
--- a/docs/content.zh/docs/connectors/datastream/formats/parquet.md
+++ b/docs/content.zh/docs/connectors/datastream/formats/parquet.md
@@ -26,21 +26,22 @@ under the License.
 -->
 
 
+<a name="parquet-format"></a>
+
 # Parquet format
 
-Flink supports reading [Parquet](https://parquet.apache.org/) files,
-producing {{< javadoc file="org/apache/flink/table/data/RowData.html" name="Flink RowData">}} and producing [Avro](https://avro.apache.org/) records.
-To use the format you need to add the `flink-parquet` dependency to your project:
+Flink 支持读取 [Parquet](https://parquet.apache.org/) 文件并生成 {{< javadoc file="org/apache/flink/table/data/RowData.html" name="Flink RowData">}} 和 [Avro](https://avro.apache.org/) 记录。
+要使用 Parquet format,你需要将 flink-parquet 依赖添加到项目中:
 
 ```xml
 <dependency>
-	<groupId>org.apache.flink</groupId>
-	<artifactId>flink-parquet</artifactId>
-	<version>{{< version >}}</version>
+    <groupId>org.apache.flink</groupId>
+    <artifactId>flink-parquet</artifactId>
+    <version>{{< version >}}</version>
 </dependency>
 ```
 
-To read Avro records, you will need to add the `parquet-avro` dependency:
+要使用 Avro 格式,你需要将 parquet-avro 依赖添加到项目中:
 
 ```xml
 <dependency>
@@ -61,83 +62,78 @@ To read Avro records, you will need to add the `parquet-avro` dependency:
 </dependency>
 ```
 
-This format is compatible with the new Source that can be used in both batch and streaming execution modes.
-Thus, you can use this format for two kinds of data:
+此格式与新的 Source 兼容,可以同时在批和流模式下使用。
+因此,你可使用此格式处理以下两类数据:
 
-- Bounded data: lists all files and reads them all.
-- Unbounded data: monitors a directory for new files that appear.
+- 有界数据: 列出所有文件并全部读取。
+- 无界数据:监控目录中出现的新文件
 
 {{< hint info >}}
-When you start a File Source it is configured for bounded data by default.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+当你开启一个 File Source,会被默认为有界读取。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
 {{< /hint >}}
 
 **Vectorized reader**
 
 ```java
-
 // Parquet rows are decoded in batches
 FileSource.forBulkFileFormat(BulkFormat,Path...)
-
 // Monitor the Paths to read data as unbounded data
 FileSource.forBulkFileFormat(BulkFormat,Path...)
-        .monitorContinuously(Duration.ofMillis(5L))
-        .build();
-
+.monitorContinuously(Duration.ofMillis(5L))
+.build();
 ```
 
 **Avro Parquet reader**
 
 ```java
-
 // Parquet rows are decoded in batches
 FileSource.forRecordStreamFormat(StreamFormat,Path...)
-
 // Monitor the Paths to read data as unbounded data
 FileSource.forRecordStreamFormat(StreamFormat,Path...)
         .monitorContinuously(Duration.ofMillis(5L))
         .build();
-
-
 ```
 
 {{< hint info >}}
-Following examples are all configured for bounded data.
-To configure the File Source for unbounded data, you must additionally call
-`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`.
+下面的案例都是基于有界数据的。
+如果你想在连续读取模式下使用 File Source,你必须额外调用
+`AbstractFileSource.AbstractFileSourceBuilder.monitorContinuously(Duration)`。
 {{< /hint >}}
 
+<a name="flink-rowdata"></a>
+
 ## Flink RowData
 
-In this example, you will create a DataStream containing Parquet records as Flink RowDatas. The schema is projected to read only the specified fields ("f7", "f4" and "f99").  
-Flink will read records in batches of 500 records. The first boolean parameter specifies that timestamp columns will be interpreted as UTC.
-The second boolean instructs the application that the projected Parquet fields names are case-sensitive.
-There is no watermark strategy defined as records do not contain event timestamps.
+在此示例中,你将创建由 Parquet 格式的记录构成的 Flink RowDatas DataStream。我们把 schema 信息映射为只读字段("f7"、"f4" 和 "f99")。
+每个批次读取 500 条记录。其中,第一个布尔类型的参数用来指定是否需要将时间戳列处理为 UTC。
+第二个布尔类型参数用来指定在进行 Parquet 字段映射时,是否要区分大小写。
+这里不需要水印策略,因为记录中不包含事件时间戳。
 
 ```java
 final LogicalType[] fieldTypes =
-  new LogicalType[] {
-  new DoubleType(), new IntType(), new VarCharType()
-  };
+        new LogicalType[] {
+        new DoubleType(), new IntType(), new VarCharType()
+        };
 
 final ParquetColumnarRowInputFormat<FileSourceSplit> format =
-  new ParquetColumnarRowInputFormat<>(
-  new Configuration(),
-  RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
-  500,
-  false,
-  true);
+        new ParquetColumnarRowInputFormat<>(
+        new Configuration(),
+        RowType.of(fieldTypes, new String[] {"f7", "f4", "f99"}),
+        500,
+        false,
+        true);
 final FileSource<RowData> source =
-  FileSource.forBulkFileFormat(format,  /* Flink Path */)
-  .build();
+        FileSource.forBulkFileFormat(format,  /* Flink Path */)
+        .build();
 final DataStream<RowData> stream =
-  env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
+        env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
 ## Avro Records
 
-Flink supports producing three types of Avro records by reading Parquet files:
+Flink 支持三种方式来读取 Parquet 文件并创建 Avro records :
 
 - [Generic record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
 - [Specific record](https://avro.apache.org/docs/1.10.0/api/java/index.html)
@@ -145,62 +141,62 @@ Flink supports producing three types of Avro records by reading Parquet files:
 
 ### Generic record
 
-Avro schemas are defined using JSON. You can get more information about Avro schemas and types from the [Avro specification](https://avro.apache.org/docs/1.10.0/spec.html).
-This example uses an Avro schema example similar to the one described in the [official Avro tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html):
+使用 JSON 定义 Avro schemas。你可以从 [Avro specification](https://avro.apache.org/docs/1.10.0/spec.html) 获取更多关于 Avro schemas 和类型的信息。
+此示例使用了一个在 [official Avro tutorial](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html) 中描述的示例相似的 Avro schema:
 
 ```json lines
 {"namespace": "example.avro",
- "type": "record",
- "name": "User",
- "fields": [
-     {"name": "name", "type": "string"},
-     {"name": "favoriteNumber",  "type": ["int", "null"]},
-     {"name": "favoriteColor", "type": ["string", "null"]}
- ]
+  "type": "record",
+  "name": "User",
+  "fields": [
+    {"name": "name", "type": "string"},
+    {"name": "favoriteNumber",  "type": ["int", "null"]},
+    {"name": "favoriteColor", "type": ["string", "null"]}
+  ]
 }
 ```
+这个 schema 定义了一个具有三个属性的的 user 记录:name,favoriteNumber 和 favoriteColor。你可以
+在 [record specification](https://avro.apache.org/docs/1.10.0/spec.html#schema_record) 找到更多关于如何定义 Avro schema 的详细信息。
 
-This schema defines a record representing a user with three fields: name, favoriteNumber, and favoriteColor. You can find more details at [record specification](https://avro.apache.org/docs/1.10.0/spec.html#schema_record) for how to define an Avro schema.
-
-In the following example, you will create a DataStream containing Parquet records as Avro Generic records.
-It will parse the Avro schema based on the JSON string. There are many other ways to parse a schema, e.g. from java.io.File or java.io.InputStream. Please refer to [Avro Schema](https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/Schema.html) for details.
-After that, you will create an `AvroParquetRecordFormat` via `AvroParquetReaders` for Avro Generic records.
+在此示例中,你将创建包含由 Avro Generic records 格式构成的 Parquet records 的 DataStream。
+Flink 会基于 JSON 字符串解析 Avro schema。也有很多其他的方式解析 schema,例如基于 java.io.File 或 java.io.InputStream。
+请参考 [Avro Schema](https://avro.apache.org/docs/1.10.0/api/java/org/apache/avro/Schema.html) 以获取更多详细信息。
+然后,你可以通过 `AvroParquetReaders` 为 Avro Generic 记录创建 `AvroParquetRecordFormat`。
 
 ```java
-// parsing avro schema
+// 解析 avro schema
 final Schema schema =
         new Schema.Parser()
-            .parse(
-                    "{\"type\": \"record\", "
-                        + "\"name\": \"User\", "
-                        + "\"fields\": [\n"
-                        + "        {\"name\": \"name\", \"type\": \"string\" },\n"
-                        + "        {\"name\": \"favoriteNumber\",  \"type\": [\"int\", \"null\"] },\n"
-                        + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
-                        + "    ]\n"
-                        + "    }");
+        .parse(
+        "{\"type\": \"record\", "
+        + "\"name\": \"User\", "
+        + "\"fields\": [\n"
+        + "        {\"name\": \"name\", \"type\": \"string\" },\n"
+        + "        {\"name\": \"favoriteNumber\",  \"type\": [\"int\", \"null\"] },\n"
+        + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
+        + "    ]\n"
+        + "    }");
 
 final FileSource<GenericRecord> source =
         FileSource.forRecordStreamFormat(
-                AvroParquetReaders.forGenericRecord(schema), /* Flink Path */)
+        AvroParquetReaders.forGenericRecord(schema), /* Flink Path */)
         .build();
 
 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         env.enableCheckpointing(10L);
-        
+
 final DataStream<GenericRecord> stream =
         env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
 ### Specific record
 
-Based on the previously defined schema, you can generate classes by leveraging Avro code generation.
-Once the classes have been generated, there is no need to use the schema directly in your programs.
-You can either use `avro-tools.jar` to generate code manually or you could use the Avro Maven plugin to perform
-code generation on any .avsc files present in the configured source directory. Please refer to
-[Avro Getting Started](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html) for more information.
+基于之前定义的 schema,你可以通过利用 Avro 代码生成来生成类。
+一旦生成了类,就不需要在程序中直接使用 schema。
+你可以使用 `avro-tools.jar` 手动生成代码,也可以直接使用 Avro Maven 插件对配置的源目录中的任何 .avsc 文件执行代码生成。
+请参考 [Avro Getting Started](https://avro.apache.org/docs/1.10.0/gettingstartedjava.html) 获取更多信息。
 
-The following example uses the example schema [testdata.avsc](https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/resources/avro/testdata.avsc):
+此示例使用了样例 schema [testdata.avsc](https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/resources/avro/testdata.avsc):
 
 ```json lines
 [
@@ -218,17 +214,17 @@ The following example uses the example schema [testdata.avsc](https://github.com
 ]
 ```
 
-You will use the Avro Maven plugin to generate the `Address` Java class:
+你可以使用 Avro Maven plugin 生成 `Address` Java 类。
 
 ```java
 @org.apache.avro.specific.AvroGenerated
 public class Address extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
-    // generated code...
+    // 生成的代码...
 }
 ```
 
-You will create an `AvroParquetRecordFormat` via `AvroParquetReaders` for Avro Specific record
-and then create a DataStream containing Parquet records as Avro Specific records.
+你可以通过 `AvroParquetReaders` 为 Avro Specific 记录创建 `AvroParquetRecordFormat`,
+然后创建一个包含由 Avro Specific records 格式构成的 Parquet records 的 DateStream。
 
 ```java
 final FileSource<GenericRecord> source =
@@ -245,12 +241,11 @@ final DataStream<GenericRecord> stream =
 
 ### Reflect record
 
-Beyond Avro Generic and Specific record that requires a predefined Avro schema,
-Flink also supports creating a DataStream from Parquet files based on existing Java POJO classes.
-In this case, Avro will use Java reflection to generate schemas and protocols for these POJO classes.
-Java types are mapped to Avro schemas, please refer to the [Avro reflect](https://avro.apache.org/docs/1.10.0/api/java/index.html) documentation for more details.
+除了需要预定义 Avro Generic 和 Specific 记录, Flink 还支持基于现有 Java POJO 类从 Parquet 文件创建 DateStream。
+在这种场景中,Avro 会使用 Java 反射为这些 POJO 类生成 schema 和协议。
+请参考 [Avro reflect](https://avro.apache.org/docs/1.10.0/api/java/index.html) 文档获取更多关于 Java 类型到 Avro schemas 映射的详细信息。
 
-This example uses a simple Java POJO class [Datum](https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/Datum.java):
+本例使用了一个简单的 Java POJO 类 [Datum](https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/Datum.java):
 
 ```java
 public class Datum implements Serializable {
@@ -287,8 +282,8 @@ public class Datum implements Serializable {
 }
 ```
 
-You will create an `AvroParquetRecordFormat` via `AvroParquetReaders` for Avro Reflect record
-and then create a DataStream containing Parquet records as Avro Reflect records.
+你可以通过 `AvroParquetReaders` 为 Avro Reflect 记录创建一个 `AvroParquetRecordFormat`,
+然后创建一个包含由 Avro Reflect records 格式构成的 Parquet records 的 DateStream。
 
 ```java
 final FileSource<GenericRecord> source =
@@ -303,14 +298,12 @@ final DataStream<GenericRecord> stream =
         env.fromSource(source, WatermarkStrategy.noWatermarks(), "file-source");
 ```
 
-#### Prerequisite for Parquet files
+### 使用 Parquet files 必备条件
 
-In order to support reading Avro reflect records, the Parquet file must contain specific meta information.
-The Avro schema used for creating the Parquet data must contain a `namespace`,
-which will be used by the program to identify the concrete Java class for the reflection process.
+为了支持读取 Avro Reflect 数据,Parquet 文件必须包含特定的 meta 信息。为了生成 Parquet 数据,Avro schema 信息中必须包含 namespace,
+以便让程序在反射执行过程中能确定唯一的 Java Class 对象。
 
-The following example shows the `User` schema used previously. But this time it contains a namespace
-pointing to the location(in this case the package), where the `User` class for the reflection could be found.
+下面的案例展示了上文中的 User 对象的 schema 信息。但是当前案例包含了一个指定文件目录的 namespace(当前案例下的包路径),反射过程中可以找到对应的 User 类。
 
 ```java
 // avro schema with namespace
@@ -324,10 +317,9 @@ final String schema =
                         + "        {\"name\": \"favoriteColor\", \"type\": [\"string\", \"null\"] }\n"
                         + "    ]\n"
                         + "    }";
-
 ```
 
-Parquet files created with this schema will contain meta information like:
+由上述 scheme 信息创建的 Parquet 文件包含以下 meta 信息:
 
 ```text
 creator:        parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94)
@@ -349,45 +341,44 @@ favoriteColor:   BINARY UNCOMPRESSED DO:0 FPO:92 SZ:55/55/1.00 VC:3 ENC:RLE,PLAI
 
 ```
 
-With the `User` class defined in the package org.apache.flink.formats.parquet.avro:
+使用包 `org.apache.flink.formats.parquet.avro` 路径下已定义的 User 类:
 
 ```java
 public class User {
-        private String name;
-        private Integer favoriteNumber;
-        private String favoriteColor;
+    private String name;
+    private Integer favoriteNumber;
+    private String favoriteColor;
 
-        public User() {}
+    public User() {}
 
-        public User(String name, Integer favoriteNumber, String favoriteColor) {
-            this.name = name;
-            this.favoriteNumber = favoriteNumber;
-            this.favoriteColor = favoriteColor;
-        }
+    public User(String name, Integer favoriteNumber, String favoriteColor) {
+        this.name = name;
+        this.favoriteNumber = favoriteNumber;
+        this.favoriteColor = favoriteColor;
+    }
 
-        public String getName() {
-            return name;
-        }
+    public String getName() {
+        return name;
+    }
 
-        public Integer getFavoriteNumber() {
-            return favoriteNumber;
-        }
+    public Integer getFavoriteNumber() {
+        return favoriteNumber;
+    }
 
-        public String getFavoriteColor() {
-            return favoriteColor;
-        }
+    public String getFavoriteColor() {
+        return favoriteColor;
     }
+}
 
 ```
 
-you can write the following program to read Avro Reflect records of User type from parquet files:
+你可以通过下面的程序读取类型为 User 的 Avro Reflect records:
 
 ```java
 final FileSource<GenericRecord> source =
         FileSource.forRecordStreamFormat(
         AvroParquetReaders.forReflectRecord(User.class), /* Flink Path */)
         .build();
-
 final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         env.enableCheckpointing(10L);