You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by ki...@apache.org on 2022/08/04 07:08:18 UTC

[incubator-seatunnel] branch dev updated: optimize file sink doc (#2363)

This is an automated email from the ASF dual-hosted git repository.

kirs pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/incubator-seatunnel.git


The following commit(s) were added to refs/heads/dev by this push:
     new 04b586f2b optimize file sink doc (#2363)
04b586f2b is described below

commit 04b586f2b81648ac9cc88b70a09f979280555606
Author: Eric <ga...@gmail.com>
AuthorDate: Thu Aug 4 15:08:13 2022 +0800

    optimize file sink doc (#2363)
---
 docs/en/connector-v2/sink/File.mdx     | 266 ---------------------------------
 docs/en/connector-v2/sink/HdfsFile.md  | 141 +++++++++++++++++
 docs/en/connector-v2/sink/LocalFile.md | 139 +++++++++++++++++
 3 files changed, 280 insertions(+), 266 deletions(-)

diff --git a/docs/en/connector-v2/sink/File.mdx b/docs/en/connector-v2/sink/File.mdx
deleted file mode 100644
index 6497ef554..000000000
--- a/docs/en/connector-v2/sink/File.mdx
+++ /dev/null
@@ -1,266 +0,0 @@
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# File
-
-## Description
-
-Output data to local or hdfs or s3 file.
-
-## Options
-
-<Tabs
-    groupId="engine-type"
-    defaultValue="LocalFile"
-    values={[
-        {label: 'LocalFile', value: 'LocalFile'},
-        {label: 'HdfsFile', value: 'HdfsFile'},
-    ]}>
-    <TabItem value="LocalFile">
-
-| name                              | type   | required | default value                                                 |
-| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
-| path                              | string | yes      | -                                                             |
-| file_name_expression              | string | no       | "${transactionId}"                                            |
-| file_format                       | string | no       | "text"                                                        |
-| filename_time_format              | string | no       | "yyyy.MM.dd"                                                  |
-| field_delimiter                   | string | no       | '\001'                                                        |
-| row_delimiter                     | string | no       | "\n"                                                          |
-| partition_by                      | array  | no       | -                                                             |
-| partition_dir_expression          | string | no       | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/"          |
-| is_partition_field_write_in_file  | boolean| no       | false                                                         |
-| sink_columns                      | array  | no       | When this parameter is empty, all fields are sink columns     |
-| is_enable_transaction             | boolean| no       | true                                                          |
-| save_mode                         | string | no       | "error"                                                       |
-
-### path [string]
-
-The target dir path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,
-
-### file_name_expression [string]
-
-`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
-`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
-
-### file_format [string]
-
-We supported as the following file types:
-
-`text` `csv` `parquet`
-
-Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.
-
-### filename_time_format [string]
-
-When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
-
-| Symbol | Description        |
-| ------ | ------------------ |
-| y      | Year               |
-| M      | Month              |
-| d      | Day of month       |
-| H      | Hour in day (0-23) |
-| m      | Minute in hour     |
-| s      | Second in minute   |
-
-See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
-
-### field_delimiter [string]
-
-The separator between columns in a row of data.
-
-### row_delimiter [string]
-
-The separator between rows in a file.
-
-### partition_by [array]
-
-Partition data based on selected fields
-
-### partition_dir_expression [string]
-
-If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.
-
-Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.
-
-### is_partition_field_write_in_file [boolean]
-
-If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.
-
-For example, if you want to write a Hive Data File, Its value should be `false`.
-
-### sink_columns [array]
-
-Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
-The order of the fields determines the order in which the file is actually written.
-
-### is_enable_transaction [boolean]
-
-If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
-
-Only support `true` now.
-
-### save_mode [string]
-
-Storage mode, currently supports `overwrite` , `append` , `ignore` and `error` . For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)
-
-Streaming Job not support `overwrite`.
-
-</TabItem>
-<TabItem value="HdfsFile">
-
-In order to use this connector, You must ensure your spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
-
-| name                              | type   | required | default value                                                 |
-| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
-| path                              | string | yes      | -                                                             |
-| file_name_expression              | string | no       | "${transactionId}"                                            |
-| file_format                       | string | no       | "text"                                                        |
-| filename_time_format              | string | no       | "yyyy.MM.dd"                                                  |
-| field_delimiter                   | string | no       | '\001'                                                        |
-| row_delimiter                     | string | no       | "\n"                                                          |
-| partition_by                      | array  | no       | -                                                             |
-| partition_dir_expression          | string | no       | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/"          |
-| is_partition_field_write_in_file  | boolean| no       | false                                                         |
-| sink_columns                      | array  | no       | When this parameter is empty, all fields are sink columns     |
-| is_enable_transaction             | boolean| no       | true                                                          |
-| save_mode                         | string | no       | "error"                                                       |
-
-### path [string]
-
-The target dir path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,
-
-### file_name_expression [string]
-
-`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
-`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
-
-### file_format [string]
-
-We supported as the following file types:
-
-`text` `csv` `parquet`
-
-Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.
-
-### filename_time_format [string]
-
-When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
-
-| Symbol | Description        |
-| ------ | ------------------ |
-| y      | Year               |
-| M      | Month              |
-| d      | Day of month       |
-| H      | Hour in day (0-23) |
-| m      | Minute in hour     |
-| s      | Second in minute   |
-
-See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
-
-### field_delimiter [string]
-
-The separator between columns in a row of data.
-
-### row_delimiter [string]
-
-The separator between rows in a file.
-
-### partition_by [array]
-
-Partition data based on selected fields
-
-### partition_dir_expression [string]
-
-If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.
-
-Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.
-
-### is_partition_field_write_in_file [boolean]
-
-If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.
-
-For example, if you want to write a Hive Data File, Its value should be `false`.
-
-### sink_columns [array]
-
-Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
-The order of the fields determines the order in which the file is actually written.
-
-### is_enable_transaction [boolean]
-
-If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.
-
-Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
-
-Only support `true` now.
-
-### save_mode [string]
-
-Storage mode, currently supports `overwrite` , `append` , `ignore` and `error` . For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)
-
-Streaming Job not support `overwrite`.
-</TabItem>
-</Tabs>
-
-## Example
-
-<Tabs
-    groupId="engine-type"
-    defaultValue="LocalFile"
-    values={[
-        {label: 'LocalFile', value: 'LocalFile'},
-        {label: 'HdfsFile', value: 'HdfsFile'},
-    ]}>
-<TabItem value="LocalFile">
-
-```bash
-
-LocalFile {
-    path="file:///tmp/hive/warehouse/test2"
-    field_delimiter="\t"
-    row_delimiter="\n"
-    partition_by=["age"]
-    partition_dir_expression="${k0}=${v0}"
-    is_partition_field_write_in_file=true
-    file_name_expression="${transactionId}_${now}"
-    file_format="text"
-    sink_columns=["name","age"]
-    filename_time_format="yyyy.MM.dd"
-    is_enable_transaction=true
-    save_mode="error"
-}
-
-```
-
-</TabItem>
-
-<TabItem value="LocalFile">
-
-```bash
-
-HdfsFile {
-    path="file:///tmp/hive/warehouse/test2"
-    field_delimiter="\t"
-    row_delimiter="\n"
-    partition_by=["age"]
-    partition_dir_expression="${k0}=${v0}"
-    is_partition_field_write_in_file=true
-    file_name_expression="${transactionId}_${now}"
-    file_format="text"
-    sink_columns=["name","age"]
-    filename_time_format="yyyy.MM.dd"
-    is_enable_transaction=true
-    save_mode="error"
-}
-
-```
-
-</TabItem>
-</Tabs>
diff --git a/docs/en/connector-v2/sink/HdfsFile.md b/docs/en/connector-v2/sink/HdfsFile.md
new file mode 100644
index 000000000..36d9f6b35
--- /dev/null
+++ b/docs/en/connector-v2/sink/HdfsFile.md
@@ -0,0 +1,141 @@
+# HdfsFile
+
+## Description
+
+Output data to hdfs file. Support bounded and unbounded job.
+
+## Options
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hadoop. The tested hadoop version is 2.x.
+
+| name                              | type   | required | default value                                                 |
+| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
+| path                              | string | yes      | -                                                             |
+| file_name_expression              | string | no       | "${transactionId}"                                            |
+| file_format                       | string | no       | "text"                                                        |
+| filename_time_format              | string | no       | "yyyy.MM.dd"                                                  |
+| field_delimiter                   | string | no       | '\001'                                                        |
+| row_delimiter                     | string | no       | "\n"                                                          |
+| partition_by                      | array  | no       | -                                                             |
+| partition_dir_expression          | string | no       | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/"          |
+| is_partition_field_write_in_file  | boolean| no       | false                                                         |
+| sink_columns                      | array  | no       | When this parameter is empty, all fields are sink columns     |
+| is_enable_transaction             | boolean| no       | true                                                          |
+| save_mode                         | string | no       | "error"                                                       |
+
+### path [string]
+
+The target dir path is required. The `hdfs file` starts with `hdfs://`.
+
+### file_name_expression [string]
+
+`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
+`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
+
+### file_format [string]
+
+We supported as the following file types:
+
+`text` `csv` `parquet`
+
+Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.
+
+### filename_time_format [string]
+
+When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
+
+| Symbol | Description        |
+| ------ | ------------------ |
+| y      | Year               |
+| M      | Month              |
+| d      | Day of month       |
+| H      | Hour in day (0-23) |
+| m      | Minute in hour     |
+| s      | Second in minute   |
+
+See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
+
+### field_delimiter [string]
+
+The separator between columns in a row of data. Only needed by `text` and `csv` file format.
+
+### row_delimiter [string]
+
+The separator between rows in a file. Only needed by `text` and `csv` file format.
+
+### partition_by [array]
+
+Partition data based on selected fields
+
+### partition_dir_expression [string]
+
+If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.
+
+Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.
+
+### is_partition_field_write_in_file [boolean]
+
+If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.
+
+For example, if you want to write a Hive Data File, Its value should be `false`.
+
+### sink_columns [array]
+
+Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
+The order of the fields determines the order in which the file is actually written.
+
+### is_enable_transaction [boolean]
+
+If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
+
+Only support `true` now.
+
+### save_mode [string]
+
+Storage mode, currently supports `overwrite`. This means we will delete the old file when a new file have a same name with it.
+
+If `is_enable_transaction` is `true`, Basically, we won't encounter the same file name. Because we will add the transaction id to file name.
+
+For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)
+
+## Example
+
+```bash
+
+HdfsFile {
+    path="hdfs://mycluster/tmp/hive/warehouse/test2"
+    field_delimiter="\t"
+    row_delimiter="\n"
+    partition_by=["age"]
+    partition_dir_expression="${k0}=${v0}"
+    is_partition_field_write_in_file=true
+    file_name_expression="${transactionId}_${now}"
+    file_format="text"
+    sink_columns=["name","age"]
+    filename_time_format="yyyy.MM.dd"
+    is_enable_transaction=true
+}
+
+```
+
+For parquet file format
+
+```bash
+
+HdfsFile {
+    path="hdfs://mycluster/tmp/hive/warehouse/test2"
+    partition_by=["age"]
+    partition_dir_expression="${k0}=${v0}"
+    is_partition_field_write_in_file=true
+    file_name_expression="${transactionId}_${now}"
+    file_format="parquet"
+    sink_columns=["name","age"]
+    filename_time_format="yyyy.MM.dd"
+    is_enable_transaction=true
+}
+
+```
diff --git a/docs/en/connector-v2/sink/LocalFile.md b/docs/en/connector-v2/sink/LocalFile.md
new file mode 100644
index 000000000..ffb0efb8c
--- /dev/null
+++ b/docs/en/connector-v2/sink/LocalFile.md
@@ -0,0 +1,139 @@
+# LocalFile
+
+## Description
+
+Output data to local file. Support bounded and unbounded job.
+
+## Options
+
+| name                              | type   | required | default value                                                 |
+| --------------------------------- | ------ | -------- | ------------------------------------------------------------- |
+| path                              | string | yes      | -                                                             |
+| file_name_expression              | string | no       | "${transactionId}"                                            |
+| file_format                       | string | no       | "text"                                                        |
+| filename_time_format              | string | no       | "yyyy.MM.dd"                                                  |
+| field_delimiter                   | string | no       | '\001'                                                        |
+| row_delimiter                     | string | no       | "\n"                                                          |
+| partition_by                      | array  | no       | -                                                             |
+| partition_dir_expression          | string | no       | "\${k0}=\${v0}\/\${k1}=\${v1}\/...\/\${kn}=\${vn}\/"          |
+| is_partition_field_write_in_file  | boolean| no       | false                                                         |
+| sink_columns                      | array  | no       | When this parameter is empty, all fields are sink columns     |
+| is_enable_transaction             | boolean| no       | true                                                          |
+| save_mode                         | string | no       | "error"                                                       |
+
+### path [string]
+
+The target dir path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,
+
+### file_name_expression [string]
+
+`file_name_expression` describes the file expression which will be created into the `path`. We can add the variable `${now}` or `${uuid}` in the `file_name_expression`, like `test_${uuid}_${now}`,
+`${now}` represents the current time, and its format can be defined by specifying the option `filename_time_format`.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
+
+### file_format [string]
+
+We supported as the following file types:
+
+`text` `csv` `parquet`
+
+Please note that, The final file name will ends with the file_format's suffix, the suffix of the text file is `txt`.
+
+### filename_time_format [string]
+
+When the format in the `file_name_expression` parameter is `xxxx-${now}` , `filename_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
+
+| Symbol | Description        |
+| ------ | ------------------ |
+| y      | Year               |
+| M      | Month              |
+| d      | Day of month       |
+| H      | Hour in day (0-23) |
+| m      | Minute in hour     |
+| s      | Second in minute   |
+
+See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
+
+### field_delimiter [string]
+
+The separator between columns in a row of data. Only needed by `text` and `csv` file format.
+
+### row_delimiter [string]
+
+The separator between rows in a file. Only needed by `text` and `csv` file format.
+
+### partition_by [array]
+
+Partition data based on selected fields
+
+### partition_dir_expression [string]
+
+If the `partition_by` is specified, we will generate the corresponding partition directory based on the partition information, and the final file will be placed in the partition directory.
+
+Default `partition_dir_expression` is `${k0}=${v0}/${k1}=${v1}/.../${kn}=${vn}/`. `k0` is the first partition field and `v0` is the value of the first partition field.
+
+### is_partition_field_write_in_file [boolean]
+
+If `is_partition_field_write_in_file` is `true`, the partition field and the value of it will be write into data file.
+
+For example, if you want to write a Hive Data File, Its value should be `false`.
+
+### sink_columns [array]
+
+Which columns need be write to file, default value is all of the columns get from `Transform` or `Source`.
+The order of the fields determines the order in which the file is actually written.
+
+### is_enable_transaction [boolean]
+
+If `is_enable_transaction` is true, we will ensure that data will not be lost or duplicated when it is written to the target directory.
+
+Please note that, If `is_enable_transaction` is `true`, we will auto add `${transactionId}_` in the head of the file.
+
+Only support `true` now.
+
+### save_mode [string]
+
+Storage mode, currently supports `overwrite`. This means we will delete the old file when a new file have a same name with it.
+
+If `is_enable_transaction` is `true`, Basically, we won't encounter the same file name. Because we will add the transaction id to file name.
+
+For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)
+
+## Example
+
+```bash
+
+LocalFile {
+    path="file:///tmp/hive/warehouse/test2"
+    field_delimiter="\t"
+    row_delimiter="\n"
+    partition_by=["age"]
+    partition_dir_expression="${k0}=${v0}"
+    is_partition_field_write_in_file=true
+    file_name_expression="${transactionId}_${now}"
+    file_format="text"
+    sink_columns=["name","age"]
+    filename_time_format="yyyy.MM.dd"
+    is_enable_transaction=true
+}
+
+```
+
+For parquet file format
+
+```bash
+
+LocalFile {
+    path="file:///tmp/hive/warehouse/test2"
+    partition_by=["age"]
+    partition_dir_expression="${k0}=${v0}"
+    is_partition_field_write_in_file=true
+    file_name_expression="${transactionId}_${now}"
+    file_format="parquet"
+    sink_columns=["name","age"]
+    filename_time_format="yyyy.MM.dd"
+    is_enable_transaction=true
+}
+
+```