You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@inlong.apache.org by do...@apache.org on 2023/03/22 10:09:29 UTC

[inlong-website] branch master updated: [INLONG-725][Doc] Dirty data archiving options description for Doris connector (#726)

This is an automated email from the ASF dual-hosted git repository.

dockerzhang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/inlong-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 36c33e2f02 [INLONG-725][Doc] Dirty data archiving options description for Doris connector (#726)
36c33e2f02 is described below

commit 36c33e2f0238337dba388df31fd4628a4e3104a4
Author: Liao Rui <li...@users.noreply.github.com>
AuthorDate: Wed Mar 22 18:09:21 2023 +0800

    [INLONG-725][Doc] Dirty data archiving options description for Doris connector (#726)
    
    Co-authored-by: ryanrliao <ry...@tencent.com>
---
 docs/data_node/load_node/doris.md                  | 25 ++++++++++++++++++----
 .../current/data_node/load_node/doris.md           | 17 ++++++++++++++-
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/docs/data_node/load_node/doris.md b/docs/data_node/load_node/doris.md
index f6f9b505c9..95505ace3e 100644
--- a/docs/data_node/load_node/doris.md
+++ b/docs/data_node/load_node/doris.md
@@ -63,7 +63,7 @@ mysql> select * from cdc_mysql_source;
 +----+----------+----+
 3 rows in set (0.07 sec)
 ```
-- For Multi-sink: Create tables `user_db.user_id_name`、`user_db.user_id_name` in the MySQL database. The command is as follows:
+- For Multi-sink: Create tables `user_db.user_id_name`、`user_db.user_id_score` in the MySQL database. The command is as follows:
 ```sql
 [root@fe001 ~]# mysql -u root -h localhost -P 3306 -p123456
 mysql> use user_db;
@@ -296,14 +296,31 @@ TODO: It will be supported in the future.
 | sink.batch.size                   | optional     | 10000             | int     | Maximum number of lines in a single write BE                                                                                                                                                                                                                                                                                                                                                                            |
 | sink.max-retries                  | optional     | 1                 | int     | Number of retries after writing BE failed                                                                                                                                                                                                                                                                                                                                                                               |
 | sink.batch.interval               | optional     | 10s               | string  | The flush interval, after which the asynchronous thread will write the data in the cache to BE. The default value is 10 second, and the time units are ms, s, min, h, and d. Set to 0 to turn off periodic writing.                                                                                                                                                                                                     |
-| sink.properties.*                 | optional     | (none)            | string  | The stream load parameters.<br /> <br /> eg:<br /> sink.properties.column_separator' = ','<br /> <br />  Setting 'sink.properties.escape_delimiters' = 'true' if you want to use a control char as a separator, so that such as '\\x01' will translate to binary 0x01<br /><br />  Support JSON format import, you need to enable both 'sink.properties.format' ='json' and 'sink.properties.strip_outer_array' ='true' |
-| sink.enable-delete                | optional     | true              | boolean | Whether to enable deletion. This option requires Doris table to enable batch delete function (0.15+ version is enabled by default), and only supports Uniq model.                                                                                                                                                                                                                                                       |
+| sink.properties.*                 | optional     | (none)            | string  | The stream load parameters.<br /> <br /> eg:<br /> sink.properties.column_separator' = ','<br /> <br />  Setting 'sink.properties.escape_delimiters' = 'true' if you want to use a control char as a separator, so that such as '\\x01' will translate to binary 0x01<br /><br />  Support `JSON` format import, you need to enable both 'sink.properties.format' ='json' and 'sink.properties.strip_outer_array' ='true' [...]
+| sink.enable-delete                | optional     | true              | boolean | Whether to enable deletion. This option requires Doris table to enable batch delete function (0.15+ version is enabled by default), and only supports Unique model.                                                                                                                                                                                                                                                       |
 | sink.multiple.enable              | optional   | false             | boolean  | Determine whether to support multiple sink writing, default is `false`. when `sink.multiple.enable` is `true`, need `sink.multiple.format`、`sink.multiple.database-pattern`、`sink.multiple.table-pattern` be correctly set.        |
 | sink.multiple.format              | optional   | (none)            | string   | The format of multiple sink, it represents the real format of the raw binary data. can be `canal-json` or `debezium-json` at present. See [kafka -- Dynamic Topic Extraction](https://github.com/apache/inlong-website/blob/master/docs/data_node/load_node/kafka.md) for more details.  |
 | sink.multiple.database-pattern    | optional   | (none)            | string   | Extract database name from the raw binary data, this is only used in the multiple sink writing scenario.                 | 
 | sink.multiple.table-pattern       | optional   | (none)            | string   | Extract table name from the raw binary data, this is only used in the multiple sink writing scenario.                           |
 | sink.multiple.ignore-single-table-errors | optional | true         | boolean  | Whether ignore the single table erros when multiple sink writing scenario. When it is `true`,sink continue when one table occur exception, only stop the exception table sink. When it is `false`, stop the whole sink when one table occur exception.     |
-| inlong.metric.labels | optional | (none) | String | Inlong metric label, format of value is groupId=`{groupId}`&streamId=`{streamId}`&nodeId=`{nodeId}`. |
+| inlong.metric.labels | optional | (none) | string | Inlong metric label, format of value is groupId=`{groupId}`&streamId=`{streamId}`&nodeId=`{nodeId}`. |
+| sink.multiple.schema-update.policy | optional | (none) | string | If sink data fields do not match doris table, such as table does not exsit or string data is over length, doris server will throw an exception.<br /><br /> When this option is `THROW_WITH_STOP`, the exception will be thrown up to flink framework, flink will restart task automatically, trying to resume the task.<br /><br /> When this option is `STOP_PARTIAL`, doris connector will stop writing into this table, other tables [...]
+| dirty.ignore | optinal | (none)| boolean | When writing data into doris table, errors may be thrown by doris server as table does not exist or data is over length. <br /><br /> When this option is `true`, and `dirty.side-output.*` properties are configed correctly, dirty data can be written to Amazon S3 or Tencent Colud COS storage. Dirty data metrics will also be collected automatically. <br /><br /> When this option is `false`, only dirty data metrics will be collected, but dirty dat [...]
+| dirty.side-output.enable | optinal | (none)| boolean | When this option is `ture` and other options about S3 or COS is configed correctly, dirty data archiving will works. When `false`, dirty data archiving will not work. |
+| dirty.side-output.connector | optinal | (none)| string | `s3` or `log` are supported now.<br /><br /> When `log`, doris connector only log the dirty data, not archive data.<br /><br /> When `s3`, doris connector can write dirty data to S3 or COS. |
+| dirty.side-output.s3.bucket | optinal | (none)| string | The bucket name of S3 or COS |
+| dirty.side-output.s3.endpoint | optinal | (none)| string | The endpoint of S3 or COS |
+| dirty.side-output.s3.key | optinal | (none)| string | The key of S3 or COS  |
+| dirty.side-output.s3.region | optinal | (none)| string | The region of S3 or COS |
+| dirty.side-output.line-delimiter | optinal | (none)| string | The line delimiter of dirty data |
+| dirty.side-output.field-delimiter | optinal | (none)| string | The field delimiter of dirty data |
+| dirty.side-output.s3.secret-key-id | optinal | (none)| string | The secret key of S3 or COS |
+| dirty.side-output.s3.access-key-id | optinal | (none)| string | The access key of S3 or COS |
+| dirty.side-output.format | optinal | (none)| string | The format of dirty data archiving, supports `json` or `csv` |
+| dirty.side-output.log-tag | optinal | (none)| string | The log tag of dirty data. Doris connector uses lags to distinguish which doris database and table the dirty data will be written to. |
+| dirty.identifier | optinal | (none)| string | The file name of drity data which written to S3 or COS. |
+| dirty.side-output.labels | optinal | (none)| string | Every dirty data line contains label and business data fields. Label is in front, and business data is at end. |
+
 
 ## Data Type Mapping
 
diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/doris.md
index d7d3da3367..38f5b9ac39 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/doris.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/data_node/load_node/doris.md
@@ -294,7 +294,7 @@ TODO: 将在未来支持此功能。
 | sink.batch.size                   | 可选   | 10000             | int      | 单次写 BE 的最大行数                                                                                                                                                                                                                                                                                           |
 | sink.max-retries                  | 可选   | 1                 | int      | 写 BE 失败之后的重试次数                                                                                                                                                                                                                                                                                         |
 | sink.batch.interval               | 可选   | 10s               | string   | Flush 间隔时间,超过该时间后异步线程将缓存中数据写入 BE。 默认值为10秒,支持时间单位 ms、s、min、h和d。设置为0表示关闭定期写入。                                                                                                                                                                                                                            |
-| sink.properties.*                 | 可选   | (none)            | string   | Stream load 的导入参数<br /><br />例如:<br />'sink.properties.column_separator' = ', '<br />定义列分隔符<br /><br />'sink.properties.escape_delimiters' = 'true'<br />特殊字符作为分隔符,'\\x01'会被转换为二进制的0x01<br /><br /> 'sink.properties.format' = 'json'<br />'sink.properties.strip_outer_array' = 'true' <br />JSON格式导入 |
+| sink.properties.*                 | 可选   | (none)            | string   | Stream load 的导入参数<br /><br />例如:<br />'sink.properties.column_separator' = ', '<br />定义列分隔符<br /><br />'sink.properties.escape_delimiters' = 'true'<br />特殊字符作为分隔符,'\\x01' 会被转换为二进制的 0x01 <br /><br /> 'sink.properties.format' = 'json'<br />'sink.properties.strip_outer_array' = 'true' <br />JSON 格式导入<br /><br /> 'sink.properties.format' = 'csv'<br />CSV 格式导入 |
 | sink.enable-delete                | 可选   | true              | boolean  | 是否启用删除。此选项需要 Doris 表开启批量删除功能(0.15+版本默认开启),只支持 Uniq 模型。                                                                                                                                                                                                                                                 |
 | sink.enable-delete                | 可选   | true              | boolean  | 是否启用删除。此选项需要 Doris 表开启批量删除功能(0.15+版本默认开启),只支持 Uniq 模型。                                                                                                                                                                                                                                                 |
 | sink.multiple.enable              | 可选   | false             | boolean  | 是否支持 Doris 多表写入。 `sink.multiple.enable` 为 `true` 时,需要 `sink.multiple.format` 、 `sink.multiple.database-pattern` 、 `sink.multiple.table-pattern` 分别设置正确的值。        |
@@ -303,6 +303,21 @@ TODO: 将在未来支持此功能。
 | sink.multiple.table-pattern       | 可选   | (none)            | string   | 多表写入时,从源端二进制数据中按照 `sink.multiple.table-pattern` 指定名称提取写入的表名。 `sink.multiple.enable` 为true时有效。                         |
 | sink.multiple.ignore-single-table-errors | 可选 | true         | boolean  | 多表写入时,是否忽略某个表写入失败。为 `true` 时,如果某个表写入异常,则不写入该表数据,其他表的数据正常写入。为 `false` 时,如果某个表写入异常,则所有表均停止写入。     |
 | inlong.metric.labels | 可选 | (none) | String | inlong metric 的标签值,该值的构成为groupId=`{groupId}`&streamId=`{streamId}`&nodeId=`{nodeId}`。|
+| sink.multiple.schema-update.policy | 可选 | (none) | string | 往 Doris 表同步数据时,如果 Doris 表不存在或字段长度超过限制,Doris 服务器会抛出异常。<br /><br /> 当该属性设置为 `THROW_WITH_STOP` ,异常会向上抛给 Flink 框架。Flink 框架会自动重启任务,尝试恢复。<br /><br /> 当该属性设置为 `STOP_PARTIAL` 时,Doris connector 会忽略该表的写入,新数据不再往该表写入,其它表则正常同步。<br /><br /> 当该属性设置为 `LOG_WITH_IGNORE` 时,异常会打印到日志中,不会向上抛出。后续新数据到来时,继续尝试往该表写入。 |
+| dirty.ignore | 可选 | (none)| boolean | 往 Doris 表同步数据时,如果遇到错误和异常,通过该变量可以控制是否忽略脏数据。如果设置为 `false` ,则忽略脏数据,不归档。如果为 `true` ,则根据其它的 `dirty.side-output.*` 的配置决定如何归档数据。 |
+| dirty.side-output.connector | 可选 | (none)| string | 支持 `s3` 和 `log` 两种配置。当配置为 `log` 时,仅打印日志,不归档数据。当配置为 `s3` 时,可以将数据归档到亚马逊S3或腾讯云COS存储。 |
+| dirty.side-output.s3.bucket | 可选 | (none)| string | S3 或 COS 的桶名称 |
+| dirty.side-output.s3.endpoint | 可选 | (none)| string | S3 或 COS 的 endpoint 地址 |
+| dirty.side-output.s3.key | 可选 | (none)| string | S3 或 COS 的 key  |
+| dirty.side-output.s3.region | 可选 | (none)| string | S3 或 COS 的区域 |
+| dirty.side-output.line-delimiter | 可选 | (none)| string | 脏数据的行分隔符 |
+| dirty.side-output.field-delimiter | 可选 | (none)| string | 脏数据的字段分隔符 |
+| dirty.side-output.s3.secret-key-id | 可选 | (none)| string | S3 或 COS 的 secret key |
+| dirty.side-output.s3.access-key-id | 可选 | (none)| string | S3 或 COS 的 access key |
+| dirty.side-output.format | 可选 | (none)| string | 脏数据归档的格式,支持 `json` 和 `csv` |
+| dirty.side-output.log-tag | 可选 | (none)| string | 脏数据的 tag 。通过该变量区分每条脏数据归属于 Doris 的哪个库表。 |
+| dirty.identifier | 可选 | (none)| string | 归档后的文件名 |
+| dirty.side-output.labels | 可选 | (none)| string | 归档后的每条数据包括标签和业务数据两部分。标签在前面,业务数据在后面。 |
 
 ## 数据类型映射