You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ji...@apache.org on 2022/05/27 04:55:50 UTC

[incubator-doris] branch master updated: [doc]Add export sql help documentation (#9797)

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 80e9c3395a [doc]Add export sql help documentation (#9797)
80e9c3395a is described below

commit 80e9c3395a1280fefc9ec0b9412521a0857d2e76
Author: jiafeng.zhang <zh...@gmail.com>
AuthorDate: Fri May 27 12:55:45 2022 +0800

    [doc]Add export sql help documentation (#9797)
    
    Add export sql help documentation
---
 docs/.vuepress/sidebar/en.js                       |   3 +-
 docs/.vuepress/sidebar/zh-CN.js                    |   3 +-
 .../Manipulation/EXPORT.md                         | 223 +++++++++++++++++++++
 .../Manipulation/EXPORT.md                         | 222 ++++++++++++++++++++
 4 files changed, 449 insertions(+), 2 deletions(-)

diff --git a/docs/.vuepress/sidebar/en.js b/docs/.vuepress/sidebar/en.js
index 54fc65d747..48b27573cd 100644
--- a/docs/.vuepress/sidebar/en.js
+++ b/docs/.vuepress/sidebar/en.js
@@ -706,7 +706,8 @@ module.exports = [
                   "DELETE",
                   "INSERT",
                   "UPDATE",
-                  "SELECT"
+                  "SELECT",
+                  "EXPORT"
                 ],
               },
               "OUTFILE"             
diff --git a/docs/.vuepress/sidebar/zh-CN.js b/docs/.vuepress/sidebar/zh-CN.js
index 4823046a19..d52f9480b0 100644
--- a/docs/.vuepress/sidebar/zh-CN.js
+++ b/docs/.vuepress/sidebar/zh-CN.js
@@ -706,7 +706,8 @@ module.exports = [
                   "DELETE",
                   "INSERT",
                   "UPDATE",
-                  "SELECT"
+                  "SELECT",
+                  "EXPORT"
                 ],
               },
               "OUTFILE"             
diff --git a/docs/en/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md b/docs/en/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md
new file mode 100644
index 0000000000..46c68574d6
--- /dev/null
+++ b/docs/en/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md
@@ -0,0 +1,223 @@
+---
+{
+    "title": "EXPORT",
+    "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## EXPORT
+
+### Name
+
+EXPORT
+
+### Description
+
+This statement is used to export the data of the specified table to the specified location.
+
+This is an asynchronous operation that returns if the task is submitted successfully. After execution, you can use the [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT) command to view the progress.
+
+```sql
+EXPORT TABLE table_name
+[PARTITION (p1[,p2])]
+TO export_path
+[opt_properties]
+WITH BROKER
+[broker_properties];
+```
+
+illustrate:
+
+- `table_name`
+
+  The table name of the table currently being exported. Only the export of Doris local table data is supported.
+
+- `partition`
+
+  It is possible to export only some specified partitions of the specified table
+
+- `export_path`
+
+  The exported path must be a directory.
+
+- `opt_properties`
+
+  Used to specify some export parameters.
+
+  ```sql
+  [PROPERTIES ("key"="value", ...)]
+  ````
+
+  The following parameters can be specified:
+
+  - `column_separator`: Specifies the exported column separator, default is \t. Only single byte is supported.
+  - `line_delimiter`: Specifies the line delimiter for export, the default is \n. Only single byte is supported.
+  - `exec_mem_limit`: Export the upper limit of the memory usage of a single BE node, the default is 2GB, and the unit is bytes.
+  - `timeout`: The timeout period of the import job, the default is 2 hours, the unit is seconds.
+  - `tablet_num_per_task`: The maximum number of tablets each subtask can allocate to scan.
+
+- `WITH BROKER`
+
+  The export function needs to write data to the remote storage through the Broker process. Here you need to define the relevant connection information for the broker to use.
+
+  ```sql
+  WITH BROKER hdfs|s3 ("key"="value"[,...])
+  ````
+
+ 1. If the export is to Amazon S3, you need to provide the following properties
+
+````
+fs.s3a.access.key: AmazonS3 access key
+fs.s3a.secret.key: AmazonS3 secret key
+fs.s3a.endpoint: AmazonS3 endpoint
+````
+
+ 2. If you use the S3 protocol to directly connect to the remote storage, you need to specify the following properties
+
+    (
+        "AWS_ENDPOINT" = "",
+        "AWS_ACCESS_KEY" = "",
+        "AWS_SECRET_KEY"="",
+        "AWS_REGION" = ""
+    )
+
+### Example
+
+1. Export all data in the test table to hdfs
+
+```sql
+EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c" 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+2. Export partitions p1, p2 in the testTbl table to hdfs
+
+```sql
+EXPORT TABLE testTbl PARTITION (p1,p2) TO "hdfs://hdfs_host:port/a/b/c" 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+3. Export all data in the testTbl table to hdfs, with "," as the column separator, and specify the label
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES ("label" = "mylabel", "column_separator"=",") 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+4. Export the row with k1 = 1 in the testTbl table to hdfs.
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" WHERE k1=1 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+5. Export all data in the testTbl table to local.
+
+```sql
+EXPORT TABLE testTbl TO "file:///home/data/a";
+```
+
+6. Export all data in the testTbl table to hdfs with invisible character "\x07" as column or row separator.
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES (
+  "column_separator"="\\x07", 
+  "line_delimiter" = "\\x07"
+) 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx", 
+  "password"="yyy"
+)
+```
+
+7. Export the k1, v1 columns of the testTbl table to the local.
+
+```sql
+EXPORT TABLE testTbl TO "file:///home/data/a" PROPERTIES ("columns" = "k1,v1");
+```
+
+8. Export all data in the testTbl table to s3 with invisible characters "\x07" as column or row separators.
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES (
+  "column_separator"="\\x07", 
+  "line_delimiter" = "\\x07"
+) WITH s3 (
+  "AWS_ENDPOINT" = "xxxxx",
+  "AWS_ACCESS_KEY" = "xxxxx",
+  "AWS_SECRET_KEY"="xxxx",
+  "AWS_REGION" = "xxxxx"
+)
+```
+
+### Keywords
+
+    EXPORT
+
+### Best Practice
+
+- Splitting of subtasks
+
+  An Export job will be split into multiple subtasks (execution plans) to execute. How many query plans need to be executed depends on how many tablets there are in total, and how many tablets can be allocated to a query plan.
+
+  Because multiple query plans are executed serially, the execution time of the job can be reduced if one query plan handles more shards.
+
+  However, if there is an error in the query plan (such as the failure of the RPC calling the broker, the jitter in the remote storage, etc.), too many Tablets will lead to a higher retry cost of a query plan.
+
+  Therefore, it is necessary to reasonably arrange the number of query plans and the number of shards that each query plan needs to scan, so as to balance the execution time and the execution success rate.
+
+  It is generally recommended that the amount of data scanned by a query plan is within 3-5 GB.
+
+  #### memory limit
+
+  Usually, the query plan of an Export job has only two parts of `scan-export`, which does not involve calculation logic that requires too much memory. So usually the default memory limit of 2GB suffices.
+
+  However, in some scenarios, such as a query plan, too many Tablets need to be scanned on the same BE, or too many Tablet data versions may cause insufficient memory. At this point, you need to set a larger memory, such as 4GB, 8GB, etc., through the `exec_mem_limit` parameter.
+
+  #### Precautions
+
+  - Exporting a large amount of data at one time is not recommended. The maximum recommended export data volume for an Export job is several tens of GB. An overly large export results in more junk files and higher retry costs. If the amount of table data is too large, it is recommended to export by partition.
+  - If the Export job fails, the `__doris_export_tmp_xxx` temporary directory generated in the remote storage and the generated files will not be deleted, and the user needs to delete it manually.
+  - If the Export job runs successfully, the `__doris_export_tmp_xxx` directory generated in the remote storage may be preserved or cleared according to the file system semantics of the remote storage. For example, in S3 object storage, after the last file in a directory is removed by the rename operation, the directory will also be deleted. If the directory is not cleared, the user can clear it manually.
+  - The Export job only exports the data of the Base table, not the data of the materialized view.
+  - The export job scans data and occupies IO resources, which may affect the query latency of the system.
+  - The maximum number of export jobs running simultaneously in a cluster is 5. Only jobs submitted after that will be queued.
diff --git a/docs/zh-CN/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md b/docs/zh-CN/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md
new file mode 100644
index 0000000000..29fb0eb148
--- /dev/null
+++ b/docs/zh-CN/sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md
@@ -0,0 +1,222 @@
+---
+{
+    "title": "EXPORT",
+    "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## EXPORT
+
+### Name
+
+EXPORT
+
+### Description
+
+该语句用于将指定表的数据导出到指定位置。
+
+这是一个异步操作,任务提交成功则返回。执行后可使用 [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT) 命令查看进度。
+
+```sql
+EXPORT TABLE table_name
+[PARTITION (p1[,p2])]
+TO export_path
+[opt_properties]
+WITH BROKER
+[broker_properties];
+```
+
+说明:
+
+- `table_name`
+
+  当前要导出的表的表名。仅支持 Doris 本地表数据的导出。
+
+- `partition`
+
+  可以只导出指定表的某些指定分区
+
+- `export_path`
+
+  导出的路径,需为目录。
+
+- `opt_properties`
+
+  用于指定一些导出参数。
+
+  ```sql
+  [PROPERTIES ("key"="value", ...)]
+  ```
+
+  可以指定如下参数:
+
+  - `column_separator`:指定导出的列分隔符,默认为\t。仅支持单字节。
+  - `line_delimiter`:指定导出的行分隔符,默认为\n。仅支持单字节。
+  - `exec_mem_limit`:导出在单个 BE 节点的内存使用上限,默认为 2GB,单位为字节。
+  - `timeout`:导入作业的超时时间,默认为2小时,单位是秒。
+  - `tablet_num_per_task`:每个子任务能分配扫描的最大 Tablet 数量。
+
+- `WITH BROKER`
+
+  导出功能需要通过 Broker 进程写数据到远端存储上。这里需要定义相关的连接信息供 Broker 使用。
+
+  ```sql
+  WITH BROKER hdfs|s3 ("key"="value"[,...])
+  ```
+
+​       1. 如果导出是到 Amazon S3,需要提供一下属性
+
+```
+fs.s3a.access.key:AmazonS3的access key
+fs.s3a.secret.key:AmazonS3的secret key
+fs.s3a.endpoint:AmazonS3的endpoint
+```
+​       2. 如果使用S3协议直接连接远程存储时需要指定如下属性
+
+    (
+        "AWS_ENDPOINT" = "",
+        "AWS_ACCESS_KEY" = "",
+        "AWS_SECRET_KEY"="",
+        "AWS_REGION" = ""
+    )
+
+### Example
+
+1. 将 test 表中的所有数据导出到 hdfs 上
+
+```sql
+EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c" 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+2. 将 testTbl 表中的分区p1,p2导出到 hdfs 上
+
+```sql
+EXPORT TABLE testTbl PARTITION (p1,p2) TO "hdfs://hdfs_host:port/a/b/c" 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+3. 将 testTbl 表中的所有数据导出到 hdfs 上,以","作为列分隔符,并指定label
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES ("label" = "mylabel", "column_separator"=",") 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+4. 将 testTbl 表中 k1 = 1 的行导出到 hdfs 上。
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" WHERE k1=1 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx",
+  "password"="yyy"
+);
+```
+
+5. 将 testTbl 表中的所有数据导出到本地。
+
+```sql
+EXPORT TABLE testTbl TO "file:///home/data/a";
+```
+
+6. 将 testTbl 表中的所有数据导出到 hdfs 上,以不可见字符 "\x07" 作为列或者行分隔符。
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES (
+  "column_separator"="\\x07", 
+  "line_delimiter" = "\\x07"
+) 
+WITH BROKER "broker_name" 
+(
+  "username"="xxx", 
+  "password"="yyy"
+)
+```
+
+7. 将 testTbl 表的 k1, v1 列导出到本地。
+
+```sql
+EXPORT TABLE testTbl TO "file:///home/data/a" PROPERTIES ("columns" = "k1,v1");
+```
+
+8. 将 testTbl 表中的所有数据导出到 s3 上,以不可见字符 "\x07" 作为列或者行分隔符。
+
+```sql
+EXPORT TABLE testTbl TO "hdfs://hdfs_host:port/a/b/c" 
+PROPERTIES (
+  "column_separator"="\\x07", 
+  "line_delimiter" = "\\x07"
+) WITH s3 (
+  "AWS_ENDPOINT" = "xxxxx",
+  "AWS_ACCESS_KEY" = "xxxxx",
+  "AWS_SECRET_KEY"="xxxx",
+  "AWS_REGION" = "xxxxx"
+)
+```
+
+### Keywords
+
+    EXPORT
+
+### Best Practice
+
+#### 子任务的拆分
+
+一个 Export 作业会拆分成多个子任务(执行计划)去执行。有多少查询计划需要执行,取决于总共有多少 Tablet,以及一个查询计划最多可以分配多少个 Tablet。
+
+因为多个查询计划是串行执行的,所以如果让一个查询计划处理更多的分片,则可以减少作业的执行时间。
+
+但如果查询计划出错(比如调用 Broker 的 RPC 失败,远端存储出现抖动等),过多的 Tablet 会导致一个查询计划的重试成本变高。
+
+所以需要合理安排查询计划的个数以及每个查询计划所需要扫描的分片数,在执行时间和执行成功率之间做出平衡。
+
+一般建议一个查询计划扫描的数据量在 3-5 GB内。
+
+#### 内存限制
+
+通常一个 Export 作业的查询计划只有 `扫描-导出` 两部分,不涉及需要太多内存的计算逻辑。所以通常 2GB 的默认内存限制可以满足需求。
+
+但在某些场景下,比如一个查询计划,在同一个 BE 上需要扫描的 Tablet 过多,或者 Tablet 的数据版本过多时,可能会导致内存不足。此时需要通过这个 `exec_mem_limit` 参数设置更大的内存,比如 4GB、8GB 等。
+
+#### 注意事项
+
+- 不建议一次性导出大量数据。一个 Export 作业建议的导出数据量最大在几十 GB。过大的导出会导致更多的垃圾文件和更高的重试成本。如果表数据量过大,建议按照分区导出。
+- 如果 Export 作业运行失败,在远端存储中产生的 `__doris_export_tmp_xxx` 临时目录,以及已经生成的文件不会被删除,需要用户手动删除。
+- 如果 Export 作业运行成功,在远端存储中产生的 `__doris_export_tmp_xxx` 目录,根据远端存储的文件系统语义,可能会保留,也可能会被清除。比如在S3对象存储中,通过 rename 操作将一个目录中的最后一个文件移走后,该目录也会被删除。如果该目录没有被清除,用户可以手动清除。
+- Export 作业只会导出 Base 表的数据,不会导出物化视图的数据。
+- Export 作业会扫描数据,占用 IO 资源,可能会影响系统的查询延迟。
+- 一个集群内同时运行的 Export 作业最大个数为 5。之后提交的只作业将会排队。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org