You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ji...@apache.org on 2022/06/15 08:43:15 UTC

[incubator-doris] branch master updated: [doc] add tpc-h benchmark (#10150)

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 4005b34a52 [doc] add tpc-h benchmark (#10150)
4005b34a52 is described below

commit 4005b34a52b58be9b1e4f4f2cec29a4b97426f50
Author: jiafeng.zhang <zh...@gmail.com>
AuthorDate: Wed Jun 15 16:43:10 2022 +0800

    [doc] add tpc-h benchmark (#10150)
    
    [doc] add tpc-h benchmark
---
 .../public/images/image-20220614114351241.png      | Bin 0 -> 92508 bytes
 docs/.vuepress/sidebar/en/docs.js                  |   3 +-
 docs/.vuepress/sidebar/zh-CN/docs.js               |   3 +-
 docs/en/docs/benchmark/tpch.md                     | 229 ++++++++++++++++++++
 docs/zh-CN/docs/benchmark/tpch.md                  | 232 +++++++++++++++++++++
 5 files changed, 465 insertions(+), 2 deletions(-)

diff --git a/docs/.vuepress/public/images/image-20220614114351241.png b/docs/.vuepress/public/images/image-20220614114351241.png
new file mode 100644
index 0000000000..fe64d53ef3
Binary files /dev/null and b/docs/.vuepress/public/images/image-20220614114351241.png differ
diff --git a/docs/.vuepress/sidebar/en/docs.js b/docs/.vuepress/sidebar/en/docs.js
index f77fd24c74..f907f2f782 100644
--- a/docs/.vuepress/sidebar/en/docs.js
+++ b/docs/.vuepress/sidebar/en/docs.js
@@ -1001,7 +1001,8 @@ module.exports = [
     directoryPath: "benchmark/",
     initialOpenGroupIndex: -1,
     children: [
-      "ssb"
+      "ssb",
+      "tpch"
     ],
   }
 ];
diff --git a/docs/.vuepress/sidebar/zh-CN/docs.js b/docs/.vuepress/sidebar/zh-CN/docs.js
index fea0e6bcaa..57e90e33a2 100644
--- a/docs/.vuepress/sidebar/zh-CN/docs.js
+++ b/docs/.vuepress/sidebar/zh-CN/docs.js
@@ -1001,7 +1001,8 @@ module.exports = [
     directoryPath: "benchmark/",
     initialOpenGroupIndex: -1,
     children: [
-      "ssb"
+      "ssb",
+      "tpch"
     ],
   }
 ];
diff --git a/docs/en/docs/benchmark/tpch.md b/docs/en/docs/benchmark/tpch.md
new file mode 100644
index 0000000000..2fcd7d4094
--- /dev/null
+++ b/docs/en/docs/benchmark/tpch.md
@@ -0,0 +1,229 @@
+---
+{
+    "title": "TPC-H Benchmark",
+    "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# TPC-H benchmark
+
+TPC-H is a Decision Support Benchmark consisting of a set of business-oriented ad hoc queries and concurrent data modifications. The data that queries and populates the database has broad industry relevance. This benchmark demonstrates a decision support system that examines large amounts of data, executes highly complex queries, and answers critical business questions. The performance metric reported by TPC-H is called the TPC-H Hourly Compound Query Performance Metric (QphH@Size) and r [...]
+
+This document mainly introduces the performance of Doris on the TPC-H test set.
+
+> Note 1: Standard test sets including TPC-H are usually far from actual business scenarios, and some tests will perform parameter tuning for the test set. Therefore, the test results of the standard test set can only reflect the performance of the database in specific scenarios. Users are advised to conduct further testing with actual business data.
+>
+> Note 2: The operations covered in this document are tested on CentOS 7.x.
+
+On 22 queries on the TPC-H standard test dataset, we tested the upcoming Doris 1.1 version and Doris 0.15.0 RC04 version side by side, and the overall performance improved by 3-4 times. In individual scenarios, it can achieve a ten-fold improvement.
+
+![image-20220614114351241](/images/image-20220614114351241.png)
+
+## 1. Hardware Environment
+
+| Hardware           | Configuration Instructions                                   |
+| -------- | ------------------------------------ |
+| number of machines | 4 Alibaba Cloud hosts (1 FE, 3 BE) |
+| CPU      | Intel Xeon(Cascade Lake) Platinum 8269CY  16C  (2.5 GHz/3.2 GHz) |
+| Memory | 64G                                  |
+| Network | 5Gbps                              |
+| Disk   | ESSD cloud hard disk  |
+
+## 2. Software Environment
+
+- Doris deploys 3BE 1FE;
+- Kernel version: Linux version 5.4.0-96-generic (buildd@lgw01-amd64-051)
+- OS version: CentOS 7.8
+- Doris software version: Apache Doris 1.1, Apache Doris 0.15.0 RC04
+- JDK: openjdk version "11.0.14" 2022-01-18
+
+## 3. Test Data Volume
+
+The entire test simulation generates 100G of data and is imported into Doris 0.15.0 RC04 and Doris 1.1 versions for testing. The following is the relevant description of the table and the amount of data.
+
+| TPC-H Table Name | Rows        | data size  | remark |
+| :--------------- | :---------- | ---------- | :----- |
+| REGION           | 5           | 400KB      |        |
+| NATION           | 25          | 7.714 KB   |        |
+| SUPPLIER         | 100 million | 85.528 MB  |        |
+| PART             | 20 million  | 752.330 MB |        |
+| PARTSUPP         | 80 million  | 4.375 GB   |        |
+| CUSTOMER         | 15 million  | 1.317 GB   |        |
+| ORDERS           | 1.5 billion | 6.301 GB   |        |
+| LINEITEM         | 6 billion   | 20.882 GB  |        |
+
+## 4. Test SQL
+
+TPCH 22 test query statements : [TPCH-Query-SQL](https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools/queries)
+
+Notice:
+
+The following four parameters in the above SQL do not exist in 0.15.0 RC04. When executed in 0.15.0 RC04, remove them:
+
+```
+1. enable_vectorized_engine=true,
+2. batch_size=4096,
+3. disable_join_reorder=false
+4. enable_projection=true
+```
+
+## 5. Test Result
+
+Here we use the upcoming Doris-1.1 version and Doris-0.15.0 RC04 version for comparative testing. The test results are as follows:
+
+| Query     | Doris-1.1(s) | 0.15.0 RC04(s) |
+| --------- | ------------ | -------------- |
+| Q1        | 3.75         | 28.63          |
+| Q2        | 4.22         | 7.88           |
+| Q3        | 2.64         | 9.39           |
+| Q4        | 1.5          | 9.3            |
+| Q5        | 2.15         | 4.11           |
+| Q6        | 0.19         | 0.43           |
+| Q7        | 1.04         | 1.61           |
+| Q8        | 1.75         | 50.35          |
+| Q9        | 7.94         | 16.34          |
+| Q10       | 1.41         | 5.21           |
+| Q11       | 0.35         | 1.72           |
+| Q12       | 0.57         | 5.39           |
+| Q13       | 8.15         | 20.88          |
+| Q14       | 0.3          |                |
+| Q15       | 0.66         | 1.86           |
+| Q16       | 0.79         | 1.32           |
+| Q17       | 1.51         | 26.67          |
+| Q18       | 3.364        | 11.77          |
+| Q19       | 0.829        | 1.71           |
+| Q20       | 2.77         | 5.2            |
+| Q21       | 4.47         | 10.34          |
+| Q22       | 0.9          | 3.22           |
+| **total** | **51.253**   | **223.33**     |
+
+- **Result description**
+  - The data set corresponding to the test results is scale 100, about 600 million.
+  - The test environment is configured to be commonly used by users, including 4 cloud servers, 16-core 64G SSD, and 1 FE and 3 BE deployment.
+  - Use common user configuration tests to reduce user selection and evaluation costs, but will not consume so many hardware resources during the entire test process.
+  - The test results are averaged over 3 executions. And the data has been fully compacted (if the data is tested immediately after the data is imported, the query delay may be higher than the test result, and the speed of compaction is being continuously optimized, and will be significantly reduced in the future).
+  - 0.15 RC04 Q14 execution failed in TPC-H test, unable to complete query.
+
+## 6. Environmental Preparation
+
+Please refer to the [official document](../install/install-deploy.md) to install and deploy Doris to obtain a normal running Doris cluster (at least 1 FE 1 BE, 1 FE 3 BE is recommended).
+
+## 7. Data Preparation
+
+### 7.1 Download and install the TPC-H data generation tool
+
+Execute the following script to download and compile the [tpch-tools](https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools) tool.
+
+```shell
+sh build-tpch-dbgen.sh
+```
+
+After successful installation, the `dbgen` binary will be generated in the `TPC-H_Tools_v3.0.0/` directory.
+
+### 7.2 Generate TPC-H test set
+
+Execute the following script to generate the TPC-H dataset:
+
+```shell
+sh gen-tpch-data.sh
+```
+
+> Note 1: View script help via `sh gen-tpch-data.sh -h`.
+>
+> Note 2: The data will be generated in the `tpch-data/` directory with the suffix `.tbl`. The total file size is about 100GB. The generation time may vary from a few minutes to an hour.
+>
+> Note 3: The standard test data set of 100G is generated by default
+
+### 7.3 Create Table
+
+#### 7.3.1 Prepare the `doris-cluster.conf` file
+
+Before calling the import script, you need to write the FE's ip port and other information in the `doris-cluster.conf` file.
+
+File location and `load-tpch-data.sh` level.
+
+The contents of the file include FE's ip, HTTP port, user name, password and the DB name of the data to be imported:
+
+```shell
+# Any of FE host
+export FE_HOST='127.0.0.1'
+# http_port in fe.conf
+export FE_HTTP_PORT=8030
+# query_port in fe.conf
+export FE_QUERY_PORT=9030
+# Doris username
+export USER='root'
+# Doris password
+export PASSWORD=''
+# The database where TPC-H tables located
+export DB='tpch1'
+```
+
+#### 7.3.2 Execute the following script to generate and create the TPC-H table
+
+```shell
+sh create-tpch-tables.sh
+```
+Or copy the table creation statement in [create-tpch-tables.sql](https://github.com/apache/incubator-doris/blob/master/tools/tpch-tools/create-tpch-tables.sql), Execute in Doris.
+
+
+### 7.4 导入数据
+
+通过下面的命令执行数据导入:
+
+```shell
+sh ./load-tpch-data.sh
+```
+
+### 7.5 Check Imported Data
+
+Execute the following SQL statement to check the imported data volume is consistent with the above data volume.
+
+```sql
+select count(*)  from  lineitem;
+select count(*)  from  orders;
+select count(*)  from  partsupp;
+select count(*)  from  part;
+select count(*)  from  customer;
+select count(*)  from  supplier;
+select count(*)  from  nation;
+select count(*)  from  region;
+select count(*)  from  revenue0;
+```
+
+### 7.6 Query Test
+
+Execute the above test SQL or execute the following command
+
+```
+./run-tpch-queries.sh
+```
+
+>Notice:
+>
+>1. At present, the query optimizer and statistics functions of Doris are not perfect, so we rewrite some queries in TPC-H to adapt to the execution framework of Doris, but it does not affect the correctness of the results
+>
+>2. Doris' new query optimizer will be released in subsequent versions
+>3. Set `set mem_exec_limit=8G` before executing the query
+
+
+
diff --git a/docs/zh-CN/docs/benchmark/tpch.md b/docs/zh-CN/docs/benchmark/tpch.md
new file mode 100644
index 0000000000..2c68c86433
--- /dev/null
+++ b/docs/zh-CN/docs/benchmark/tpch.md
@@ -0,0 +1,232 @@
+---
+{
+    "title": "TPC-H benchmark",
+    "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# TPC-H benchmark
+
+TPC-H是一个决策支持基准(Decision Support Benchmark),它由一套面向业务的特别查询和并发数据修改组成。查询和填充数据库的数据具有广泛的行业相关性。这个基准测试演示了检查大量数据、执行高度复杂的查询并回答关键业务问题的决策支持系统。TPC-H报告的性能指标称为TPC-H每小时复合查询性能指标(QphH@Size),反映了系统处理查询能力的多个方面。这些方面包括执行查询时所选择的数据库大小,由单个流提交查询时的查询处理能力,以及由多个并发用户提交查询时的查询吞吐量。
+
+本文档主要介绍 Doris 在 TPC-H 测试集上的性能表现。
+
+> 注1:包括 TPC-H 在内的标准测试集通常和实际业务场景差距较大,并且部分测试会针对测试集进行参数调优。所以标准测试集的测试结果仅能反映数据库在特定场景下的性能表现。建议用户使用实际业务数据进行进一步的测试。
+>
+> 注2:本文档涉及的操作都在 CentOS 7.x 上进行测试。
+
+在 TPC-H 标准测试数据集上的 22 个查询上,我们对即将发布的 Doris 1.1 版本和 Doris 0.15.0 RC04 版本进行了对别测试,整体性能提升了 3-4 倍。个别场景下达到十几倍的提升。
+
+![image-20220614114351241](/images/image-20220614114351241.png)
+
+## 1. 硬件环境
+
+| 硬件     | 配置说明                                                     |
+| -------- | ------------------------------------ |
+| 机器数量 | 4 台腾讯云主机(1个FE,3个BE)       |
+| CPU      | Intel Xeon(Cascade Lake) Platinum 8269CY  16核  (2.5 GHz/3.2 GHz) |
+| 内存     | 64G                                  |
+| 网络带宽  | 5Gbps                              |
+| 磁盘     | ESSD云硬盘                      |
+
+## 2. 软件环境
+
+- Doris部署 3BE 1FE;
+- 内核版本:Linux version 5.4.0-96-generic (buildd@lgw01-amd64-051)
+- 操作系统版本:CentOS 7.8
+- Doris 软件版本:Apache Doris 1.1 、Apache Doris 0.15.0 RC04
+- JDK:openjdk version "11.0.14" 2022-01-18
+
+## 3. 测试数据量
+
+整个测试模拟生成100G的数据分别导入到 Doris 0.15.0 RC04 和 Doris 1.1 版本进行测试,下面是表的相关说明及数据量。
+
+| TPC-H表名 | 行数   | 导入后大小 | 备注         |
+| :-------- | :----- | ---------- | :----------- |
+| REGION    | 5      | 400KB      | 区域表       |
+| NATION    | 25     | 7.714 KB   | 国家表       |
+| SUPPLIER  | 100万  | 85.528 MB  | 供应商表     |
+| PART      | 2000万 | 752.330 MB | 零部件表     |
+| PARTSUPP  | 8000万 | 4.375 GB   | 零部件供应表 |
+| CUSTOMER  | 1500万 | 1.317 GB   | 客户表       |
+| ORDERS    | 1.5亿  | 6.301 GB   | 订单表       |
+| LINEITEM  | 6亿    | 20.882 GB  | 订单明细表   |
+
+## 4. 测试SQL
+
+TPCH 22个测试查询语句 : [TPCH-Query-SQL](https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools/queries)
+
+**注意:**
+
+以上SQL中的以下四个参数在0.15.0 RC04中不存在,在0.15.0 RC04中执行的时候,去掉:
+
+```
+1. enable_vectorized_engine=true,
+2. batch_size=4096,
+3. disable_join_reorder=false
+4. enable_projection=true
+```
+
+
+
+## 5. 测试结果
+
+这里我们使用即将发布的 Doris-1.1版本和 Doris-0.15.0 RC04 版本进行对比测试,测试结果如下:
+
+| Query     | Doris-1.1(s) | 0.15.0 RC04(s) |
+| --------- | ------------ | -------------- |
+| Q1        | 3.75         | 28.63          |
+| Q2        | 4.22         | 7.88           |
+| Q3        | 2.64         | 9.39           |
+| Q4        | 1.5          | 9.3            |
+| Q5        | 2.15         | 4.11           |
+| Q6        | 0.19         | 0.43           |
+| Q7        | 1.04         | 1.61           |
+| Q8        | 1.75         | 50.35          |
+| Q9        | 7.94         | 16.34          |
+| Q10       | 1.41         | 5.21           |
+| Q11       | 0.35         | 1.72           |
+| Q12       | 0.57         | 5.39           |
+| Q13       | 8.15         | 20.88          |
+| Q14       | 0.3          |                |
+| Q15       | 0.66         | 1.86           |
+| Q16       | 0.79         | 1.32           |
+| Q17       | 1.51         | 26.67          |
+| Q18       | 3.364        | 11.77          |
+| Q19       | 0.829        | 1.71           |
+| Q20       | 2.77         | 5.2            |
+| Q21       | 4.47         | 10.34          |
+| Q22       | 0.9          | 3.22           |
+| **total** | **51.253**   | **223.33**     |
+
+**结果说明**
+
+- 测试结果对应的数据集为scale 100, 约6亿条。
+- 测试环境配置为用户常用配置,云服务器4台,16核 64G SSD,1 FE 3 BE 部署。
+- 选用用户常见配置测试以降低用户选型评估成本,但整个测试过程中不会消耗如此多的硬件资源。
+- 测试结果为3次执行取平均值。并且数据经过充分的 compaction(如果在刚导入数据后立刻测试,则查询延迟可能高于本测试结果,compaction的速度正在持续优化中,未来会显著降低)。
+- 0.15 RC04 在 TPC-H 测试中 Q14 执行失败,无法完成查询。
+
+## 6. 环境准备
+
+请先参照 [官方文档](../install/install-deploy.md) 进行 Doris 的安装部署,以获得一个正常运行中的 Doris 集群(至少包含 1 FE 1 BE,推荐 1 FE 3 BE)。
+
+## 7. 数据准备
+
+### 7.1 下载安装 TPC-H 数据生成工具
+
+执行以下脚本下载并编译  [tpch-tools](https://github.com/apache/incubator-doris/tree/master/tools/tpch-tools)  工具。
+
+```shell
+sh build-tpch-dbgen.sh
+```
+
+安装成功后,将在 `TPC-H_Tools_v3.0.0/` 目录下生成 `dbgen` 二进制文件。
+
+### 7.2 生成 TPC-H 测试集
+
+执行以下脚本生成 TPC-H 数据集:
+
+```shell
+sh gen-tpch-data.sh
+```
+
+> 注1:通过 `sh gen-tpch-data.sh -h` 查看脚本帮助。
+>
+> 注2:数据会以 `.tbl` 为后缀生成在  `tpch-data/` 目录下。文件总大小约100GB。生成时间可能在数分钟到1小时不等。
+>
+> 注3:默认生成 100G 的标准测试数据集
+
+### 7.3 建表
+
+#### 7.3.1 准备 `doris-cluster.conf` 文件
+
+在调用导入脚本前,需要将 FE 的 ip 端口等信息写在 `doris-cluster.conf` 文件中。
+
+文件位置和 `load-tpch-data.sh` 平级。
+
+文件内容包括 FE 的 ip,HTTP 端口,用户名,密码以及待导入数据的 DB 名称:
+
+```shell
+# Any of FE host
+export FE_HOST='127.0.0.1'
+# http_port in fe.conf
+export FE_HTTP_PORT=8030
+# query_port in fe.conf
+export FE_QUERY_PORT=9030
+# Doris username
+export USER='root'
+# Doris password
+export PASSWORD=''
+# The database where TPC-H tables located
+export DB='tpch1'
+```
+
+#### 7.3.2 执行以下脚本生成创建 TPC-H 表
+
+```shell
+sh create-tpch-tables.sh
+```
+或者复制 [create-tpch-tables.sql](https://github.com/apache/incubator-doris/blob/master/tools/tpch-tools/create-tpch-tables.sql) 中的建表语句,在 Doris 中执行。
+
+
+### 7.4 导入数据
+
+通过下面的命令执行数据导入:
+
+```shell
+sh ./load-tpch-data.sh
+```
+
+### 7.5 检查导入数据
+
+执行下面的 SQL 语句检查导入的数据量上 上面的数据量是一致。
+
+```sql
+select count(*)  from  lineitem;
+select count(*)  from  orders;
+select count(*)  from  partsupp;
+select count(*)  from  part;
+select count(*)  from  customer;
+select count(*)  from  supplier;
+select count(*)  from  nation;
+select count(*)  from  region;
+select count(*)  from  revenue0;
+```
+
+### 7.6 查询测试
+
+执行上面的测试 SQL 或者 执行下面的命令
+
+```
+./run-tpch-queries.sh
+```
+
+>注意:
+>
+>1. 目前Doris的查询优化器和统计信息功能还不完善,所以我们在TPC-H中重写了一些查询以适应Doris的执行框架,但不影响结果的正确性
+>
+>2. Doris 新的查询优化器将在后续的版本中发布
+>3. 执行查询之前设置 `set mem_exec_limit=8G`
+
+
+


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org