You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kyuubi.apache.org by ch...@apache.org on 2021/10/12 07:54:21 UTC
[incubator-kyuubi] branch master updated: [KYUUBI #1217] [DOC]
Z-order by and order by performance test
This is an automated email from the ASF dual-hosted git repository.
chengpan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-kyuubi.git
The following commit(s) were added to refs/heads/master by this push:
new cb886e9 [KYUUBI #1217] [DOC] Z-order by and order by performance test
cb886e9 is described below
commit cb886e9a1d406cac7a6d7bfc3f38d28462ae52ec
Author: xiongyinke <10...@qq.com>
AuthorDate: Tue Oct 12 15:54:11 2021 +0800
[KYUUBI #1217] [DOC] Z-order by and order by performance test
<!--
Thanks for sending a pull request!
Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->
### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
1. If you add a feature, you can talk about the use case of it.
2. If you fix a bug, you can clarify why it is a bug.
-->
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request
Closes #1217 from hzxiongyinke/zorder-by_and_order-by_performance_test.
Closes #1217
c0232c68 [xiongyinke] format z-order-benchmark.md
a7d71111 [xiongyinke] update zorder benchmark data
3bf5f81b [xiongyinke] update benchmark result secondary headlines and fix z-order test result;
f5c9dfb5 [hzxiongyinke] Merge pull request #3 from apache/master
6f1892be [hzxiongyinke] Merge pull request #1 from apache/master
Lead-authored-by: xiongyinke <10...@qq.com>
Co-authored-by: hzxiongyinke <75...@users.noreply.github.com>
Signed-off-by: Cheng Pan <ch...@apache.org>
---
docs/sql/z-order-benchmark.md | 55 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 46 insertions(+), 9 deletions(-)
diff --git a/docs/sql/z-order-benchmark.md b/docs/sql/z-order-benchmark.md
index 968b518..d60cb7c 100644
--- a/docs/sql/z-order-benchmark.md
+++ b/docs/sql/z-order-benchmark.md
@@ -23,7 +23,7 @@
</div>
-# z-order benchmark
+# Z-order Benchmark
Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test
@@ -50,6 +50,8 @@ spark.sql(s"drop database if exists $dbName cascade")
spark.sql(s"create database if not exists $dbName")
spark.sql(s"use $dbName")
spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"show tables").show(false)
@@ -67,15 +69,17 @@ def randomConnRecord(r: Random) = ConnRecord(
dst_ip = randomIPv4(r), dst_port = randomPort(r))
```
-Step3: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
+Step3: do optimize with z-order only ip and do optimize with order by only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
```
-Step4: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
+Step4: do optimize with z-order and do optimize with order by , sort column: src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers .
```
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
```
@@ -111,6 +115,8 @@ spark.sql(s"drop database if exists $dbName cascade")
spark.sql(s"create database if not exists $dbName")
spark.sql(s"use $dbName")
spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"show tables").show(false)
@@ -150,7 +156,7 @@ spark.stop()
```
-Optimize Sql:
+Z-order Optimize statement:
```sql
@@ -161,9 +167,16 @@ OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
```
+ORDER BY statement:
+```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
+
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
+
+```
-Query Sql :
+Query statement:
```sql
@@ -177,7 +190,31 @@ select count(*) from conn_zorder where src_ip like '157%' and dst_ip like '216.%
```
-# benchmark result
+## Benchmark result
+
+We have done two performance tests: one is to compare the efficiency of Z-order Optimize and Order by Sort,
+and the other is to query based on the optimized Z-order by data and Random data
+
+### Efficiency of Z-order Optimize and Order-by Sort
+
+**10 billion data and 1000 files and Query resource:200 core 600G memory**
+
+z-order by or order by only ip
+
+| Table | row count | optimize time |
+| ------------------- | -------------- | ------------------ |
+| conn_order_only_ip | 10,000,000,000 | 1591.99 s |
+| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s |
+
+z-order by or order by all columns
+
+| Table | row count | optimize time |
+| ------------------- | -------------- | ------------------ |
+| conn_order | 10,000,000,000 | 1515.298 s |
+| conn_zorder | 10,000,000,000 | 11057.194 s |
+
+### Z-order by benchmark result
+
by querying the tables before and after optimization, we find that
**10 billion data and 200 files and Query resource:200 core 600G memory**
@@ -190,13 +227,13 @@ by querying the tables before and after optimization, we find that
-**10 billion data and 2000 files and Query resource:200 core 600G memory**
+**10 billion data and 1000 files and Query resource:200 core 600G memory**
| Table | Average File Size | Scan row count | Average query time | row count Skipping ratio |
| ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
| conn_random_parquet | 234.8 M | 10,000,000,000 | 27.031 s | 0.0% |
-| conn_zorder_only_ip | 173.9 M | 43,170,600 | 2.668 s | 99.568% |
-| conn_zorder | 174.0 M | 54,841,302 | 3.207 s | 99.451% |
+| conn_zorder_only_ip | 173.9 M | 53,499,068 | 3.120 s | 99.465% |
+| conn_zorder | 174.0 M | 35,910,500 | 3.103 s | 99.640% |