You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kyuubi.apache.org by ch...@apache.org on 2021/10/12 07:54:21 UTC

[incubator-kyuubi] branch master updated: [KYUUBI #1217] [DOC] Z-order by and order by performance test

This is an automated email from the ASF dual-hosted git repository.

chengpan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-kyuubi.git


The following commit(s) were added to refs/heads/master by this push:
     new cb886e9  [KYUUBI #1217] [DOC] Z-order by and order by performance test
cb886e9 is described below

commit cb886e9a1d406cac7a6d7bfc3f38d28462ae52ec
Author: xiongyinke <10...@qq.com>
AuthorDate: Tue Oct 12 15:54:11 2021 +0800

    [KYUUBI #1217] [DOC] Z-order by and order by performance test
    
    <!--
    Thanks for sending a pull request!
    
    Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
      2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
    -->
    
    ### _Why are the changes needed?_
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you add a feature, you can talk about the use case of it.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    ### _How was this patch tested?_
    - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
    
    - [ ] Add screenshots for manual tests if appropriate
    
    - [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request
    
    Closes #1217 from hzxiongyinke/zorder-by_and_order-by_performance_test.
    
    Closes #1217
    
    c0232c68 [xiongyinke] format z-order-benchmark.md
    a7d71111 [xiongyinke] update  zorder benchmark data
    3bf5f81b [xiongyinke] update benchmark result secondary headlines and fix z-order test result;
    f5c9dfb5 [hzxiongyinke] Merge pull request #3 from apache/master
    6f1892be [hzxiongyinke] Merge pull request #1 from apache/master
    
    Lead-authored-by: xiongyinke <10...@qq.com>
    Co-authored-by: hzxiongyinke <75...@users.noreply.github.com>
    Signed-off-by: Cheng Pan <ch...@apache.org>
---
 docs/sql/z-order-benchmark.md | 55 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 9 deletions(-)

diff --git a/docs/sql/z-order-benchmark.md b/docs/sql/z-order-benchmark.md
index 968b518..d60cb7c 100644
--- a/docs/sql/z-order-benchmark.md
+++ b/docs/sql/z-order-benchmark.md
@@ -23,7 +23,7 @@
 
 </div>
 
-# z-order benchmark
+# Z-order Benchmark
 
 Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test
 
@@ -50,6 +50,8 @@ spark.sql(s"drop database if exists $dbName cascade")
 spark.sql(s"create database if not exists $dbName")
 spark.sql(s"use $dbName")
 spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"show tables").show(false)
@@ -67,15 +69,17 @@ def randomConnRecord(r: Random) = ConnRecord(
   dst_ip = randomIPv4(r), dst_port = randomPort(r))
 ```
 
-Step3: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
+Step3: do optimize with z-order only ip and do optimize with order by only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
 
 ```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
 OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
 ```
 
-Step4: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
+Step4: do optimize with z-order and do optimize with order by , sort column: src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers .
 
 ```
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
 OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
 ```
 
@@ -111,6 +115,8 @@ spark.sql(s"drop database if exists $dbName cascade")
 spark.sql(s"create database if not exists $dbName")
 spark.sql(s"use $dbName")
 spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"show tables").show(false)
@@ -150,7 +156,7 @@ spark.stop()
 
 ```
 
-Optimize Sql:
+Z-order Optimize statement:
 
 ```sql
 
@@ -161,9 +167,16 @@ OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
 OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
 ```
 
+ORDER BY statement:
 
+```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
+
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
+
+```
 
-Query Sql : 
+Query statement:
 
 ```sql
 
@@ -177,7 +190,31 @@ select count(*) from conn_zorder where src_ip like '157%' and dst_ip like '216.%
 ```
 
 
-# benchmark result
+## Benchmark result
+
+We have done two performance tests: one is to compare the efficiency of  Z-order Optimize and Order by Sort, 
+and the other is to query based on the optimized Z-order by data and Random data
+
+### Efficiency of Z-order Optimize and Order-by Sort
+
+**10 billion data and 1000 files and Query resource:200 core 600G memory**
+
+z-order by or order by only ip
+
+| Table               | row count      | optimize  time     |
+| ------------------- | -------------- | ------------------ |
+| conn_order_only_ip  | 10,000,000,000 | 1591.99 s          |
+| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s         |
+
+z-order by or order by all columns
+
+| Table               | row count      | optimize  time     |
+| ------------------- | -------------- | ------------------ |
+| conn_order          | 10,000,000,000 | 1515.298 s         |
+| conn_zorder         | 10,000,000,000 | 11057.194 s        |
+
+### Z-order by benchmark result
+
 by querying the tables before and after optimization, we find that
 
 **10 billion data and 200 files and Query resource:200 core 600G memory**
@@ -190,13 +227,13 @@ by querying the tables before and after optimization, we find that
 
 
 
-**10 billion data and 2000 files and Query resource:200 core 600G memory**
+**10 billion data and 1000 files and Query resource:200 core 600G memory**
 
 | Table               | Average File Size | Scan row count | Average query time | row count Skipping ratio |
 | ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
 | conn_random_parquet | 234.8 M           | 10,000,000,000 | 27.031 s           | 0.0%                     |
-| conn_zorder_only_ip | 173.9 M           | 43,170,600     | 2.668 s            | 99.568%                  |
-| conn_zorder         | 174.0 M           | 54,841,302     | 3.207 s            | 99.451%                  |
+| conn_zorder_only_ip | 173.9 M           | 53,499,068     | 3.120 s            | 99.465%                  |
+| conn_zorder         | 174.0 M           | 35,910,500     | 3.103 s            | 99.640%                  |