You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/28 06:24:40 UTC

[GitHub] [iceberg] linfey90 commented on pull request #5824: Spark: support hilbert curve when rewrite

linfey90 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1366402542

   Hilbert  has better data aggregation.Here is a simple performance test.
   1: prepare a parquet table which has One hundred million rows, and 11 columns.
   and has two column name c1 and c2.the values is range from 0 to 500000.
   the flinksql like,
   CREATE TABLE default_catalog.default_database.dg (
       c1 INT,
   	c2 bigint
   	c3 VARCHAR,
   	c4 VARCHAR,
   	c5 TINYINT,
   	c6 SMALLINT,
   	c7 FLOAT,
   	c8 double,
   	c9 char,
   	c10 boolean,
   	c11 AS localtimestamp
   ) WITH (
       'connector' = 'datagen',
   	'fields.c3.length' = '10',
   	'fields.c4.length' = '10',
   	'fields.c1.min' = '0',
   	'fields.c1.max' = '1000000',
   	'fields.c2.min' = '0',
   	'fields.c2.max' = '1000000',
       'rows-per-second' = '30000',
   	'number-of-rows' ='100000000'
   );
   2: Create two tables, test_zorder and test_hilbert, and copy the above data.
   3: rewrite the table by sort c1,c2 with zorder and hilbert.
   4: Write code to view the number of file skips, and execute the sql like select count.
   |query condition         | table           | file skip  | total Files | file Skip percentage  | query time |
   | ------------- |:-------------:| -----:| -----:| -----:| -----:|
   | c1 <500000 and c2 < 500000      | hilbert | 97 | 171 | 56.7% | 1.018s |
   |       | zorder      |   82 | 180 | 45.56% | 1.353s |
   | c1 >500000 and c2 > 500000 | hilbert | 28 | 171 | 16.37% | 3.337s |
   |  | zorder | 18 | 180 | 10% | 3.37s |
   
   note:The query time depends on the cluster environment and is for reference only. But file skip is stable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org