You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Norbert Luksa (Code Review)" <ge...@cloudera.org> on 2020/02/03 10:16:08 UTC

[Impala-ASF-CR] IMPALA-8755: Backend support for Z-ordering

Norbert Luksa has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/14080 )

Change subject: IMPALA-8755: Backend support for Z-ordering
......................................................................

IMPALA-8755: Backend support for Z-ordering

This change depends on gerrit.cloudera.org/#/c/13955/
(Frontend support for Z-ordering)

The commit adds a Comparator based on Z-ordering. See in detail:
https://en.wikipedia.org/wiki/Z-order_curve

The comparator instead of calculating the Z-values of the rows,
looks for the column with the most significant dimension, and
compares the values of this column only. The most significant
dimension will be the one where the compared values have the
highest different bits. The algorithm requires values of
the same binary representation, therefore the values are
converted into either uint32_t, uint63_t or uint128_t, the
smallest in which all data fits. Comparing smaller types with
bigger ones would make the bigger type much more dominant
therefore the bits of these smaller types are shifted up.

All primitive types (including string and floating point types)
are supported.

Testing:
 * Added unit tests.
 * Run manual tests, comparing 4-column values with 4-bit
   integers, for all possible combinations. Checked the result by
   calculating the Z-value for each comparison.
 * Tested performance on various data, getting great results for
   selective queries. An example: used the TPCH dataset's
   lineitem table with scale 25, where the sorting columns are
   l_partkey and l_suppkey, in that order. Run selective queries
   for the value range of the two columns, for both lexical and
   Z-ordering and compared the percentage of filtered pages and
   row groups. While queries with filters on the first column
   showed almost no difference, queries on the second column
   is in favour of Z-ordering:
   Ordering | Column | Filtered pages % | Filtered row groups %
   Lex.       1st      ~99%               ~90%
   Z-ord.     1st      ~99%               ~89%
   Lex.       2nd      ~25%               0%
   Z-ord.     2nd      ~97%               0%
   The only drawback is the sorting itself, taking ~4 times more
   than lexical sorting (eg. sorting for the dataset above took
   14m for Lexical, and 55m for Z-ordering).
   Note however, that this is a one-time thing to do, sorting
   only happens once, when writing the data.
   Also, lexical ordering is supported by codegen, while it is
   not implemented for Z-ordering yet.

Change-Id: I0200748ce3e65ebc5d3530f794c0f80aa335a2ab
---
M be/src/exec/exchange-node.cc
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/partial-sort-node.cc
M be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.cc
M be/src/exec/sort-node.h
M be/src/exec/topn-node.cc
M be/src/runtime/data-stream-test.cc
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/CMakeLists.txt
A be/src/util/tuple-row-compare-test.cc
M be/src/util/tuple-row-compare.cc
M be/src/util/tuple-row-compare.h
M fe/src/main/java/org/apache/impala/analysis/TableDef.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
18 files changed, 1,128 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/80/14080/22
-- 
To view, visit http://gerrit.cloudera.org:8080/14080
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0200748ce3e65ebc5d3530f794c0f80aa335a2ab
Gerrit-Change-Number: 14080
Gerrit-PatchSet: 22
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward (520)
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>