You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Grant Henke (Code Review)" <ge...@cloudera.org> on 2019/06/13 16:25:16 UTC

[kudu-CR](branch-1.10.x) KUDU-2846: optimize predicate evaluation for primitives

Grant Henke has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13635


Change subject: KUDU-2846: optimize predicate evaluation for primitives
......................................................................

KUDU-2846: optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by up to 7.2x depending on the particular
predicate, type, and nullability (average around 4.8x). Branches are
reduced by about 6.5x and branch-misses by about 22x.

It's possible that hand-coded SIMD could improve on this a little bit
but likely not worth the effort.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*':
      73905.379627      task-clock (msec)         #    0.997 CPUs utilized
   272,810,081,028      cycles                    #    3.691 GHz
   938,488,388,743      instructions              #    3.44  insn per cycle
   148,052,698,322      branches                  # 2003.274 M/sec
       882,311,138      branch-misses             #    0.60% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*':
      15354.077654      task-clock (msec)         #    0.992 CPUs utilized
    56,850,629,856      cycles                    #    3.703 GHz
   181,599,095,960      instructions              #    3.19  insn per cycle
    22,496,453,160      branches                  # 1465.178 M/sec
        38,662,626      branch-misses             #    0.17% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 632.1M evals/sec	4.44 cycles/eval
  int8   NULL       (c = 0) 515.6M evals/sec	5.48 cycles/eval
  int8   NOT NULL   (c >= 0) 630.8M evals/sec	4.45 cycles/eval
  int8   NULL       (c >= 0) 426.8M evals/sec	6.64 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 632.6M evals/sec	4.44 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 384.7M evals/sec	7.38 cycles/eval
  int16  NOT NULL   (c = 0) 644.4M evals/sec	4.34 cycles/eval
  int16  NULL       (c = 0) 524.6M evals/sec	5.37 cycles/eval
  int16  NOT NULL   (c >= 0) 638.4M evals/sec	4.37 cycles/eval
  int16  NULL       (c >= 0) 458.8M evals/sec	6.17 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 635.3M evals/sec	4.40 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 335.1M evals/sec	8.50 cycles/eval
  int32  NOT NULL   (c = 0) 645.2M evals/sec	4.34 cycles/eval
  int32  NULL       (c = 0) 492.6M evals/sec	5.77 cycles/eval
  int32  NOT NULL   (c >= 0) 608.6M evals/sec	4.64 cycles/eval
  int32  NULL       (c >= 0) 440.7M evals/sec	6.48 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 637.8M evals/sec	4.43 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 348.0M evals/sec	8.22 cycles/eval
  int64  NOT NULL   (c = 0) 642.7M evals/sec	4.36 cycles/eval
  int64  NULL       (c = 0) 505.3M evals/sec	5.60 cycles/eval
  int64  NOT NULL   (c >= 0) 643.5M evals/sec	4.34 cycles/eval
  int64  NULL       (c >= 0) 472.8M evals/sec	6.00 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 634.2M evals/sec	4.43 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 396.7M evals/sec	7.21 cycles/eval
  float  NOT NULL   (c = 0) 604.6M evals/sec	4.63 cycles/eval
  float  NULL       (c = 0) 406.7M evals/sec	7.05 cycles/eval
  float  NOT NULL   (c >= 0) 545.3M evals/sec	5.20 cycles/eval
  float  NULL       (c >= 0) 384.4M evals/sec	7.39 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 583.2M evals/sec	4.80 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 312.2M evals/sec	9.12 cycles/eval
  double NOT NULL   (c = 0) 614.0M evals/sec	4.56 cycles/eval
  double NULL       (c = 0) 471.5M evals/sec	5.99 cycles/eval
  double NOT NULL   (c >= 0) 623.0M evals/sec	4.48 cycles/eval
  double NULL       (c >= 0) 379.9M evals/sec	7.47 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 599.5M evals/sec	4.67 cycles/eval
  double NULL       (c >= 0 AND c < 2) 415.2M evals/sec	6.82 cycles/eval

Detailed results after:
  int8   NOT NULL   (c = 0) 3660.3M evals/sec	0.76 cycles/eval
  int8   NULL       (c = 0) 3657.1M evals/sec	0.76 cycles/eval
  int8   NOT NULL   (c >= 0) 3712.0M evals/sec	0.75 cycles/eval
  int8   NULL       (c >= 0) 3618.9M evals/sec	0.78 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 1661.9M evals/sec	1.73 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 1663.4M evals/sec	1.77 cycles/eval
  int16  NOT NULL   (c = 0) 3781.4M evals/sec	0.73 cycles/eval
  int16  NULL       (c = 0) 3738.3M evals/sec	0.74 cycles/eval
  int16  NOT NULL   (c >= 0) 3672.9M evals/sec	0.76 cycles/eval
  int16  NULL       (c >= 0) 3767.4M evals/sec	0.75 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 1654.3M evals/sec	1.77 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 1651.6M evals/sec	1.72 cycles/eval
  int32  NOT NULL   (c = 0) 2925.1M evals/sec	0.97 cycles/eval
  int32  NULL       (c = 0) 2844.4M evals/sec	0.97 cycles/eval
  int32  NOT NULL   (c >= 0) 2942.7M evals/sec	0.95 cycles/eval
  int32  NULL       (c >= 0) 2900.8M evals/sec	0.98 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 1641.1M evals/sec	1.73 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 1638.8M evals/sec	1.75 cycles/eval
  int64  NOT NULL   (c = 0) 3878.6M evals/sec	0.71 cycles/eval
  int64  NULL       (c = 0) 3763.9M evals/sec	0.76 cycles/eval
  int64  NOT NULL   (c >= 0) 2784.4M evals/sec	1.01 cycles/eval
  int64  NULL       (c >= 0) 2782.6M evals/sec	1.01 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 1671.4M evals/sec	1.71 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 1741.5M evals/sec	1.64 cycles/eval
  float  NOT NULL   (c = 0) 3940.8M evals/sec	0.72 cycles/eval
  float  NULL       (c = 0) 3820.9M evals/sec	0.72 cycles/eval
  float  NOT NULL   (c >= 0) 4571.4M evals/sec	0.60 cycles/eval
  float  NULL       (c >= 0) 4741.3M evals/sec	0.58 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 1318.0M evals/sec	2.18 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 1262.3M evals/sec	2.28 cycles/eval
  double NOT NULL   (c = 0) 2813.4M evals/sec	1.01 cycles/eval
  double NULL       (c = 0) 2664.6M evals/sec	1.06 cycles/eval
  double NOT NULL   (c >= 0) 3620.8M evals/sec	0.77 cycles/eval
  double NULL       (c >= 0) 3657.2M evals/sec	0.76 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 1248.8M evals/sec	2.30 cycles/eval
  double NULL       (c >= 0 AND c < 2) 1253.7M evals/sec	2.28 cycles/eval

Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Reviewed-on: http://gerrit.cloudera.org:8080/13591
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
(cherry picked from commit 349aeaab33d33ba1ed323a6a4ff1bd6eee971d85)
---
M src/kudu/common/CMakeLists.txt
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
3 files changed, 147 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/13635/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.10.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13635
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR](branch-1.10.x) KUDU-2846: optimize predicate evaluation for primitives

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13635 )

Change subject: KUDU-2846: optimize predicate evaluation for primitives
......................................................................

KUDU-2846: optimize predicate evaluation for primitives

This changes to an optimized unrolled-by-8 predicate evaluation for
primitive columns.

Performance is improved by up to 7.2x depending on the particular
predicate, type, and nullability (average around 4.8x). Branches are
reduced by about 6.5x and branch-misses by about 22x.

It's possible that hand-coded SIMD could improve on this a little bit
but likely not worth the effort.

perf-stat before:
 Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*':
      73905.379627      task-clock (msec)         #    0.997 CPUs utilized
   272,810,081,028      cycles                    #    3.691 GHz
   938,488,388,743      instructions              #    3.44  insn per cycle
   148,052,698,322      branches                  # 2003.274 M/sec
       882,311,138      branch-misses             #    0.60% of all branches

perf-stat after:
 Performance counter stats for 'build/latest/bin/column_predicate-test --gtest_filter=*Bench*':
      15354.077654      task-clock (msec)         #    0.992 CPUs utilized
    56,850,629,856      cycles                    #    3.703 GHz
   181,599,095,960      instructions              #    3.19  insn per cycle
    22,496,453,160      branches                  # 1465.178 M/sec
        38,662,626      branch-misses             #    0.17% of all branches

Detailed results before:
  int8   NOT NULL   (c = 0) 632.1M evals/sec	4.44 cycles/eval
  int8   NULL       (c = 0) 515.6M evals/sec	5.48 cycles/eval
  int8   NOT NULL   (c >= 0) 630.8M evals/sec	4.45 cycles/eval
  int8   NULL       (c >= 0) 426.8M evals/sec	6.64 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 632.6M evals/sec	4.44 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 384.7M evals/sec	7.38 cycles/eval
  int16  NOT NULL   (c = 0) 644.4M evals/sec	4.34 cycles/eval
  int16  NULL       (c = 0) 524.6M evals/sec	5.37 cycles/eval
  int16  NOT NULL   (c >= 0) 638.4M evals/sec	4.37 cycles/eval
  int16  NULL       (c >= 0) 458.8M evals/sec	6.17 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 635.3M evals/sec	4.40 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 335.1M evals/sec	8.50 cycles/eval
  int32  NOT NULL   (c = 0) 645.2M evals/sec	4.34 cycles/eval
  int32  NULL       (c = 0) 492.6M evals/sec	5.77 cycles/eval
  int32  NOT NULL   (c >= 0) 608.6M evals/sec	4.64 cycles/eval
  int32  NULL       (c >= 0) 440.7M evals/sec	6.48 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 637.8M evals/sec	4.43 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 348.0M evals/sec	8.22 cycles/eval
  int64  NOT NULL   (c = 0) 642.7M evals/sec	4.36 cycles/eval
  int64  NULL       (c = 0) 505.3M evals/sec	5.60 cycles/eval
  int64  NOT NULL   (c >= 0) 643.5M evals/sec	4.34 cycles/eval
  int64  NULL       (c >= 0) 472.8M evals/sec	6.00 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 634.2M evals/sec	4.43 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 396.7M evals/sec	7.21 cycles/eval
  float  NOT NULL   (c = 0) 604.6M evals/sec	4.63 cycles/eval
  float  NULL       (c = 0) 406.7M evals/sec	7.05 cycles/eval
  float  NOT NULL   (c >= 0) 545.3M evals/sec	5.20 cycles/eval
  float  NULL       (c >= 0) 384.4M evals/sec	7.39 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 583.2M evals/sec	4.80 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 312.2M evals/sec	9.12 cycles/eval
  double NOT NULL   (c = 0) 614.0M evals/sec	4.56 cycles/eval
  double NULL       (c = 0) 471.5M evals/sec	5.99 cycles/eval
  double NOT NULL   (c >= 0) 623.0M evals/sec	4.48 cycles/eval
  double NULL       (c >= 0) 379.9M evals/sec	7.47 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 599.5M evals/sec	4.67 cycles/eval
  double NULL       (c >= 0 AND c < 2) 415.2M evals/sec	6.82 cycles/eval

Detailed results after:
  int8   NOT NULL   (c = 0) 3660.3M evals/sec	0.76 cycles/eval
  int8   NULL       (c = 0) 3657.1M evals/sec	0.76 cycles/eval
  int8   NOT NULL   (c >= 0) 3712.0M evals/sec	0.75 cycles/eval
  int8   NULL       (c >= 0) 3618.9M evals/sec	0.78 cycles/eval
  int8   NOT NULL   (c >= 0 AND c < 2) 1661.9M evals/sec	1.73 cycles/eval
  int8   NULL       (c >= 0 AND c < 2) 1663.4M evals/sec	1.77 cycles/eval
  int16  NOT NULL   (c = 0) 3781.4M evals/sec	0.73 cycles/eval
  int16  NULL       (c = 0) 3738.3M evals/sec	0.74 cycles/eval
  int16  NOT NULL   (c >= 0) 3672.9M evals/sec	0.76 cycles/eval
  int16  NULL       (c >= 0) 3767.4M evals/sec	0.75 cycles/eval
  int16  NOT NULL   (c >= 0 AND c < 2) 1654.3M evals/sec	1.77 cycles/eval
  int16  NULL       (c >= 0 AND c < 2) 1651.6M evals/sec	1.72 cycles/eval
  int32  NOT NULL   (c = 0) 2925.1M evals/sec	0.97 cycles/eval
  int32  NULL       (c = 0) 2844.4M evals/sec	0.97 cycles/eval
  int32  NOT NULL   (c >= 0) 2942.7M evals/sec	0.95 cycles/eval
  int32  NULL       (c >= 0) 2900.8M evals/sec	0.98 cycles/eval
  int32  NOT NULL   (c >= 0 AND c < 2) 1641.1M evals/sec	1.73 cycles/eval
  int32  NULL       (c >= 0 AND c < 2) 1638.8M evals/sec	1.75 cycles/eval
  int64  NOT NULL   (c = 0) 3878.6M evals/sec	0.71 cycles/eval
  int64  NULL       (c = 0) 3763.9M evals/sec	0.76 cycles/eval
  int64  NOT NULL   (c >= 0) 2784.4M evals/sec	1.01 cycles/eval
  int64  NULL       (c >= 0) 2782.6M evals/sec	1.01 cycles/eval
  int64  NOT NULL   (c >= 0 AND c < 2) 1671.4M evals/sec	1.71 cycles/eval
  int64  NULL       (c >= 0 AND c < 2) 1741.5M evals/sec	1.64 cycles/eval
  float  NOT NULL   (c = 0) 3940.8M evals/sec	0.72 cycles/eval
  float  NULL       (c = 0) 3820.9M evals/sec	0.72 cycles/eval
  float  NOT NULL   (c >= 0) 4571.4M evals/sec	0.60 cycles/eval
  float  NULL       (c >= 0) 4741.3M evals/sec	0.58 cycles/eval
  float  NOT NULL   (c >= 0 AND c < 2) 1318.0M evals/sec	2.18 cycles/eval
  float  NULL       (c >= 0 AND c < 2) 1262.3M evals/sec	2.28 cycles/eval
  double NOT NULL   (c = 0) 2813.4M evals/sec	1.01 cycles/eval
  double NULL       (c = 0) 2664.6M evals/sec	1.06 cycles/eval
  double NOT NULL   (c >= 0) 3620.8M evals/sec	0.77 cycles/eval
  double NULL       (c >= 0) 3657.2M evals/sec	0.76 cycles/eval
  double NOT NULL   (c >= 0 AND c < 2) 1248.8M evals/sec	2.30 cycles/eval
  double NULL       (c >= 0 AND c < 2) 1253.7M evals/sec	2.28 cycles/eval

Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Reviewed-on: http://gerrit.cloudera.org:8080/13591
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
(cherry picked from commit 349aeaab33d33ba1ed323a6a4ff1bd6eee971d85)
Reviewed-on: http://gerrit.cloudera.org:8080/13635
---
M src/kudu/common/CMakeLists.txt
M src/kudu/common/column_predicate-test.cc
M src/kudu/common/column_predicate.cc
3 files changed, 147 insertions(+), 13 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/13635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.10.x
Gerrit-MessageType: merged
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13635
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR](branch-1.10.x) KUDU-2846: optimize predicate evaluation for primitives

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13635 )

Change subject: KUDU-2846: optimize predicate evaluation for primitives
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.10.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I9dd062961a3cd2c892997d6aba12684e603628a1
Gerrit-Change-Number: 13635
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 13 Jun 2019 18:09:57 +0000
Gerrit-HasComments: No