You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2020/04/01 22:29:00 UTC

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Hello Andrew Wong, Grant Henke,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/15634

to review the following change.


Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................

columnar_serialization: use AVX2 for int32 and int64 copying

This uses the AVX2 "gather" instructions to do the copying of selected
int32s and int64s. The following improvements were observed:

Int32:
  Converting 10_int32_non_null to PB (method columnar) row select rate 1: 0.8829691 cycles/cell -> 0.8386091 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.8: 1.86863074 cycles/cell -> 1.61456746 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.5: 2.3829623 cycles/cell -> 2.05157198 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.2: 4.15909214 cycles/cell -> 3.82449024 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 1: 1.04652828 cycles/cell -> 1.01822806 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.8: 2.10860372 cycles/cell -> 1.85333702 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.5: 2.75141002 cycles/cell -> 2.39638206 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.2: 4.6968821 cycles/cell -> 4.40193506 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 1: 1.31809924 cycles/cell -> 1.31851512 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.8: 2.36648378 cycles/cell -> 2.12030662 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.5: 2.98480266 cycles/cell -> 2.7476185 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.2: 5.0439634 cycles/cell -> 4.5842071 cycles/cell

Int64:
  Converting 10_int64_non_null to PB (method columnar) row select rate 1: 1.32330358 cycles/cell -> 1.24855148 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.8: 2.04848734 cycles/cell -> 2.12979712 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.5: 2.50150968 cycles/cell -> 2.5724664 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.2: 4.4513395 cycles/cell -> 4.35936382 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 1: 1.5080423 cycles/cell -> 1.51448434 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.8: 2.34286302 cycles/cell -> 2.26529584 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.5: 2.99375316 cycles/cell -> 2.7263687 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.2: 5.01722324 cycles/cell -> 4.71793008 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 1: 1.7227708 cycles/cell -> 1.67661726 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.8: 2.68160422 cycles/cell -> 2.50480846 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.5: 3.29833934 cycles/cell -> 3.05940708 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.2: 5.42127834 cycles/cell -> 4.99359244 cycles/cell

In the few places that the above indicates a regression, I looped that
same test case and found that the "after" was indeed either
indistinguishable or slightly faster. The test results just have a
little bit of noise.

Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
---
M src/kudu/common/columnar_serialization.cc
1 file changed, 68 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/15634/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc
File src/kudu/common/columnar_serialization.cc:

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@323
PS1, Line 323: static bool has_avx2 = base::CPU().has_avx2();
> warning: 'has_avx2' is a static definition in anonymous namespace; static i
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@323
PS1, Line 323: static bool
> static const bool
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@328
PS1, Line 328: type_size
> nit: rename this sizeof_type too?
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@328
PS1, Line 328: nt type_size
> Could you add a comment what's exactly type_size in this context?
Done (added at the main entry point)


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@330
PS1, Line 330:     const uint16_t* __restrict__ sel_rows,
> warning: parameter 'sel_rows' is unused [misc-unused-parameters]
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@331
PS1, Line 331:     int n_sel_rows,
> warning: parameter 'n_sel_rows' is unused [misc-unused-parameters]
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@332
PS1, Line 332:     const uint8_t* __restrict__ src_buf,
> warning: parameter 'src_buf' is unused [misc-unused-parameters]
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@333
PS1, Line 333:     uint8_t* __restrict__ dst_buf) {
> warning: parameter 'dst_buf' is unused [misc-unused-parameters]
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@340
PS1, Line 340: #if __x86_64__ && (defined(__clang__) || (defined(__GNUC__) && __GNUC__ >= 5)
> Alternatively could use following technique that detects whether compiler s
GCC4 supports AVX2 but doesn't know how to expose AVX2 intrinsics on a per-function basis according to the function's target attribute


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@343
PS1, Line 343: 4
> For sake of readability, could declare a static constexpr variable size_of_
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@349
PS1, Line 349:   int iters = n_sel_rows / 8;
> It'd be good to have a variable that derives 8 which is basically the numbe
Done


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@351
PS1, Line 351: __m256i indexes = _mm256_cvtepu16_epi32(*reinterpret_cast<const __m128i*>(sel_rows));
> Why not load 16 indexes in 256-bit variable instead of 8?
we can only fit 8 ints into the vector on the next line, so loading 16 indexes doesn't help us anything. Also the indexes have to be extended to 32-bit because the gather instruction doesn't support 16-bit integers.


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@349
PS1, Line 349:   int iters = n_sel_rows / 8;
             :   while (iters--) {
             :     __m256i indexes = _mm256_cvtepu16_epi32(*reinterpret_cast<const __m128i*>(sel_rows));
             :     __m256i elems = _mm256_i32gather_epi32(src_buf, indexes, sizeof(int32_t));
             :     _mm256_storeu_si256(reinterpret_cast<__m256i*>(dst_buf), elems);
             :     dst_buf += 8 * sizeof(int32_t);
             :     sel_rows += 8;
             :   }
> +1.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 Apr 2020 20:17:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................

columnar_serialization: use AVX2 for int32 and int64 copying

This uses the AVX2 "gather" instructions to do the copying of selected
int32s and int64s. The following improvements were observed:

Int32:
  Converting 10_int32_non_null to PB (method columnar) row select rate 1: 0.8829691 cycles/cell -> 0.8386091 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.8: 1.86863074 cycles/cell -> 1.61456746 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.5: 2.3829623 cycles/cell -> 2.05157198 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.2: 4.15909214 cycles/cell -> 3.82449024 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 1: 1.04652828 cycles/cell -> 1.01822806 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.8: 2.10860372 cycles/cell -> 1.85333702 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.5: 2.75141002 cycles/cell -> 2.39638206 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.2: 4.6968821 cycles/cell -> 4.40193506 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 1: 1.31809924 cycles/cell -> 1.31851512 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.8: 2.36648378 cycles/cell -> 2.12030662 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.5: 2.98480266 cycles/cell -> 2.7476185 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.2: 5.0439634 cycles/cell -> 4.5842071 cycles/cell

Int64:
  Converting 10_int64_non_null to PB (method columnar) row select rate 1: 1.32330358 cycles/cell -> 1.24855148 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.8: 2.04848734 cycles/cell -> 2.12979712 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.5: 2.50150968 cycles/cell -> 2.5724664 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.2: 4.4513395 cycles/cell -> 4.35936382 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 1: 1.5080423 cycles/cell -> 1.51448434 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.8: 2.34286302 cycles/cell -> 2.26529584 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.5: 2.99375316 cycles/cell -> 2.7263687 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.2: 5.01722324 cycles/cell -> 4.71793008 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 1: 1.7227708 cycles/cell -> 1.67661726 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.8: 2.68160422 cycles/cell -> 2.50480846 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.5: 3.29833934 cycles/cell -> 3.05940708 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.2: 5.42127834 cycles/cell -> 4.99359244 cycles/cell

In the few places that the above indicates a regression, I looped that
same test case and found that the "after" was indeed either
indistinguishable or slightly faster. The test results just have a
little bit of noise.

Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Reviewed-on: http://gerrit.cloudera.org:8080/15634
Tested-by: Todd Lipcon <to...@apache.org>
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/common/columnar_serialization.cc
1 file changed, 93 insertions(+), 9 deletions(-)

Approvals:
  Todd Lipcon: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 2:

> Patch Set 2: Verified-1
> 
> Build Failed 
> 
> http://jenkins.kudu.apache.org/job/kudu-gerrit/21281/ : FAILURE

Github was down. Will need to rebase/retrigger the jenkins verification job.


-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 Apr 2020 18:45:12 +0000
Gerrit-HasComments: No

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc
File src/kudu/common/columnar_serialization.cc:

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@323
PS1, Line 323: static bool
static const bool


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@328
PS1, Line 328: nt type_size
Could you add a comment what's exactly type_size in this context?


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@340
PS1, Line 340: #if __x86_64__ && (defined(__clang__) || (defined(__GNUC__) && __GNUC__ >= 5)
Alternatively could use following technique that detects whether compiler supports AVX2
https://github.com/apache/kudu/blob/master/src/kudu/util/CMakeLists.txt#L259


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@343
PS1, Line 343: 4
For sake of readability, could declare a static constexpr variable size_of_type to be 4/sizeof(int32_t).
Same for below.


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@349
PS1, Line 349:   int iters = n_sel_rows / 8;
It'd be good to have a variable that derives 8 which is basically the number of sel_rows processed in single iteration below.


Same for below.


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@351
PS1, Line 351: __m256i indexes = _mm256_cvtepu16_epi32(*reinterpret_cast<const __m128i*>(sel_rows));
Why not load 16 indexes in 256-bit variable instead of 8?


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@349
PS1, Line 349:   int iters = n_sel_rows / 8;
             :   while (iters--) {
             :     __m256i indexes = _mm256_cvtepu16_epi32(*reinterpret_cast<const __m128i*>(sel_rows));
             :     __m256i elems = _mm256_i32gather_epi32(src_buf, indexes, sizeof(int32_t));
             :     _mm256_storeu_si256(reinterpret_cast<__m256i*>(dst_buf), elems);
             :     dst_buf += 8 * sizeof(int32_t);
             :     sel_rows += 8;
             :   }
> I found this difficult to grok without looking at Intel docs. Mind adding a
+1.



-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 02 Apr 2020 01:52:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 06 Apr 2020 17:42:23 +0000
Gerrit-HasComments: No

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc
File src/kudu/common/columnar_serialization.cc:

http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@328
PS1, Line 328: type_size
nit: rename this sizeof_type too?


http://gerrit.cloudera.org:8080/#/c/15634/1/src/kudu/common/columnar_serialization.cc@349
PS1, Line 349:   int iters = n_sel_rows / 8;
             :   while (iters--) {
             :     __m256i indexes = _mm256_cvtepu16_epi32(*reinterpret_cast<const __m128i*>(sel_rows));
             :     __m256i elems = _mm256_i32gather_epi32(src_buf, indexes, sizeof(int32_t));
             :     _mm256_storeu_si256(reinterpret_cast<__m256i*>(dst_buf), elems);
             :     dst_buf += 8 * sizeof(int32_t);
             :     sel_rows += 8;
             :   }
I found this difficult to grok without looking at Intel docs. Mind adding a high-level explanation of what we're doing? e.g. 

"Iterate over 'sel_rows' 128-bits at time, first converting our selected row indexes to packed 32-bit integers, and then gathering the values pointed to by the packed indexes into 'dst_buf' 256-bits at time."

which I think applies to both instantiations.



-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 02 Apr 2020 00:31:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15634

to look at the new patch set (#3).

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................

columnar_serialization: use AVX2 for int32 and int64 copying

This uses the AVX2 "gather" instructions to do the copying of selected
int32s and int64s. The following improvements were observed:

Int32:
  Converting 10_int32_non_null to PB (method columnar) row select rate 1: 0.8829691 cycles/cell -> 0.8386091 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.8: 1.86863074 cycles/cell -> 1.61456746 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.5: 2.3829623 cycles/cell -> 2.05157198 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.2: 4.15909214 cycles/cell -> 3.82449024 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 1: 1.04652828 cycles/cell -> 1.01822806 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.8: 2.10860372 cycles/cell -> 1.85333702 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.5: 2.75141002 cycles/cell -> 2.39638206 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.2: 4.6968821 cycles/cell -> 4.40193506 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 1: 1.31809924 cycles/cell -> 1.31851512 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.8: 2.36648378 cycles/cell -> 2.12030662 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.5: 2.98480266 cycles/cell -> 2.7476185 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.2: 5.0439634 cycles/cell -> 4.5842071 cycles/cell

Int64:
  Converting 10_int64_non_null to PB (method columnar) row select rate 1: 1.32330358 cycles/cell -> 1.24855148 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.8: 2.04848734 cycles/cell -> 2.12979712 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.5: 2.50150968 cycles/cell -> 2.5724664 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.2: 4.4513395 cycles/cell -> 4.35936382 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 1: 1.5080423 cycles/cell -> 1.51448434 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.8: 2.34286302 cycles/cell -> 2.26529584 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.5: 2.99375316 cycles/cell -> 2.7263687 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.2: 5.01722324 cycles/cell -> 4.71793008 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 1: 1.7227708 cycles/cell -> 1.67661726 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.8: 2.68160422 cycles/cell -> 2.50480846 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.5: 3.29833934 cycles/cell -> 3.05940708 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.2: 5.42127834 cycles/cell -> 4.99359244 cycles/cell

In the few places that the above indicates a regression, I looped that
same test case and found that the "after" was indeed either
indistinguishable or slightly faster. The test results just have a
little bit of noise.

Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
---
M src/kudu/common/columnar_serialization.cc
1 file changed, 93 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/15634/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Andrew Wong, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15634

to look at the new patch set (#2).

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................

columnar_serialization: use AVX2 for int32 and int64 copying

This uses the AVX2 "gather" instructions to do the copying of selected
int32s and int64s. The following improvements were observed:

Int32:
  Converting 10_int32_non_null to PB (method columnar) row select rate 1: 0.8829691 cycles/cell -> 0.8386091 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.8: 1.86863074 cycles/cell -> 1.61456746 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.5: 2.3829623 cycles/cell -> 2.05157198 cycles/cell
  Converting 10_int32_non_null to PB (method columnar) row select rate 0.2: 4.15909214 cycles/cell -> 3.82449024 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 1: 1.04652828 cycles/cell -> 1.01822806 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.8: 2.10860372 cycles/cell -> 1.85333702 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.5: 2.75141002 cycles/cell -> 2.39638206 cycles/cell
  Converting 10_int32_0pct_null to PB (method columnar) row select rate 0.2: 4.6968821 cycles/cell -> 4.40193506 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 1: 1.31809924 cycles/cell -> 1.31851512 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.8: 2.36648378 cycles/cell -> 2.12030662 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.5: 2.98480266 cycles/cell -> 2.7476185 cycles/cell
  Converting 10_int32_10pct_null to PB (method columnar) row select rate 0.2: 5.0439634 cycles/cell -> 4.5842071 cycles/cell

Int64:
  Converting 10_int64_non_null to PB (method columnar) row select rate 1: 1.32330358 cycles/cell -> 1.24855148 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.8: 2.04848734 cycles/cell -> 2.12979712 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.5: 2.50150968 cycles/cell -> 2.5724664 cycles/cell
  Converting 10_int64_non_null to PB (method columnar) row select rate 0.2: 4.4513395 cycles/cell -> 4.35936382 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 1: 1.5080423 cycles/cell -> 1.51448434 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.8: 2.34286302 cycles/cell -> 2.26529584 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.5: 2.99375316 cycles/cell -> 2.7263687 cycles/cell
  Converting 10_int64_0pct_null to PB (method columnar) row select rate 0.2: 5.01722324 cycles/cell -> 4.71793008 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 1: 1.7227708 cycles/cell -> 1.67661726 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.8: 2.68160422 cycles/cell -> 2.50480846 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.5: 3.29833934 cycles/cell -> 3.05940708 cycles/cell
  Converting 10_int64_10pct_null to PB (method columnar) row select rate 0.2: 5.42127834 cycles/cell -> 4.99359244 cycles/cell

In the few places that the above indicates a regression, I looped that
same test case and found that the "after" was indeed either
indistinguishable or slightly faster. The test results just have a
little bit of noise.

Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
---
M src/kudu/common/columnar_serialization.cc
1 file changed, 92 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/34/15634/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] columnar serialization: use AVX2 for int32 and int64 copying

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/15634 )

Change subject: columnar_serialization: use AVX2 for int32 and int64 copying
......................................................................


Patch Set 3: Verified+1

the release build failed downloading numpy due to a network error. Since the previous build passed (only a lint issue) I'm overriding


-- 
To view, visit http://gerrit.cloudera.org:8080/15634
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9a536b78a524e8178f5d4a0d2dea04deedbd78
Gerrit-Change-Number: 15634
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <an...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 06 Apr 2020 16:36:27 +0000
Gerrit-HasComments: No