You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Jim Apple (Code Review)" <ge...@cloudera.org> on 2016/06/08 16:02:32 UTC

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Jim Apple has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3338

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................

Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Impala supports machines that do not have AVX2 instructions, so the
fast insert path is only conditionally enabled. This condition check
does not hurt Insert()'s speed.

I also needed to make a minor change to the order of libraries in
LD_LIBRARY_PATH when using the toolchain.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
M bin/impala-config.sh
6 files changed, 66 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 6:

(5 comments)

I have changed the layout of the BFs so that they are the same no matter what instruction set is supported in the writer.

There are some performance changes from this, but they are mostly a wash.

http://gerrit.cloudera.org:8080/#/c/3338/5/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

PS5, Line 158: 
> typo: constants
Done


Line 187:   const uint32_t bucket_idx = HashUtil::Rehash32to32(hash) & directory_mask_;
> If the datalayout is impacted, then I agree having a member makes sense.  T
I have tried to clarify this in the comments and I have changed the code to use the same layout whether AVX2 instructions are enabled or not.


PS5, Line 188:  (CpuInfo::IsSup
> can you double check that we aren't double dispatching?  I.e. is this a dir
I have checked that we are not double-dispatching. However, we are not inlining the AVX functions inside the functions without target("avx2"). I will fix this in a later patch by following the strategy in sse-util.h.


Line 246: 
> Is there some reason we need to zero them here but not in the Find() implem
No; that was simply an omission


Line 258: 
> The logic  to compute the mask is identical to above, right? Can we make it
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

My feeling is:

pre-SSE 4.1: as long as it works, we're good, nobody will be expecting great performance on those machines
SSE 4.1: we would need to think very carefully about regressing it

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

> It looks like it hit this on one of the builds:
 > 
 > 11:45:59 /tmp/cc4RuPOa.s: Assembler messages:
 > 11:45:59 /tmp/cc4RuPOa.s:19102: Error: no such instruction:
 > `vpsrlvq -208(%rbp),%ymm0,%ymm0'
 > 11:45:59 /tmp/cc4RuPOa.s:19115: Error: operand type mismatch for
 > `vpand'
 > 11:45:59 /tmp/cc4RuPOa.s:19128: Error: no such instruction:
 > `vpsllvq -336(%rbp),%ymm0,%ymm0'
 > 11:45:59 /tmp/cc4RuPOa.s:19148: Error: operand type mismatch for
 > `vpor'
 > 
 > That's an avx2 instruction, so I guess on some systems it's trying
 > to use some assembler that doesn't support AVX2?

I filed https://jira.cloudera.com/browse/RELENG-734 to ask for newer linkers on all build machines.

Also, I spent some time trying to recreate a query that speeds up either end-to-end or in the build time of a hash join, but did not succeed in finding one.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 10: Code-Review+2

(1 comment)

Carry +2

http://gerrit.cloudera.org:8080/#/c/3338/8/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 109:   static const int LOG_BUCKET_WORD_BITS = 5;
> Maybe just remove this TODO, since I don't think it's actionable right now 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> I could call it in the constructor, but given how cheap it is to check, do 
It will save a couple of instructions and ideally the boolean will be read from a register. 
Apart from the cost, it is good practice to check the condition once and cache opposed to checking for each row.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3338

to look at the new patch set (#5).

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................

Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. When AVX2 is not available,
the operations do not change - the tiny Bloom filters are still 512
bits. Checking whether AVX2 instructions are available does not seem
to hurt operation speed, perhaps because the branch becomes very easy
to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 189 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> No, wasn't referring to a static const, I expect the compiler to hoist the 
That codegen improvement isn't actually vapourware, I have a draft patch that's nearly done: https://gerrit.cloudera.org/#/c/3401/3


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has uploaded a new patch set (#3).

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................

Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Impala supports machines that do not have AVX2 instructions, so the
fast insert path is only conditionally enabled. This condition check
does not hurt Insert()'s speed.

I also needed to make a minor change to the order of libraries in
LD_LIBRARY_PATH when using the toolchain.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
M bin/impala-config.sh
6 files changed, 77 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> "const" doesn't make something a compile time constant. It just means that 
No, wasn't referring to a static const, I expect the compiler to hoist the bool in a register, also I expect the CMP instruction to be cheaper than MOV+TESTL. 
My preference is to get the code as efficient as possible and not wait for code-gen as code-gen work tends to take a lot of time.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

PS7: 15% increase in lookup speed by using constants, rather than lea + aligned load.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> So TESTL is cheaper than cmp eax, 1?
"const" doesn't make something a compile time constant. It just means that once initialized, it won't change (unless const is cast away).  Maybe you are thinking of "static const"., but this field is not a compile time constant. It depends on what CPU we're running on which is only known at runtime.  It is a code-gen time constant, however, which is why I said it can later be optimized via code-gen.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3338

to look at the new patch set (#10).

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................

Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. Checking whether AVX2
instructions are available does not seem to hurt operation speed,
perhaps because the branch becomes very easy to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 342 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/10
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3338/5/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

PS5, Line 158: constats
typo: constants


Line 187:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
Since this affects the data layout rather than switching between two equivalent functions, can we make this a property of the class, e.g. 'use_avx2_'? I could maybe be persuaded otherwise, but this makes the state of the bloom filter more self-contained.


Line 246:   _mm256_zeroupper();
Is there some reason we need to zero them here but not in the Find() implementation below?


Line 258:   const __m256i mask = _mm256_sllv_epi32(ones, hash_data);
The logic  to compute the mask is identical to above, right? Can we make it a common function.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3338/5/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 187:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> Since this affects the data layout rather than switching between two equiva
If the datalayout is impacted, then I agree having a member makes sense.  This also needs to be called out clearly in comments (currently, it's kind of implicit in the comments).


PS5, Line 188: BucketInsertAVX2
can you double check that we aren't double dispatching?  I.e. is this a direct call or do we get a call to a  "jmp *addr" here due to the target attribute?  (The compiler wouldn't know we've already done the dispatch ourselves and I'm not sure it takes into account whether there is a single implementation).


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has uploaded a new patch set (#2).

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................

Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Impala supports machines that do not have AVX2 instructions, so the
fast insert path is only conditionally enabled. This condition check
does not hurt Insert()'s speed.

I also needed to make a minor change to the order of libraries in
LD_LIBRARY_PATH when using the toolchain.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
M bin/impala-config.sh
6 files changed, 66 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Michael Brown (Code Review)" <ge...@cloudera.org>.
Michael Brown has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/10//COMMIT_MSG
Commit Message:

PS10, Line 10: fitlers,
typo: "fitlers"


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 8: Code-Review+1

Carry Tim's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

PS7, Line 41: instructionshalf
> nit: missing space before 'half'
Done


http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 109:   // TODO: Use Bits::Log2Ceiling64(numeric_limits<BucketWord>::digits) once we enable
> This should be enabled now, so if you feel like trying, we could just fix i
Unfortunately, it isn't constexpr yet, and it's in a Google library.


Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> Can you call CpuInfo::IsSupported once at Init or BloomFilter() and use a c
I could call it in the constructor, but given how cheap it is to check, do you think it will save any cycles?

https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/util/cpu-info.h#L61


PS7, Line 218: BLOOM_HASH_CONSTANTS
> Maybe just undefine at the bottom of the header? It'll be easier to spot. O
added IMPALA_, moved undef to bottom of file


Line 244: inline bool BloomFilter::BucketFindAVX2(
> nit: could cut down on vertical whitespace in this function.
Slimmed down whitespace here and elsewhere.


PS7, Line 250: is
> nit: 'if' not 'is'
Done


Line 269:       return false;
> Don't need to fix, but I wonder if it would be more efficient to always che
It was slower last time I tried.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> There should be no cost difference.  IsSupported is just a bit-test which i
So TESTL is cheaper than cmp eax, 1?
Why would the const boolean not get read from the register? can't the compiler optimize that?


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> I am interested in closing up this debate.
I think it's unlikely to make a measurable difference reading a word from a fixed address in the data segment and applying  a bitmask versus reading a byte at a fixed offset from the bloom filter pointer.

My preference is to keep it as-is so I don't have to revert the change in the codegen patch, but either is fine to get the patch committed.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3338

to look at the new patch set (#7).

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................

Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. Checking whether AVX2
instructions are available does not seem to hurt operation speed,
perhaps because the branch becomes very easy to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 354 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> That codegen improvement isn't actually vapourware, I have a draft patch th
Code-gen aside.
As far as I know it is not recommend to re-check conditions that don't change over the lifetime of an object. 

And if use_avx2 is defined as below it should be read from the registers in BucketFindAVX2 and BucketInsertAVX2.


private:
  /// log_directory_space_ is the log (base 2) of the number of buckets in the directory.
  const int log_num_buckets_;

  const bool use_avx2; 

  /// directory_mask_ is (1 << log_num_buckets_) - 1. It is precomputed for
  /// efficiency reasons.
  const uint32_t directory_mask_;


Line 254:   const bool result = _mm256_testc_si256(bucket, mask);
Did you check that AVX2 provides speedup even for highly selective filters?
As BucketFind can return earlier at first BUCKET_WORD.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3338

to look at the new patch set (#6).

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................

Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. Checking whether AVX2
instructions are available does not seem to hurt operation speed,
perhaps because the branch becomes very easy to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 358 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/4/bin/impala-config.sh
File bin/impala-config.sh:

Line 395:   LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${IMPALA_TOOLCHAIN_GCC_LIB}"
Rolled my change back; a clean build worked fine on my machine. Might have been something screwey with my dev env.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/2/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

PS2, Line 184: reinterpret_cast<__m256i*>(
This cast violates C's strict aliasing rules since we end up accessing the same memory through different non-char data types. For better or worse we compile with -fno-strict-aliasing, so this won't cause problems, but it would be good to  document this assumption in a comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> Code-gen aside.
I am interested in closing up this debate.

If I use use_avx2_, and this is not as efficient as the codegen approach since the replacement for CpuInfo:IsSupported(CpuInfo::AVX2) is now further away from the actual usage of that info in Find() and Insert(), then we can change back to using CpuInfo::IsSupported in Find and Insert once the codegen patch is in.

IOW, I propose to do what Mostafa is suggesting for now and change it back later if it helps performance after the codegen patch is in.

Dan, Tim, does that work for you?


Line 254:   const bool result = _mm256_testc_si256(bucket, mask);
> Did you check that AVX2 provides speedup even for highly selective filters?
Yes. Those are the 'absent' lines in the benchmark file.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 8: Code-Review+2

(3 comments)

Looks good, upgrading to +2 since it's all in the utils folder.

http://gerrit.cloudera.org:8080/#/c/3338/8/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

Line 33: uint64_t MakeRand() {
Don't need to fix but we we could avoid having to work around rand's limitations by using a different rng. E.g. be/src/runtime/sorter.cc uses mersenne twister.


http://gerrit.cloudera.org:8080/#/c/3338/8/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 109:   // TODO: Use Bits::Log2Ceiling64(numeric_limits<BucketWord>::digits) once we enable
Maybe just remove this TODO, since I don't think it's actionable right now or all that important.


Line 143:       __attribute__((__target__("avx2")));
I spent a little time looking at how __attribute__((__target__ is handled in gcc and clang just to understand what happens here, since the gcc docs are unclear on how __attribute__((__target__ is handled.

gcc can do function multi-versioning (https://gcc.gnu.org/wiki/FunctionMultiVersioning) if you have multiple declarations of the function with different attributes that are visible at the function callsite. If you do this, then the function call turns into an indirect call via a link table that's set up during dynamic linking.

We avoid that here, since there's only a single version of the function declared. When compiling a release build, I only see a direct call in the assembly, so that is good.

In the LLVM IR I also see a direct function call:

  tail call void @_ZN6impala11BloomFilter16BucketInsertAVX2Ejj(%"class.impala::BloomFilter"* nonnull %this, i32 %8, i32 %hash)

So I think this pattern is good and doing what we're expecting.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

It looks like it hit this on one of the builds:

11:45:59 /tmp/cc4RuPOa.s: Assembler messages:
11:45:59 /tmp/cc4RuPOa.s:19102: Error: no such instruction: `vpsrlvq -208(%rbp),%ymm0,%ymm0'
11:45:59 /tmp/cc4RuPOa.s:19115: Error: operand type mismatch for `vpand'
11:45:59 /tmp/cc4RuPOa.s:19128: Error: no such instruction: `vpsllvq -336(%rbp),%ymm0,%ymm0'
11:45:59 /tmp/cc4RuPOa.s:19148: Error: operand type mismatch for `vpor'

That's an avx2 instruction, so I guess on some systems it's trying to use some assembler that doesn't support AVX2?

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has submitted this change and it was merged.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. Checking whether AVX2
instructions are available does not seem to hurt operation speed,
perhaps because the branch becomes very easy to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Reviewed-on: http://gerrit.cloudera.org:8080/3338
Reviewed-by: Jim Apple <jb...@cloudera.com>
Tested-by: Internal Jenkins
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 342 insertions(+), 126 deletions(-)

Approvals:
  Jim Apple: Looks good to me, approved
  Internal Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 11
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> I think it's unlikely to make a measurable difference reading a word from a
Mostafa, your proposed code change will not improve anything. A field in an object won't magically be saved in a register across function calls (at least not without complete inlining and a lot of static alias analysis by the compiler -- and that would work equally well with the current code, like Tim says, anyway).

So, leave the code as is.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

> How much of a regression will pre-SSE4.1 incur?  Regressing that
 > case in favor of making SSE4.1 capable

I think this got cut off.

In answer to your question, it's tough to answer without actually testing ona variety of hardware. I can test on my machine (which is a Broadwell), but I think it will very sensitive to the actual microarchitectural details, so the result I get on my machine could be, I think, much more or much less regression (percentage wise) than older machines would get.

For instance, the "ports" that the multiplication instruction uses on newer machines are different. I have seen this change the relative performance of two different approaches to integer arithmetic on an even smaller time frame than 2008-to-2016.

 > For #2, what is the rough perf impact for pre-SSE4.1 and SSE4.1
 > before and after that change?  #2 is probably the right choice, but
 > would be good to have some rough estimates to understand the
 > implications.

I guess this is the same comment as the above that got cut off.

I can do some testing on my machine if you would find it informative, even given the caveats above. Let me know.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 2:

(3 comments)

Nice! Did you see any difference in end-to-end query time? I suspect the improvement may be partially masked by the interpreted hash evaluation thta's done at the same time (There's a "TODO: codegen expr evaluation and hashing" in exec/partitioned-hash-join-node-ir.cc)

http://gerrit.cloudera.org:8080/#/c/3338/2/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 129:   // A faster SIMD version of Insert()
/// instead of //

Also combine the declaration and definition comments up here, I thnk - they're semi-redundant.


Line 154:     const uint32_t bucket_idx = HashUtil::Rehash32to32(hash) & directory_mask_;
Factor this out into InsertGeneric()


http://gerrit.cloudera.org:8080/#/c/3338/2/bin/impala-config.sh
File bin/impala-config.sh:

Line 395:   LD_LIBRARY_PATH="${IMPALA_TOOLCHAIN_GCC_LIB}:${LD_LIBRARY_PATH}"
This seems fine, but it may be worth running a private impala packaging build, just in case this breaks something on a particular OS.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

> My feeling is:
 > 
 > pre-SSE 4.1: as long as it works, we're good, nobody will be
 > expecting great performance on those machines
 > SSE 4.1: we would need to think very carefully about regressing it

I suspect this change would make SSE4.1 machines faster.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 10: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 10
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Michael Brown <mi...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

> > It just occurred to me that this could give incorrect results
 > > running on a mixed cluster of avx2/non-avx2 machines.
 > >
 > > Would it make sense to just use the avx2-optimised layout for the
 > > non-avx2 case?
 > 
 > Good point, Tim!
 > 
 > Using the same layout is certainly possible. Using the same hash
 > functions, however, would slow down the non-avx2 code.
 > 
 > The reason is that, between PS4 and PS5, I stated using the vpmulld
 > instruction to rehash the 32-bit value by multiplying it by 8
 > different odd 32-bit constants and taking the top 5 bits of each.
 > In the serial code, I multiply by two different 64-bit constants
 > using Rehash32to64, add other 64-bit constants, then take the top
 > 32-bits of each of each. Switching to eight 32-bit multiplications
 > would be a good bit slower, I suspect.
 > 
 > This could be alleviated using pmulld, which can perform 4 32-bit
 > multiplications with one instruction, but that was added in SSE4.1.
 > 
 > I see two options:
 > 
 > 1. Leave some performance on the table with this commit by moving
 > back to PS4.
 > 
 > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for
 > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger
 > speedup for more modern machines.
 > 
 > I have another change I've already started testing that increases
 > the gap between #1 and #2 by another 50-100%.
 > 
 > Tim, Dan: what do you think is the right choice?

How much of a regression will pre-SSE4.1 incur?  Regressing that case in favor of making SSE4.1 capable 

 > > It just occurred to me that this could give incorrect results
 > > running on a mixed cluster of avx2/non-avx2 machines.
 > >
 > > Would it make sense to just use the avx2-optimised layout for the
 > > non-avx2 case?
 > 
 > Good point, Tim!
 > 
 > Using the same layout is certainly possible. Using the same hash
 > functions, however, would slow down the non-avx2 code.
 > 
 > The reason is that, between PS4 and PS5, I stated using the vpmulld
 > instruction to rehash the 32-bit value by multiplying it by 8
 > different odd 32-bit constants and taking the top 5 bits of each.
 > In the serial code, I multiply by two different 64-bit constants
 > using Rehash32to64, add other 64-bit constants, then take the top
 > 32-bits of each of each. Switching to eight 32-bit multiplications
 > would be a good bit slower, I suspect.
 > 
 > This could be alleviated using pmulld, which can perform 4 32-bit
 > multiplications with one instruction, but that was added in SSE4.1.
 > 
 > I see two options:
 > 
 > 1. Leave some performance on the table with this commit by moving
 > back to PS4.
 > 
 > 2. Take a regression for pre-sse4.1 machines (ended in 2008ish for
 > Intel, 2012ish for AMD, if I'm reading correctly) and a bigger
 > speedup for more modern machines.
 > 
 > I have another change I've already started testing that increases
 > the gap between #1 and #2 by another 50-100%.
 > 
 > Tim, Dan: what do you think is the right choice?

For #2, what is the rough perf impact for pre-SSE4.1 and SSE4.1 before and after that change?  #2 is probably the right choice, but would be good to have some rough estimates to understand the implications.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/4/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 160:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> We should standardize on one way to dispatch based on CPUs.  According to t
For testing, I think our more verbose way is easier.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

It looks like if we set the -B option to gcc (like we do for USE_GOLD_LINKER), it should pick up the as from the toolchain binutils, so maybe that's something we can look to do?

/opt/Impala-Toolchain/gcc-4.9.2/bin/gcc -B /opt/Impala-Toolchain/binutils-2.26/bin/ --print-search-dirs 

Based on the microbenchmark, this looks promising, but it's probably just masked by slowness in the interpreted hashing.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

> It just occurred to me that this could give incorrect results
 > running on a mixed cluster of avx2/non-avx2 machines.
 > 
 > Would it make sense to just use the avx2-optimised layout for the
 > non-avx2 case?

Good point, Tim!

Using the same layout is certainly possible. Using the same hash functions, however, would slow down the non-avx2 code.

The reason is that, between PS4 and PS5, I stated using the vpmulld instruction to rehash the 32-bit value by multiplying it by 8 different odd 32-bit constants and taking the top 5 bits of each. In the serial code, I multiply by two different 64-bit constants using Rehash32to64, add other 64-bit constants, then take the top 32-bits of each of each. Switching to eight 32-bit multiplications would be a good bit slower, I suspect.

This could be alleviated using pmulld, which can perform 4 32-bit multiplications with one instruction, but that was added in SSE4.1.

I see two options:

1. Leave some performance on the table with this commit by moving back to PS4.

2. Take a regression for pre-sse4.1 machines (ended in 2008ish for Intel, 2012ish for AMD, if I'm reading correctly) and a bigger speedup for more modern machines.

I have another change I've already started testing that increases the gap between #1 and #2 by another 50-100%. 

Tim, Dan: what do you think is the right choice?

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

It just occurred to me that this could give incorrect results running on a mixed cluster of avx2/non-avx2 machines.

Would it make sense to just use the avx2-optimised layout for the non-avx2 case?

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> It will save a couple of instructions and ideally the boolean will be read 
There should be no cost difference.  IsSupported is just a bit-test which is a single instruction (TESTL).  And making it a const bool in the class won't allow it to be in a register.

In the medium term we can get rid of these using code-gen anyway.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/2/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

Line 48: TEST(BloomFilter, Insert) {
Can we add some test coverage with AVX2 disabled?

I believe there's support in CpuInfo to explicitly disable flags.

It's also not 100% clear to me whether we'll get reliable test coverage for the AVX2 path on Jenkins machines, since I've seen before that some of them have AVX2 disabled. I'm not too concerned, but it would be good to rerun the tests locally before merging in case the merge job doesn't run with AVX2.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/4/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 160:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
We should standardize on one way to dispatch based on CPUs.  According to this page: https://gcc.gnu.org/wiki/FunctionMultiVersioning, GCC also has built in support for this if you define two versions of BucketInsert(), one with "default" target and one with "avx2" target.  It doesn't sound like it patches the code directly, though, so still incurs a runtime cost of dispatching (sounds like they are using "jmp *ptr"). 

Have you looked at that to see which way would be better?


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7: Code-Review+1

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

PS7, Line 41: instructionshalf
nit: missing space before 'half'


Line 44: void BfInsert(BloomFilter& bf, uint32_t h) {
Nice, I like this approach.


http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 109:   // TODO: Use Bits::Log2Ceiling64(numeric_limits<BucketWord>::digits) once we enable
This should be enabled now, so if you feel like trying, we could just fix it now.


Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> Can you call CpuInfo::IsSupported once at Init or BloomFilter() and use a c
We're going to optimise out CpuInfo::IsSupported() in codegen with a uniform approach once we've got a better constant-replacement infrastructure (IMPALA-3637)


PS7, Line 218: BLOOM_HASH_CONSTANTS
Maybe just undefine at the bottom of the header? It'll be easier to spot. Or make the macro name more unique and don't worry about undefining it, e.g. IMPALA_BLOOM_HASH_CONSTANTS


Line 244: inline bool BloomFilter::BucketFindAVX2(
nit: could cut down on vertical whitespace in this function.


PS7, Line 250: is
nit: 'if' not 'is'


Line 269:       return false;
Don't need to fix, but I wonder if it would be more efficient to always check every word and move the branch outside the loop. That requires more bit manipulation but might be offset by reduced (unpredictable) branches.

I guess it doesn't really matter since this is the slow path now.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 8:

This has passed a private packaging build.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/7/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 176:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
Can you call CpuInfo::IsSupported once at Init or BloomFilter() and use a const bool instead?


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 7
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/3338/2/be/src/util/bloom-filter-test.cc
File be/src/util/bloom-filter-test.cc:

Line 48: TEST(BloomFilter, Insert) {
> Can we add some test coverage with AVX2 disabled?
Done


http://gerrit.cloudera.org:8080/#/c/3338/2/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 129:   // A faster SIMD version of Insert()
> /// instead of //
Done


Line 154:     const uint32_t bucket_idx = HashUtil::Rehash32to32(hash) & directory_mask_;
> Factor this out into InsertGeneric()
Called them Insert, InsertBucket, InsertBucketAVX2. Not married to the names.


PS2, Line 184: reinterpret_cast<__m256i*>(
> This cast violates C's strict aliasing rules since we end up accessing the 
Done


http://gerrit.cloudera.org:8080/#/c/3338/2/bin/impala-config.sh
File bin/impala-config.sh:

Line 395:   LD_LIBRARY_PATH="${IMPALA_TOOLCHAIN_GCC_LIB}:${LD_LIBRARY_PATH}"
> This seems fine, but it may be worth running a private impala packaging bui
Packaging build failed with no useful error message.

    00:15:24.447 make[1]: *** [be/src/service/CMakeFiles/impalad.dir/rule] Error 2

I'll work on fixing it, but everything else is ready for re-review.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/4/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 160:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> For testing, I think our more verbose way is easier.
Okay, sounds good. It also makes it clearer where we have the runtime dispatching.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3338

to look at the new patch set (#8).

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................

Use AVX2 operations to speedup Bloom filters by 10-100%.

As a reminder, our Bloom filters use so-called "block" Bloom filters,
in which each Bloom filter is actually a set of tiny Bloom fitlers,
each the size of a cache line.

The big idea is to make the tiny Bloom filters that make up a large
Bloom filter the size of AVX2 registers (256 bits) rather than cache
lines (512 bits). This enables a number of SIMD optimizations, and the
resulting AVX2 code does not need loops or conditionals.

Impala supports machines that do not have AVX2 instructions, so the
fast path is only conditionally enabled. Checking whether AVX2
instructions are available does not seem to hurt operation speed,
perhaps because the branch becomes very easy to predict.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
6 files changed, 344 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/8
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3338/4/be/src/util/bloom-filter.h
File be/src/util/bloom-filter.h:

Line 160:   if (CpuInfo::IsSupported(CpuInfo::AVX2)) {
> We should standardize on one way to dispatch based on CPUs.  According to t
I guess one advantage to making it explicit like you have is that we could replace these with constants in codegen.


-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4: Code-Review+1

Looks good once we confirm we can build it ok.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has uploaded a new patch set (#4).

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................

Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Impala supports machines that do not have AVX2 instructions, so the
fast insert path is only conditionally enabled. This condition check
does not hurt Insert()'s speed.

I also needed to make a minor change to the order of libraries in
LD_LIBRARY_PATH when using the toolchain.

Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.h
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
5 files changed, 76 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/38/3338/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filters by 10-100%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filters by 10-100%.
......................................................................


Patch Set 5:

In PS5, I changed the way the AVX2 operations work in Insert(), and added AVX2 operations in Find(). The speedups now sometimes more than 2x.

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) Use AVX2 operations to speedup Bloom filter insert by 10-50%.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change.

Change subject: Use AVX2 operations to speedup Bloom filter insert by 10-50%.
......................................................................


Patch Set 4:

> Looks good once we confirm we can build it ok.

OK. Blocked on https://gerrit.cloudera.org/#/c/3352/, which in turn is blocked on https://gerrit.cloudera.org/#/c/3353/1, which is blocked on not committing infra changed on a Friday afternoon :-D

-- 
To view, visit http://gerrit.cloudera.org:8080/3338
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6fef4f6652876f8fd7e3f0e41431702380418c98
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No