You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@ip-10-146-233-104.ec2.internal> on 2016/01/20 22:16:30 UTC
[kudu-CR] codegen: force inlining of utility functions in JIT code

Hello Dan Burkert, Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/1845

to review the following change.

Change subject: codegen: force inlining of utility functions in JIT code
......................................................................

codegen: force inlining of utility functions in JIT code

In investigating why some benchmarks got slower with the switch to C++11, I
looked at some of the LLVM-generated code. We switched from LLVM 3.4 to 3.7
with the switch to C++11, so I thought we might have accidentally broken
optimizations.

It turns out we didn't accidentally disable optimizations, but llvm's inlining
heuristics must have changed. Looking at the generated code, I noticed that
we were not inlining the small utility functions from precompiled.cc, but instead
generating indirect calls.

I figured out that adding the 'always_inline' attribute to these functions
fixes the inlining and gave a pretty big performance improvement.

Before:
I0120 12:57:22.352180 25902 full_stack-insert-scan-test.cc:403] Time spent empty projection, 0 col: real 0.418s user 0.016s     sys 0.048s
I0120 12:57:22.889552 25902 full_stack-insert-scan-test.cc:403] Time spent key scan, 1 col: real 0.479s user 0.020s     sys 0.008s
I0120 12:57:24.727413 25902 full_stack-insert-scan-test.cc:403] Time spent full schema scan, 10 col: real 1.770s        user 0.092s     sys 0.016s
I0120 12:57:25.588093 25902 full_stack-insert-scan-test.cc:403] Time spent String projection, 1 col: real 0.788s        user 0.064s     sys 0.000s
I0120 12:57:26.288879 25902 full_stack-insert-scan-test.cc:403] Time spent Int32 projection, 4 col: real 0.639s user 0.028s     sys 0.000s
I0120 12:57:27.042276 25902 full_stack-insert-scan-test.cc:403] Time spent Int64 projection, 4 col: real 0.693s user 0.028s     sys 0.008s

After:
I0120 12:56:01.236634 25319 full_stack-insert-scan-test.cc:310] Doing test scans on table of 10000000 rows.
I0120 12:56:01.965637 25319 full_stack-insert-scan-test.cc:403] Time spent empty projection, 0 col: real 0.366s user 0.032s     sys 0.004s
I0120 12:56:02.491664 25319 full_stack-insert-scan-test.cc:403] Time spent key scan, 1 col: real 0.459s user 0.024s     sys 0.004s
I0120 12:56:03.834020 25319 full_stack-insert-scan-test.cc:403] Time spent full schema scan, 10 col: real 1.250s        user 0.096s     sys 0.008s
I0120 12:56:04.681128 25319 full_stack-insert-scan-test.cc:403] Time spent String projection, 1 col: real 0.780s        user 0.060s     sys 0.000s
I0120 12:56:05.294754 25319 full_stack-insert-scan-test.cc:403] Time spent Int32 projection, 4 col: real 0.547s user 0.028s     sys 0.000s
I0120 12:56:05.988930 25319 full_stack-insert-scan-test.cc:403] Time spent Int64 projection, 4 col: real 0.633s user 0.036s     sys 0.000s

Comparing the machine code before and after for the projection with 4 int64s
shows the effect of inlining:

LLVM 3.4 (prior to our C++11 upgrade):
        movq    (%rsi), %rax
        movq    8(%rsi), %rcx
        movq    192(%rax), %rdx
        movq    (%rdx), %rdx
        movq    40(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rdx
        movq    8(%rdx), %rdx
        movq    48(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rdx
        movq    16(%rdx), %rdx
        movq    56(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rax
        movq    24(%rax), %rax
        movq    64(%rdi), %rdx
        movq    %rdx, (%rax,%rcx,8)
        movb    $1, %al
        ret

After upgrade to LLVM 3.7:
        pushq   %r15
        pushq   %r14
        pushq   %rbx
        movq    %rsi, %r14
        movq    %rdi, %rbx
        leaq    40(%rbx), %rdi
        movabsq $140426221318144, %r15     <-- address of the 'PrecompiledCopyCell' function
        xorl    %edx, %edx
        callq   *%r15                      <-- indirect call for each cell (#1)
        leaq    48(%rbx), %rdi
        movl    $1, %edx
        movq    %r14, %rsi
        callq   *%r15                      <-- #2
        leaq    56(%rbx), %rdi
        movl    $2, %edx
        movq    %r14, %rsi
        callq   *%r15                      <-- #3
        addq    $64, %rbx
        movl    $3, %edx
        movq    %rbx, %rdi
        movq    %r14, %rsi
        callq   *%r15                      <-- #4
        movb    $1, %al
        popq    %rbx
        popq    %r14
        popq    %r15
        retq

After this patch:
        movq    (%rsi), %rax
        movq    8(%rsi), %rcx
        movq    192(%rax), %rdx
        movq    (%rdx), %rdx
        movq    40(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rdx
        movq    8(%rdx), %rdx
        movq    48(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rdx
        movq    16(%rdx), %rdx
        movq    56(%rdi), %rsi
        movq    %rsi, (%rdx,%rcx,8)
        movq    192(%rax), %rax
        movq    24(%rax), %rax
        movq    64(%rdi), %rdx
        movq    %rdx, (%rax,%rcx,8)
        movb    $1, %al
        retq

It looks like the 'after' still has some redundant loads which might be preventable,
but this should restore our previous performance.

Change-Id: I49c0eec360818c160db25bac7a91b5d00fff57ba
---
M src/kudu/codegen/precompiled.cc
1 file changed, 11 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/1845/1
-- 
To view, visit http://gerrit.cloudera.org:8080/1845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I49c0eec360818c160db25bac7a91b5d00fff57ba
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>