You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@ip-10-146-233-104.ec2.internal> on 2016/01/20 22:16:30 UTC
[kudu-CR] codegen: force inlining of utility functions in JIT code
Hello Dan Burkert, Adar Dembo,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/1845
to review the following change.
Change subject: codegen: force inlining of utility functions in JIT code
......................................................................
codegen: force inlining of utility functions in JIT code
In investigating why some benchmarks got slower with the switch to C++11, I
looked at some of the LLVM-generated code. We switched from LLVM 3.4 to 3.7
with the switch to C++11, so I thought we might have accidentally broken
optimizations.
It turns out we didn't accidentally disable optimizations, but llvm's inlining
heuristics must have changed. Looking at the generated code, I noticed that
we were not inlining the small utility functions from precompiled.cc, but instead
generating indirect calls.
I figured out that adding the 'always_inline' attribute to these functions
fixes the inlining and gave a pretty big performance improvement.
Before:
I0120 12:57:22.352180 25902 full_stack-insert-scan-test.cc:403] Time spent empty projection, 0 col: real 0.418s user 0.016s sys 0.048s
I0120 12:57:22.889552 25902 full_stack-insert-scan-test.cc:403] Time spent key scan, 1 col: real 0.479s user 0.020s sys 0.008s
I0120 12:57:24.727413 25902 full_stack-insert-scan-test.cc:403] Time spent full schema scan, 10 col: real 1.770s user 0.092s sys 0.016s
I0120 12:57:25.588093 25902 full_stack-insert-scan-test.cc:403] Time spent String projection, 1 col: real 0.788s user 0.064s sys 0.000s
I0120 12:57:26.288879 25902 full_stack-insert-scan-test.cc:403] Time spent Int32 projection, 4 col: real 0.639s user 0.028s sys 0.000s
I0120 12:57:27.042276 25902 full_stack-insert-scan-test.cc:403] Time spent Int64 projection, 4 col: real 0.693s user 0.028s sys 0.008s
After:
I0120 12:56:01.236634 25319 full_stack-insert-scan-test.cc:310] Doing test scans on table of 10000000 rows.
I0120 12:56:01.965637 25319 full_stack-insert-scan-test.cc:403] Time spent empty projection, 0 col: real 0.366s user 0.032s sys 0.004s
I0120 12:56:02.491664 25319 full_stack-insert-scan-test.cc:403] Time spent key scan, 1 col: real 0.459s user 0.024s sys 0.004s
I0120 12:56:03.834020 25319 full_stack-insert-scan-test.cc:403] Time spent full schema scan, 10 col: real 1.250s user 0.096s sys 0.008s
I0120 12:56:04.681128 25319 full_stack-insert-scan-test.cc:403] Time spent String projection, 1 col: real 0.780s user 0.060s sys 0.000s
I0120 12:56:05.294754 25319 full_stack-insert-scan-test.cc:403] Time spent Int32 projection, 4 col: real 0.547s user 0.028s sys 0.000s
I0120 12:56:05.988930 25319 full_stack-insert-scan-test.cc:403] Time spent Int64 projection, 4 col: real 0.633s user 0.036s sys 0.000s
Comparing the machine code before and after for the projection with 4 int64s
shows the effect of inlining:
LLVM 3.4 (prior to our C++11 upgrade):
movq (%rsi), %rax
movq 8(%rsi), %rcx
movq 192(%rax), %rdx
movq (%rdx), %rdx
movq 40(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rdx
movq 8(%rdx), %rdx
movq 48(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rdx
movq 16(%rdx), %rdx
movq 56(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rax
movq 24(%rax), %rax
movq 64(%rdi), %rdx
movq %rdx, (%rax,%rcx,8)
movb $1, %al
ret
After upgrade to LLVM 3.7:
pushq %r15
pushq %r14
pushq %rbx
movq %rsi, %r14
movq %rdi, %rbx
leaq 40(%rbx), %rdi
movabsq $140426221318144, %r15 <-- address of the 'PrecompiledCopyCell' function
xorl %edx, %edx
callq *%r15 <-- indirect call for each cell (#1)
leaq 48(%rbx), %rdi
movl $1, %edx
movq %r14, %rsi
callq *%r15 <-- #2
leaq 56(%rbx), %rdi
movl $2, %edx
movq %r14, %rsi
callq *%r15 <-- #3
addq $64, %rbx
movl $3, %edx
movq %rbx, %rdi
movq %r14, %rsi
callq *%r15 <-- #4
movb $1, %al
popq %rbx
popq %r14
popq %r15
retq
After this patch:
movq (%rsi), %rax
movq 8(%rsi), %rcx
movq 192(%rax), %rdx
movq (%rdx), %rdx
movq 40(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rdx
movq 8(%rdx), %rdx
movq 48(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rdx
movq 16(%rdx), %rdx
movq 56(%rdi), %rsi
movq %rsi, (%rdx,%rcx,8)
movq 192(%rax), %rax
movq 24(%rax), %rax
movq 64(%rdi), %rdx
movq %rdx, (%rax,%rcx,8)
movb $1, %al
retq
It looks like the 'after' still has some redundant loads which might be preventable,
but this should restore our previous performance.
Change-Id: I49c0eec360818c160db25bac7a91b5d00fff57ba
---
M src/kudu/codegen/precompiled.cc
1 file changed, 11 insertions(+), 4 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/1845/1
--
To view, visit http://gerrit.cloudera.org:8080/1845
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I49c0eec360818c160db25bac7a91b5d00fff57ba
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>