You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2018/01/23 22:09:15 UTC

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Hello Dan Burkert,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/9108

to review the following change.


Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     322623.135418      task-clock (msec)         #    4.964 CPUs utilized
        11,751,640      context-switches          #    0.036 M/sec
         3,619,010      cpu-migrations            #    0.011 M/sec
           117,706      page-faults               #    0.365 K/sec
 1,057,656,296,542      cycles                    #    3.278 GHz
   695,668,728,662      instructions              #    0.66  insn per cycle
   127,565,245,327      branches                  #  395.400 M/sec
     1,480,987,998      branch-misses             #    1.16% of all branches

      64.996506196 seconds time elapsed

     330206.351604      task-clock (msec)         #    5.330 CPUs utilized
        12,518,370      context-switches          #    0.038 M/sec
         3,695,790      cpu-migrations            #    0.011 M/sec
           113,580      page-faults               #    0.344 K/sec
 1,059,061,436,907      cycles                    #    3.207 GHz
   697,782,051,928      instructions              #    0.66  insn per cycle
   127,949,774,596      branches                  #  387.484 M/sec
     1,371,967,917      branch-misses             #    1.07% of all branches

      61.952217532 seconds time elapsed

After:

     310788.334202      task-clock (msec)         #    5.315 CPUs utilized
        12,486,276      context-switches          #    0.040 M/sec
         3,690,660      cpu-migrations            #    0.012 M/sec
           113,988      page-faults               #    0.367 K/sec
 1,027,033,932,375      cycles                    #    3.305 GHz
   663,508,168,985      instructions              #    0.65  insn per cycle
   125,864,966,052      branches                  #  404.986 M/sec
     1,442,683,495      branch-misses             #    1.15% of all branches

      58.479033228 seconds time elapsed

 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     317505.548908      task-clock (msec)         #    5.323 CPUs utilized
        12,461,003      context-switches          #    0.039 M/sec
         3,688,491      cpu-migrations            #    0.012 M/sec
           113,952      page-faults               #    0.359 K/sec
 1,029,595,155,391      cycles                    #    3.243 GHz
   663,514,269,656      instructions              #    0.64  insn per cycle
   125,865,138,975      branches                  #  396.419 M/sec
     1,419,717,877      branch-misses             #    1.13% of all branches

      59.642651558 seconds time elapsed

Seems like a small improvement in cycle count, wall clock, etc.

Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
---
M CMakeLists.txt
M thirdparty/download-thirdparty.sh
D thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A thirdparty/patches/gperftools-Replace-namespace-base-with-namespace-tcmalloc.patch
D thirdparty/patches/gperftools-hook-mi_force_unlock-on-OSX-instead-of-pthread_atfork.patch
D thirdparty/patches/gperftools-issue-827-add_get_default_zone_to_osx_libc_override.patch
M thirdparty/vars.sh
7 files changed, 1,696 insertions(+), 159 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/9108/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Dan Burkert, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9108

to look at the new patch set (#2).

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.
- for unknown reasons, gperftools now fails to build with TSAN
  instrumentation: it shows duplicate symbols between the tcmalloc
  'operator new' and clang_rt's 'operator new'. We never actually needed
  to use gperftools in TSAN builds anyway, so this just removes the
  instrumented build and properly ifdefs out a few of the places where
  we'd included gperftools headers unconditionally.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     322623.135418      task-clock (msec)         #    4.964 CPUs utilized
        11,751,640      context-switches          #    0.036 M/sec
         3,619,010      cpu-migrations            #    0.011 M/sec
           117,706      page-faults               #    0.365 K/sec
 1,057,656,296,542      cycles                    #    3.278 GHz
   695,668,728,662      instructions              #    0.66  insn per cycle
   127,565,245,327      branches                  #  395.400 M/sec
     1,480,987,998      branch-misses             #    1.16% of all branches

      64.996506196 seconds time elapsed

     330206.351604      task-clock (msec)         #    5.330 CPUs utilized
        12,518,370      context-switches          #    0.038 M/sec
         3,695,790      cpu-migrations            #    0.011 M/sec
           113,580      page-faults               #    0.344 K/sec
 1,059,061,436,907      cycles                    #    3.207 GHz
   697,782,051,928      instructions              #    0.66  insn per cycle
   127,949,774,596      branches                  #  387.484 M/sec
     1,371,967,917      branch-misses             #    1.07% of all branches

      61.952217532 seconds time elapsed

After:

     310788.334202      task-clock (msec)         #    5.315 CPUs utilized
        12,486,276      context-switches          #    0.040 M/sec
         3,690,660      cpu-migrations            #    0.012 M/sec
           113,988      page-faults               #    0.367 K/sec
 1,027,033,932,375      cycles                    #    3.305 GHz
   663,508,168,985      instructions              #    0.65  insn per cycle
   125,864,966,052      branches                  #  404.986 M/sec
     1,442,683,495      branch-misses             #    1.15% of all branches

      58.479033228 seconds time elapsed

 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     317505.548908      task-clock (msec)         #    5.323 CPUs utilized
        12,461,003      context-switches          #    0.039 M/sec
         3,688,491      cpu-migrations            #    0.012 M/sec
           113,952      page-faults               #    0.359 K/sec
 1,029,595,155,391      cycles                    #    3.243 GHz
   663,514,269,656      instructions              #    0.64  insn per cycle
   125,865,138,975      branches                  #  396.419 M/sec
     1,419,717,877      branch-misses             #    1.13% of all branches

      59.642651558 seconds time elapsed

Seems like a small improvement in cycle count, wall clock, etc.

Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
---
M CMakeLists.txt
M src/kudu/server/default_path_handlers.cc
M src/kudu/server/pprof_path_handlers.cc
M src/kudu/server/tcmalloc_metrics.cc
M src/kudu/util/mem_tracker.cc
M src/kudu/util/process_memory.cc
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
D thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A thirdparty/patches/gperftools-Replace-namespace-base-with-namespace-tcmalloc.patch
D thirdparty/patches/gperftools-hook-mi_force_unlock-on-OSX-instead-of-pthread_atfork.patch
D thirdparty/patches/gperftools-issue-827-add_get_default_zone_to_osx_libc_override.patch
M thirdparty/vars.sh
13 files changed, 1,704 insertions(+), 165 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/9108/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610
PS1, Line 610: Exported variants should conform to the C++11 ABI
Technically they conform to the pre-C++11 ABI, no? Well, at least all of the exported ABI symbols must adhere to the C++03 symbols; not sure about "hidden" symbols.



-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 23 Jan 2018 23:16:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.
- for unknown reasons, gperftools now fails to build with TSAN
  instrumentation: it shows duplicate symbols between the tcmalloc
  'operator new' and clang_rt's 'operator new'. We never actually needed
  to use gperftools in TSAN builds anyway, so this just removes the
  instrumented build and properly ifdefs out a few of the places where
  we'd included gperftools headers unconditionally.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     322623.135418      task-clock (msec)         #    4.964 CPUs utilized
        11,751,640      context-switches          #    0.036 M/sec
         3,619,010      cpu-migrations            #    0.011 M/sec
           117,706      page-faults               #    0.365 K/sec
 1,057,656,296,542      cycles                    #    3.278 GHz
   695,668,728,662      instructions              #    0.66  insn per cycle
   127,565,245,327      branches                  #  395.400 M/sec
     1,480,987,998      branch-misses             #    1.16% of all branches

      64.996506196 seconds time elapsed

     330206.351604      task-clock (msec)         #    5.330 CPUs utilized
        12,518,370      context-switches          #    0.038 M/sec
         3,695,790      cpu-migrations            #    0.011 M/sec
           113,580      page-faults               #    0.344 K/sec
 1,059,061,436,907      cycles                    #    3.207 GHz
   697,782,051,928      instructions              #    0.66  insn per cycle
   127,949,774,596      branches                  #  387.484 M/sec
     1,371,967,917      branch-misses             #    1.07% of all branches

      61.952217532 seconds time elapsed

After:

     310788.334202      task-clock (msec)         #    5.315 CPUs utilized
        12,486,276      context-switches          #    0.040 M/sec
         3,690,660      cpu-migrations            #    0.012 M/sec
           113,988      page-faults               #    0.367 K/sec
 1,027,033,932,375      cycles                    #    3.305 GHz
   663,508,168,985      instructions              #    0.65  insn per cycle
   125,864,966,052      branches                  #  404.986 M/sec
     1,442,683,495      branch-misses             #    1.15% of all branches

      58.479033228 seconds time elapsed

 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=200000 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

     317505.548908      task-clock (msec)         #    5.323 CPUs utilized
        12,461,003      context-switches          #    0.039 M/sec
         3,688,491      cpu-migrations            #    0.012 M/sec
           113,952      page-faults               #    0.359 K/sec
 1,029,595,155,391      cycles                    #    3.243 GHz
   663,514,269,656      instructions              #    0.64  insn per cycle
   125,865,138,975      branches                  #  396.419 M/sec
     1,419,717,877      branch-misses             #    1.13% of all branches

      59.642651558 seconds time elapsed

Seems like a small improvement in cycle count, wall clock, etc.

Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Reviewed-on: http://gerrit.cloudera.org:8080/9108
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert <da...@cloudera.com>
---
M CMakeLists.txt
M src/kudu/server/default_path_handlers.cc
M src/kudu/server/pprof_path_handlers.cc
M src/kudu/server/tcmalloc_metrics.cc
M src/kudu/util/mem_tracker.cc
M src/kudu/util/process_memory.cc
M thirdparty/build-thirdparty.sh
M thirdparty/download-thirdparty.sh
D thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A thirdparty/patches/gperftools-Replace-namespace-base-with-namespace-tcmalloc.patch
D thirdparty/patches/gperftools-hook-mi_force_unlock-on-OSX-instead-of-pthread_atfork.patch
D thirdparty/patches/gperftools-issue-827-add_get_default_zone_to_osx_libc_override.patch
M thirdparty/vars.sh
13 files changed, 1,704 insertions(+), 165 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Dan Burkert: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................


Patch Set 2: Code-Review+2

LGTM.  Tests pass on macos 10.12.


-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 24 Jan 2018 18:30:31 +0000
Gerrit-HasComments: No

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610
PS1, Line 610: Exported variants should conform to the C++11 ABI
> Technically they conform to the pre-C++11 ABI, no? Well, at least all of th
good point. I thought you were on vacation, though :P



-- 
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 24 Jan 2018 00:17:44 +0000
Gerrit-HasComments: Yes