You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/11/01 20:02:58 UTC

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Hello zhen.zhang, Adar Dembo,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/8446

to review the following change.


Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................

compaction_policy: out-of-line inessential fields in RowSetInfo

Based on analysis using toplev.py from pmu-tools[1] I found that the
compaction algorithm was memory-bandwidth bound for the benchmarks in
compaction_policy-test.

The core algorithms of the compaction policy only access a subset of the
fields in the object, so an easy fix to reduce memory bandwidth usage is
to out-of-line the infrequently accessed fields to a separate object. In
particular, there were two strings in RowSetInfo that are used rarely,
and sizeof(std::string) is 32 bytes. So moving them out of line reduces
sizeof(RowSetInfo) quite a bit.

Summary of improvements:
  - 12% improvement in wall time
  - L1 cache: 36% reduction in misses
  - L2 cache: 66% reduction in misses
  - L3 cache: 59% reduction in misses

Before:

 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        55,364,354      mem_load_retired.l1_miss                                      ( +-  0.83% )  (66.24%)
     1,201,317,760      mem_load_retired.l1_hit                                       ( +-  0.32% )  (66.71%)
        44,942,323      mem_load_retired.l2_miss                                      ( +-  1.02% )  (67.08%)
        10,873,016      mem_load_retired.l2_hit                                       ( +-  3.48% )  (67.31%)
            42,525      mem_load_retired.l3_miss                                      ( +- 10.29% )  (67.00%)
        44,167,947      mem_load_retired.l3_hit                                       ( +-  1.49% )  (66.29%)

       0.406484613 seconds time elapsed                                          ( +-  1.22% )

After:

 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        35,557,941      mem_load_retired.l1_miss                                      ( +-  0.69% )  (66.58%)
     1,262,199,262      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.58%)
        15,424,777      mem_load_retired.l2_miss                                      ( +-  4.25% )  (66.58%)
        18,029,572      mem_load_retired.l2_hit                                       ( +-  3.31% )  (67.04%)
            17,235      mem_load_retired.l3_miss                                      ( +-  7.93% )  (67.18%)
        16,544,671      mem_load_retired.l3_hit                                       ( +-  4.02% )  (66.73%)

       0.359402346 seconds time elapsed                                          ( +-  0.28% )

As this is the third in a set of patches for optimizations that were
concentrating on the "TinyOverlap" test case, I also verified before/after of
the "YCSB" compaction test case. This also showed about a 14% improvement
overall from this patch series.

ycsb at start:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        795.924340      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.53% )
                27      context-switches          #    0.034 K/sec                    ( +- 60.21% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,672      page-faults               #    0.003 M/sec                    ( +-  1.36% )
     2,998,040,105      cycles                    #    3.767 GHz                      ( +-  0.35% )
     7,457,801,099      instructions              #    2.49  insn per cycle           ( +-  0.00% )
     1,212,318,353      branches                  # 1523.158 M/sec                    ( +-  0.00% )
         9,741,331      branch-misses             #    0.80% of all branches          ( +-  0.03% )

       0.797407974 seconds time elapsed                                          ( +-  0.58% )

ycsb at end:

 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        686.208936      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.48% )
                 9      context-switches          #    0.013 K/sec                    ( +- 54.78% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,470      page-faults               #    0.004 M/sec                    ( +-  1.21% )
     2,567,417,451      cycles                    #    3.741 GHz                      ( +-  0.22% )
     7,271,965,987      instructions              #    2.83  insn per cycle           ( +-  0.00% )
     1,211,559,802      branches                  # 1765.584 M/sec                    ( +-  0.00% )
         9,759,994      branch-misses             #    0.81% of all branches          ( +-  0.03% )

       0.686853500 seconds time elapsed                                          ( +-  0.49% )

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
---
M src/kudu/tablet/rowset_info.cc
M src/kudu/tablet/rowset_info.h
2 files changed, 32 insertions(+), 23 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/46/8446/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h
File src/kudu/tablet/rowset_info.h:

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h@103
PS1, Line 103:     // True if the RowSet has known bounds.
             :     // MemRowSets in particular do not.
This comment belongs above has_bounds.



-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:53:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 1:

I wanted to try out the "efficiency sanitizer" so I ran the tests prior to this fix with -fsanitize=efficiency-cache-frag and got the following results:
 
==8130==   # 0: offset = 0,	 size = 8,	 count = 20000,	 type = %"class.kudu::tablet::RowSet"*
==8130==   # 1: offset = 8,	 size = 4,	 count = 39998,	 type = i32
==8130==   # 2: offset = 12,	 size = 4,	 count = 107698980,	 type = i32
==8130==   # 3: offset = 16,	 size = 1,	 count = 39998,	 type = i8
==8130==   # 4: offset = 24,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 5: offset = 56,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 6: offset = 88,	 size = 8,	 count = 300940510,	 type = double
==8130==   # 7: offset = 96,	 size = 8,	 count = 200930507,	 type = double
==8130==   # 8: offset = 104,	 size = 8,	 count = 89700226,	 type = double

From this it seems like the byte size int32 is also infrequently accessed and might be a candidate for out-of-lining.


-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 21:37:46 +0000
Gerrit-HasComments: No

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 23:31:30 +0000
Gerrit-HasComments: No

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................

compaction_policy: out-of-line inessential fields in RowSetInfo

Based on analysis using toplev.py from pmu-tools[1] I found that the
compaction algorithm was memory-bandwidth bound for the benchmarks in
compaction_policy-test.

The core algorithms of the compaction policy only access a subset of
the fields in the object, meaning that some fields are accessed far
more frequently than others. This can also be illustrated by building
with -fsanitize=cache-efficiency-frag which generates the following
report for the RowSetInfo struct:

==8130==   # 0: offset = 0,	 size = 8,	 count = 20000,	 type = %"class.kudu::tablet::RowSet"*
==8130==   # 1: offset = 8,	 size = 4,	 count = 39998,	 type = i32
==8130==   # 2: offset = 12,	 size = 4,	 count = 107698980,	 type = i32
==8130==   # 3: offset = 16,	 size = 1,	 count = 39998,	 type = i8
==8130==   # 4: offset = 24,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 5: offset = 56,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 6: offset = 88,	 size = 8,	 count = 300940510,	 type = double
==8130==   # 7: offset = 96,	 size = 8,	 count = 200930507,	 type = double
==8130==   # 8: offset = 104,	 size = 8,	 count = 89700226,	 type = double

Here we can see that the second i32 (size_mb) as well as the three
doubles (cdf_min, cdf_max, and density) are accessed orders of
magnitude more than the other fields.

An easy fix for this is to out-of-line the other fields into an ancillary
struct elsewhere on the heap, and store just a pointer to the struct.
This ensures that the frequently-accessed fields fit into a smaller
memory footprint and thus involve far fewer memory accesses.

I benchmarked using both the 'TinyOverlap' case and the 'Ycsb' case,
measuring CPU cache misses at each level along with total runtime.

TinyOverlap case
------------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        55,773,441      mem_load_retired.l1_miss                                      ( +-  0.91% )  (66.20%)
     1,200,071,033      mem_load_retired.l1_hit                                       ( +-  0.19% )  (66.51%)
        44,324,227      mem_load_retired.l2_miss                                      ( +-  1.21% )  (67.02%)
        11,525,953      mem_load_retired.l2_hit                                       ( +-  3.75% )  (67.34%)
            42,298      mem_load_retired.l3_miss                                      ( +-  8.56% )  (67.07%)
        44,461,423      mem_load_retired.l3_hit                                       ( +-  1.16% )  (66.44%)

       0.399781701 seconds time elapsed                                          ( +-  0.84% )
After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        24,864,656      mem_load_retired.l1_miss                                      ( +-  1.37% )  (66.51%)
     1,261,166,288      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.73%)
         5,694,166      mem_load_retired.l2_miss                                      ( +-  4.37% )  (66.75%)
        18,221,204      mem_load_retired.l2_hit                                       ( +-  1.13% )  (66.97%)
            28,574      mem_load_retired.l3_miss                                      ( +- 16.39% )  (67.10%)
         5,829,783      mem_load_retired.l3_hit                                       ( +-  5.19% )  (66.63%)

       0.360860837 seconds time elapsed                                          ( +-  0.35% )

Summary of improvements:
  - ~10% improvement in wall time
  - L1 cache: 45% reduction in misses
  - L2 cache: 93% reduction in misses
  - L3 cache: 68% reduction in misses

YCSB test case
---------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        97,654,120      mem_load_retired.l1_miss                                      ( +-  1.13% )  (66.41%)
     2,205,803,278      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.57%)
        80,632,313      mem_load_retired.l2_miss                                      ( +-  1.68% )  (66.73%)
        16,742,677      mem_load_retired.l2_hit                                       ( +-  3.77% )  (66.99%)
            60,874      mem_load_retired.l3_miss                                      ( +-  5.64% )  (67.02%)
        82,209,502      mem_load_retired.l3_hit                                       ( +-  1.08% )  (66.61%)

       0.747246686 seconds time elapsed                                          ( +-  0.49% )

After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        39,568,838      mem_load_retired.l1_miss                                      ( +-  0.94% )  (66.49%)
     2,328,627,974      mem_load_retired.l1_hit                                       ( +-  0.12% )  (66.67%)
         6,673,404      mem_load_retired.l2_miss                                      ( +-  3.58% )  (66.76%)
        31,882,992      mem_load_retired.l2_hit                                       ( +-  1.11% )  (66.95%)
            59,963      mem_load_retired.l3_miss                                      ( +- 10.28% )  (66.95%)
         6,378,458      mem_load_retired.l3_hit                                       ( +-  3.19% )  (66.58%)

       0.676610184 seconds time elapsed                                          ( +-  0.44% )

Summary:
  - ~10% improvement in wall time
  - L1 cache: 59% reduction in misses
  - L2 cache: 92% reduction in misses
  - L3 cache: 1% reduction (apparently this workload always fit well
    in L3)

As this is the third in a set of patches for optimizations that were
concentrating on the "TinyOverlap" test case, I also verified
before/after of the "YCSB" compaction test case since the beginning of
series.  This also showed about a 14% improvement:

ycsb at start:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        795.924340      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.53% )
                27      context-switches          #    0.034 K/sec                    ( +- 60.21% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,672      page-faults               #    0.003 M/sec                    ( +-  1.36% )
     2,998,040,105      cycles                    #    3.767 GHz                      ( +-  0.35% )
     7,457,801,099      instructions              #    2.49  insn per cycle           ( +-  0.00% )
     1,212,318,353      branches                  # 1523.158 M/sec                    ( +-  0.00% )
         9,741,331      branch-misses             #    0.80% of all branches          ( +-  0.03% )

       0.797407974 seconds time elapsed                                          ( +-  0.58% )

ycsb at end:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        681.047998      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.52% )
                 2      context-switches          #    0.003 K/sec                    ( +- 46.55% )
                 0      cpu-migrations            #    0.000 K/sec
             2,474      page-faults               #    0.004 M/sec                    ( +-  1.22% )
     2,581,500,560      cycles                    #    3.790 GHz                      ( +-  0.42% )
     7,267,792,155      instructions              #    2.82  insn per cycle           ( +-  0.00% )
     1,210,494,552      branches                  # 1777.400 M/sec                    ( +-  0.00% )
         9,790,631      branch-misses             #    0.81% of all branches          ( +-  0.02% )

       0.681425636 seconds time elapsed                                          ( +-  0.52% )

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Reviewed-on: http://gerrit.cloudera.org:8080/8446
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/tablet/rowset_info.cc
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/rowset_tree.cc
3 files changed, 46 insertions(+), 30 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Dan Burkert, Kudu Jenkins, zhen.zhang, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8446

to look at the new patch set (#3).

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................

compaction_policy: out-of-line inessential fields in RowSetInfo

Based on analysis using toplev.py from pmu-tools[1] I found that the
compaction algorithm was memory-bandwidth bound for the benchmarks in
compaction_policy-test.

The core algorithms of the compaction policy only access a subset of
the fields in the object, meaning that some fields are accessed far
more frequently than others. This can also be illustrated by building
with -fsanitize=cache-efficiency-frag which generates the following
report for the RowSetInfo struct:

==8130==   # 0: offset = 0,	 size = 8,	 count = 20000,	 type = %"class.kudu::tablet::RowSet"*
==8130==   # 1: offset = 8,	 size = 4,	 count = 39998,	 type = i32
==8130==   # 2: offset = 12,	 size = 4,	 count = 107698980,	 type = i32
==8130==   # 3: offset = 16,	 size = 1,	 count = 39998,	 type = i8
==8130==   # 4: offset = 24,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 5: offset = 56,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 6: offset = 88,	 size = 8,	 count = 300940510,	 type = double
==8130==   # 7: offset = 96,	 size = 8,	 count = 200930507,	 type = double
==8130==   # 8: offset = 104,	 size = 8,	 count = 89700226,	 type = double

Here we can see that the second i32 (size_mb) as well as the three
doubles (cdf_min, cdf_max, and density) are accessed orders of
magnitude more than the other fields.

An easy fix for this is to out-of-line the other fields into an ancillary
struct elsewhere on the heap, and store just a pointer to the struct.
This ensures that the frequently-accessed fields fit into a smaller
memory footprint and thus involve far fewer memory accesses.

I benchmarked using both the 'TinyOverlap' case and the 'Ycsb' case,
measuring CPU cache misses at each level along with total runtime.

TinyOverlap case
------------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        55,773,441      mem_load_retired.l1_miss                                      ( +-  0.91% )  (66.20%)
     1,200,071,033      mem_load_retired.l1_hit                                       ( +-  0.19% )  (66.51%)
        44,324,227      mem_load_retired.l2_miss                                      ( +-  1.21% )  (67.02%)
        11,525,953      mem_load_retired.l2_hit                                       ( +-  3.75% )  (67.34%)
            42,298      mem_load_retired.l3_miss                                      ( +-  8.56% )  (67.07%)
        44,461,423      mem_load_retired.l3_hit                                       ( +-  1.16% )  (66.44%)

       0.399781701 seconds time elapsed                                          ( +-  0.84% )
After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        24,864,656      mem_load_retired.l1_miss                                      ( +-  1.37% )  (66.51%)
     1,261,166,288      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.73%)
         5,694,166      mem_load_retired.l2_miss                                      ( +-  4.37% )  (66.75%)
        18,221,204      mem_load_retired.l2_hit                                       ( +-  1.13% )  (66.97%)
            28,574      mem_load_retired.l3_miss                                      ( +- 16.39% )  (67.10%)
         5,829,783      mem_load_retired.l3_hit                                       ( +-  5.19% )  (66.63%)

       0.360860837 seconds time elapsed                                          ( +-  0.35% )

Summary of improvements:
  - ~10% improvement in wall time
  - L1 cache: 45% reduction in misses
  - L2 cache: 93% reduction in misses
  - L3 cache: 68% reduction in misses

YCSB test case
---------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        97,654,120      mem_load_retired.l1_miss                                      ( +-  1.13% )  (66.41%)
     2,205,803,278      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.57%)
        80,632,313      mem_load_retired.l2_miss                                      ( +-  1.68% )  (66.73%)
        16,742,677      mem_load_retired.l2_hit                                       ( +-  3.77% )  (66.99%)
            60,874      mem_load_retired.l3_miss                                      ( +-  5.64% )  (67.02%)
        82,209,502      mem_load_retired.l3_hit                                       ( +-  1.08% )  (66.61%)

       0.747246686 seconds time elapsed                                          ( +-  0.49% )

After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        39,568,838      mem_load_retired.l1_miss                                      ( +-  0.94% )  (66.49%)
     2,328,627,974      mem_load_retired.l1_hit                                       ( +-  0.12% )  (66.67%)
         6,673,404      mem_load_retired.l2_miss                                      ( +-  3.58% )  (66.76%)
        31,882,992      mem_load_retired.l2_hit                                       ( +-  1.11% )  (66.95%)
            59,963      mem_load_retired.l3_miss                                      ( +- 10.28% )  (66.95%)
         6,378,458      mem_load_retired.l3_hit                                       ( +-  3.19% )  (66.58%)

       0.676610184 seconds time elapsed                                          ( +-  0.44% )

Summary:
  - ~10% improvement in wall time
  - L1 cache: 59% reduction in misses
  - L2 cache: 92% reduction in misses
  - L3 cache: 1% reduction (apparently this workload always fit well
    in L3)

As this is the third in a set of patches for optimizations that were
concentrating on the "TinyOverlap" test case, I also verified
before/after of the "YCSB" compaction test case since the beginning of
series.  This also showed about a 14% improvement:

ycsb at start:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        795.924340      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.53% )
                27      context-switches          #    0.034 K/sec                    ( +- 60.21% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,672      page-faults               #    0.003 M/sec                    ( +-  1.36% )
     2,998,040,105      cycles                    #    3.767 GHz                      ( +-  0.35% )
     7,457,801,099      instructions              #    2.49  insn per cycle           ( +-  0.00% )
     1,212,318,353      branches                  # 1523.158 M/sec                    ( +-  0.00% )
         9,741,331      branch-misses             #    0.80% of all branches          ( +-  0.03% )

       0.797407974 seconds time elapsed                                          ( +-  0.58% )

ycsb at end:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        681.047998      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.52% )
                 2      context-switches          #    0.003 K/sec                    ( +- 46.55% )
                 0      cpu-migrations            #    0.000 K/sec
             2,474      page-faults               #    0.004 M/sec                    ( +-  1.22% )
     2,581,500,560      cycles                    #    3.790 GHz                      ( +-  0.42% )
     7,267,792,155      instructions              #    2.82  insn per cycle           ( +-  0.00% )
     1,210,494,552      branches                  # 1777.400 M/sec                    ( +-  0.00% )
         9,790,631      branch-misses             #    0.81% of all branches          ( +-  0.02% )

       0.681425636 seconds time elapsed                                          ( +-  0.52% )

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
---
M src/kudu/tablet/rowset_info.cc
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/rowset_tree.cc
3 files changed, 46 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/46/8446/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Dan Burkert, Kudu Jenkins, zhen.zhang, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8446

to look at the new patch set (#2).

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................

compaction_policy: out-of-line inessential fields in RowSetInfo

Based on analysis using toplev.py from pmu-tools[1] I found that the
compaction algorithm was memory-bandwidth bound for the benchmarks in
compaction_policy-test.

The core algorithms of the compaction policy only access a subset of
the fields in the object, meaning that some fields are accessed far
more frequently than others. This can also be illustrated by building
with -fsanitize=cache-efficiency-frag which generates the following
report for the RowSetInfo struct:

==8130==   # 0: offset = 0,	 size = 8,	 count = 20000,	 type = %"class.kudu::tablet::RowSet"*
==8130==   # 1: offset = 8,	 size = 4,	 count = 39998,	 type = i32
==8130==   # 2: offset = 12,	 size = 4,	 count = 107698980,	 type = i32
==8130==   # 3: offset = 16,	 size = 1,	 count = 39998,	 type = i8
==8130==   # 4: offset = 24,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 5: offset = 56,	 size = 32,	 count = 210001,	 type = %"class.std::__cxx11::basic_string" = type { %"struct.std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider", i64, %union.anon }
==8130==   # 6: offset = 88,	 size = 8,	 count = 300940510,	 type = double
==8130==   # 7: offset = 96,	 size = 8,	 count = 200930507,	 type = double
==8130==   # 8: offset = 104,	 size = 8,	 count = 89700226,	 type = double

Here we can see that the second i32 (size_mb) as well as the three
doubles (cdf_min, cdf_max, and density) are accessed orders of
magnitude more than the other fields.

An easy fix for this is to out-of-line the other fields into an ancillary
struct elsewhere on the heap, and store just a pointer to the struct.
This ensures that the frequently-accessed fields fit into a smaller
memory footprint and thus involve far fewer memory accesses.

I benchmarked using both the 'TinyOverlap' case and the 'Ycsb' case,
measuring CPU cache misses at each level along with total runtime.

TinyOverlap case
------------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        55,773,441      mem_load_retired.l1_miss                                      ( +-  0.91% )  (66.20%)
     1,200,071,033      mem_load_retired.l1_hit                                       ( +-  0.19% )  (66.51%)
        44,324,227      mem_load_retired.l2_miss                                      ( +-  1.21% )  (67.02%)
        11,525,953      mem_load_retired.l2_hit                                       ( +-  3.75% )  (67.34%)
            42,298      mem_load_retired.l3_miss                                      ( +-  8.56% )  (67.07%)
        44,461,423      mem_load_retired.l3_hit                                       ( +-  1.16% )  (66.44%)

       0.399781701 seconds time elapsed                                          ( +-  0.84% )
After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*TinyOver*' (10 runs):

        24,864,656      mem_load_retired.l1_miss                                      ( +-  1.37% )  (66.51%)
     1,261,166,288      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.73%)
         5,694,166      mem_load_retired.l2_miss                                      ( +-  4.37% )  (66.75%)
        18,221,204      mem_load_retired.l2_hit                                       ( +-  1.13% )  (66.97%)
            28,574      mem_load_retired.l3_miss                                      ( +- 16.39% )  (67.10%)
         5,829,783      mem_load_retired.l3_hit                                       ( +-  5.19% )  (66.63%)

       0.360860837 seconds time elapsed                                          ( +-  0.35% )

Summary of improvements:
  - ~10% improvement in wall time
  - L1 cache: 45% reduction in misses
  - L2 cache: 93% reduction in misses
  - L3 cache: 68% reduction in misses

YCSB test case
---------------

Before:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        97,654,120      mem_load_retired.l1_miss                                      ( +-  1.13% )  (66.41%)
     2,205,803,278      mem_load_retired.l1_hit                                       ( +-  0.13% )  (66.57%)
        80,632,313      mem_load_retired.l2_miss                                      ( +-  1.68% )  (66.73%)
        16,742,677      mem_load_retired.l2_hit                                       ( +-  3.77% )  (66.99%)
            60,874      mem_load_retired.l3_miss                                      ( +-  5.64% )  (67.02%)
        82,209,502      mem_load_retired.l3_hit                                       ( +-  1.08% )  (66.61%)

       0.747246686 seconds time elapsed                                          ( +-  0.49% )

After:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        39,568,838      mem_load_retired.l1_miss                                      ( +-  0.94% )  (66.49%)
     2,328,627,974      mem_load_retired.l1_hit                                       ( +-  0.12% )  (66.67%)
         6,673,404      mem_load_retired.l2_miss                                      ( +-  3.58% )  (66.76%)
        31,882,992      mem_load_retired.l2_hit                                       ( +-  1.11% )  (66.95%)
            59,963      mem_load_retired.l3_miss                                      ( +- 10.28% )  (66.95%)
         6,378,458      mem_load_retired.l3_hit                                       ( +-  3.19% )  (66.58%)

       0.676610184 seconds time elapsed                                          ( +-  0.44% )

Summary:
  - ~10% improvement in wall time
  - L1 cache: 59% reduction in misses
  - L2 cache: 92% reduction in misses
  - L3 cache: 1% reduction (apparently this workload always fit well
    in L3)

As this is the third in a set of patches for optimizations that were
concentrating on the "TinyOverlap" test case, I also verified
before/after of the "YCSB" compaction test case since the beginning of
series.  This also showed about a 14% improvement:

ycsb at start:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        795.924340      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.53% )
                27      context-switches          #    0.034 K/sec                    ( +- 60.21% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
             2,672      page-faults               #    0.003 M/sec                    ( +-  1.36% )
     2,998,040,105      cycles                    #    3.767 GHz                      ( +-  0.35% )
     7,457,801,099      instructions              #    2.49  insn per cycle           ( +-  0.00% )
     1,212,318,353      branches                  # 1523.158 M/sec                    ( +-  0.00% )
         9,741,331      branch-misses             #    0.80% of all branches          ( +-  0.03% )

       0.797407974 seconds time elapsed                                          ( +-  0.58% )

ycsb at end:
 Performance counter stats for 'build/latest/bin/compaction_policy-test --gtest_filter=*Ycsb*' (10 runs):

        681.047998      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.52% )
                 2      context-switches          #    0.003 K/sec                    ( +- 46.55% )
                 0      cpu-migrations            #    0.000 K/sec
             2,474      page-faults               #    0.004 M/sec                    ( +-  1.22% )
     2,581,500,560      cycles                    #    3.790 GHz                      ( +-  0.42% )
     7,267,792,155      instructions              #    2.82  insn per cycle           ( +-  0.00% )
     1,210,494,552      branches                  # 1777.400 M/sec                    ( +-  0.00% )
         9,790,631      branch-misses             #    0.81% of all branches          ( +-  0.02% )

       0.681425636 seconds time elapsed                                          ( +-  0.52% )

[1] https://github.com/andikleen/pmu-tools/wiki/toplev-manual

Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
---
M src/kudu/tablet/rowset_info.cc
M src/kudu/tablet/rowset_info.h
M src/kudu/tablet/rowset_tree.cc
3 files changed, 46 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/46/8446/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h
File src/kudu/tablet/rowset_info.h:

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h@109
PS1, Line 109:   const std::shared_ptr<ExtraData> extra_;
> Why not unique_ptr, which is only 8 bytes and no atomic ops?
I tried that first, but we actually copy RowSetInfo objects at some point in the algorithm so that we end up with two separate arrays (one sorted by min_key and one by max_key). Despite doubling memory footprint it results in a more sequential access pattern so I think it's a win to do those copies (though I didn't try changing to pointers there)



-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 21:21:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8446/2/src/kudu/tablet/rowset_info.cc
File src/kudu/tablet/rowset_info.cc:

http://gerrit.cloudera.org:8080/#/c/8446/2/src/kudu/tablet/rowset_info.cc@266
PS2, Line 266: size_bytes()
Nit: use extra_->size_bytes here, to avoid having to jump back and forth between the header and this file.

I figured it's OK since the args to GetBounds() are specified directly instead of by calling accessors.



-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 22:27:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 1:

(1 comment)

This might be a fun benchmark to play with the new EfficiencySanitizer tool in llvm as well

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h
File src/kudu/tablet/rowset_info.h:

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h@109
PS1, Line 109:   const std::shared_ptr<ExtraData> extra_;
perhaps we could use RefCounted here instead of shared_ptr, since  iirc shared_ptr is 16 bytes whereas RefCounted is only 8 (and we'd  avoid any atomic ops)



-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:07:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] compaction policy: out-of-line inessential fields in RowSetInfo

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8446 )

Change subject: compaction_policy: out-of-line inessential fields in RowSetInfo
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h
File src/kudu/tablet/rowset_info.h:

http://gerrit.cloudera.org:8080/#/c/8446/1/src/kudu/tablet/rowset_info.h@109
PS1, Line 109:   const std::shared_ptr<ExtraData> extra_;
> perhaps we could use RefCounted here instead of shared_ptr, since  iirc sha
Why not unique_ptr, which is only 8 bytes and no atomic ops?



-- 
To view, visit http://gerrit.cloudera.org:8080/8446
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I148b8c976577cbe0020ed887f3bc4c7a2cb2c3ae
Gerrit-Change-Number: 8446
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: zhen.zhang <zh...@xiaomi.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 21:08:18 +0000
Gerrit-HasComments: Yes