You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/04/06 01:40:05 UTC

[kudu-CR] WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Hello David Ribeiro Alves,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6569

to review the following change.

Change subject: WIP: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread cache for use cases
like this. With the new batched Apply() path for writes, we can expect
that each subsequent operation is likely to hit the same rowset over and
over in a row, which means that we're likely to get a good hit rate even
with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

WIP: should run YCSB as a second end-to-end benchmark
WIP: need to add docs, etc, to code

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11573.863720 task-clock                #    7.167 CPUs utilized            ( +-  0.91% )
             213,283 context-switches          #    0.018 M/sec                    ( +-  2.31% )
                  29 cpu-migrations            #    0.003 K/sec                    ( +- 10.75% )
               1,566 page-faults               #    0.135 K/sec                    ( +-  0.06% )
      33,399,534,315 cycles                    #    2.886 GHz                      ( +-  0.90% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,043,928,426 instructions              #    0.39  insns per cycle          ( +-  0.45% )
       2,722,013,296 branches                  #  235.186 M/sec                    ( +-  0.58% )
          27,412,912 branch-misses             #    1.01% of all branches          ( +-  0.79% )

         1.614814095 seconds time elapsed                                          ( +-  0.86% )

  (43% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 171 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6569

to look at the new patch set (#5).

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11314.801368 task-clock                #    7.230 CPUs utilized            ( +-  0.74% )
             213,126 context-switches          #    0.019 M/sec                    ( +-  2.04% )
                  27 cpu-migrations            #    0.002 K/sec                    ( +- 13.33% )
               1,547 page-faults               #    0.137 K/sec                    ( +-  0.05% )
      32,714,275,234 cycles                    #    2.891 GHz                      ( +-  0.72% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,071,571,872 instructions              #    0.40  insns per cycle          ( +-  0.35% )
       2,719,869,660 branches                  #  240.382 M/sec                    ( +-  0.49% )
          27,950,304 branch-misses             #    1.03% of all branches          ( +-  0.39% )

  (46% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 231 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/threadlocal_cache.h
File src/kudu/util/threadlocal_cache.h:

PS4, Line 30: // This only ensures that the ThreadLocalCache can properly destruct
            : // entries when the thread exits.
> Ah, didn't realize the goal was to share a cache like that.
OK, let me take a whack at making it work that way. Hopefully my template skills are up to snuff.


-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 5: Verified+1

Test failure was an unrelated flake that Alexey just fixed.

-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/cfile/bloomfile.cc
File src/kudu/cfile/bloomfile.cc:

Line 175
I imagine mem_consumption_ was here to handle the change in memory consumption in InitOnce(). But, now there are no calls to memory_footprint_excluding_reader(), which means the memory consumption of BloomFileReader isn't being tracked by anyone. Seems like it should be, in CFileSet perhaps?


Line 267:   // If we didn't hit in the cache, make a new cache entry and instantiate a reader.
Based on how you're using the ThreadLocalCache, I think a single LookupOrInsertNew() that returns a pair (object + was it created or looked up) would be cleaner than separate Lookup() and InsertNew() methods. It'd also reduce the number of modulo calculations from 5 to 4 (one from each iteration in Lookup() and one from InsertNew()).


http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/mt-threadlocal-test.cc
File src/kudu/util/mt-threadlocal-test.cc:

Line 329: TEST_F(ThreadLocalTest, TestThreadLocalCache) {
What about a multi-threaded test to verify that it's really a thread-local cache? That is, if you insert with the same key across threads, you don't wind up with the same item.


http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/threadlocal_cache.h
File src/kudu/util/threadlocal_cache.h:

PS4, Line 30: // This only ensures that the ThreadLocalCache can properly destruct
            : // entries when the thread exits.
I don't think I understand this. Suppose ThreadLocalCache were templatized on T (the type of element from Lookup() and InsertNew()). Wouldn't that also guarantee that T's destructor runs on every element when the ThreadLocalCache is destroyed?


Line 53:   static constexpr int kItemCapacity = 4;
It must also be <= sizeof last_hit_.


Line 66:     for (int i = 0; i < kItemCapacity; i++, last_hit_++) {
Would it be more performant to:
1. Only modify last_hit_ on L69.
2. Calculate the index on L67 as (last_hit_ + i) % kItemCapacity.

This has the added clarity advantage of mutating last_hit_ only on a real hit.


PS4, Line 67: %
As per L52, didn't you intend to use bit shifting here?


PS4, Line 80: %
Not bit shifting?


Line 103:   using EntryPair = std::pair<Key, std::unique_ptr<ThreadLocalCacheable>>;
Would it be possible to avoid allocations entirely? Perhaps by templatizing the cache on T, and declaring the array as being of type T, or pair<Key, T>, or pair<Key, optional<T>> (if you don't want to force T to have some sort of default initialization).


-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6569

to look at the new patch set (#2).

Change subject: WIP: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread cache for use cases
like this. With the new batched Apply() path for writes, we can expect
that each subsequent operation is likely to hit the same rowset over and
over in a row, which means that we're likely to get a good hit rate even
with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

WIP: should run YCSB as a second end-to-end benchmark

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11573.863720 task-clock                #    7.167 CPUs utilized            ( +-  0.91% )
             213,283 context-switches          #    0.018 M/sec                    ( +-  2.31% )
                  29 cpu-migrations            #    0.003 K/sec                    ( +- 10.75% )
               1,566 page-faults               #    0.135 K/sec                    ( +-  0.06% )
      33,399,534,315 cycles                    #    2.886 GHz                      ( +-  0.90% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,043,928,426 instructions              #    0.39  insns per cycle          ( +-  0.45% )
       2,722,013,296 branches                  #  235.186 M/sec                    ( +-  0.58% )
          27,412,912 branch-misses             #    1.01% of all branches          ( +-  0.79% )

         1.614814095 seconds time elapsed                                          ( +-  0.86% )

  (43% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
10 files changed, 252 insertions(+), 86 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/threadlocal_cache.h
File src/kudu/util/threadlocal_cache.h:

PS4, Line 30: // This only ensures that the ThreadLocalCache can properly destruct
            : // entries when the thread exits.
> I don't think I understand this. Suppose ThreadLocalCache were templatized 
yea, but I'd like to also extend this cache to other use cases without having to make a separate ThreadLocalCache for each type of cached object.

ie we can have a constant-sized cache which includes some entries which are BloomFileReader and others which are BlockCache entries, etc.

If you think that use case isn't that interesting, I could make the whole thing templatized and have N separate caches for N different types. I don't really have a strong opinion.


Line 53:   static constexpr int kItemCapacity = 4;
> It must also be <= sizeof last_hit_.
you mean < (1 << (sizeof(last_hit_) * 8)) right?


Line 66:     for (int i = 0; i < kItemCapacity; i++, last_hit_++) {
> Would it be more performant to:
actually went back and forth on this, not sure which is more performant, but it's probably not a hot spot either way. Will change to that way if you think it's more clear.


PS4, Line 67: %
> As per L52, didn't you intend to use bit shifting here?
% with a constant power of two will turn into bit-shifting after optimization, and I think % is more clear than & (kItemCapacity - 1)


Line 103:   using EntryPair = std::pair<Key, std::unique_ptr<ThreadLocalCacheable>>;
> Would it be possible to avoid allocations entirely? Perhaps by templatizing
Same as above -- we'd have to make this class templatized on the type, vs being generic.


-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/cfile/bloomfile.cc
File src/kudu/cfile/bloomfile.cc:

Line 175
> I imagine mem_consumption_ was here to handle the change in memory consumpt
put this stuff back.


Line 199:                                  ReaderOptions options)
> warning: parameter 'options' is unused [misc-unused-parameters]
Done


Line 267:   // If we didn't hit in the cache, make a new cache entry and instantiate a reader.
> Based on how you're using the ThreadLocalCache, I think a single LookupOrIn
I'm not sure that's great, because then we have to always pass the constructor parameters into LookupOrInsertNew (I moved the IndexTreeIterator constructor parameters into BloomCacheItem to avoid an extra indirection). Plus I'm not sure I follow how it reduces a modulo, since the InsertNew function uses next_slot_ which is different than last_hit_


Line 301:     bci->cur_block_pointer = std::move(bblk_ptr);
> warning: passing result of std::move() as a const reference argument; no mo
Done


http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/mt-threadlocal-test.cc
File src/kudu/util/mt-threadlocal-test.cc:

Line 329: TEST_F(ThreadLocalTest, TestThreadLocalCache) {
> What about a multi-threaded test to verify that it's really a thread-local 
meh, I'm gonna skip this one, because we've already tested ThreadLocal above, and the cache's usage of threadlocals is not particularly interesting.


-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11314.801368 task-clock                #    7.230 CPUs utilized            ( +-  0.74% )
             213,126 context-switches          #    0.019 M/sec                    ( +-  2.04% )
                  27 cpu-migrations            #    0.002 K/sec                    ( +- 13.33% )
               1,547 page-faults               #    0.137 K/sec                    ( +-  0.05% )
      32,714,275,234 cycles                    #    2.891 GHz                      ( +-  0.72% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,071,571,872 instructions              #    0.40  insns per cycle          ( +-  0.35% )
       2,719,869,660 branches                  #  240.382 M/sec                    ( +-  0.49% )
          27,950,304 branch-misses             #    1.03% of all branches          ( +-  0.39% )

  (46% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Reviewed-on: http://gerrit.cloudera.org:8080/6569
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.h
8 files changed, 231 insertions(+), 84 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Todd Lipcon: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6569

to look at the new patch set (#4).

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11573.863720 task-clock                #    7.167 CPUs utilized            ( +-  0.91% )
             213,283 context-switches          #    0.018 M/sec                    ( +-  2.31% )
                  29 cpu-migrations            #    0.003 K/sec                    ( +- 10.75% )
               1,566 page-faults               #    0.135 K/sec                    ( +-  0.06% )
      33,399,534,315 cycles                    #    2.886 GHz                      ( +-  0.90% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,043,928,426 instructions              #    0.39  insns per cycle          ( +-  0.45% )
       2,722,013,296 branches                  #  235.186 M/sec                    ( +-  0.58% )
          27,412,912 branch-misses             #    1.01% of all branches          ( +-  0.79% )

         1.614814095 seconds time elapsed                                          ( +-  0.86% )

  (43% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
10 files changed, 265 insertions(+), 86 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6569

to look at the new patch set (#3).

Change subject: WIP: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................

WIP: KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread cache for use cases
like this. With the new batched Apply() path for writes, we can expect
that each subsequent operation is likely to hit the same rowset over and
over in a row, which means that we're likely to get a good hit rate even
with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

WIP: should run YCSB as a second end-to-end benchmark

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11573.863720 task-clock                #    7.167 CPUs utilized            ( +-  0.91% )
             213,283 context-switches          #    0.018 M/sec                    ( +-  2.31% )
                  29 cpu-migrations            #    0.003 K/sec                    ( +- 10.75% )
               1,566 page-faults               #    0.135 K/sec                    ( +-  0.06% )
      33,399,534,315 cycles                    #    2.886 GHz                      ( +-  0.90% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,043,928,426 instructions              #    0.39  insns per cycle          ( +-  0.45% )
       2,722,013,296 branches                  #  235.186 M/sec                    ( +-  0.58% )
          27,412,912 branch-misses             #    1.01% of all branches          ( +-  0.79% )

         1.614814095 seconds time elapsed                                          ( +-  0.86% )

  (43% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
---
M src/kudu/cfile/block_pointer.h
M src/kudu/cfile/bloomfile.cc
M src/kudu/cfile/bloomfile.h
M src/kudu/cfile/index_btree.h
M src/kudu/cfile/mt-bloomfile-test.cc
M src/kudu/util/CMakeLists.txt
M src/kudu/util/bloom_filter.h
M src/kudu/util/mt-threadlocal-test.cc
A src/kudu/util/threadlocal_cache.cc
A src/kudu/util/threadlocal_cache.h
10 files changed, 265 insertions(+), 86 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/69/6569/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-680. bloomfile: switch to a threadlocal cache

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-680. bloomfile: switch to a threadlocal cache
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6569/4/src/kudu/util/threadlocal_cache.h
File src/kudu/util/threadlocal_cache.h:

PS4, Line 30: // This only ensures that the ThreadLocalCache can properly destruct
            : // entries when the thread exits.
> yea, but I'd like to also extend this cache to other use cases without havi
Ah, didn't realize the goal was to share a cache like that.

Do you think a shared cache would remain performant as more objects types were added? Let's take your example of a single cache used for both bloom file readers and block cache entries. Assuming there's one "hot" object for each type and a split access pattern, the best we can hope for is "ping-ponging" between the two hot objects, right?

Plus, as more object types are added, kItemCapacity will need to grow, at which point there'd be no discernible difference between multiple 4-item caches and one n-item cache, right?

Am I looking at this the right way? To me this suggests that a multi-cache approach would be net better.


Line 53:   static constexpr int kItemCapacity = 4;
> you mean < (1 << (sizeof(last_hit_) * 8)) right?
Yes, my bad.


Line 66:     for (int i = 0; i < kItemCapacity; i++, last_hit_++) {
> actually went back and forth on this, not sure which is more performant, bu
Yeah, I would definitely find it more clear if last_hit_ is only modified during a real hit.


PS4, Line 67: %
> % with a constant power of two will turn into bit-shifting after optimizati
Oh ok, didn't realize you were referring to the compiler when you wrote "modulo can be implemented with bit-shifting".


-- 
To view, visit http://gerrit.cloudera.org:8080/6569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes