You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2017/04/12 06:47:26 UTC

kudu git commit: KUDU-680. bloomfile: switch to a threadlocal cache

Repository: kudu
Updated Branches:
  refs/heads/master 4649cd8af -> 9adbd90d1


KUDU-680. bloomfile: switch to a threadlocal cache

Prior to this patch, every BloomFileReader instance keeps a per-CPU
vector of IndexTreeIterators to try to avoid construction/destruction
costs. However, this has significant overhead. Rough math:

1TB of data / (32MB/rowset) = 32k rowsets

32k rowsets * 48 cores * (64 bytes per padded_spinlock + 100+ bytes per
                          IndexTreeIterator)
   = approximately 246MB of RAM

This doesn't even include the BlockCache entries which end up pinned by
these readers in cold data that was last read long ago, an issue which
is probably the root cause of KUDU-680.

This patch introduces a ThreadLocalCache utility, which is meant for
keeping a very small (4-entry, currently) per-thread object cache for
use cases like this. With the new batched Apply() path for writes, we
can expect that each subsequent operation is likely to hit the same
rowset over and over in a row, which means that we're likely to get a
good hit rate even with such a small cache.

In addition to being a more memory-efficient way of avoiding
IndexTreeIterator costs, this patch also avoids a trip to the central
BlockCache in the case that subsequent bloom queries fall into the same
exact BloomFilter within the BloomFile by keeping the most recent block
cached in the same structure.

As a microbenchmark, I used mt-bloomfile-test:
  Before:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):

        16632.878265 task-clock                #    7.212 CPUs utilized            ( +-  0.68% )
             318,630 context-switches          #    0.019 M/sec                    ( +-  1.70% )
                  43 cpu-migrations            #    0.003 K/sec                    ( +- 15.50% )
               1,563 page-faults               #    0.094 K/sec                    ( +-  0.06% )
      47,956,020,453 cycles                    #    2.883 GHz                      ( +-  0.66% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      17,189,022,099 instructions              #    0.36  insns per cycle          ( +-  0.26% )
       3,691,965,566 branches                  #  221.968 M/sec                    ( +-  0.29% )
          31,842,288 branch-misses             #    0.86% of all branches          ( +-  0.33% )

         2.306319236 seconds time elapsed                                          ( +-  0.59% )

  After:
   Performance counter stats for 'build/latest/bin/mt-bloomfile-test' (10 runs):
        11314.801368 task-clock                #    7.230 CPUs utilized            ( +-  0.74% )
             213,126 context-switches          #    0.019 M/sec                    ( +-  2.04% )
                  27 cpu-migrations            #    0.002 K/sec                    ( +- 13.33% )
               1,547 page-faults               #    0.137 K/sec                    ( +-  0.05% )
      32,714,275,234 cycles                    #    2.891 GHz                      ( +-  0.72% )
     <not supported> stalled-cycles-frontend
     <not supported> stalled-cycles-backend
      13,071,571,872 instructions              #    0.40  insns per cycle          ( +-  0.35% )
       2,719,869,660 branches                  #  240.382 M/sec                    ( +-  0.49% )
          27,950,304 branch-misses             #    1.03% of all branches          ( +-  0.39% )

  (46% savings on cycles)

As an end-to-end benchmark, I used tpch_real_world SF=300 with a few local
patches:
- local patch to enable hash-partitioned inserts, so that it's not a
  purely sequential write, and we have contention on the same tablets.
- maintenance manager starts flushing at 60% of the memory limit, but
  only throttle writes at 80% of the memory limit
- maintenance manager wakes up the scheduler thread immediately when
  there is a free worker, to keep the MM workers fully occupied
- reserve 50% of the MM worker threads for flushes at all times

These are various works-in-progress that will show up on gerrit in
coming days. Without these patches, I found that the performance was
fairly comparable because the writer was throttled so heavily by memory
limiting that very few RowSets accumulated and bloom lookups were not a
bottleneck.

The results of this benchmark were:
- Wall time reduced from 2549s to 2113s (20% better)
- CPU time reduced from 77895s to 57399s (35% better)
- Block cache usage stayed under the configured limit with the patch,
  but went above it without the patch.
- Block cache contention was not the bottleneck with the patch (~10x
  reduction in block cache lookups per second)

As another end-to-end benchmark, I used YCSB, also using the same local
patches mentioned above so that it wasn't memory-throttling-bound. In
the YCSB "load" phase, the insertions are fully uniform-random, so we
don't expect to see performance benefits, but we do expect to see the
memory usage stay more consistent, and to not see any serious perf
regression. In practice, the results were as predicted, detailed in the
document linked below.

Various graphs from the benchmark runs are posted here:
  https://docs.google.com/document/d/1rwt3aShl_e95E9rYUPriPkTfsHVOM2sFOVTmwXeAlI4/edit?usp=sharing

Change-Id: Id8efd7f52eb376de2a9c445458827721806d9da8
Reviewed-on: http://gerrit.cloudera.org:8080/6569
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/9adbd90d
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/9adbd90d
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/9adbd90d

Branch: refs/heads/master
Commit: 9adbd90d170a0a525e8aef05939ba211ca6c3a41
Parents: 4649cd8
Author: Todd Lipcon <to...@apache.org>
Authored: Wed Apr 5 15:36:08 2017 -0700
Committer: Todd Lipcon <to...@apache.org>
Committed: Wed Apr 12 02:57:20 2017 +0000

----------------------------------------------------------------------
 src/kudu/cfile/block_pointer.h       |   4 +
 src/kudu/cfile/bloomfile.cc          | 159 +++++++++++++++---------------
 src/kudu/cfile/bloomfile.h           |  14 ++-
 src/kudu/cfile/index_btree.h         |   4 +
 src/kudu/cfile/mt-bloomfile-test.cc  |   2 -
 src/kudu/util/bloom_filter.h         |   1 +
 src/kudu/util/mt-threadlocal-test.cc |  27 +++++
 src/kudu/util/threadlocal_cache.h    | 110 +++++++++++++++++++++
 8 files changed, 234 insertions(+), 87 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/cfile/block_pointer.h
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/block_pointer.h b/src/kudu/cfile/block_pointer.h
index 670c9da..3e29f17 100644
--- a/src/kudu/cfile/block_pointer.h
+++ b/src/kudu/cfile/block_pointer.h
@@ -83,6 +83,10 @@ class BlockPointer {
     return size_;
   }
 
+  bool Equals(const BlockPointer& other) const {
+    return offset_ == other.offset_ && size_ == other.size_;
+  }
+
  private:
   uint64_t offset_;
   uint32_t size_;

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/cfile/bloomfile.cc
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/bloomfile.cc b/src/kudu/cfile/bloomfile.cc
index 061ad78..23c143c 100644
--- a/src/kudu/cfile/bloomfile.cc
+++ b/src/kudu/cfile/bloomfile.cc
@@ -22,12 +22,14 @@
 
 #include "kudu/cfile/bloomfile.h"
 #include "kudu/cfile/cfile_writer.h"
+#include "kudu/gutil/atomicops.h"
 #include "kudu/gutil/stringprintf.h"
 #include "kudu/gutil/sysinfo.h"
 #include "kudu/util/coding.h"
 #include "kudu/util/hexdump.h"
 #include "kudu/util/malloc.h"
 #include "kudu/util/pb_util.h"
+#include "kudu/util/threadlocal_cache.h"
 
 DECLARE_bool(cfile_lazy_open);
 
@@ -40,6 +42,36 @@ using fs::ReadableBlock;
 using fs::ScopedWritableBlockCloser;
 using fs::WritableBlock;
 
+// Generator for BloomFileReader::instance_nonce_.
+static Atomic64 g_next_nonce = 0;
+
+namespace {
+
+// Frequently, a thread processing a batch of operations will consult the same BloomFile
+// many times in a row. So, we keep a thread-local cache of the state for recently-accessed
+// BloomFileReaders so that we can avoid doing repetitive work.
+class BloomCacheItem {
+ public:
+  explicit BloomCacheItem(CFileReader* reader)
+      : index_iter(reader, reader->validx_root()),
+        cur_block_pointer(0, 0) {
+  }
+
+  // The IndexTreeIterator used to seek the BloomFileReader last time it was accessed.
+  IndexTreeIterator index_iter;
+
+  // The block pointer to the specific block we read last time we used this bloom reader.
+  BlockPointer cur_block_pointer;
+  // The block handle and parsed BloomFilter corresponding to cur_block_pointer.
+  BlockHandle cur_block_handle;
+  BloomFilter cur_bloom;
+
+ private:
+  DISALLOW_COPY_AND_ASSIGN(BloomCacheItem);
+};
+using BloomCacheTLC = ThreadLocalCache<uint64_t, BloomCacheItem>;
+} // anonymous namespace
+
 ////////////////////////////////////////////////////////////
 // Writer
 ////////////////////////////////////////////////////////////
@@ -171,9 +203,11 @@ Status BloomFileReader::OpenNoInit(unique_ptr<ReadableBlock> block,
 
 BloomFileReader::BloomFileReader(unique_ptr<CFileReader> reader,
                                  ReaderOptions options)
-  : reader_(std::move(reader)),
-    mem_consumption_(std::move(options.parent_mem_tracker),
-                     memory_footprint_excluding_reader()) {
+
+    : instance_nonce_(base::subtle::NoBarrier_AtomicIncrement(&g_next_nonce, 1)),
+      reader_(std::move(reader)),
+      mem_consumption_(std::move(options.parent_mem_tracker),
+                       memory_footprint_excluding_reader()) {
 }
 
 Status BloomFileReader::Init() {
@@ -194,22 +228,6 @@ Status BloomFileReader::InitOnce() {
     return Status::Corruption("bloom file missing value index",
                               reader_->ToString());
   }
-
-  BlockPointer validx_root = reader_->validx_root();
-
-  // Ugly hack: create a per-cpu iterator.
-  // Instead this should be threadlocal, or allow us to just
-  // stack-allocate these things more smartly!
-  int n_cpus = base::MaxCPUIndex() + 1;
-  for (int i = 0; i < n_cpus; i++) {
-    index_iters_.emplace_back(
-      IndexTreeIterator::Create(reader_.get(), validx_root));
-  }
-  iter_locks_.reset(new padded_spinlock[n_cpus]);
-
-  // The memory footprint has changed.
-  mem_consumption_.Reset(memory_footprint_excluding_reader());
-
   return Status::OK();
 }
 
@@ -246,71 +264,58 @@ Status BloomFileReader::CheckKeyPresent(const BloomKeyProbe &probe,
                                         bool *maybe_present) {
   DCHECK(init_once_.initted());
 
-#if defined(__linux__)
-  int cpu = sched_getcpu();
-#else
-  // Use just one lock if on OS X.
-  int cpu = 0;
-#endif
-  BlockPointer bblk_ptr;
-  {
-    std::unique_lock<simple_spinlock> lock;
-    while (true) {
-      std::unique_lock<simple_spinlock> l(iter_locks_[cpu], std::try_to_lock);
-      if (l.owns_lock()) {
-        lock.swap(l);
-        break;
-      }
-      cpu = (cpu + 1) % index_iters_.size();
-    }
-
-    cfile::IndexTreeIterator *index_iter = index_iters_[cpu].get();
-
-    Status s = index_iter->SeekAtOrBefore(probe.key());
-    if (PREDICT_FALSE(s.IsNotFound())) {
-      // Seek to before the first entry in the file.
-      *maybe_present = false;
-      return Status::OK();
-    }
-    RETURN_NOT_OK(s);
-
-    // Successfully found the pointer to the bloom block. Read it.
-    bblk_ptr = index_iter->GetCurrentBlockPointer();
+  // Since we frequently will access the same BloomFile many times in a row
+  // when processing a batch of operations, we put our state in a small thread-local
+  // cache, keyed by the BloomFileReader's nonce. We use this nonce rather than
+  // the BlockID because it's possible that a BloomFile could be closed and
+  // re-opened, in which case we don't want to use our previous cache entry,
+  // which now points to a destructed CFileReader.
+  auto* tlc = BloomCacheTLC::GetInstance();
+  BloomCacheItem* bci = tlc->Lookup(instance_nonce_);
+  // If we didn't hit in the cache, make a new cache entry and instantiate a reader.
+  if (!bci) {
+    bci = tlc->EmplaceNew(instance_nonce_, reader_.get());
+  }
+  DCHECK_EQ(reader_.get(), bci->index_iter.cfile_reader())
+      << "Cached index reader does not match expected instance";
+
+  IndexTreeIterator* index_iter = &bci->index_iter;
+  Status s = index_iter->SeekAtOrBefore(probe.key());
+  if (PREDICT_FALSE(s.IsNotFound())) {
+    // Seek to before the first entry in the file.
+    *maybe_present = false;
+    return Status::OK();
+  }
+  RETURN_NOT_OK(s);
+
+  // Successfully found the pointer to the bloom block.
+  BlockPointer bblk_ptr = index_iter->GetCurrentBlockPointer();
+
+  // If the previous lookup from this bloom on this thread seeked to a different
+  // block in the BloomFile, we need to read the correct block and re-hydrate the
+  // BloomFilter instance.
+  if (!bci->cur_block_pointer.Equals(bblk_ptr)) {
+    BlockHandle dblk_data;
+    RETURN_NOT_OK(reader_->ReadBlock(bblk_ptr, CFileReader::CACHE_BLOCK, &dblk_data));
+
+    // Parse the header in the block.
+    BloomBlockHeaderPB hdr;
+    Slice bloom_data;
+    RETURN_NOT_OK(ParseBlockHeader(dblk_data.data(), &hdr, &bloom_data));
+
+    // Save the data back into our threadlocal cache.
+    bci->cur_bloom = BloomFilter(bloom_data, hdr.num_hash_functions());
+    bci->cur_block_pointer = bblk_ptr;
+    bci->cur_block_handle = std::move(dblk_data);
   }
-
-  BlockHandle dblk_data;
-  RETURN_NOT_OK(reader_->ReadBlock(bblk_ptr, CFileReader::CACHE_BLOCK, &dblk_data));
-
-  // Parse the header in the block.
-  BloomBlockHeaderPB hdr;
-  Slice bloom_data;
-  RETURN_NOT_OK(ParseBlockHeader(dblk_data.data(), &hdr, &bloom_data));
 
   // Actually check the bloom filter.
-  BloomFilter bf(bloom_data, hdr.num_hash_functions());
-  *maybe_present = bf.MayContainKey(probe);
+  *maybe_present = bci->cur_bloom.MayContainKey(probe);
   return Status::OK();
 }
 
 size_t BloomFileReader::memory_footprint_excluding_reader() const {
-  size_t size = kudu_malloc_usable_size(this);
-
-  size += init_once_.memory_footprint_excluding_this();
-
-  // TODO: Track the iterators' memory footprint? May change with every seek;
-  // not clear if it's worth doing.
-  if (!index_iters_.empty()) {
-    size += kudu_malloc_usable_size(index_iters_.data());
-    for (int i = 0; i < index_iters_.size(); i++) {
-      size += kudu_malloc_usable_size(index_iters_[i].get());
-    }
-  }
-
-  if (iter_locks_) {
-    size += kudu_malloc_usable_size(iter_locks_.get());
-  }
-
-  return size;
+  return kudu_malloc_usable_size(this) + init_once_.memory_footprint_excluding_this();
 }
 
 } // namespace cfile

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/cfile/bloomfile.h
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/bloomfile.h b/src/kudu/cfile/bloomfile.h
index d01942c..c88f0cc 100644
--- a/src/kudu/cfile/bloomfile.h
+++ b/src/kudu/cfile/bloomfile.h
@@ -124,15 +124,13 @@ class BloomFileReader {
   // excluding the CFileReader, which is tracked independently.
   size_t memory_footprint_excluding_reader() const;
 
-  std::unique_ptr<CFileReader> reader_;
+  // Sequence number for the instance, generated from a global counter.
+  // Used for a ThreadLocalCache key.
+  // TODO(todd): if we want to conserve a bit of memory we could try to
+  // collapse this into the init_once_ object or some-such.
+  const uint64_t instance_nonce_;
 
-  // TODO: temporary workaround for the fact that
-  // the index tree iterator is a member of the Reader object.
-  // We need a big per-thread object which gets passed around so as
-  // to avoid this... Instead we'll use a per-CPU iterator as a
-  // lame hack.
-  std::vector<std::unique_ptr<cfile::IndexTreeIterator>> index_iters_;
-  gscoped_ptr<padded_spinlock[]> iter_locks_;
+  std::unique_ptr<CFileReader> reader_;
 
   KuduOnceDynamic init_once_;
 

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/cfile/index_btree.h
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/index_btree.h b/src/kudu/cfile/index_btree.h
index ecd4c9b..c0d0f62 100644
--- a/src/kudu/cfile/index_btree.h
+++ b/src/kudu/cfile/index_btree.h
@@ -86,6 +86,10 @@ class IndexTreeIterator {
     const CFileReader *reader,
     const BlockPointer &idx_root);
 
+  const CFileReader* cfile_reader() const {
+    return reader_;
+  }
+
  private:
   IndexBlockIterator *BottomIter();
   IndexBlockReader *BottomReader();

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/cfile/mt-bloomfile-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/cfile/mt-bloomfile-test.cc b/src/kudu/cfile/mt-bloomfile-test.cc
index aafbd70..91b03d3 100644
--- a/src/kudu/cfile/mt-bloomfile-test.cc
+++ b/src/kudu/cfile/mt-bloomfile-test.cc
@@ -29,7 +29,6 @@ namespace cfile {
 class MTBloomFileTest : public BloomFileTestBase {
 };
 
-#ifdef NDEBUG
 TEST_F(MTBloomFileTest, Benchmark) {
   ASSERT_NO_FATAL_FAILURE(WriteTestBloomFile());
   ASSERT_OK(OpenBloomFile());
@@ -47,7 +46,6 @@ TEST_F(MTBloomFileTest, Benchmark) {
     t->Join();
   }
 }
-#endif
 
 } // namespace cfile
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/util/bloom_filter.h
----------------------------------------------------------------------
diff --git a/src/kudu/util/bloom_filter.h b/src/kudu/util/bloom_filter.h
index fb12022..cbf5250 100644
--- a/src/kudu/util/bloom_filter.h
+++ b/src/kudu/util/bloom_filter.h
@@ -169,6 +169,7 @@ class BloomFilterBuilder {
 // Wrapper around a byte array for reading it as a bloom filter.
 class BloomFilter {
  public:
+  BloomFilter() : bitmap_(nullptr) {}
   BloomFilter(const Slice &data, size_t n_hashes);
 
   // Return true if the filter may contain the given key.

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/util/mt-threadlocal-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/util/mt-threadlocal-test.cc b/src/kudu/util/mt-threadlocal-test.cc
index 1d4d6b3..2390b82 100644
--- a/src/kudu/util/mt-threadlocal-test.cc
+++ b/src/kudu/util/mt-threadlocal-test.cc
@@ -17,6 +17,7 @@
 
 #include <glog/logging.h>
 #include <mutex>
+#include <string>
 #include <unordered_set>
 
 #include "kudu/gutil/macros.h"
@@ -30,7 +31,9 @@
 #include "kudu/util/test_util.h"
 #include "kudu/util/thread.h"
 #include "kudu/util/threadlocal.h"
+#include "kudu/util/threadlocal_cache.h"
 
+using std::string;
 using std::unordered_set;
 using std::vector;
 using strings::Substitute;
@@ -318,5 +321,29 @@ TEST_F(ThreadLocalTest, TestTLSMember) {
   }
 }
 
+TEST_F(ThreadLocalTest, TestThreadLocalCache) {
+  using TLC = ThreadLocalCache<int, string>;
+  TLC* tlc = TLC::GetInstance();
+
+  // Lookup in an empty cache should return nullptr.
+  ASSERT_EQ(nullptr, tlc->Lookup(0));
+
+  // Insert more items than the cache capacity.
+  const int kLastItem = TLC::kItemCapacity * 2;
+  for (int i = 1; i <= kLastItem ; i++) {
+    auto* item = tlc->EmplaceNew(i);
+    ASSERT_NE(nullptr, item);
+    *item = Substitute("item $0", i);
+  }
+
+  // Looking up the most recent items should return them.
+  string* item = tlc->Lookup(kLastItem);
+  ASSERT_NE(nullptr, item);
+  EXPECT_EQ(*item, Substitute("item $0", kLastItem));
+
+  // Looking up evicted items should return nullptr.
+  ASSERT_EQ(nullptr, tlc->Lookup(1));
+}
+
 } // namespace threadlocal
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/9adbd90d/src/kudu/util/threadlocal_cache.h
----------------------------------------------------------------------
diff --git a/src/kudu/util/threadlocal_cache.h b/src/kudu/util/threadlocal_cache.h
new file mode 100644
index 0000000..e9ab3c2
--- /dev/null
+++ b/src/kudu/util/threadlocal_cache.h
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "kudu/util/threadlocal.h"
+
+#include <boost/optional/optional.hpp>
+#include <array>
+#include <memory>
+#include <utility>
+
+namespace kudu {
+
+// A small thread-local cache for arbitrary objects.
+//
+// This can be used as a contention-free "lookaside" type cache for frequently-accessed
+// objects to avoid having to go to a less-efficient centralized cache.
+//
+// 'Key' must be copyable, and comparable using operator==().
+// 'T' has no particular requirements.
+template<class Key, class T>
+class ThreadLocalCache {
+ public:
+  // The number of entries in the cache.
+  // NOTE: this should always be a power of two for good performance, so that the
+  // compiler can optimize the modulo operations into bit-mask operations.
+  static constexpr int kItemCapacity = 4;
+
+  // Look up a key in the cache. Returns either the existing entry with this key,
+  // or nullptr if no entry matched.
+  T* Lookup(const Key& key) {
+    // Our cache is so small that a linear search is likely to be more efficient than
+    // any kind of actual hashing. We always start the search at wherever we most
+    // recently found a hit.
+    for (int i = 0; i < kItemCapacity; i++) {
+      int idx = (last_hit_ + i) % kItemCapacity;
+      auto& p = cache_[idx];
+      if (p.first == key) {
+        last_hit_ = idx;
+        return p.second.get_ptr();
+      }
+    }
+    return nullptr;
+  }
+
+  // Insert a new entry into the cache. If the cache is full (as it usually is in the
+  // steady state), this replaces one of the existing entries. The 'args' are forwarded
+  // to T's constructor.
+  //
+  // NOTE: entries returned by a previous call to Lookup() may possibly be invalidated
+  // by this function.
+  template<typename ... Args>
+  T* EmplaceNew(const Key& key, Args&&... args) {
+    auto& p = cache_[next_slot_++ % kItemCapacity];
+    p.second.emplace(std::forward<Args>(args)...);
+    p.first = key;
+    return p.second.get_ptr();
+  }
+
+  // Get the the cache instance for this thread, creating it if it has not yet been
+  // created.
+  //
+  // The instance is automatically deleted and any cached items destructed when the
+  // thread exits.
+  static ThreadLocalCache* GetInstance() {
+    INIT_STATIC_THREAD_LOCAL(ThreadLocalCache, tl_instance_);
+    return tl_instance_;
+  }
+
+ private:
+  using EntryPair = std::pair<Key, boost::optional<T>>;
+  std::array<EntryPair, kItemCapacity> cache_;
+
+  // The next slot that we will write into. We always modulo this by the capacity
+  // before use.
+  uint8_t next_slot_ = 0;
+  // The slot where we last got a cache hit, so we can start our search at the same
+  // spot, optimizing for the case of repeated lookups of the same hot element.
+  uint8_t last_hit_ = 0;
+
+  static_assert(kItemCapacity <= 1 << (sizeof(next_slot_) * 8),
+                "next_slot_ must be large enough for capacity");
+  static_assert(kItemCapacity <= 1 << (sizeof(last_hit_) * 8),
+                "last_hit_ must be large enough for capacity");
+
+  DECLARE_STATIC_THREAD_LOCAL(ThreadLocalCache, tl_instance_);
+};
+
+// Define the thread-local storage for the ThreadLocalCache template.
+// We can't use DEFINE_STATIC_THREAD_LOCAL here because the commas in the
+// template arguments confuse the C preprocessor.
+template<class K, class T>
+__thread ThreadLocalCache<K,T>* ThreadLocalCache<K,T>::tl_instance_;
+
+} // namespace kudu