You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ben Manes (JIRA)" <ji...@apache.org> on 2018/12/06 04:36:00 UTC

[jira] [Commented] (HBASE-15560) TinyLFU-based BlockCache

    [ https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710972#comment-16710972 ] 

Ben Manes commented on HBASE-15560:
-----------------------------------

Sorry that this dropped off my radar. I can summarize a few things stated before, with minor updates.
h5. Current (SLRU)
 * Design:
 ** Reads perform a ConcurrentHashMap lookup, increment a global AtomicLong counter for the access time, and marks the block type as frequent
 ** Writes perform a ConcurrentHashMap update and notifies a thread if the cache overflows 
 ** Thread wakes up every 10s or when notified. Performs an O(n lg n) sort, and evicts the recency/frequency segments up to watermarks
 * Benefits
 ** Provides scan resistance and captures simple frequency workloads. Is optimal for Zipf.
 ** Has minimal latencies at low/modest concurrency as does very little work on requesting threads
 * Costs
 ** At high concurrency, AtomicLong would be a synchronization bottleneck (~10M op/sec if I recall correctly). This probably does not matter due to disk I/O, network I/O, etc. resulting in modest thrashing on this counter.
 ** No back-pressure on writes if the cache cannot evict fast enough. However, the I/O involved may make this moot. 
 ** Expected lower hit rates in real-world traces, based on the variety of workload we have examined (non-HBase, various freq/recency mixtures)

h5. Caffeine (Proposed, TinyLFU)
 * Design:
 ** Reads perform a ConcurrentHashMap lookup, hash to a ring buffer (growable array of buffers), and tries to add the item (up to 3 times, may rehash). If the ring buffer is full or a state machine flag is marked, then tryLocks to schedule a task on an executor.
 ** Writes perform a ConcurrentHashMap update, add to a ring buffer (blocking if full), updates a state machine flag, and tryLocks to schedule a task on an executor.
 ** Executor drains the ring buffers, replays the events on the eviction policy, evicts if the cache has overflowed (default: ForkJoinPool.commonPool()).
 * Benefits
 ** Allows higher degree of read concurrency by not having a single point of contention (striped ring buffers)
 ** Offers back-pressure on writes if the eviction thread cannot keep up (deschedules writers by them taking the global lock if the buffer is full)
 ** Spreads out small chunks of O(1) work
 ** Allows more advanced policies / data-structures (TinyLFU, Hierarchical TimerWheel) => higher hit rates & more features
 * Costs
 ** Slightly higher penalties on read / write (no free lunch)
 ** Is more biased towards frequency (a negative if a recency-skewed workload)

h5. Synopsis

The SLRU is the cheapest (latency) and most optimal (hit rate) for synthetic Zipf testing. It was designed with those considerations in mind. Any other solution will trade higher latency for better hit rates and system behavior. The question is then if the latency difference is small enough (effectively noise) and the higher hit rate improves overall performance. *This can only be definitively answered by someone willing to canary an instance in a live environment.* My belief, from analyzing hit rates and their impacts on other applications, is that there will be a benefit.
h5. TinyLFU improvements

We have been exploring ways to improve TinyLFU-based policies in adversarial workloads (recency-biased). In those cases work is brought in, operated on repeatedly, and then never touched again. A good example of that is a transaction log or a distributed compilation cache (with local cache). In those workloads frequency is a negative signal, as by the time the score is high enough for retention the item is no longer worth retaining.

We have been working on adaptive schemes by sampling the workload and adjusting based on its characteristics ([paper|https://drive.google.com/open?id=1CT2ASkfuG9qVya9Sn8ZUCZjrFSSyjRA_]). Both a naive hill climber and a statistics-based model correct the policy to the optimal hit rate. I hope to try [adaptive moment estimation|https://arxiv.org/abs/1412.6980], an advanced hill climber, which I believe will be the most robust and inexpensive mechanism (as proven by the ML community). This work will allow the cache to offer the best hit rate regardless of workload, which no other policy has been able to do so far.
h5. Next Steps

I don't think there is anything meaningful that I can offer to this ticket. If this was to go in, either a leap of faith by making it an option or someone in the community would have to prove the benefit. Without an environment or trace, we can't do more than discuss minor details from synthetic testing.

> TinyLFU-based BlockCache
> ------------------------
>
>                 Key: HBASE-15560
>                 URL: https://issues.apache.org/jira/browse/HBASE-15560
>             Project: HBase
>          Issue Type: Improvement
>          Components: BlockCache
>    Affects Versions: 2.0.0
>            Reporter: Ben Manes
>            Assignee: Ben Manes
>            Priority: Major
>         Attachments: HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, bc.hit.count, bc.miss.count, branch-1.tinylfu.txt, gets, run_ycsb_c.sh, run_ycsb_loading.sh, tinylfu.patch
>
>
> LruBlockCache uses the Segmented LRU (SLRU) policy to capture frequency and recency of the working set. It achieves concurrency by using an O( n ) background thread to prioritize the entries and evict. Accessing an entry is O(1) by a hash table lookup, recording its logical access time, and setting a frequency flag. A write is performed in O(1) time by updating the hash table and triggering an async eviction thread. This provides ideal concurrency and minimizes the latencies by penalizing the thread instead of the caller. However the policy does not age the frequencies and may not be resilient to various workload patterns.
> W-TinyLFU ([research paper|http://arxiv.org/pdf/1512.00727.pdf]) records the frequency in a counting sketch, ages periodically by halving the counters, and orders entries by SLRU. An entry is discarded by comparing the frequency of the new arrival (candidate) to the SLRU's victim, and keeping the one with the highest frequency. This allows the operations to be performed in O(1) time and, though the use of a compact sketch, a much larger history is retained beyond the current working set. In a variety of real world traces the policy had [near optimal hit rates|https://github.com/ben-manes/caffeine/wiki/Efficiency].
> Concurrency is achieved by buffering and replaying the operations, similar to a write-ahead log. A read is recorded into a striped ring buffer and writes to a queue. The operations are applied in batches under a try-lock by an asynchronous thread, thereby track the usage pattern without incurring high latencies ([benchmarks|https://github.com/ben-manes/caffeine/wiki/Benchmarks#server-class]).
> In YCSB benchmarks the results were inconclusive. For a large cache (99% hit rates) the two caches have near identical throughput and latencies with LruBlockCache narrowly winning. At medium and small caches, TinyLFU had a 1-4% hit rate improvement and therefore lower latencies. The lack luster result is because a synthetic Zipfian distribution is used, which SLRU performs optimally. In a more varied, real-world workload we'd expect to see improvements by being able to make smarter predictions.
> The provided patch implements BlockCache using the [Caffeine|https://github.com/ben-manes/caffeine] caching library (see HighScalability [article|http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html]).
> Edward Bortnikov and Eshcar Hillel have graciously provided guidance for evaluating this patch ([github branch|https://github.com/ben-manes/hbase/tree/tinylfu]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)