You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Matteo Merli <mm...@apache.org> on 2022/07/21 18:37:18 UTC

[VOTE] PIP-174: Provide new implementation for broker dispatch cache

## Motivation

The current implementation of the read cache in the Pulsar broker has largely
remained unchanged for a long time, except for a few minor tweaks.

While the implementation is stable and reasonably efficient for
typical workloads,
the overhead required for managing the cache evictions in a broker
that is running
many topics can be pretty high in terms of extra CPU utilization and on the JVM
garbage collection to track an increased number of medium-lived objects.

The goal is to provide an alternative implementation that can adapt better to
a wider variety of operating conditions.

### Current implementation details

The broker cache is implemented as part of the `ManagedLedger` component,
which sits in the Pulsar broker and provides a higher level of
abstraction of top
of BookKeeper.

Each topic (and managed-ledger) has its own private cache space. This
cache is implemented
as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
payload`. The payload
is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
when reading from a socket, or it can be a copied buffer.

Each topic cache is allowed to use the full broker max cache size before an
eviction is triggered. The total cache size is effectively a resource
shared across all
the topics, where a topic can use a more prominent portion of it if it
"asks for more".

When the eviction happens, we need to do an expensive ranking of all
the caches in the broker
and do an eviction in a proportional way to the currently used space
for each of them.

The bigger problem is represented by the `ConcurrentSkipList` and the
`ByteBuf` objects
that need to be tracked. The skip list is essentially like a "tree"
structure and needs to
maintain Java objects for each entry in the cache. We also need to
potentially have
a huge number of ByteBuf objects.

A cache workload is typically the worst-case scenario for each garbage
collector implementation because it involves creating objects, storing
them for some amount of
time and then throwing them away. During that time, the GC would have
already tenured these
objects and copy them into an "old generation" space, and sometime
later, a costly compaction
of that memory would have to be performed.

To mitigate the effect of the cache workload on the GC, we're being
very aggressive in
purging the cache by triggering time-based eviction. By putting a max
TTL on the elements in
the cache, we can avoid keeping the objects around for too long to be
a problem for the GC.

The reverse side of this is that we're artificially reducing the cache
capacity to a very
short time frame, reducing the cache usefulness.

The other problem is the CPU cost involved in doing these frequent
evictions, which can
be very high when there are 10s of thousands of topics in a broker.


## Proposed changes

Instead of dealing with individual caches for each topic, let's adopt
a model where
there is a single cache space for the broker.

This cache is broken into N segments which act as a circular buffer.
Whenever a segment
is full, we start writing into the next one, and when we reach the
last one, we will
restart recycling the first segment.

This model has been working very well for the BookKeeper `ReadCache`:
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java

The eviction becomes a completely trivial operation, buffers are just
rotated and
overwritten. We don't need to do any per-topic task or keep track of
utilization.

Today, there are 2 ways of configuring the cache, one that "copies"
data into the cache
and another that will just use reference-counting on the original
buffers to avoid
payload copies.

### Memory copies into the cache

Each segment is composed of a buffer, an offset, and a hashmap which maps
`(ledgerId, entryId) -> offset`.


The advantage of this approach is that entries are copied into the cache buffer
(in direct memory), and we don't need to keep any long-lived Java objects around

### Keeping reference-counted buffers in the cache

Each segment in the cache will contain a map `(ledgerId, entryId) -> ByteBuf`.
Buffers will have an increase reference count that will keep the data
alive as long
as the buffer is in the cache and it will be released when the cache
segment is rotated.

The advantage is we avoid any memory copy when inserting into or
reading from the cache.
The disadvantage is that we will have references to all the `ByteBuf`
objects that are in the cache.

### API changes

No user-facing API changes are required.

### New configuration options

The existing cache implementation will not be removed at this point. Users will
be able to configure the old implementation in `broker.conf`.

This option will be helpful in case of performance regressions would be seen for
some use cases with the new cache implementation.



--
Matteo Merli
<mm...@apache.org>

Re: [VOTE] PIP-174: Provide new implementation for broker dispatch cache

Posted by PengHui Li <co...@gmail.com>.
+1

Penghui Li
On Jul 22, 2022, 02:37 +0800, Matteo Merli <mm...@apache.org>, wrote:
> ## Motivation
>
> The current implementation of the read cache in the Pulsar broker has largely
> remained unchanged for a long time, except for a few minor tweaks.
>
> While the implementation is stable and reasonably efficient for
> typical workloads,
> the overhead required for managing the cache evictions in a broker
> that is running
> many topics can be pretty high in terms of extra CPU utilization and on the JVM
> garbage collection to track an increased number of medium-lived objects.
>
> The goal is to provide an alternative implementation that can adapt better to
> a wider variety of operating conditions.
>
> ### Current implementation details
>
> The broker cache is implemented as part of the `ManagedLedger` component,
> which sits in the Pulsar broker and provides a higher level of
> abstraction of top
> of BookKeeper.
>
> Each topic (and managed-ledger) has its own private cache space. This
> cache is implemented
> as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> payload`. The payload
> is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
> when reading from a socket, or it can be a copied buffer.
>
> Each topic cache is allowed to use the full broker max cache size before an
> eviction is triggered. The total cache size is effectively a resource
> shared across all
> the topics, where a topic can use a more prominent portion of it if it
> "asks for more".
>
> When the eviction happens, we need to do an expensive ranking of all
> the caches in the broker
> and do an eviction in a proportional way to the currently used space
> for each of them.
>
> The bigger problem is represented by the `ConcurrentSkipList` and the
> `ByteBuf` objects
> that need to be tracked. The skip list is essentially like a "tree"
> structure and needs to
> maintain Java objects for each entry in the cache. We also need to
> potentially have
> a huge number of ByteBuf objects.
>
> A cache workload is typically the worst-case scenario for each garbage
> collector implementation because it involves creating objects, storing
> them for some amount of
> time and then throwing them away. During that time, the GC would have
> already tenured these
> objects and copy them into an "old generation" space, and sometime
> later, a costly compaction
> of that memory would have to be performed.
>
> To mitigate the effect of the cache workload on the GC, we're being
> very aggressive in
> purging the cache by triggering time-based eviction. By putting a max
> TTL on the elements in
> the cache, we can avoid keeping the objects around for too long to be
> a problem for the GC.
>
> The reverse side of this is that we're artificially reducing the cache
> capacity to a very
> short time frame, reducing the cache usefulness.
>
> The other problem is the CPU cost involved in doing these frequent
> evictions, which can
> be very high when there are 10s of thousands of topics in a broker.
>
>
> ## Proposed changes
>
> Instead of dealing with individual caches for each topic, let's adopt
> a model where
> there is a single cache space for the broker.
>
> This cache is broken into N segments which act as a circular buffer.
> Whenever a segment
> is full, we start writing into the next one, and when we reach the
> last one, we will
> restart recycling the first segment.
>
> This model has been working very well for the BookKeeper `ReadCache`:
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
>
> The eviction becomes a completely trivial operation, buffers are just
> rotated and
> overwritten. We don't need to do any per-topic task or keep track of
> utilization.
>
> Today, there are 2 ways of configuring the cache, one that "copies"
> data into the cache
> and another that will just use reference-counting on the original
> buffers to avoid
> payload copies.
>
> ### Memory copies into the cache
>
> Each segment is composed of a buffer, an offset, and a hashmap which maps
> `(ledgerId, entryId) -> offset`.
>
>
> The advantage of this approach is that entries are copied into the cache buffer
> (in direct memory), and we don't need to keep any long-lived Java objects around
>
> ### Keeping reference-counted buffers in the cache
>
> Each segment in the cache will contain a map `(ledgerId, entryId) -> ByteBuf`.
> Buffers will have an increase reference count that will keep the data
> alive as long
> as the buffer is in the cache and it will be released when the cache
> segment is rotated.
>
> The advantage is we avoid any memory copy when inserting into or
> reading from the cache.
> The disadvantage is that we will have references to all the `ByteBuf`
> objects that are in the cache.
>
> ### API changes
>
> No user-facing API changes are required.
>
> ### New configuration options
>
> The existing cache implementation will not be removed at this point. Users will
> be able to configure the old implementation in `broker.conf`.
>
> This option will be helpful in case of performance regressions would be seen for
> some use cases with the new cache implementation.
>
>
>
> --
> Matteo Merli
> <mm...@apache.org>

Re: [VOTE] PIP-174: Provide new implementation for broker dispatch cache

Posted by Michael Marshall <mm...@apache.org>.
+1 - thanks for a great write up Matteo, it looks like it'll be a
great improvement!

(Sorry for casting a late vote, but I figure it's worth affirming the
PIP even if we've already closed the vote.)

Thanks,
Michael

On Sun, Jul 24, 2022 at 9:32 PM Matteo Merli <ma...@gmail.com> wrote:
>
> Closing this vote with 3 +1s and no -1s:
>
> +1s:
>  * Matteo
>  * PengHui
>  * Dave
>
> Thanks,
> Matteo
>
> --
> Matteo Merli
> <ma...@gmail.com>
>
> On Thu, Jul 21, 2022 at 7:58 PM Dave Fisher <wa...@comcast.net> wrote:
> >
> > Sorry I’m late to this discussion.
> >
> > I think that the motivation is correct. There is really quite a bit of activity around this issue. Let’s take extra efforts to engage extra time with commits to confirm performance improvements.
> >
> > Let’s particularly pay attention to threading.
> >
> > +1
> >
> > Regards,
> > Dave
> >
> > Sent from my iPhone
> >
> > > On Jul 21, 2022, at 11:37 AM, Matteo Merli <mm...@apache.org> wrote:
> > >
> > > ## Motivation
> > >
> > > The current implementation of the read cache in the Pulsar broker has largely
> > > remained unchanged for a long time, except for a few minor tweaks.
> > >
> > > While the implementation is stable and reasonably efficient for
> > > typical workloads,
> > > the overhead required for managing the cache evictions in a broker
> > > that is running
> > > many topics can be pretty high in terms of extra CPU utilization and on the JVM
> > > garbage collection to track an increased number of medium-lived objects.
> > >
> > > The goal is to provide an alternative implementation that can adapt better to
> > > a wider variety of operating conditions.
> > >
> > > ### Current implementation details
> > >
> > > The broker cache is implemented as part of the `ManagedLedger` component,
> > > which sits in the Pulsar broker and provides a higher level of
> > > abstraction of top
> > > of BookKeeper.
> > >
> > > Each topic (and managed-ledger) has its own private cache space. This
> > > cache is implemented
> > > as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> > > payload`. The payload
> > > is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
> > > when reading from a socket, or it can be a copied buffer.
> > >
> > > Each topic cache is allowed to use the full broker max cache size before an
> > > eviction is triggered. The total cache size is effectively a resource
> > > shared across all
> > > the topics, where a topic can use a more prominent portion of it if it
> > > "asks for more".
> > >
> > > When the eviction happens, we need to do an expensive ranking of all
> > > the caches in the broker
> > > and do an eviction in a proportional way to the currently used space
> > > for each of them.
> > >
> > > The bigger problem is represented by the `ConcurrentSkipList` and the
> > > `ByteBuf` objects
> > > that need to be tracked. The skip list is essentially like a "tree"
> > > structure and needs to
> > > maintain Java objects for each entry in the cache. We also need to
> > > potentially have
> > > a huge number of ByteBuf objects.
> > >
> > > A cache workload is typically the worst-case scenario for each garbage
> > > collector implementation because it involves creating objects, storing
> > > them for some amount of
> > > time and then throwing them away. During that time, the GC would have
> > > already tenured these
> > > objects and copy them into an "old generation" space, and sometime
> > > later, a costly compaction
> > > of that memory would have to be performed.
> > >
> > > To mitigate the effect of the cache workload on the GC, we're being
> > > very aggressive in
> > > purging the cache by triggering time-based eviction. By putting a max
> > > TTL on the elements in
> > > the cache, we can avoid keeping the objects around for too long to be
> > > a problem for the GC.
> > >
> > > The reverse side of this is that we're artificially reducing the cache
> > > capacity to a very
> > > short time frame, reducing the cache usefulness.
> > >
> > > The other problem is the CPU cost involved in doing these frequent
> > > evictions, which can
> > > be very high when there are 10s of thousands of topics in a broker.
> > >
> > >
> > > ## Proposed changes
> > >
> > > Instead of dealing with individual caches for each topic, let's adopt
> > > a model where
> > > there is a single cache space for the broker.
> > >
> > > This cache is broken into N segments which act as a circular buffer.
> > > Whenever a segment
> > > is full, we start writing into the next one, and when we reach the
> > > last one, we will
> > > restart recycling the first segment.
> > >
> > > This model has been working very well for the BookKeeper `ReadCache`:
> > > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
> > >
> > > The eviction becomes a completely trivial operation, buffers are just
> > > rotated and
> > > overwritten. We don't need to do any per-topic task or keep track of
> > > utilization.
> > >
> > > Today, there are 2 ways of configuring the cache, one that "copies"
> > > data into the cache
> > > and another that will just use reference-counting on the original
> > > buffers to avoid
> > > payload copies.
> > >
> > > ### Memory copies into the cache
> > >
> > > Each segment is composed of a buffer, an offset, and a hashmap which maps
> > > `(ledgerId, entryId) -> offset`.
> > >
> > >
> > > The advantage of this approach is that entries are copied into the cache buffer
> > > (in direct memory), and we don't need to keep any long-lived Java objects around
> > >
> > > ### Keeping reference-counted buffers in the cache
> > >
> > > Each segment in the cache will contain a map `(ledgerId, entryId) -> ByteBuf`.
> > > Buffers will have an increase reference count that will keep the data
> > > alive as long
> > > as the buffer is in the cache and it will be released when the cache
> > > segment is rotated.
> > >
> > > The advantage is we avoid any memory copy when inserting into or
> > > reading from the cache.
> > > The disadvantage is that we will have references to all the `ByteBuf`
> > > objects that are in the cache.
> > >
> > > ### API changes
> > >
> > > No user-facing API changes are required.
> > >
> > > ### New configuration options
> > >
> > > The existing cache implementation will not be removed at this point. Users will
> > > be able to configure the old implementation in `broker.conf`.
> > >
> > > This option will be helpful in case of performance regressions would be seen for
> > > some use cases with the new cache implementation.
> > >
> > >
> > >
> > > --
> > > Matteo Merli
> > > <mm...@apache.org>
> >

Re: [VOTE] PIP-174: Provide new implementation for broker dispatch cache

Posted by Matteo Merli <ma...@gmail.com>.
Closing this vote with 3 +1s and no -1s:

+1s:
 * Matteo
 * PengHui
 * Dave

Thanks,
Matteo

--
Matteo Merli
<ma...@gmail.com>

On Thu, Jul 21, 2022 at 7:58 PM Dave Fisher <wa...@comcast.net> wrote:
>
> Sorry I’m late to this discussion.
>
> I think that the motivation is correct. There is really quite a bit of activity around this issue. Let’s take extra efforts to engage extra time with commits to confirm performance improvements.
>
> Let’s particularly pay attention to threading.
>
> +1
>
> Regards,
> Dave
>
> Sent from my iPhone
>
> > On Jul 21, 2022, at 11:37 AM, Matteo Merli <mm...@apache.org> wrote:
> >
> > ## Motivation
> >
> > The current implementation of the read cache in the Pulsar broker has largely
> > remained unchanged for a long time, except for a few minor tweaks.
> >
> > While the implementation is stable and reasonably efficient for
> > typical workloads,
> > the overhead required for managing the cache evictions in a broker
> > that is running
> > many topics can be pretty high in terms of extra CPU utilization and on the JVM
> > garbage collection to track an increased number of medium-lived objects.
> >
> > The goal is to provide an alternative implementation that can adapt better to
> > a wider variety of operating conditions.
> >
> > ### Current implementation details
> >
> > The broker cache is implemented as part of the `ManagedLedger` component,
> > which sits in the Pulsar broker and provides a higher level of
> > abstraction of top
> > of BookKeeper.
> >
> > Each topic (and managed-ledger) has its own private cache space. This
> > cache is implemented
> > as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> > payload`. The payload
> > is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
> > when reading from a socket, or it can be a copied buffer.
> >
> > Each topic cache is allowed to use the full broker max cache size before an
> > eviction is triggered. The total cache size is effectively a resource
> > shared across all
> > the topics, where a topic can use a more prominent portion of it if it
> > "asks for more".
> >
> > When the eviction happens, we need to do an expensive ranking of all
> > the caches in the broker
> > and do an eviction in a proportional way to the currently used space
> > for each of them.
> >
> > The bigger problem is represented by the `ConcurrentSkipList` and the
> > `ByteBuf` objects
> > that need to be tracked. The skip list is essentially like a "tree"
> > structure and needs to
> > maintain Java objects for each entry in the cache. We also need to
> > potentially have
> > a huge number of ByteBuf objects.
> >
> > A cache workload is typically the worst-case scenario for each garbage
> > collector implementation because it involves creating objects, storing
> > them for some amount of
> > time and then throwing them away. During that time, the GC would have
> > already tenured these
> > objects and copy them into an "old generation" space, and sometime
> > later, a costly compaction
> > of that memory would have to be performed.
> >
> > To mitigate the effect of the cache workload on the GC, we're being
> > very aggressive in
> > purging the cache by triggering time-based eviction. By putting a max
> > TTL on the elements in
> > the cache, we can avoid keeping the objects around for too long to be
> > a problem for the GC.
> >
> > The reverse side of this is that we're artificially reducing the cache
> > capacity to a very
> > short time frame, reducing the cache usefulness.
> >
> > The other problem is the CPU cost involved in doing these frequent
> > evictions, which can
> > be very high when there are 10s of thousands of topics in a broker.
> >
> >
> > ## Proposed changes
> >
> > Instead of dealing with individual caches for each topic, let's adopt
> > a model where
> > there is a single cache space for the broker.
> >
> > This cache is broken into N segments which act as a circular buffer.
> > Whenever a segment
> > is full, we start writing into the next one, and when we reach the
> > last one, we will
> > restart recycling the first segment.
> >
> > This model has been working very well for the BookKeeper `ReadCache`:
> > https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
> >
> > The eviction becomes a completely trivial operation, buffers are just
> > rotated and
> > overwritten. We don't need to do any per-topic task or keep track of
> > utilization.
> >
> > Today, there are 2 ways of configuring the cache, one that "copies"
> > data into the cache
> > and another that will just use reference-counting on the original
> > buffers to avoid
> > payload copies.
> >
> > ### Memory copies into the cache
> >
> > Each segment is composed of a buffer, an offset, and a hashmap which maps
> > `(ledgerId, entryId) -> offset`.
> >
> >
> > The advantage of this approach is that entries are copied into the cache buffer
> > (in direct memory), and we don't need to keep any long-lived Java objects around
> >
> > ### Keeping reference-counted buffers in the cache
> >
> > Each segment in the cache will contain a map `(ledgerId, entryId) -> ByteBuf`.
> > Buffers will have an increase reference count that will keep the data
> > alive as long
> > as the buffer is in the cache and it will be released when the cache
> > segment is rotated.
> >
> > The advantage is we avoid any memory copy when inserting into or
> > reading from the cache.
> > The disadvantage is that we will have references to all the `ByteBuf`
> > objects that are in the cache.
> >
> > ### API changes
> >
> > No user-facing API changes are required.
> >
> > ### New configuration options
> >
> > The existing cache implementation will not be removed at this point. Users will
> > be able to configure the old implementation in `broker.conf`.
> >
> > This option will be helpful in case of performance regressions would be seen for
> > some use cases with the new cache implementation.
> >
> >
> >
> > --
> > Matteo Merli
> > <mm...@apache.org>
>

Re: [VOTE] PIP-174: Provide new implementation for broker dispatch cache

Posted by Dave Fisher <wa...@comcast.net>.
Sorry I’m late to this discussion.

I think that the motivation is correct. There is really quite a bit of activity around this issue. Let’s take extra efforts to engage extra time with commits to confirm performance improvements.

Let’s particularly pay attention to threading.

+1

Regards,
Dave

Sent from my iPhone

> On Jul 21, 2022, at 11:37 AM, Matteo Merli <mm...@apache.org> wrote:
> 
> ## Motivation
> 
> The current implementation of the read cache in the Pulsar broker has largely
> remained unchanged for a long time, except for a few minor tweaks.
> 
> While the implementation is stable and reasonably efficient for
> typical workloads,
> the overhead required for managing the cache evictions in a broker
> that is running
> many topics can be pretty high in terms of extra CPU utilization and on the JVM
> garbage collection to track an increased number of medium-lived objects.
> 
> The goal is to provide an alternative implementation that can adapt better to
> a wider variety of operating conditions.
> 
> ### Current implementation details
> 
> The broker cache is implemented as part of the `ManagedLedger` component,
> which sits in the Pulsar broker and provides a higher level of
> abstraction of top
> of BookKeeper.
> 
> Each topic (and managed-ledger) has its own private cache space. This
> cache is implemented
> as a `ConcurrentSkipList` sorted map that maps `(ledgerId, entryId) ->
> payload`. The payload
> is a `ByteBuf` reference that can either be a slice of a `ByteBuf` that we got
> when reading from a socket, or it can be a copied buffer.
> 
> Each topic cache is allowed to use the full broker max cache size before an
> eviction is triggered. The total cache size is effectively a resource
> shared across all
> the topics, where a topic can use a more prominent portion of it if it
> "asks for more".
> 
> When the eviction happens, we need to do an expensive ranking of all
> the caches in the broker
> and do an eviction in a proportional way to the currently used space
> for each of them.
> 
> The bigger problem is represented by the `ConcurrentSkipList` and the
> `ByteBuf` objects
> that need to be tracked. The skip list is essentially like a "tree"
> structure and needs to
> maintain Java objects for each entry in the cache. We also need to
> potentially have
> a huge number of ByteBuf objects.
> 
> A cache workload is typically the worst-case scenario for each garbage
> collector implementation because it involves creating objects, storing
> them for some amount of
> time and then throwing them away. During that time, the GC would have
> already tenured these
> objects and copy them into an "old generation" space, and sometime
> later, a costly compaction
> of that memory would have to be performed.
> 
> To mitigate the effect of the cache workload on the GC, we're being
> very aggressive in
> purging the cache by triggering time-based eviction. By putting a max
> TTL on the elements in
> the cache, we can avoid keeping the objects around for too long to be
> a problem for the GC.
> 
> The reverse side of this is that we're artificially reducing the cache
> capacity to a very
> short time frame, reducing the cache usefulness.
> 
> The other problem is the CPU cost involved in doing these frequent
> evictions, which can
> be very high when there are 10s of thousands of topics in a broker.
> 
> 
> ## Proposed changes
> 
> Instead of dealing with individual caches for each topic, let's adopt
> a model where
> there is a single cache space for the broker.
> 
> This cache is broken into N segments which act as a circular buffer.
> Whenever a segment
> is full, we start writing into the next one, and when we reach the
> last one, we will
> restart recycling the first segment.
> 
> This model has been working very well for the BookKeeper `ReadCache`:
> https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/ReadCache.java
> 
> The eviction becomes a completely trivial operation, buffers are just
> rotated and
> overwritten. We don't need to do any per-topic task or keep track of
> utilization.
> 
> Today, there are 2 ways of configuring the cache, one that "copies"
> data into the cache
> and another that will just use reference-counting on the original
> buffers to avoid
> payload copies.
> 
> ### Memory copies into the cache
> 
> Each segment is composed of a buffer, an offset, and a hashmap which maps
> `(ledgerId, entryId) -> offset`.
> 
> 
> The advantage of this approach is that entries are copied into the cache buffer
> (in direct memory), and we don't need to keep any long-lived Java objects around
> 
> ### Keeping reference-counted buffers in the cache
> 
> Each segment in the cache will contain a map `(ledgerId, entryId) -> ByteBuf`.
> Buffers will have an increase reference count that will keep the data
> alive as long
> as the buffer is in the cache and it will be released when the cache
> segment is rotated.
> 
> The advantage is we avoid any memory copy when inserting into or
> reading from the cache.
> The disadvantage is that we will have references to all the `ByteBuf`
> objects that are in the cache.
> 
> ### API changes
> 
> No user-facing API changes are required.
> 
> ### New configuration options
> 
> The existing cache implementation will not be removed at this point. Users will
> be able to configure the old implementation in `broker.conf`.
> 
> This option will be helpful in case of performance regressions would be seen for
> some use cases with the new cache implementation.
> 
> 
> 
> --
> Matteo Merli
> <mm...@apache.org>