You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Stefania Alborghetti (Jira)" <ji...@apache.org> on 2020/04/01 16:15:00 UTC
[jira] [Comment Edited] (CASSANDRA-15229) BufferPool Regression

    [ https://issues.apache.org/jira/browse/CASSANDRA-15229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072895#comment-17072895 ] 

Stefania Alborghetti edited comment on CASSANDRA-15229 at 4/1/20, 4:14 PM:
---------------------------------------------------------------------------

We hit this buffer pool regression problem in our DSE fork a while ago. Because our chunk cache became much larger when it replaced the OS page cache, off-heap memory was growing significantly beyond the limits configured. This was partly due to some leaks, but the fragmentation in the current design of the buffer pool was a big part of it.

This is how we solved it:

 - a bump-the-pointer slab approach for the transient pool, not to dissimilar from the current implementation. We then exploit our thread per core architecture: core threads get a dedicated slab each, other threads share a global slab.

 - a bitmap-based slab approach for the permanent pool, which is only used by the chunk cache. These slabs can only issue buffers of the same size, one bit is flipped in the bitmap for each buffer issued. When multiple buffers are requested, the slab tries to issue consecutive addresses but this is not guaranteed since we want to avoid memory fragmentation. We have global lists of these slabs, sorted by buffer size where each size is a power-of-two. Slabs are taken out of these lists when they are full, and they are put back into circulation when they have space available. The lists are global but core threads get a thread-local stash of buffers, i.e. they request multiple buffers at the same time in order to reduce contention on the global lists.

We changed the chunk cache to always store buffers of the same size. If we need to read chunks of a different size, we use an array of buffers in the cache and we request multiple buffers at the same time. If we get consecutive addresses, we optimize for this case by building a single byte buffer over the first address. We also optimized the chunk cache to store memory addresses rather than byte buffers, which significantly reduced heap usage. The byte buffers are materialized on the fly.

For the permanent case, we made the choice of constraining the size of the buffers in the cache so that memory in the pool could be fully used. This may or may not be what people prefer. Our choice was due to the large size of the cache, 20+ GB. An approach that allows some memory fragmentation may be sufficient for smaller cache sizes.

Please let me know if there is interest in porting this solution to 4.0 or 4.x. I can share the code if needed.




was (Author: stefania):
We hit this buffer pool regression problem in our DSE fork a while ago. Because our chunk cache became much larger when it replaced the OS page cache, off-heap memory was growing significantly beyond the limits configured. This was partly due to some leaks, but the fragmentation in the current design of the buffer pool was a big part of it.

This is how we solved it:

 - a bump-the-pointer slab approach for the transient pool, not to dissimilar from the current implementation. We then exploit our thread per core architecture: core threads get a dedicated slab each, other threads share a global slab.

 - a bitmap-based slab approach for the permanent pool, which is only used by the chunk cache. These slabs can only issue buffers of the same size, one bit is flipped in the bitmap for each buffer issued. When multiple buffers are requested, the slab tries to issue consecutive addresses but this is not guaranteed since we want to avoid memory fragmentation. We have global lists of these slabs, sorted by buffer size where each size is a power-of-two. Slabs are taken out of these lists when they are full, and they are put back into circulation when they have space available. The lists are global but core threads get a thread-local stash of buffers, i.e. they request multiple buffers at the same time in order to reduce contention on the global lists.

We changed the chunk cache to always store chunks of the same size. If we need to read chunks of a different size, we use an array of buffers in the cache and we request multiple buffers at the same time. If we get consecutive addresses, we optimize for this case by building a single byte buffer over the first address. We also optimized the chunk cache to store memory addresses rather than byte buffers, which significantly reduced heap usage. The byte buffers are materialized on the fly.

For the permanent case, we made the choice of constraining the size of the buffers in the cache so that memory in the pool could be fully used. This may or may not be what people prefer. Our choice was due to the large size of the cache, 20+ GB. An approach that allows some memory fragmentation may be sufficient for smaller cache sizes.

Please let me know if there is interest in porting this solution to 4.0 or 4.x. I can share the code if needed.



> BufferPool Regression
> ---------------------
>
>                 Key: CASSANDRA-15229
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15229
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Caching
>            Reporter: Benedict Elliott Smith
>            Assignee: ZhaoYang
>            Priority: Normal
>             Fix For: 4.0, 4.0-beta
>
>
> The BufferPool was never intended to be used for a {{ChunkCache}}, and we need to either change our behaviour to handle uncorrelated lifetimes or use something else.  This is particularly important with the default chunk size for compressed sstables being reduced.  If we address the problem, we should also utilise the BufferPool for native transport connections like we do for internode messaging, and reduce the number of pooling solutions we employ.
> Probably the best thing to do is to improve BufferPool’s behaviour when used for things with uncorrelated lifetimes, which essentially boils down to tracking those chunks that have not been freed and re-circulating them when we run out of completely free blocks.  We should probably also permit instantiating separate {{BufferPool}}, so that we can insulate internode messaging from the {{ChunkCache}}, or at least have separate memory bounds for each, and only share fully-freed chunks.
> With these improvements we can also safely increase the {{BufferPool}} chunk size to 128KiB or 256KiB, to guarantee we can fit compressed pages and reduce the amount of global coordination and per-allocation overhead.  We don’t need 1KiB granularity for allocations, nor 16 byte granularity for tiny allocations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org