You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2021/12/15 17:30:42 UTC

[GitHub] [bookkeeper] mauricebarnum opened a new issue #2943: BP-44: Support for writing ledger entries with O_DIRECT

mauricebarnum opened a new issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943


   **BP**
   
   > Follow the instructions at http://bookkeeper.apache.org/community/bookkeeper_proposals/ to create a proposal.
   
   This is the master ticket for tracking BP-44 :
   
   Add support for writing ledger entries, bypassing the file system's buffering.  On supported systems (currently Linux and MacOS), files are opened with O_DIRECT and page-aligned buffers are maintained by the BookKeeper application.  Access to the operating system's IO API is via a thin JNI binding.
   
   <!-- add a proposal PR link below -->
   Proposal PR - https://github.com/apache/bookkeeper/pull/2932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2943: BP-47: Support for writing ledger entries with O_DIRECT

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943#issuecomment-1086446902


   @eolivelli @dlg99 @mauricebarnum  I add more details about this proposal, Please help take a look, thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2943: BP-47: Support for writing ledger entries with O_DIRECT

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943#issuecomment-1086446251


   ### Motivation
   #### Ledger read/write logic
   When the BookKeeper server receives a write entry request, it will write the entry into the memory table, which is a bookie-level cache. After the memory table is full, it will sort and trigger a flush into the operating system PageCache. The operating system PageCache will buffer that data again. When the PageCache flush is triggered, the data will be flushed to disk.
   
   When the BookKeeper server receives a read entry request, it will check whether the memory table and read cache contains the target entry. If both caches are missed, it will query rocksDB to get the entry's location in the entry log file, and then read the target entry from the entry log file. After reading the entry, the read cache will pre-read more entries from the entry log files to ensure the following read keeps high cache hit rate. From the operating system perspective, when reading a log file from a specific position, it will check whether the target data has been cached in PageCache. If PageCache hits, it will be directly returned from PageCache. Otherwise, it will read the block data from the log file and cache the block data into PageCache. The block may contain multiple entries located near the target read position.
   
   
   #### Drawbacks
   For ledger writes, it will limit the write throughput for the following reasons.
   1. The memtable and OS PageCache will double buffer entry data, which will consume more memory. 
   2. The flush mechanism of PageCache is controlled by the kernel and it's hard to tune by application, which is very important for IO intensive applications. 
   3. The number of kernel sync threads is limited by the number of disks, which is not conducive to RAID composed of multiple disks.
   
   For ledger reads, it will also limit the read throughput for the following reasons.
   1. When reading entry data from a log file, the OS will prefetch data and store them into PageCache. When a lot of topics from Pulsar fetch historical cold data, it will trigger fetch data from a lot of log files at the same time and a lot of data will be pre-fetched into PageCache. Due to ledger file special organization, which sorts and writes a lot of ledgers into the same file, the prefetched data may not belong to the target ledger, which will waste a lot of memory and reduce the PageCache hit rate.
   2. After the OS pre-fetched a lot of data into PageCache, the eviction is also a big problem. The PageCache default eviction policy is LRU, it can't be controlled by Application except when we re-compile the kernel. We can't control which entries will be evicted and when to evict them.
   
   
   ### Proposal
   Based on the above issues, we introduce an optional support to bypass the operating system PageCache on supported systems (currently Linux and MacOS) by using the open(2) (https://man7.org/linux/man-pages/man2/open.2.html) flag O_DIRECT. fallocate(2) (https://man7.org/linux/man-pages/man2/fallocate.2.html) will be used, if available, to request that the filesystem allocate the required space before data is written.
   
   The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing).
   
   There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are
   actually written.
   
   The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must be written in aligned blocks. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will always parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0).
   
   Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.
   
   We have designed a writeBuffer pool to hold the write entries and flush them to disk when the buffer is full. For entry reading, each entry log file has a reader to deal with reading. The reader is managed by a cache backed with an eviction policy. Each read has a specific size read buffer to hold read data.
   
   To enable this, set dbStorage_directIOEntryLogger=true in the configuration.
   
   ### Changes
   1. Add bookkeeper-slogger module to provide support for structured logging with a pluggable logging backend. Provide an implementation using SLF4J.
   2. Add native-io package to provide JNI bindings to operating system I/O api.
   3. Introduce entry logger interface to support multi-implement of entry logger. Current support for PageCache based implementation and direct-io based implementation.
   4. Add direct-io based implementation DirectEntryLogger, which is enabled by flag `dbStorage_directIOEntryLogger`
   5. Refactor garbage collection and compaction to allow the entry logger to control which files are available to be garbage collected.
   
   ### Implementation
   For part 1,2,3,5, we will push individual PRs. For part 4, we are trying to split into two PRs, one for writing, another for reading.
   
   ### Compatibility, Deprecation, and Migration Plan
   We just modified the read and write logic of the entry log file, and didn't modify the organization of it.
   
   So no compatibility concerns at this moment.
   
   
   ### Test Plan
   We will add tests for the following module.
   1. BookKeeper-slogger 
   2. Native-io 
   3. Direct-io based implementation DirectEntryLogger
   5. Garbage collection and compaction based on DirectEntryLogger
   
   ### Others
   I’m doing performance testing for the direct-io based implementation.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] Vanlightly commented on issue #2943: BP-44: Support for writing ledger entries with O_DIRECT

Posted by GitBox <gi...@apache.org>.
Vanlightly commented on issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943#issuecomment-995024701


   This will need to be BP-47. BP-44 - 46 are not yet merged but do exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [bookkeeper] hangc0276 commented on issue #2943: BP-47: Support for writing ledger entries with O_DIRECT

Posted by GitBox <gi...@apache.org>.
hangc0276 commented on issue #2943:
URL: https://github.com/apache/bookkeeper/issues/2943#issuecomment-1032143505


   @mauricebarnum Would you please give more details about the motivation and design for this proposal? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org