You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Yingchun Lai (Jira)" <ji...@apache.org> on 2022/07/06 16:00:00 UTC

[jira] [Comment Edited] (KUDU-3371) Use RocksDB to store LBM metadata

    [ https://issues.apache.org/jira/browse/KUDU-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17559031#comment-17559031 ] 

Yingchun Lai edited comment on KUDU-3371 at 7/6/22 3:59 PM:
------------------------------------------------------------

I have submmit a merge request on gerrit [1], but it seems too large and not friendly for reviewers, I will split it to several small merge requests.
 # Refactor LogBlockManager as a base class, add LogfBlockManager extend from it. LogfBlockManager is the Log Block Manager which manage the append only file to store containers' metadata, it is how we do as before.
 # Refactor LogBlockContainer as a base class, add LogfBlockContainer extend from it. LogfBlockContainer is the Log Block Container which use append only file to store containers' metadata, it is how we do as before.
 # Intruduce rocksdb as a thirdparty lib.
 # Add LogrBlockContainer which use rocksdb to store containers metadata. and add LogrBlockManager to manage LogrBlockContainer. Add related unit tests.
 # Do some refactors to support batch operates on blocks.
 # Use existing benchmarks to show the effect.
 # Add some metrics. (TODO, not included in [1])
 # Add more kudu tools to operate on rocksdb metadata. (TODO, not included in [1])
 # futher tuning on rocksdb options.  (TODO, not included in [1])

 

1. [https://gerrit.cloudera.org/c/18569/|https://gerrit.cloudera.org/c/18569/,]


was (Author: laiyingchun):
I have submmit a merge request on gerrit [1][,|https://gerrit.cloudera.org/c/18569/,] but it seems too large and not friendly for reviewers, I will split it to several small merge requests.
 # Refactor LogBlockManager as a base class, add LogfBlockManager extend from it. LogfBlockManager is the Log Block Manager which manage the append only file to store containers' metadata, it is how we do as before.
 # Refactor LogBlockContainer as a base class, add LogfBlockContainer extend from it. LogfBlockContainer is the Log Block Container which use append only file to store containers' metadata, it is how we do as before.
 # Intruduce rocksdb as a thirdparty lib.
 # Add LogrBlockContainer which use rocksdb to store containers metadata. and add LogrBlockManager to manage LogrBlockContainer. Add related unit tests.
 # Do some refactors to support batch operates on blocks.
 # Use existing benchmarks to show the effect.
 # Add some metrics. (TODO, not included in [1])
 # Add more kudu tools to operate on rocksdb metadata. (TODO, not included in [1])
 # futher tuning on rocksdb options.  (TODO, not included in [1])

 

1. [https://gerrit.cloudera.org/c/18569/|https://gerrit.cloudera.org/c/18569/,]

> Use RocksDB to store LBM metadata
> ---------------------------------
>
>                 Key: KUDU-3371
>                 URL: https://issues.apache.org/jira/browse/KUDU-3371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Yingchun Lai
>            Priority: Major
>
> h1. Motivation
> The current LBM container use separate .data and .metadata files. The .data file store the real user data, we can use hole punching to reduce disk space. While the metadata use write protobuf serialized string to a file, in append only mode. Each protobuf object is a struct of BlockRecordPB:
>  
> {code:java}
> message BlockRecordPB {
>   required BlockIdPB block_id = 1;  // int64
>   required BlockRecordType op_type = 2;  // CREATE or DELETE
>   required uint64 timestamp_us = 3;
>   optional int64 offset = 4; // Required for CREATE.
>   optional int64 length = 5; // Required for CREATE.
> } {code}
> That means each object is either type of CREATE or DELETE. To mark a 'block' as deleted, there will be 2 objects in the metadata, one is CREATE type and the other is DELETE type.
> There are some weak points of current LBM metadata storage mechanism:
> h2. 1. Disk space amplification
> The metadata live blocks rate may be very low, the worst case is there is only 1 alive block (suppose it hasn't reach the runtime compact threshold), all the other thousands of blocks are dead (i.e. in pair of CREATE-DELETE).
> So the disk space amplification is very serious.
> h2. 2. Long time bootstrap
> In Kudu server bootstrap stage, it have to replay all the metadata files, to find out the alive blocks. In the worst case, we may replayed thousands of blocks in metadata, but find only a very few blocks are alive.
> It may waste much time in almost all cases, since the Kudu cluster in production environment always run without bootstrap with several months, the LBM may be very loose.
> h2. 3. Metadada compaction
> To resolve the issues above, there is a metadata compaction mechanism in LBM, both at runtime and bootstrap stage.
> The one at runtime will lock the container, and it's synchronous.
> The one in bootstrap stage is synchronous too, and may make the bootstrap time longer.
> h1. Optimization by using RocksDB
> h2. Storage design
>  * RocksDB instance: one RocksDB instance per data directory.
>  * Key: <container_id>.<block_id>
>  * Value: the same as before, i.e. the serialized protobuf string, and only store for CREATE entries.
>  * Put/Delete: put value to rocksdb when create block, delete it from rocksdb when delete block
>  * Scan: happened only in bootstrap stage to retrieve all blocks
>  * DeleteRange: happened only when invalidate a container
> h2. Advantages
>  # Disk space amplification: There is still disk space amplification problem. But we can tune RocksDB to reach a balanced point, I trust in most cases, RocksDB is better than append only file.
>  # Bootstrap time: since there are only valid blocks left in rocksdb, so it maybe much faster than before.
>  # metadata compaction: we can leave it to rocksdb to do this work, of course tuning needed.
> h2. test & benchmark
> I'm trying to use RocksDB to store LBM container metadata recently, finished most of work now, and did some benchmark. It show that the fs module block read/write/delete performance is similar to or little worse than the old implemention, the bootstrap time may reduce several times.
> I not sure if it is worth to continue the work, or anybody know if there is any discussion on this topic ever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)