You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2021/09/22 16:40:00 UTC

[jira] [Comment Edited] (KAFKA-13239) Use RocksDB.ingestExternalFile for restoration

    [ https://issues.apache.org/jira/browse/KAFKA-13239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418705#comment-17418705 ] 

Guozhang Wang edited comment on KAFKA-13239 at 9/22/21, 4:39 PM:
-----------------------------------------------------------------

{quote} What does this mean/how do memtables factor into this?{quote}

This is for the {{allowBlockingFlush}} call specifically. The {{ingestExternalFile}} function can be called in the middle of reads/writes where the mem-table was not empty, and when ingesting the sst files, if there's overlap with the mem-tables blocking calls to flush the memtable may be triggered (? I am not 100% sure why that's the case, worth digging deeper); and if that is not allowed via the setter the ingestion call would fail. For our case, the mem-tables "should" be empty during restoration, but no one knows what rocksDB would be doing upon opening really, so just to be on the safer side while I was browsing through the interface I think it may be better to just set it.

{quote} Either way, I think I'm not convinced that it will be so simple to avoid issues due to compaction/write stalls. {quote}

Yeah I agree. My understanding is that {{RocksDB.ingestExternalFile}} itself is not blocking on compactions, and hence background compactions may still be pretty hot behind the scene even when it returns. I also feel that it's better to first do the restoration thread first (https://issues.apache.org/jira/browse/KAFKA-10199), and then consider this one.


was (Author: guozhang):
{quote} What does this mean/how do memtables factor into this?{quote}

This is for the {{allowBlockingFlush}} call specifically. The {{ingestExternalFile}} function can be called in the middle of reads/writes where the mem-table was not empty, and when ingesting the sst files, if there's overlap with the mem-tables blocking calls to flush the memtable may be triggered (? I am not 100% sure why that's the case, worth digging deeper); and if that is not allowed via the setter the ingestion call would fail. For our case, the mem-tables "should" be empty during restoration, but no one knows what rocksDB would be doing upon opening really, so just to be on the safer side while I was browsing through the interface I think it may be better to just set it.

{quote} Either way, I think I'm not convinced that it will be so simple to avoid issues due to compaction/write stalls. {quote}

Yeah I agree. My understanding is that {{RocksDB.ingestExternalFile}} itself is not blocking on compactions, and hence background compactions may still be pretty hot behind the scene even when it returns. I also feel that it's better to first do the restoration thread first, and then consider this one.

> Use RocksDB.ingestExternalFile for restoration
> ----------------------------------------------
>
>                 Key: KAFKA-13239
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13239
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Guozhang Wang
>            Priority: Major
>
> Now that we are in newer version of RocksDB, we can consider using the new
> {code}
> ingestExternalFile(final ColumnFamilyHandle columnFamilyHandle,
>       final List<String> filePathList,
>       final IngestExternalFileOptions ingestExternalFileOptions)
> {code}
> for restoring changelog into state stores. More specifically:
> 1) Use larger default batch size in restore consumer polling behavior so that each poll would return more records as possible.
> 2) For a single batch of records returned from a restore consumer poll call, first write them as a single SST File using the {{SstFileWriter}}. The existing {{DBOptions}} could be used to construct the {{EnvOptions} and {{Options}} for the writter.
> Do not yet ingest the written file to the db yet within each iteration
> 3) At the end of the restoration, call {{RocksDB.ingestExternalFile}} given all the written files' path as the parameter. The {{IngestExternalFileOptions}} would be specifically configured to allow key range overlapping with mem-table.
> 4) A specific note is that after the call in 3), heavy compaction may be executed by RocksDB in the background and before it cools down, starting normal processing immediately which would try to {{put}} new records into the store may see high stalls. To work around it we would consider using {{RocksDB.compactRange()}} which would block until the compaction is completed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)