You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Roesler (Jira)" <ji...@apache.org> on 2020/01/18 04:36:00 UTC
[jira] [Commented] (KAFKA-9450) Decouple inner state flushing from committing with EOS

    [ https://issues.apache.org/jira/browse/KAFKA-9450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018501#comment-17018501 ] 

John Roesler commented on KAFKA-9450:
-------------------------------------

This reminds me of an idea I proposed a while back, and haven't been able to let go of. I think it's buried in a Jira ticket somewhere.

One practical complexity with flushing is to guarantee that persistent stores actually do a fs sync call before we write the checkpoint file, which in turn is to guarantee that in a crash-recovery scenario, the offset in the recovered checkpoint file is always equal to or before the state of the recovered store.

The same goal could be accomplished without any filesystem intricacies if we store the offset in the same store as the data. Think: either a reserved key, or a separate column family. This would allow the underlying store to ensure the order of data updates with respect to changelog offset updates on its own (using its internal translog or whatever).

Anyway, I bring this up right now because offhand, I can't think of any reason we'd actually need to flush bytes stores at all if we did things that way.

> Decouple inner state flushing from committing with EOS
> ------------------------------------------------------
>
>                 Key: KAFKA-9450
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9450
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Sophie Blee-Goldman
>            Priority: Major
>
> When EOS is turned on, the commit interval is set quite low (100ms) and all the store layers are flushed during a commit. This is necessary for forwarding records in the cache to the changelog, but unfortunately also forces rocksdb to flush the current memtable before it's full. The result is a large number of small writes to disk, losing the benefits of batching, and a large number of very small L0 files that are likely to slow compaction.
> Since we have to delete the stores to recreate from scratch anyways during an unclean shutdown with EOS, we may as well skip flushing the innermost StateStore during a commit and only do so during a graceful shutdown, before a rebalance, etc. This is currently blocked on a refactoring of the state store layers to allow decoupling the flush of the caching layer from the actual state store.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)