You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/01/31 17:34:51 UTC

[jira] (KAFKA-4317) RocksDB checkpoint files lost on kill -9

    [ https://issues.apache.org/jira/browse/KAFKA-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847179#comment-15847179 ] 

ASF GitHub Bot commented on KAFKA-4317:
---------------------------------------

GitHub user dguy opened a pull request:

    https://github.com/apache/kafka/pull/2471

    KAFKA-4317: Checkpoint State Stores on commit/flush

    Currently the checkpoint file is deleted at state store initialization and it is only ever written again during a clean shutdown. This can result in significant delays during restarts as the entire store needs to be loaded from the changelog. 
    We can mitigate against this by frequently checkpointing the offsets. The checkpointing happens only during the commit phase, i.e, after we have manually flushed the store and the producer. So we guarantee that the checkpointed offsets are never greater than what has been flushed. 
    In the event of hard failure we can recover by reading the checkpoints and consuming from the stored offsets.
    The checkpoint interval can be controlled by the config `statestore.checkpoint.interval.ms` - if this is set to a value <= 0 it effectively turns checkpoints off. The interval is only i guide in that the minimum checkpoint time is always going to be the commit interval (as we need to do this to guarantee consistency)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dguy/kafka kafka-4317

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/kafka/pull/2471.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2471
    
----
commit 6743dc63293e2d0fca57dcb7d1a0ace5237837b0
Author: Damian Guy <da...@gmail.com>
Date:   2017-01-31T13:37:00Z

    checkpoint statestores

----


> RocksDB checkpoint files lost on kill -9
> ----------------------------------------
>
>                 Key: KAFKA-4317
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4317
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 0.10.0.1
>            Reporter: Greg Fodor
>            Assignee: Damian Guy
>            Priority: Critical
>              Labels: architecture, user-experience
>
> Right now, the checkpoint files for logged RocksDB stores are written during a graceful shutdown, and removed upon restoration. Unfortunately this means that in a scenario where the process is forcibly killed, the checkpoint files are not there, so all RocksDB stores are rematerialized from scratch on the next launch.
> In a way, this is good, because it simulates bootstrapping a new node (for example, its a good way to see how much I/O is used to rematerialize the stores) however it leads to longer recovery times when a non-graceful shutdown occurs and we want to get the job up and running again.
> It seems that two possible things to consider:
> - Simply do not remove checkpoint files on restoring. This way a kill -9 will result in only repeating the restoration of all the data generated in the source topics since the last graceful shutdown.
> - Continually update the checkpoint files (perhaps on commit) -- this would result in the least amount of overhead/latency in restarting, but the additional complexity may not be worth it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)