You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "José Armando García Sancio (Jira)" <ji...@apache.org> on 2023/02/09 23:51:00 UTC
[jira] [Updated] (KAFKA-14703) Don't resign when failing to replay uncommitted records

     [ https://issues.apache.org/jira/browse/KAFKA-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

José Armando García Sancio updated KAFKA-14703:
-----------------------------------------------
    Description: 
h1. Problem

The KRaft controller is replays both committed and uncommitted records. Committed records are replayed by the inactive controller. Uncommitted records are replayed by the active controller.

When handling an RPC the active controller generates a response and a list of uncommitted records. The active controller replays the uncommitted records before sending them to the KRaft layer for durability and replication. If the active controller encounters an error when replaying the uncommitted records, it calls the process exit fault handler.

Indirectly, the process exit fault handler resigns its KRaft leadership and closes all of the client connections.

Most clients to retry the RPC when they disconnect from the remote endpoint. If the RPC's replay error is deterministic then it is possible for the failure to propagate to all of the controllers as they become leaders. This handling may cause the controllers to become unavailable.
h1. Solution

We can avoid this failure from propagating to all of the controllers by changing how we handle errors when replaying uncommitted records. The active controller doesn't need to fatally exit, if it failed to replay an uncommitted record. The active controller should instead failed the RPC with an UNKNOWN_ERROR and revert the in-memory state to the in-memory snapshot before the RPC was handled. 
h1. Drawback

This solution doesn't work if the error is in the Timeline data structures themselves and the controller is unable to SnapshotRegistry::revertToSnapshot to the previous state.

  was:
h1. Problem

The KRaft controller is replays both committed and uncommitted records. Committed records are replayed by the inactive controller. Uncommitted records are replayed by the active controller.

When handling an RPC the active controller generates a response and a list of uncommitted records. The active controller replays the uncommitted records before sending them to the KRaft layer for durability and replication. If the active controller encounters an error when replaying the uncommitted records, it calls the process exit fault handler.

Indirectly, the process exit fault handler resigns its KRaft leadership and closes all of the client connections.

Most clients to retry the RPC when they disconnect from the remote endpoint. If the RPC's replay error is deterministic then it is possible for the failure to propagate to all of the controllers as they become leaders. This handling may cause the controllers to become unavailable.
h1. Solution

We can avoid this failure from propagating to all of the controllers by changing how we handle errors when replaying uncommitted records. The active controller doesn't need to fatally exit, if it failed to replay an uncommitted record. The active controller should instead failed the RPC with an UNKNOWN_ERROR and revert the in-memory state to the in-memory snapshot before the RPC was handled.


> Don't resign when failing to replay uncommitted records
> -------------------------------------------------------
>
>                 Key: KAFKA-14703
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14703
>             Project: Kafka
>          Issue Type: Improvement
>          Components: controller
>            Reporter: José Armando García Sancio
>            Priority: Major
>
> h1. Problem
> The KRaft controller is replays both committed and uncommitted records. Committed records are replayed by the inactive controller. Uncommitted records are replayed by the active controller.
> When handling an RPC the active controller generates a response and a list of uncommitted records. The active controller replays the uncommitted records before sending them to the KRaft layer for durability and replication. If the active controller encounters an error when replaying the uncommitted records, it calls the process exit fault handler.
> Indirectly, the process exit fault handler resigns its KRaft leadership and closes all of the client connections.
> Most clients to retry the RPC when they disconnect from the remote endpoint. If the RPC's replay error is deterministic then it is possible for the failure to propagate to all of the controllers as they become leaders. This handling may cause the controllers to become unavailable.
> h1. Solution
> We can avoid this failure from propagating to all of the controllers by changing how we handle errors when replaying uncommitted records. The active controller doesn't need to fatally exit, if it failed to replay an uncommitted record. The active controller should instead failed the RPC with an UNKNOWN_ERROR and revert the in-memory state to the in-memory snapshot before the RPC was handled. 
> h1. Drawback
> This solution doesn't work if the error is in the Timeline data structures themselves and the controller is unable to SnapshotRegistry::revertToSnapshot to the previous state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)