You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2020/09/15 17:11:00 UTC

[jira] [Created] (KAFKA-10485) Use a separate error code for replication related errors

Guozhang Wang created KAFKA-10485:
-------------------------------------

             Summary: Use a separate error code for replication related errors
                 Key: KAFKA-10485
                 URL: https://issues.apache.org/jira/browse/KAFKA-10485
             Project: Kafka
          Issue Type: Improvement
            Reporter: Guozhang Wang


Today when coordinator requests involves an append to the internal topic, e.g. a commit / sync-group request sent to the group coordinator, we would capture the following error and translate them as a COORDINATOR_NOT_AVAILABLE to return to the client:

* UNKNOWN_TOPIC_OR_PARTITION
* NOT_ENOUGH_REPLICAS
* NOT_ENOUGH_REPLICAS_AFTER_APPEND
* REQUEST_TIMED_OUT (for txn coordinator)

Among those, the second / third case worth reconsideration, because a COORDINATOR_NOT_AVAILABLE would cause the clients trying to re-discover the coordinator unnecessarily with a short backoff time. The forth case is probably also worth revisiting: although the motivation of using COORDINATOR_NOT_AVAILABLE is to let the client retry, it still incurs unnecessary coordinator re-discovery.

What would be better, is that for 2)/3) clients would not re-discovery the coordinator, but would just retry with a longer backoff time, and at the same time expose this either through a metric or through warning logs indicate that some other brokers, not the coordinator, is unavailable and causing this operation to be blocked. For 4) clients can just retry without re-discovery. Only for 1) it makes sense to let the clients to re-discover the coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)