You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Rajini Sivaram (Jira)" <ji...@apache.org> on 2020/10/27 11:30:00 UTC

[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure

    [ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221328#comment-17221328 ] 

Rajini Sivaram commented on KAFKA-7987:
---------------------------------------

[~junrao] This is still an open issue for all versions of Kafka, right? I am looking into an authorizer issue where authorizer notifications were not processed for a long time. Heap dump shows that the authorizer's ZookeeperClient is in AUTH_FAILED state. ZK is Kerberos-enabled and there are a couple of authentication failures in the logs due to clock-skew errors, which look like the reason why the authorizer's ZooKeeperClient got into this state. For the authorizer, we do need to schedule retries in this case. But the issue doesn't seem to have affected other operations of the broker in this case, presumably because we retry connections for the other ZooKeeperClient when there are requests. We should still apply the retry fix to the common ZooKeeperClient, right? Since the affected broker didn't pick up ACL updates made on other brokers, it is a critical security issue. But I wanted to check if we have applied any fixes in this area in newer versions of Kafka. Thanks.


> a broker's ZK session may die on transient auth failure
> -------------------------------------------------------
>
>                 Key: KAFKA-7987
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7987
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Jun Rao
>            Priority: Major
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7))]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's ZK session then expired, but the broker didn't know that yet since it couldn't establish a connection to ZK. When the network was back, the broker tried to establish a connection to ZK, but failed due to auth failure (likely due to a transient KDC issue). The current logic just ignores the auth failure without trying to create a new ZK session. Then the broker will be permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)