You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Mohammad Arshad (Jira)" <ji...@apache.org> on 2021/06/09 12:33:00 UTC

[jira] [Assigned] (ZOOKEEPER-4275) Slowness in sasl login or subject.doAs() causes zk client to falsely assume that the server did not respond, closes connection and goes to unnecessary retries

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mohammad Arshad reassigned ZOOKEEPER-4275:
------------------------------------------

    Assignee: Ravi Kishore Valeti

> Slowness in sasl login or subject.doAs() causes zk client to falsely assume that the server did not respond, closes connection and goes to unnecessary retries
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4275
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4275
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.5.9
>            Reporter: Ravi Kishore Valeti
>            Assignee: Ravi Kishore Valeti
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.5.10, 3.8.0, 3.7.1, 3.6.4
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Zookeeper client does sasl auth (login and subject.doAs())as a preset before attempting a connection to server.
>  If there is a delay in sasl auth (possibly due to slow Kerberos communication), ZK client falsely assumes that the zk server did not respond and runs in to unnecessary multiple retries.
> Client configuration:
>  "zookeeper.session.timeout" = "3000"
>  "zookeeper.recovery.retry" = "1"
>  "zookeeper.recovery.retry.intervalmill" = "500"
> This configuration translates to  
> connect timeout as 1000ms
>  Read Timeout as 2000ms
> Example: There was a 3 second delay in logging in the user as seen from the logs below. The connection attempt was made later. However, zk client did not wait for server response but logged a timeout (3 seconds > 1 sec connect timeout), closed the connection and went to retries. Since there was a consistent delay at Kerberos master, we had seen this retries go as long as 10 mins causing requests to timeout/fail.
> Logs:
> 3/23/21 4:15:*32.389* AM jute.maxbuffer value is xxxxx Bytes
> 3/23/21 4:15:*35.395* AM Client successfully logged in.
> 3/23/21 4:15:35.396 AM TGT refresh sleeping until: Wed Mar 24 00:34:31 GMT 2021
> 3/23/21 4:15:35.396 AM TGT refresh thread started.
> 3/23/21 4:15:35.396 AM Client will use GSSAPI as SASL mechanism.
> 3/23/21 4:15:35.396 AM TGT expires:                  xxx Mar xx 04:15:35 GMT 2021
> 3/23/21 4:15:35.396 AM TGT valid starting at:        xxx Mar xx 04:15:35 GMT 2021
> 3/23/21 4:15:*35.397* AM *Opening socket connection* to server xxxxx:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 3/23/21 4:15:*35.397* AM *Client session timed out, have not heard from server in* *3008ms* for sessionid 0x0
> 3/23/21 4:15:35.397 AM Client session timed out, have not heard from server in 3008ms for sessionid 0x0, closing socket connection and attempting reconnect
> 3/23/21 4:15:35.498 AM TGT renewal thread has been interrupted and will exit.
> 3/23/21 4:15:38.503 AM Client successfully logged in.
> 3/23/21 4:15:38.503 AM TGT expires:                  xxx Mar xx 04:15:38 GMT 2021
> 3/23/21 4:15:38.503 AM Client will use GSSAPI as SASL mechanism.
> 3/23/21 4:15:38.503 AM TGT valid starting at:        xxx Mar xx 04:15:38 GMT 2021
> 3/23/21 4:15:38.503 AM TGT refresh thread started.
> 3/23/21 4:15:38.503 AM TGT refresh sleeping until: Wed Mar 24 00:10:10 GMT 2021
> 3/23/21 4:15:38.506 AM Opening socket connection to server xxxxx:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 3/23/21 4:15:38.506 AM Client session timed out, have not heard from server in 3009ms for sessionid 0x0, closing socket connection and attempting reconnect
> 3/23/21 4:15:38.506 AM Client session timed out, have not heard from server in 3009ms for sessionid 0x0
> 3/23/21 4:15:38.606 AM TGT renewal thread has been interrupted and will exit.
> 3/23/21 4:15:41.610 AM Client successfully logged in.
> 3/23/21 4:15:41.611 AM TGT refresh sleeping until: xxx Mar xx 23:42:03 GMT 2021
> 3/23/21 4:15:41.611 AM Client will use GSSAPI as SASL mechanism.
> 3/23/21 4:15:41.611 AM TGT valid starting at:        xxx Mar xx 04:15:41 GMT 2021
> 3/23/21 4:15:41.611 AM TGT expires:                  xxx Mar xx 04:15:41 GMT 2021
> 3/23/21 4:15:41.611 AM TGT refresh thread started.
> 3/23/21 4:15:41.612 AM Opening socket connection to server xxxxx:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
> 3/23/21 4:15:41.613 AM Client session timed out, have not heard from server in 3006ms for sessionid 0x0
> 3/23/21 4:15:41.613 AM Client session timed out, have not heard from server in 3006ms for sessionid 0x0, closing socket connection and attempting reconnect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)