You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "Xing Lin (Jira)" <ji...@apache.org> on 2023/05/29 21:30:00 UTC

[jira] [Created] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy

Xing Lin created HDFS-17030:
-------------------------------

             Summary: Limit wait time for getHAServiceState in ObserverReaderProxy
                 Key: HDFS-17030
                 URL: https://issues.apache.org/jira/browse/HDFS-17030
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs
    Affects Versions: 3.4.0
            Reporter: Xing Lin


When HA is enabled and a standby NN is not responsible (either when it is down or a heap dump is being taken), we would wait for either _socket_connection_timeout * socket_max_retries_on_connection_timeout_ or _rpcTimeOut_ before moving on to the next NN. This adds a significantly latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures. 

The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org