You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "Xing Lin (Jira)" <ji...@apache.org> on 2023/05/29 21:30:00 UTC
[jira] [Created] (HDFS-17030) Limit wait time for getHAServiceState in ObserverReaderProxy
Xing Lin created HDFS-17030:
-------------------------------
Summary: Limit wait time for getHAServiceState in ObserverReaderProxy
Key: HDFS-17030
URL: https://issues.apache.org/jira/browse/HDFS-17030
Project: Hadoop HDFS
Issue Type: Improvement
Components: hdfs
Affects Versions: 3.4.0
Reporter: Xing Lin
When HA is enabled and a standby NN is not responsible (either when it is down or a heap dump is being taken), we would wait for either _socket_connection_timeout * socket_max_retries_on_connection_timeout_ or _rpcTimeOut_ before moving on to the next NN. This adds a significantly latency. For clusters at Linkedin, we set rpcTimeOut to 120 seconds and a request would need to take more than 2 mins to complete when we take a heap dump at a standby. This has been causing user job failures.
The proposal is to add a timeout on getHAServiceState() calls in ObserverReaderProxy and we will only wait for the timeout for an NN to respond its HA state. Once we pass that timeout, we will move on to the next NN.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org