You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/06/14 17:54:00 UTC

[jira] [Resolved] (HBASE-6751) Too many retries, leading a a delay to read the HLog after a datanode failure

     [ https://issues.apache.org/jira/browse/HBASE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Kyle Purtell resolved HBASE-6751.
----------------------------------------
    Resolution: Fixed

> Too many retries, leading a a delay to read the HLog after a datanode failure
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-6751
>                 URL: https://issues.apache.org/jira/browse/HBASE-6751
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 2.0.0
>            Reporter: Nicolas Liochon
>            Priority: Major
>
> When reading an HLog, we need to got to the last block to get the file size.
> In HDFS 1.0.3, it leads to HDFS-3701 / HBASE-6401
> In HDFS branch-2, this bug is fixed; but we have two other issues.
> 1) For simple cases as a single node died, we don't have the effect of HDFS-3703, and the default location order leads us to try to connect to a dead datanode while we should not. This is not analysed yet. A specific JIRA will be created later.
> 2) If we are redirected to a wrong node, we experience a huge delay:
> The pseudo code in DFSInputStream#readBlockLength is:
> {noformat}   
>     for(DatanodeInfo datanode : locatedblock.getLocations()) {     
>       try {
>         ClientDatanodeProtocol cdp = DFSUtil.createClientDatanodeProtocolProxy(
>             datanode, dfsClient.conf, dfsClient.getConf().socketTimeout,
>             dfsClient.getConf().connectToDnViaHostname, locatedblock);
>        
>        return cdp.getReplicaVisibleLength(locatedblock.getBlock());
>         } catch {
>              // retry
>       }
>     }
> {noformat} 
> However, with this code, the connection is created with a null RetryPolicy. It's then defaulted to 10 retries, with:
> {noformat} 
>   public static final String  IPC_CLIENT_CONNECT_MAX_RETRIES_KEY = "ipc.client.connect.max.retries";
>   public static final int     IPC_CLIENT_CONNECT_MAX_RETRIES_DEFAULT = 10;
> {noformat} 
> So if the first datanode is bad, we will try 10 times before trying the second. In the context of HBASE-6738, the split task is cancelled before we're have opened the file to split.
> By nature, it's likely to be a pure HDFS issue. But may be it can be solved in HBase with the right configuration on "ipc.client.connect.max.retries".
> The ideal fix (in HDFS) would be to try the datanodes once each, and then loop 10 times.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)