You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/07/01 21:36:00 UTC
[jira] [Closed] (HBASE-14458) AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server

     [ https://issues.apache.org/jira/browse/HBASE-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Kyle Purtell closed HBASE-14458.
---------------------------------------

> AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14458
>                 URL: https://issues.apache.org/jira/browse/HBASE-14458
>             Project: HBase
>          Issue Type: Bug
>          Components: IPC/RPC
>    Affects Versions: 1.2.0, 1.3.0, 1.1.3, 2.0.0
>            Reporter: Samir Ahmic
>            Assignee: Samir Ahmic
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: HBASE-14458 (1).patch, HBASE-14458.patch, HBASE-14458.patch
>
>
> I have notice this issue while testing master branch in distributed mode. Reproduction steps:
> 1. Write some data with hbase ltt 
> 2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs] 
> 3. Wait until script start to reload regions to restarted server. In that moment ltt will stop writing and eventually fail. 
> After some digging i have notice that while ltt is working correctly there is single connection per regionserver (lsof for single connection, 27109 is  ltt PID )
> {code}
> java      27109   hbase  143u    210579579      0t0        TCP hnode1:40423->hnode5:16020 (ESTABLISHED)
> {code}  
> and when in this example hnode5 server is restarted and script starts to reload regions on this server ltt start creating thousands of new tcp connections to this server:
> {code}
> java      27109   hbase *623u              210674415      0t0        TCP hnode1:52948->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *624u               210674416      0t0        TCP hnode1:52949->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *625u               210674417      0t0        TCP hnode1:52950->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *627u               210674419      0t0        TCP hnode1:52952->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *628u               210674420      0t0        TCP hnode1:52953->hnode5:16020 (ESTABLISHED)
> java      27109   hbase *633u               210674425      0t0        TCP hnode1:52958->hnode5:16020 (ESTABLISHED)
> ...
> {code}
> So here is what happened based on some additional logging and debugging:
> - AsyncRpcClient never detected that regionserver is restarted because regions were moved and there was no write/read requests to this server and  there is no some sort of heart-bit mechanism implemented
> -  because of above dead {code}AsyncRpcChannel{code} stayed in {code}PoolMap<Integer, AsyncRpcChannel> connections{code}
> - when ltt detected that regions are moved back to hnode5  it tried to reconnect to hnode5  leading this issue
> I was able to resolve this issue by adding following to AsyncRpcClient#createRpcChannel():
> {code}
> synchronized (connections) {
>       if (closed) {
>         throw new StoppedRpcClientException();
>       }
>       rpcChannel = connections.get(hashCode);
> +    if (rpcChannel != null && !rpcChannel.isAlive()) {
> +        LOG.debug(Removing dead channel from "+ rpcChannel.address.toString());
> +        connections.remove(hashCode);
> +      }      
>       if (rpcChannel == null || !rpcChannel.isAlive()) {
>         rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, serviceName, location);
>         connections.put(hashCode, rpcChannel);
> {code}
>  I will attach patch after some more testing.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)