You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Samir Ahmic (JIRA)" <ji...@apache.org> on 2015/09/21 21:20:05 UTC

[jira] [Created] (HBASE-14458) AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server

Samir Ahmic created HBASE-14458:
-----------------------------------

             Summary: AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server
                 Key: HBASE-14458
                 URL: https://issues.apache.org/jira/browse/HBASE-14458
             Project: HBase
          Issue Type: Bug
          Components: IPC/RPC
    Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.1.3
            Reporter: Samir Ahmic
            Assignee: Samir Ahmic
            Priority: Critical


I have notice this issue while testing master branch in distributed mode. Reproduction steps:
1. Write some data with hbase ltt 
2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs] 
3. Wait until script start to reload regions to restarted server. In that moment ltt will stop writing and eventually fail. 

After some digging i have notice that while ltt is working correctly there is single connection per regionserver (lsof for single connection, 27109 is  ltt PID )
{code}
java      27109   hbase  143u    210579579      0t0        TCP hnode1:40423->hnode5:16020 (ESTABLISHED)
{code}  

and when in this example hnode5 server is restarted and script starts to reload regions on this server ltt start creating thousands of new tcp connections to this server:
{code}
java      27109   hbase *623u              210674415      0t0        TCP hnode1:52948->hnode5:16020 (ESTABLISHED)
java      27109   hbase *624u               210674416      0t0        TCP hnode1:52949->hnode5:16020 (ESTABLISHED)
java      27109   hbase *625u               210674417      0t0        TCP hnode1:52950->hnode5:16020 (ESTABLISHED)
java      27109   hbase *627u               210674419      0t0        TCP hnode1:52952->hnode5:16020 (ESTABLISHED)
java      27109   hbase *628u               210674420      0t0        TCP hnode1:52953->hnode5:16020 (ESTABLISHED)
java      27109   hbase *633u               210674425      0t0        TCP hnode1:52958->hnode5:16020 (ESTABLISHED)
...
{code}
So here is what happened based on some additional logging and debugging:
- AsyncRpcClient never detected that regionserver is restarted because regions were moved and there was no write/read requests to this server and  there is no some sort of heart-bit mechanism implemented
-  because of above dead {code}AsyncRpcChannel{code} stayed in {code}PoolMap<Integer, AsyncRpcChannel> connections{code}
- when ltt detected that regions are moved back to hnode5  it tried to reconnect to hnode5  leading this issue
I was able to resolve this issue by adding following to AsyncRpcClient#createRpcChannel():
{code}
synchronized (connections) {
      if (closed) {
        throw new StoppedRpcClientException();
      }
      rpcChannel = connections.get(hashCode);
+    if (rpcChannel != null && !rpcChannel.isAlive()) {
+        LOG.debug(Removing dead channel from "+ rpcChannel.address.toString());
+        connections.remove(hashCode);
+      }      

      if (rpcChannel == null || !rpcChannel.isAlive()) {
        rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, serviceName, location);
        connections.put(hashCode, rpcChannel);
{code}
 I will attach patch after some more testing.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)