You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Samir Ahmic (JIRA)" <ji...@apache.org> on 2015/09/21 21:20:05 UTC
[jira] [Created] (HBASE-14458) AsyncRpcClient#createRpcChannel()
should check and remove dead channel before creating new one to same server
Samir Ahmic created HBASE-14458:
-----------------------------------
Summary: AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server
Key: HBASE-14458
URL: https://issues.apache.org/jira/browse/HBASE-14458
Project: HBase
Issue Type: Bug
Components: IPC/RPC
Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.1.3
Reporter: Samir Ahmic
Assignee: Samir Ahmic
Priority: Critical
I have notice this issue while testing master branch in distributed mode. Reproduction steps:
1. Write some data with hbase ltt
2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs]
3. Wait until script start to reload regions to restarted server. In that moment ltt will stop writing and eventually fail.
After some digging i have notice that while ltt is working correctly there is single connection per regionserver (lsof for single connection, 27109 is ltt PID )
{code}
java 27109 hbase 143u 210579579 0t0 TCP hnode1:40423->hnode5:16020 (ESTABLISHED)
{code}
and when in this example hnode5 server is restarted and script starts to reload regions on this server ltt start creating thousands of new tcp connections to this server:
{code}
java 27109 hbase *623u 210674415 0t0 TCP hnode1:52948->hnode5:16020 (ESTABLISHED)
java 27109 hbase *624u 210674416 0t0 TCP hnode1:52949->hnode5:16020 (ESTABLISHED)
java 27109 hbase *625u 210674417 0t0 TCP hnode1:52950->hnode5:16020 (ESTABLISHED)
java 27109 hbase *627u 210674419 0t0 TCP hnode1:52952->hnode5:16020 (ESTABLISHED)
java 27109 hbase *628u 210674420 0t0 TCP hnode1:52953->hnode5:16020 (ESTABLISHED)
java 27109 hbase *633u 210674425 0t0 TCP hnode1:52958->hnode5:16020 (ESTABLISHED)
...
{code}
So here is what happened based on some additional logging and debugging:
- AsyncRpcClient never detected that regionserver is restarted because regions were moved and there was no write/read requests to this server and there is no some sort of heart-bit mechanism implemented
- because of above dead {code}AsyncRpcChannel{code} stayed in {code}PoolMap<Integer, AsyncRpcChannel> connections{code}
- when ltt detected that regions are moved back to hnode5 it tried to reconnect to hnode5 leading this issue
I was able to resolve this issue by adding following to AsyncRpcClient#createRpcChannel():
{code}
synchronized (connections) {
if (closed) {
throw new StoppedRpcClientException();
}
rpcChannel = connections.get(hashCode);
+ if (rpcChannel != null && !rpcChannel.isAlive()) {
+ LOG.debug(Removing dead channel from "+ rpcChannel.address.toString());
+ connections.remove(hashCode);
+ }
if (rpcChannel == null || !rpcChannel.isAlive()) {
rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, serviceName, location);
connections.put(hashCode, rpcChannel);
{code}
I will attach patch after some more testing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)