You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2022/07/01 21:36:00 UTC
[jira] [Closed] (HBASE-14458) AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server
[ https://issues.apache.org/jira/browse/HBASE-14458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Kyle Purtell closed HBASE-14458.
---------------------------------------
> AsyncRpcClient#createRpcChannel() should check and remove dead channel before creating new one to same server
> -------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-14458
> URL: https://issues.apache.org/jira/browse/HBASE-14458
> Project: HBase
> Issue Type: Bug
> Components: IPC/RPC
> Affects Versions: 1.2.0, 1.3.0, 1.1.3, 2.0.0
> Reporter: Samir Ahmic
> Assignee: Samir Ahmic
> Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-14458 (1).patch, HBASE-14458.patch, HBASE-14458.patch
>
>
> I have notice this issue while testing master branch in distributed mode. Reproduction steps:
> 1. Write some data with hbase ltt
> 2. While ltt is writing execute $graceful_stop.sh --restart --reload [rs]
> 3. Wait until script start to reload regions to restarted server. In that moment ltt will stop writing and eventually fail.
> After some digging i have notice that while ltt is working correctly there is single connection per regionserver (lsof for single connection, 27109 is ltt PID )
> {code}
> java 27109 hbase 143u 210579579 0t0 TCP hnode1:40423->hnode5:16020 (ESTABLISHED)
> {code}
> and when in this example hnode5 server is restarted and script starts to reload regions on this server ltt start creating thousands of new tcp connections to this server:
> {code}
> java 27109 hbase *623u 210674415 0t0 TCP hnode1:52948->hnode5:16020 (ESTABLISHED)
> java 27109 hbase *624u 210674416 0t0 TCP hnode1:52949->hnode5:16020 (ESTABLISHED)
> java 27109 hbase *625u 210674417 0t0 TCP hnode1:52950->hnode5:16020 (ESTABLISHED)
> java 27109 hbase *627u 210674419 0t0 TCP hnode1:52952->hnode5:16020 (ESTABLISHED)
> java 27109 hbase *628u 210674420 0t0 TCP hnode1:52953->hnode5:16020 (ESTABLISHED)
> java 27109 hbase *633u 210674425 0t0 TCP hnode1:52958->hnode5:16020 (ESTABLISHED)
> ...
> {code}
> So here is what happened based on some additional logging and debugging:
> - AsyncRpcClient never detected that regionserver is restarted because regions were moved and there was no write/read requests to this server and there is no some sort of heart-bit mechanism implemented
> - because of above dead {code}AsyncRpcChannel{code} stayed in {code}PoolMap<Integer, AsyncRpcChannel> connections{code}
> - when ltt detected that regions are moved back to hnode5 it tried to reconnect to hnode5 leading this issue
> I was able to resolve this issue by adding following to AsyncRpcClient#createRpcChannel():
> {code}
> synchronized (connections) {
> if (closed) {
> throw new StoppedRpcClientException();
> }
> rpcChannel = connections.get(hashCode);
> + if (rpcChannel != null && !rpcChannel.isAlive()) {
> + LOG.debug(Removing dead channel from "+ rpcChannel.address.toString());
> + connections.remove(hashCode);
> + }
> if (rpcChannel == null || !rpcChannel.isAlive()) {
> rpcChannel = new AsyncRpcChannel(this.bootstrap, this, ticket, serviceName, location);
> connections.put(hashCode, rpcChannel);
> {code}
> I will attach patch after some more testing.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)