You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2019/05/08 12:04:00 UTC
[jira] [Created] (HBASE-22381) The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck

Zheng Hu created HBASE-22381:
--------------------------------

             Summary: The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck
                 Key: HBASE-22381
                 URL: https://issues.apache.org/jira/browse/HBASE-22381
             Project: HBase
          Issue Type: Bug
            Reporter: Zheng Hu
            Assignee: Zheng Hu


In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case: 
1. an RegionServer got stuck;
2. all requests are write requests, and  thrown an exception like this: 
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3.  all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly,   which lead to the availability < 100% in a long time.

I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure: 

{code}
  private void receiveGlobalFailure(
     //....
      updateCachedLocations(server, regionName, row,
        ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
     //....
   }
{code}

The isMetaClearingException won't consider the SocketTimeoutException.

{code}
  public static boolean isMetaClearingException(Throwable cur) {
    cur = findException(cur);

    if (cur == null) {
      return true;
    }
    return !isSpecialException(cur) || (cur instanceof RegionMovedException)
        || cur instanceof NotServingRegionException;
  }

  public static boolean isSpecialException(Throwable cur) {
    return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
        || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
        || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
        || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
        || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)