You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2019/05/08 12:08:00 UTC

[jira] [Updated] (HBASE-22381) The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck

     [ https://issues.apache.org/jira/browse/HBASE-22381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Hu updated HBASE-22381:
-----------------------------
    Description: 
In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case: 
1. an RegionServer got stuck;
2. all requests are write requests, and  thrown an exception like this: 
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3.  all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly,   which lead to the availability < 100% in a long time.

I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure: 

{code}
  private void receiveGlobalFailure(
     //....
      updateCachedLocations(server, regionName, row,
        ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
     //....
   }
{code}

The isMetaClearingException won't consider the SocketTimeoutException, so the client would always request to the stuck server. 

{code}
  public static boolean isMetaClearingException(Throwable cur) {
    cur = findException(cur);

    if (cur == null) {
      return true;
    }
    return !isSpecialException(cur) || (cur instanceof RegionMovedException)
        || cur instanceof NotServingRegionException;
  }

  public static boolean isSpecialException(Throwable cur) {
    return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
        || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
        || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
        || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
        || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
  }
{code}

But I'm afraid that  if we put the SocketTimeoutException into isSpecialException set,  we will increase the pressure of meta table, because there're other case we may encounter an SocketTimeoutException without any reigon moving,  if we clear cache , more request will be directed to meta table. 


  was:
In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case: 
1. an RegionServer got stuck;
2. all requests are write requests, and  thrown an exception like this: 
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3.  all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly,   which lead to the availability < 100% in a long time.

I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure: 

{code}
  private void receiveGlobalFailure(
     //....
      updateCachedLocations(server, regionName, row,
        ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
     //....
   }
{code}

The isMetaClearingException won't consider the SocketTimeoutException.

{code}
  public static boolean isMetaClearingException(Throwable cur) {
    cur = findException(cur);

    if (cur == null) {
      return true;
    }
    return !isSpecialException(cur) || (cur instanceof RegionMovedException)
        || cur instanceof NotServingRegionException;
  }

  public static boolean isSpecialException(Throwable cur) {
    return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
        || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
        || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
        || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
        || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
  }
{code}


> The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-22381
>                 URL: https://issues.apache.org/jira/browse/HBASE-22381
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Zheng Hu
>            Assignee: Zheng Hu
>            Priority: Major
>
> In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case: 
> 1. an RegionServer got stuck;
> 2. all requests are write requests, and  thrown an exception like this: 
> {code}
> Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
> {code}
> 3.  all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly,   which lead to the availability < 100% in a long time.
> I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure: 
> {code}
>   private void receiveGlobalFailure(
>      //....
>       updateCachedLocations(server, regionName, row,
>         ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
>      //....
>    }
> {code}
> The isMetaClearingException won't consider the SocketTimeoutException, so the client would always request to the stuck server. 
> {code}
>   public static boolean isMetaClearingException(Throwable cur) {
>     cur = findException(cur);
>     if (cur == null) {
>       return true;
>     }
>     return !isSpecialException(cur) || (cur instanceof RegionMovedException)
>         || cur instanceof NotServingRegionException;
>   }
>   public static boolean isSpecialException(Throwable cur) {
>     return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
>         || cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
>         || cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
>         || cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
>         || cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
>   }
> {code}
> But I'm afraid that  if we put the SocketTimeoutException into isSpecialException set,  we will increase the pressure of meta table, because there're other case we may encounter an SocketTimeoutException without any reigon moving,  if we clear cache , more request will be directed to meta table. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)