You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2019/05/08 12:04:00 UTC
[jira] [Created] (HBASE-22381) The write request won't refresh its
HConnection's local meta cache once an RegionServer got stuck
Zheng Hu created HBASE-22381:
--------------------------------
Summary: The write request won't refresh its HConnection's local meta cache once an RegionServer got stuck
Key: HBASE-22381
URL: https://issues.apache.org/jira/browse/HBASE-22381
Project: HBase
Issue Type: Bug
Reporter: Zheng Hu
Assignee: Zheng Hu
In production environment (Provided by [~xinxin fan] from Netease, HBase version: 1.2.6), we found a case:
1. an RegionServer got stuck;
2. all requests are write requests, and thrown an exception like this:
{code}
Caused by: java.net.SocketTimeoutException: 15000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.130.88.181:59049 remote=hbase699.hz.163.org/10.120.192.76:60020] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:558) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1076) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:727)
{code}
3. all write request to the stuck region server never clear their client's local meta cache, and requested to the stuck server endlessly, which lead to the availability < 100% in a long time.
I checked the code, and found that in our AsyncRequestFutureImpl#receiveGlobalFailure:
{code}
private void receiveGlobalFailure(
//....
updateCachedLocations(server, regionName, row,
ClientExceptionsUtil.isMetaClearingException(t) ? null : t);
//....
}
{code}
The isMetaClearingException won't consider the SocketTimeoutException.
{code}
public static boolean isMetaClearingException(Throwable cur) {
cur = findException(cur);
if (cur == null) {
return true;
}
return !isSpecialException(cur) || (cur instanceof RegionMovedException)
|| cur instanceof NotServingRegionException;
}
public static boolean isSpecialException(Throwable cur) {
return (cur instanceof RegionMovedException || cur instanceof RegionOpeningException
|| cur instanceof RegionTooBusyException || cur instanceof RpcThrottlingException
|| cur instanceof MultiActionResultTooLarge || cur instanceof RetryImmediatelyException
|| cur instanceof CallQueueTooBigException || cur instanceof CallDroppedException
|| cur instanceof NotServingRegionException || cur instanceof RequestTooBigException);
}
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)