You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by 李玉明 <li...@gmail.com> on 2014/07/10 10:19:34 UTC

hbase region servers refuse connection

Hi,

  The HBase version is 0.9.6. I experienced the HBase region servers
refuse connection problem.  Could anyone please help? Thank you in
advance.

  Summary of the problem.

1. Several Region Servers refuse service.  The Requests Per Second become 0.

2. The application client can't connect to the Region Server.  Even
with the nc simple linux  command, the connection is refused.
    For example: nc  10.207.27.41 8420

3. Even restart the HBase cluster, the service can't recover.

4. Snippet  of some log  at the application client :

2014-07-10 16:03:51[htable-pool20-t13:2931892] - [INFO] #3541,
table=monitor-data, attempt=702/1 failed 117 ops, last exception:
org.apache.hadoop.hbase.ipc.RpcClient$ailedServerException: This
server is in the failed servers list:
nz-cloudera1.xxx.com/10.207.27.41:8420 on
nz-cloudera1.xxx.com,8420,1404973262139, tracking started Thu Jul 10
15:16:19 CST 2014, retrying after 4034 ms, replay 117 ops.
2014-07-10 16:03:51[htable-pool27-t45:2931892] - [INFO] #5443,
table=monitor-data, attempt=694/1 failed 3 ops, last exception:
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This
server is in the failed servers list:
nz-cloudera67.xxx.com/10.208.244.41:8420 on
nz-cloudera67.xxx.com,8420,1404973261815, tracking started Thu Jul 10
15:17:02 CST 2014, retrying after 4028 ms, replay 3 ops.

5.  Snippet  of some log  at the Region Server:  it does
periodicFlusher and compaction again and again.

2014-07-10 15:26:36,508 INFO
org.apache.hadoop.hbase.regionserver.HStore: Completed major
compaction of 6 file(s) in t of
monitor-data,ig\x01RdB\xCD\x1CS\xAA\xF2\x00,1403772336393.73dba0c6346574f324d79e976db64def.
into 493864bb386042099e1ef6be1b9770b2(size=4.4 G), total size for
store is 4.4 G. This selection was in queue for 0sec, and took 5mins,
7sec to execute.
2014-07-10 15:26:36,508 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed
compaction: Request =
regionName=monitor-data,ig\x01RdB\xCD\x1CS\xAA\xF2\x00,1403772336393.73dba0c6346574f324d79e976db64def.,
storeName=t, fileCount=6, fileSize=4.4 G, priority=44,
time=1397522352382230; duration=5mins, 7sec
2014-07-10 15:26:36,509 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on t
in region monitor-data,\x12\xDB\x13?dB\xCD
S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6.
2014-07-10 15:26:36,509 INFO
org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of 5
file(s) in t of monitor-data,\x12\xDB\x13?dB\xCD
S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6. into
tmpdir=hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-data/8a56fad579ebdefc2cb6622d715fe7a6/.tmp,
totalSize=4.9 G
2014-07-10 15:32:17,359 INFO
org.apache.hadoop.hbase.regionserver.HStore: Completed major
compaction of 5 file(s) in t of monitor-data,\x12\xDB\x13?dB\xCD
S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6. into
3377dcd054244e42825227dc93d94bcb(size=4.9 G), total size for store is
4.9 G. This selection was in queue for 0sec, and took 5mins, 40sec to
execute.
2014-07-10 15:32:17,359 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed
compaction: Request = regionName=monitor-data,\x12\xDB\x13?dB\xCD
S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6.,
storeName=t, fileCount=5, fileSize=4.9 G, priority=45,
time=1397829741046386; duration=5mins, 40sec
2014-07-10 15:43:34,272 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver8420.periodicFlusher requesting flush for region
monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
after a delay of 14966
2014-07-10 15:43:44,272 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver8420.periodicFlusher requesting flush for region
monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
after a delay of 16416
2014-07-10 15:43:49,240 WARN
org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
seqNum for the region we are about to flush:
[f581188d3dd9d6d136888821a374de55]
2014-07-10 15:43:49,629 INFO
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
sequenceid=63172, memsize=37.0 M, hasBloomFilter=true, into tmp file
hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/f581188d3dd9d6d136888821a374de55/.tmp/5d07d2aa3dba4e85ab48326846a90ba9
2014-07-10 15:43:49,639 INFO
org.apache.hadoop.hbase.regionserver.HStore: Added
hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/f581188d3dd9d6d136888821a374de55/t/5d07d2aa3dba4e85ab48326846a90ba9,
entries=158209, sequenceid=63172, filesize=3.4 M
2014-07-10 15:43:49,640 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
of ~46.0 M/48218512, currentsize=0/0 for region
monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
in 399ms, sequenceid=63172, compaction requested=false
2014-07-10 15:45:14,273 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver8420.periodicFlusher requesting flush for region
monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
after a delay of 21673
2014-07-10 15:45:24,273 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver8420.periodicFlusher requesting flush for region
monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
after a delay of 10915
2014-07-10 15:45:34,273 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver8420.periodicFlusher requesting flush for region
monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
after a delay of 17862
2014-07-10 15:45:35,948 WARN
org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
seqNum for the region we are about to flush:
[0c9e4bd771a895e2ebe2c146809c82ce]
2014-07-10 15:45:36,142 INFO
org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
sequenceid=63173, memsize=18.2 M, hasBloomFilter=true, into tmp file
hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/0c9e4bd771a895e2ebe2c146809c82ce/.tmp/0f3bba8b09ed41fbb21395fc9ae94e04
2014-07-10 15:45:36,152 INFO
org.apache.hadoop.hbase.regionserver.HStore: Added
hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/0c9e4bd771a895e2ebe2c146809c82ce/t/0f3bba8b09ed41fbb21395fc9ae94e04,
entries=78724, sequenceid=63173, filesize=1.7 M
2014-07-10 15:45:36,152 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
of ~22.5 M/23614384, currentsize=0/0 for region
monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
in 204ms, sequenceid=63173, compaction requested=false

-- 
Vito Li

Re: hbase region servers refuse connection

Posted by vito <li...@gmail.com>.

We observe that one core the CPU was nearly 100%. 

Here is also some log from the client:

2014-07-14 01:11:11[htable-pool20-t21776:224529968] - [INFO] #2242,
table=monitor-data, attempt=10/1 failed 79 ops, last exception:
java.net.SocketTimeoutException: Call to
hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420 failed because
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.210.87.44:17831
remote=hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420] on
hz01-noah-cloudera32.hz01.xxxx.com,8420,1405055027387, tracking started Mon
Jul 14 01:03:48 CST 2014, retrying after 2005 ms, replay 79 ops.
2014-07-14 01:11:11[htable-pool21-t22740:224529969] - [INFO] #2421,
table=monitor-data, attempt=10/1 failed 112 ops, last exception:
java.net.SocketTimeoutException: Call to
hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420 failed because
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.210.87.44:17831
remote=hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420] on
hz01-noah-cloudera32.hz01.xxxx.com,8420,1405055027387, tracking started Mon
Jul 14 01:03:35 CST 2014, retrying after 2007 ms, replay 112 ops.
2014-07-14 01:11:11[htable-pool19-t21540:224529974] - [INFO] #2241,
table=monitor-data, attempt=10/1 failed 82 ops, last exception:
java.net.SocketTimeoutException: Call to
hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420 failed because
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.210.87.44:17831
remote=hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420] on
hz01-noah-cloudera32.hz01.xxxx.com,8420,1405055027387, tracking started Mon
Jul 14 01:03:36 CST 2014, retrying after 2002 ms, replay 82 ops.
2014-07-14 01:11:11[htable-pool30-t23007:224529976] - [INFO] #5324,
table=monitor-data, attempt=10/1 failed 2 ops, last exception:
java.net.SocketTimeoutException: Call to
hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420 failed because
java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.210.87.44:17831
remote=hz01-noah-cloudera32.hz01.xxxx.com/10.211.135.19:8420] on
hz01-noah-cloudera32.hz01.xxxx.com,8420,1405055027387, tracking started Mon
Jul 14 01:03:09 CST 2014, retrying after 2015 ms, replay 2 ops.



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061294.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: hbase region servers refuse connection

Posted by vito <li...@gmail.com>.

Hi Qiang, 

   Not sure whether HBASE-11306 affects 0.94.  We have gone back to 0.94. 




--
View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061364.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: hbase region servers refuse connection

Posted by Qiang Tian <ti...@gmail.com>.

Hi YuMing,

it looks 0.94 is immune to this problem.  if read 0 bytes, just returns

readAndProcess:
        count = channelRead(channel, data);

        if (data.remaining() == 0) {
          dataLengthBuffer.clear();
          data.flip();
          if (headerRead) {
            processData(data.array());
            data = null;
            return count;
          }
          processHeader();
          headerRead = true;
          data = null;
          continue;
        }
        return count;
      }



On Tue, Jul 15, 2014 at 11:06 AM, Qiang Tian <ti...@gmail.com> wrote:

> Hi, below is more details.
>
> the read0 stacktrace you see means reader wants to read something from
> client RPC call. in Andrew's test it shows it is in reading RPC request
> data (reasonable. since other meta data size is quite small).  although
> client follows request-receive style, when multiple clients share the
> connection(the default case), the synchronization window when writing to
> the same channel is quite small. if those request data have big size, there
> might be a sudden rush to the transportation layer..might causing RPC
> server could not receive the data in time due to congestion control,
> without hbase11277, the reader get 0 byte read again and again...
>
> with hbase11277 the problem should be resolved - we get back to complete
> non-blocking IO, but it is still worth investigation non-shared connection
> under high workload(hbase11306).
>
>
>
> On Tue, Jul 15, 2014 at 10:44 AM, Rural Hunter <ru...@gmail.com>
> wrote:
>
>> No. My issue is definitely a server side issue. When the problem happens
>> I'm not able to telnet the data port of RS.
>>
>> 于 2014/7/15 10:24, Qiang Tian 写道:
>>
>>  not sure if you could also try using different connections for your
>>> client
>>> program. see the related issue HBASE-11306. I guess sharing connection
>>>  is
>>> a contributing factor for this issue.
>>>
>>>
>>>
>>
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Just uodate my result: Since HBASE-11277 was applied, I have not seen 
any connection problem for a week. Before, the connection problem almost 
occurred everyday.

于 2014/7/17 16:58, vito 写道:
> Yes.  *Without* fix of hbase11277, we tried each client with separate
> connection(non-share), but it still does not work.
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061578.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: hbase region servers refuse connection

Posted by vito <li...@gmail.com>.

Yes.  *Without* fix of hbase11277, we tried each client with separate
connection(non-share), but it still does not work.   



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061578.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: hbase region servers refuse connection

Posted by Qiang Tian <ti...@gmail.com>.

just to clarify, ignore attached patch on hbase-11306. there is no
conclusion for the behavior of hbase-11306 yet, the shared connection is
just a suspect.
thanks.



On Tue, Jul 15, 2014 at 1:52 PM, Rural Hunter <ru...@gmail.com> wrote:

> Hi Tian Qiang,
>
> Thanks for the detailed explaination. I have deployed the latest code of
> 0.96 branch with hbase-11277 applied. I will keep monitoring to see if
> there is still problem and the necessarity of hbase-11306.
>
> 于 2014/7/15 11:06, Qiang Tian 写道:
>
>  Hi, below is more details.
>>
>> the read0 stacktrace you see means reader wants to read something from
>> client RPC call. in Andrew's test it shows it is in reading RPC request
>> data (reasonable. since other meta data size is quite small).  although
>> client follows request-receive style, when multiple clients share the
>> connection(the default case), the synchronization window when writing to
>> the same channel is quite small. if those request data have big size,
>> there
>> might be a sudden rush to the transportation layer..might causing RPC
>> server could not receive the data in time due to congestion control,
>> without hbase11277, the reader get 0 byte read again and again...
>>
>> with hbase11277 the problem should be resolved - we get back to complete
>> non-blocking IO, but it is still worth investigation non-shared connection
>> under high workload(hbase11306).
>>
>
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Hi Tian Qiang,

Thanks for the detailed explaination. I have deployed the latest code of 
0.96 branch with hbase-11277 applied. I will keep monitoring to see if 
there is still problem and the necessarity of hbase-11306.

于 2014/7/15 11:06, Qiang Tian 写道:
> Hi, below is more details.
>
> the read0 stacktrace you see means reader wants to read something from
> client RPC call. in Andrew's test it shows it is in reading RPC request
> data (reasonable. since other meta data size is quite small).  although
> client follows request-receive style, when multiple clients share the
> connection(the default case), the synchronization window when writing to
> the same channel is quite small. if those request data have big size, there
> might be a sudden rush to the transportation layer..might causing RPC
> server could not receive the data in time due to congestion control,
> without hbase11277, the reader get 0 byte read again and again...
>
> with hbase11277 the problem should be resolved - we get back to complete
> non-blocking IO, but it is still worth investigation non-shared connection
> under high workload(hbase11306).

Re: hbase region servers refuse connection

Posted by Qiang Tian <ti...@gmail.com>.

Hi, below is more details.

the read0 stacktrace you see means reader wants to read something from
client RPC call. in Andrew's test it shows it is in reading RPC request
data (reasonable. since other meta data size is quite small).  although
client follows request-receive style, when multiple clients share the
connection(the default case), the synchronization window when writing to
the same channel is quite small. if those request data have big size, there
might be a sudden rush to the transportation layer..might causing RPC
server could not receive the data in time due to congestion control,
without hbase11277, the reader get 0 byte read again and again...

with hbase11277 the problem should be resolved - we get back to complete
non-blocking IO, but it is still worth investigation non-shared connection
under high workload(hbase11306).

On Tue, Jul 15, 2014 at 10:44 AM, Rural Hunter <ru...@gmail.com>
wrote:

> No. My issue is definitely a server side issue. When the problem happens
> I'm not able to telnet the data port of RS.
>
> 于 2014/7/15 10:24, Qiang Tian 写道:
>
>  not sure if you could also try using different connections for your client
>> program. see the related issue HBASE-11306. I guess sharing connection  is
>> a contributing factor for this issue.
>>
>>
>>
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

No. My issue is definitely a server side issue. When the problem happens 
I'm not able to telnet the data port of RS.

于 2014/7/15 10:24, Qiang Tian 写道:
> not sure if you could also try using different connections for your client
> program. see the related issue HBASE-11306. I guess sharing connection  is
> a contributing factor for this issue.
>
>

Re: hbase region servers refuse connection

Posted by Qiang Tian <ti...@gmail.com>.

not sure if you could also try using different connections for your client
program. see the related issue HBASE-11306. I guess sharing connection  is
a contributing factor for this issue.




On Mon, Jul 14, 2014 at 7:19 PM, Rural Hunter <ru...@gmail.com> wrote:

> Hi Tian Qiang,
>
> Good to hear that. It seems the jira is fixed. I will seek time to build
> the latest 0.96 branch and test it.
>
> 于 2014/7/14 18:06, Qiang Tian 写道:
>
>> Hi YuMing, :)
>> yes. several iterations of jstack on the problem regionserver could help
>> identify the problem
>>
>> Rural,
>> you probably hit hbase11277(and probably YuMin as well) - the reader 14
>> loops again and again in
>> below stack(high cpu usage) and listener 12 is blocked and cannot
>> accept new connections.
>>
>>
>>
>>     1. Thread 12 (RpcServer.listener,port=60020):
>>     2.   State: BLOCKED
>>     3.   Blocked count: 123264191
>>     4.   Waited count: 0
>>     5.   Blocked on
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@77f87716
>>     6.   Blocked by 14 (RpcServer.reader=1,port=60020)
>>     7.   Stack:
>>     8.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
>> registerChannel(RpcServer.java:598)
>>     9.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(
>> RpcServer.java:755)
>>     10.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(
>> RpcServer.java:673)
>>     11. Thread 24 (RpcServer.responder):
>>
>>
>>
>>     1. Thread 14 (RpcServer.reader=1,port=60020):
>>     2.   State: RUNNABLE
>>     3.   Blocked count: 12510492
>>     4.   Waited count: 12826560
>>     5.   Stack:
>>     6.     sun.nio.ch.FileDispatcher.read0(Native Method)
>>     7.     sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>     8.     sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     9.     sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     10.     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     11.
>>     org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2438)
>>     12.
>>     org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
>> RpcServer.java:2404)
>>     13.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Connection.
>> readAndProcess(RpcServer.java:1498)
>>     14.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
>> RpcServer.java:780)
>>     15.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
>> doRunLoop(RpcServer.java:568)
>>     16.
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
>> RpcServer.java:543)
>>     17.
>>     java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1146)
>>     18.
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>     19.     java.lang.Thread.run(Thread.java:701)
>>     20. Thread 13 (RpcServer.reader=0,port=60020):
>>     21.
>>
>>
>>
>>     1. 2014-07-10 14:13:49,614 WARN  [RpcServer.reader=7,port=60020]
>>
>>     ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
>>     2. java.io.IOException: Connection reset by peer
>>     3.         at sun.nio.ch.FileDispatcher.read0(Native Method)
>>     4.         at sun.nio.ch.SocketDispatcher.
>> read(SocketDispatcher.java:39)
>>     5.         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>>     6.         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>>     7.         at
>>     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>>     8.         at
>>     org.apache.hadoop.hbase.ipc.RpcServer.channelRead(
>> RpcServer.java:2404)
>>     9.         at
>>     org.apache.hadoop.hbase.ipc.RpcServer$Connection.
>> readAndProcess(RpcServer.java:1425)
>>     10.         at
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(
>> RpcServer.java:780)
>>     11.         at
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.
>> doRunLoop(RpcServer.java:568)
>>     12.         at
>>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(
>> RpcServer.java:543)
>>     13.         at
>>     java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1146)
>>     14.         at
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)
>>     15.         at java.lang.Thread.run(Thread.java:701)
>>
>>
>>
>>
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Hi Tian Qiang,

Good to hear that. It seems the jira is fixed. I will seek time to build 
the latest 0.96 branch and test it.

于 2014/7/14 18:06, Qiang Tian 写道:
> Hi YuMing, :)
> yes. several iterations of jstack on the problem regionserver could help
> identify the problem
>
> Rural,
> you probably hit hbase11277(and probably YuMin as well) - the reader 14
> loops again and again in
> below stack(high cpu usage) and listener 12 is blocked and cannot
> accept new connections.
>
>
>
>     1. Thread 12 (RpcServer.listener,port=60020):
>     2.   State: BLOCKED
>     3.   Blocked count: 123264191
>     4.   Waited count: 0
>     5.   Blocked on
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@77f87716
>     6.   Blocked by 14 (RpcServer.reader=1,port=60020)
>     7.   Stack:
>     8.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:598)
>     9.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:755)
>     10.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:673)
>     11. Thread 24 (RpcServer.responder):
>
>
>
>     1. Thread 14 (RpcServer.reader=1,port=60020):
>     2.   State: RUNNABLE
>     3.   Blocked count: 12510492
>     4.   Waited count: 12826560
>     5.   Stack:
>     6.     sun.nio.ch.FileDispatcher.read0(Native Method)
>     7.     sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     8.     sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     9.     sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     10.     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     11.
>     org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2438)
>     12.
>     org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2404)
>     13.
>     org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1498)
>     14.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:780)
>     15.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:568)
>     16.
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:543)
>     17.
>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>     18.
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     19.     java.lang.Thread.run(Thread.java:701)
>     20. Thread 13 (RpcServer.reader=0,port=60020):
>     21.
>
>
>
>     1. 2014-07-10 14:13:49,614 WARN  [RpcServer.reader=7,port=60020]
>     ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
>     2. java.io.IOException: Connection reset by peer
>     3.         at sun.nio.ch.FileDispatcher.read0(Native Method)
>     4.         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     5.         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
>     6.         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
>     7.         at
>     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
>     8.         at
>     org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2404)
>     9.         at
>     org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1425)
>     10.         at
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:780)
>     11.         at
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:568)
>     12.         at
>     org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:543)
>     13.         at
>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>     14.         at
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     15.         at java.lang.Thread.run(Thread.java:701)
>
>
>

Re: hbase region servers refuse connection

Posted by Qiang Tian <ti...@gmail.com>.

Hi YuMing, :)
yes. several iterations of jstack on the problem regionserver could help
identify the problem

Rural,
you probably hit hbase11277(and probably YuMin as well) - the reader 14
loops again and again in
below stack(high cpu usage) and listener 12 is blocked and cannot
accept new connections.



   1. Thread 12 (RpcServer.listener,port=60020):
   2.   State: BLOCKED
   3.   Blocked count: 123264191
   4.   Waited count: 0
   5.   Blocked on
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@77f87716
   6.   Blocked by 14 (RpcServer.reader=1,port=60020)
   7.   Stack:
   8.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:598)
   9.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:755)
   10.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:673)
   11. Thread 24 (RpcServer.responder):



   1. Thread 14 (RpcServer.reader=1,port=60020):
   2.   State: RUNNABLE
   3.   Blocked count: 12510492
   4.   Waited count: 12826560
   5.   Stack:
   6.     sun.nio.ch.FileDispatcher.read0(Native Method)
   7.     sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
   8.     sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
   9.     sun.nio.ch.IOUtil.read(IOUtil.java:224)
   10.     sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
   11.
   org.apache.hadoop.hbase.ipc.RpcServer.channelIO(RpcServer.java:2438)
   12.
   org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2404)
   13.
   org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1498)
   14.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:780)
   15.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:568)
   16.
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:543)
   17.
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   18.
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   19.     java.lang.Thread.run(Thread.java:701)
   20. Thread 13 (RpcServer.reader=0,port=60020):
   21.



   1. 2014-07-10 14:13:49,614 WARN  [RpcServer.reader=7,port=60020]
   ipc.RpcServer: RpcServer.listener,port=60020: count of bytes read: 0
   2. java.io.IOException: Connection reset by peer
   3.         at sun.nio.ch.FileDispatcher.read0(Native Method)
   4.         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
   5.         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251)
   6.         at sun.nio.ch.IOUtil.read(IOUtil.java:224)
   7.         at
   sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254)
   8.         at
   org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2404)
   9.         at
   org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1425)
   10.         at
   org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:780)
   11.         at
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:568)
   12.         at
   org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:543)
   13.         at
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   14.         at
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   15.         at java.lang.Thread.run(Thread.java:701)



On Mon, Jul 14, 2014 at 9:24 AM, Rural Hunter <ru...@gmail.com> wrote:

> Yes. But you may want to check if there are many connections in SYN_RECV
> state when the problem happens.
>
>
> 于 2014/7/14 4:18, vito 写道:
>
>> Hi Rural ,
>>
>>
>>   Do you mean the following action you have taken? Thanks a lot.
>>
>> "Anyway, I just changed these kernel settings:
>> net.core.somaxconn=1024 (original 128)
>> net.ipv4.tcp_synack_retries=2 (original 5) "
>>
>>
>>
>> --
>> View this message in context: http://apache-hbase.679495.n3.
>> nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061293.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>> .
>>
>>
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Yes. But you may want to check if there are many connections in SYN_RECV 
state when the problem happens.

于 2014/7/14 4:18, vito 写道:
> Hi Rural ,
>
>   Do you mean the following action you have taken? Thanks a lot.
>
> "Anyway, I just changed these kernel settings:
> net.core.somaxconn=1024 (original 128)
> net.ipv4.tcp_synack_retries=2 (original 5) "
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061293.html
> Sent from the HBase User mailing list archive at Nabble.com.
> .
>

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Yes.

于 2014/7/14 4:18, vito 写道:
> Hi Rural ,
>
>   Do you mean the following action you have taken? Thanks a lot.
>
> "Anyway, I just changed these kernel settings:
> net.core.somaxconn=1024 (original 128)
> net.ipv4.tcp_synack_retries=2 (original 5) "
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061293.html
> Sent from the HBase User mailing list archive at Nabble.com.
> .
>

Re: hbase region servers refuse connection

Posted by vito <li...@gmail.com>.

Hi Rural ,

 Do you mean the following action you have taken? Thanks a lot. 

"Anyway, I just changed these kernel settings: 
net.core.somaxconn=1024 (original 128) 
net.ipv4.tcp_synack_retries=2 (original 5) "



--
View this message in context: http://apache-hbase.679495.n3.nabble.com/hbase-region-servers-refuse-connection-tp4061278p4061293.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: hbase region servers refuse connection

Posted by Rural Hunter <ru...@gmail.com>.

Maybe you can try my action:
http://apache-hbase.679495.n3.nabble.com/Region-server-not-accept-connections-intermittently-td4061107.html

Before I took the action, I almost see the RS connection problem 
everyday and the problem usually last about half hour. From the 
observation in these 2 days after the action was taken, the frequency of 
the problem has been reduced. The huge improvement is, even when the 
problem happens, the RS can recover in less than 1 minute comparing with 
30-60 minutes before. I guess there should be some room to adjust the 
parameters more better.

于 2014/7/12 17:39, 李玉明 写道:
> Can any experts look at this?  Similar posts are :
>
> 1.      http://mail-archives.apache.org/mod_mbox/hbase-user/201406.mbox/%3C001001cf863e$c81a60e0$584f22a0$@com%3E
> 2.      https://issues.apache.org/jira/browse/HBASE-11306  "Client
> connection starvation issues under high load on Amazon EC2"
> 3.      https://issues.apache.org/jira/browse/HBASE-11277
>
>
> Regards,
> --
> Yu Ming
>

Re: hbase region servers refuse connection

Posted by 李玉明 <li...@gmail.com>.

Can any experts look at this?  Similar posts are :

1.      http://mail-archives.apache.org/mod_mbox/hbase-user/201406.mbox/%3C001001cf863e$c81a60e0$584f22a0$@com%3E
2.      https://issues.apache.org/jira/browse/HBASE-11306  "Client
connection starvation issues under high load on Amazon EC2"
3.      https://issues.apache.org/jira/browse/HBASE-11277


Regards,
--
Yu Ming

2014-07-10 16:19 GMT+08:00 李玉明 <li...@gmail.com>:
> Hi,
>
>   The HBase version is 0.9.6. I experienced the HBase region servers
> refuse connection problem.  Could anyone please help? Thank you in
> advance.
>
>   Summary of the problem.
>
> 1. Several Region Servers refuse service.  The Requests Per Second become 0.
>
> 2. The application client can't connect to the Region Server.  Even
> with the nc simple linux  command, the connection is refused.
>     For example: nc  10.207.27.41 8420
>
> 3. Even restart the HBase cluster, the service can't recover.
>
> 4. Snippet  of some log  at the application client :
>
> 2014-07-10 16:03:51[htable-pool20-t13:2931892] - [INFO] #3541,
> table=monitor-data, attempt=702/1 failed 117 ops, last exception:
> org.apache.hadoop.hbase.ipc.RpcClient$ailedServerException: This
> server is in the failed servers list:
> nz-cloudera1.xxx.com/10.207.27.41:8420 on
> nz-cloudera1.xxx.com,8420,1404973262139, tracking started Thu Jul 10
> 15:16:19 CST 2014, retrying after 4034 ms, replay 117 ops.
> 2014-07-10 16:03:51[htable-pool27-t45:2931892] - [INFO] #5443,
> table=monitor-data, attempt=694/1 failed 3 ops, last exception:
> org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This
> server is in the failed servers list:
> nz-cloudera67.xxx.com/10.208.244.41:8420 on
> nz-cloudera67.xxx.com,8420,1404973261815, tracking started Thu Jul 10
> 15:17:02 CST 2014, retrying after 4028 ms, replay 3 ops.
>
> 5.  Snippet  of some log  at the Region Server:  it does
> periodicFlusher and compaction again and again.
>
> 2014-07-10 15:26:36,508 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Completed major
> compaction of 6 file(s) in t of
> monitor-data,ig\x01RdB\xCD\x1CS\xAA\xF2\x00,1403772336393.73dba0c6346574f324d79e976db64def.
> into 493864bb386042099e1ef6be1b9770b2(size=4.4 G), total size for
> store is 4.4 G. This selection was in queue for 0sec, and took 5mins,
> 7sec to execute.
> 2014-07-10 15:26:36,508 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed
> compaction: Request =
> regionName=monitor-data,ig\x01RdB\xCD\x1CS\xAA\xF2\x00,1403772336393.73dba0c6346574f324d79e976db64def.,
> storeName=t, fileCount=6, fileSize=4.4 G, priority=44,
> time=1397522352382230; duration=5mins, 7sec
> 2014-07-10 15:26:36,509 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction on t
> in region monitor-data,\x12\xDB\x13?dB\xCD
> S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6.
> 2014-07-10 15:26:36,509 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of 5
> file(s) in t of monitor-data,\x12\xDB\x13?dB\xCD
> S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6. into
> tmpdir=hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-data/8a56fad579ebdefc2cb6622d715fe7a6/.tmp,
> totalSize=4.9 G
> 2014-07-10 15:32:17,359 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Completed major
> compaction of 5 file(s) in t of monitor-data,\x12\xDB\x13?dB\xCD
> S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6. into
> 3377dcd054244e42825227dc93d94bcb(size=4.9 G), total size for store is
> 4.9 G. This selection was in queue for 0sec, and took 5mins, 40sec to
> execute.
> 2014-07-10 15:32:17,359 INFO
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Completed
> compaction: Request = regionName=monitor-data,\x12\xDB\x13?dB\xCD
> S\xAA\xF2\x00,1403772319221.8a56fad579ebdefc2cb6622d715fe7a6.,
> storeName=t, fileCount=5, fileSize=4.9 G, priority=45,
> time=1397829741046386; duration=5mins, 40sec
> 2014-07-10 15:43:34,272 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver8420.periodicFlusher requesting flush for region
> monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
> after a delay of 14966
> 2014-07-10 15:43:44,272 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver8420.periodicFlusher requesting flush for region
> monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
> after a delay of 16416
> 2014-07-10 15:43:49,240 WARN
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
> seqNum for the region we are about to flush:
> [f581188d3dd9d6d136888821a374de55]
> 2014-07-10 15:43:49,629 INFO
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> sequenceid=63172, memsize=37.0 M, hasBloomFilter=true, into tmp file
> hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/f581188d3dd9d6d136888821a374de55/.tmp/5d07d2aa3dba4e85ab48326846a90ba9
> 2014-07-10 15:43:49,639 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Added
> hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/f581188d3dd9d6d136888821a374de55/t/5d07d2aa3dba4e85ab48326846a90ba9,
> entries=158209, sequenceid=63172, filesize=3.4 M
> 2014-07-10 15:43:49,640 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~46.0 M/48218512, currentsize=0/0 for region
> monitor-meta,\x18:\x9B\xFB,1403771550932.f581188d3dd9d6d136888821a374de55.
> in 399ms, sequenceid=63172, compaction requested=false
> 2014-07-10 15:45:14,273 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver8420.periodicFlusher requesting flush for region
> monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
> after a delay of 21673
> 2014-07-10 15:45:24,273 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver8420.periodicFlusher requesting flush for region
> monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
> after a delay of 10915
> 2014-07-10 15:45:34,273 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> regionserver8420.periodicFlusher requesting flush for region
> monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
> after a delay of 17862
> 2014-07-10 15:45:35,948 WARN
> org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest
> seqNum for the region we are about to flush:
> [0c9e4bd771a895e2ebe2c146809c82ce]
> 2014-07-10 15:45:36,142 INFO
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> sequenceid=63173, memsize=18.2 M, hasBloomFilter=true, into tmp file
> hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/0c9e4bd771a895e2ebe2c146809c82ce/.tmp/0f3bba8b09ed41fbb21395fc9ae94e04
> 2014-07-10 15:45:36,152 INFO
> org.apache.hadoop.hbase.regionserver.HStore: Added
> hdfs://nz-cloudera-namenode.xxx.com:8020/hbase/data/default/monitor-meta/0c9e4bd771a895e2ebe2c146809c82ce/t/0f3bba8b09ed41fbb21395fc9ae94e04,
> entries=78724, sequenceid=63173, filesize=1.7 M
> 2014-07-10 15:45:36,152 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~22.5 M/23614384, currentsize=0/0 for region
> monitor-meta,,?\xBC\xA9,1403771552560.0c9e4bd771a895e2ebe2c146809c82ce.
> in 204ms, sequenceid=63173, compaction requested=false
>
> --
> Vito Li