You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Zhou Shuaifeng <zh...@huawei.com> on 2011/02/01 02:43:21 UTC

Re: IPC Server Responder out put error causing RegionServer down

Thank you very much.
I will check more about the cause according your suggestion. Can you explain
more about 'GC pause'?
Because the network wardship of my company, I'm not able to post large
amount of logs to pastebin(about 15KB logs in 30 minutes). 
But thanks all the same, if I have some way to show more logs, I will inform
you.

Zhou Shuaifeng(Frank)
HUAWEI TECHNOLOGIES CO.,LTD. 

----------------------------------------------------------------------------
---------------------------------------------------------
This e-mail and its attachments contain confidential information from
HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any
use of the 
information contained herein in any way (including, but not limited to,
total or partial 
disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please
notify the sender by 
phone or email immediately and delete it!

-----邮件原件-----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel
Cryans
发送时间: 2011年2月1日 2:23
收件人: user@hbase.apache.org
抄送: yanlijun@huawei.com; wei.zhi@huawei.com; dev@hbase.apache.org;
zhoushuaifeng@huawei.com
主题: Re: IPC Server Responder out put error causing RegionServer down

(moving to the user mailing list, where it belongs)

My educated guess is that you had a GC pause that lasted for more than a
minute while a file was being written to. Even if the write wasn't
happening, your region server would have committed suicide anyways since it
was probably past it's lease timeout. In order to confirm that, we would
need to see the logs from a few minutes before the exception happened. Also,
please use pastebin.com (or similar) to store the logs.

Regarding the NPE, it's definitely not something that should happen but the
region server was already on it's way out.

J-D

2011/1/31 Zhou Shuaifeng <zh...@huawei.com>

> 2011-01-28 17:13:47,633
>
> java.net.*SocketTimeoutException*: Call to
c3s4.site/167.6.5.13:60020failed on
> *socket timeout exception*: java.net.SocketTimeoutException: 60000 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
local=/167.6.5.14:*50735*remote=c3s4.site/
> 167.6.5.13:60020]
>
> 2011-01-28 17:13:47,634 INFO
org.apache.hadoop.hbase.regionserver.*HRegionServer:
> STOPPED: Failed open of daughter*
>
>
>
> According this “60000 millis timeout”, I checked the log at 17:12:47,
> there is a null point exception and ClosedChannelException, is there any
> relation?:
>
>
>
> 2011-01-28 17:12:47,029 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call getRegionInfo([B@3a3e8cfb) from 167.6.5.17:51035: output
> error
>
> 2011-01-28 17:12:47,029 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call getRegionInfo([B@7a0a4ce5) from 167.6.5.14:50307: output
> error
>
> 2011-01-28 17:12:47,029 WARN org.apache.hadoop.ipc.HBaseServer: PRI IPC
> Server handler 2 on 60020 caught: java.nio.channels.ClosedChannelException
>
>                 at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>
>                 at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1336)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServe
r.java:724)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java
:789)
>
>                 at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1080)
>
>
>
> 2011-01-28 17:12:47,029 WARN org.apache.hadoop.ipc.HBaseServer: PRI IPC
> Server handler 4 on 60020 caught: java.nio.channels.ClosedChannelException
>
>                 at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>
>                 at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1336)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServe
r.java:724)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java
:789)
>
>                 at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1080)
>
>
>
> 2011-01-28 17:12:47,029 WARN org.apache.hadoop.ipc.HBaseServer: PRI IPC
> Server handler 5 on 60020 caught: java.nio.channels.ClosedChannelException
>
>                 at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>
>                 at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1336)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServe
r.java:724)
>
>                 at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java
:789)
>
>                 at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1080)
>
>
>
> 2011-01-28 17:12:47,097 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x22dcbbf01b40006 Retrieved 78 byte(s) of data from
znode
> /hbase/unassigned/1489005343;
> data=region=ufdr_c,201101258613813763326#174006,1295973611272,
> server=c3s1.site:60000, state=M_ZK_REGION_OFFLINE
>
> 2011-01-28 17:12:47,097 ERROR
> org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
> processing event M_RS_OPEN_REGION
>
> java.lang.NullPointerException
>
>                 at
> org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>
>                 at
>
org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransi
tionData.java:198)
>
>                 at
>
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)
>
>                 at
> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.
java:552)
>
>                 at
> org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNodeOpening(ZKAssign.
java:545)
>
>                 at
>
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.transitionZoo
keeperOfflineToOpening(OpenRegionHandler.java:296)
>
>                 at
>
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenR
egionHandler.java:90)
>
>                 at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>
>                 at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.
java:886)
>
>                 at
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
>
>                 at java.lang.Thread.run(Thread.java:662)
>
>
>
>
>
> Zhou Shuaifeng(Frank)
> HUAWEI TECHNOLOGIES CO.,LTD. [image: huawei_logo]
>
>
>
>
>
----------------------------------------------------------------------------
---------------------------------------------------------
> This e-mail and its attachments contain confidential information from
> HUAWEI, which
> is intended only for the person or entity whose address is listed above.
> Any use of the
> information contained herein in any way (including, but not limited to,
> total or partial
> disclosure, reproduction, or dissemination) by persons other than the
> intended
> recipient(s) is prohibited. If you receive this e-mail in error, please
> notify the sender by
> phone or email immediately and delete it!
>
> *发件人**:* Zhou Shuaifeng [mailto:zhoushuaifeng@huawei.com]
> *发送时间:* 2011年1月31日 17:44
> *收件人:* dev@hbase.apache.org
> *抄送:* yanlijun@huawei.com; wei.zhi@huawei.com
> *主题:* IPC Server Responder out put error causing RegionServer down
>
>
>
> Hi,
>
>
>
> when I update my hbase from 0.20.6 to 0.90, some error occur:
>
> Below is logs on 167.6.5.13, because Meta table is served by this server,
> so other regionservers want to connect this server to getRegionInfo, but
got
> some error:
>
>
>
> 2011-01-28 17:14:18,291 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call getRegionInfo([B@15e43aec) from 167.6.5.11:34887: output
> error
>
> 2011-01-28 17:14:18,291 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call getRegionInfo([B@3ad8672e) from 167.6.5.14:50735: output
> error
>
>
>
> This caused the server on 11 and 14  socket timeout and shutdown:
>
>
>
> 2011-01-28 17:13:47,633 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=c3s5.site,60020,1296203449919, load=(requests=0, regions=2710,
> usedHeap=4622, maxHeap=8165): *Failed open of
daughter*ufdr_c,201101268613812388592#085317,1296205930314.cd36b19931bd2b2ad
cca176a88dbd6e5.
>
> java.net.*SocketTimeoutException*: Call to
c3s4.site/167.6.5.13:60020failed on
> *socket timeout exception*: java.net.SocketTimeoutException: 60000 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
local=/167.6.5.14:*50735*remote=c3s4.site/
> 167.6.5.13:60020]
>
> 2011-01-28 17:13:47,634 INFO
org.apache.hadoop.hbase.regionserver.*HRegionServer:
> STOPPED: Failed open of daughter*
>
> * *
>
> *Now, my question is what**’**s the cause of **“*output error” ?
>
>
>
> Before the “output error” warning, there is an EOFException, is there
any
> relationship with this error?
>
>
>
> 2011-01-28 17:13:43,413 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60020: readAndProcess threw exception java.io.EOFException.
> Count of bytes read: 0
>
> java.io.EOFException
>
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>
>        at java.io.DataInputStream.readUTF(DataInputStream.java:592)
>
>        at java.io.DataInputStream.readUTF(DataInputStream.java:547)
>
>        at
>
org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:122
)
>
>        at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.j
ava:975)
>
>        at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServe
r.java:943)
>
>        at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:519
)
>
>        at
>
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java
:313)
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.
java:886)
>
>        at
>
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9
08)
>
>        at java.lang.Thread.run(Thread.java:662)
>
> 2011-01-28 17:13:43,413 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to
open
> region: ufdr_c,201101258613814756901#232839,1295985296267.2025697489
>
> 2011-01-28 17:13:43,418 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x22dcbbf01b40006 Retrieved 92 byte(s) of data from
znode
> /hbase/unassigned/1489005343;
> data=region=ufdr_c,201101258613813763326#174006,1295973611272,
> server=c3s4.site,60020,1296203450194, state=RS_ZK_REGION_OPENING
>
> 2011-01-28 17:13:43,420 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x22dcbbf01b40006 Retrieved 92 byte(s) of data from
znode
> /hbase/unassigned/1497320702;
> data=region=ufdr_c,201101268613810041680#085457,1296052066413,
> server=c3s4.site,60020,1296203450194, state=RS_ZK_REGION_OPENING
>
> 2011-01-28 17:13:43,441 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher:
> regionserver:60020-0x22dcbbf01b40006 Received ZooKeeper Event,
> type=NodeDataChanged, state=SyncConnected,
path=/hbase/unassigned/1489005343
>
> 2011-01-28 17:13:43,441 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x22dcbbf01b40006 Successfully transitioned node
> 1489005343 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING
>
>
>
>
>
>
>
> Zhou Shuaifeng(Frank)
>
>