You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Gaojinchao <ga...@huawei.com> on 2011/08/11 04:05:41 UTC

Root table couldn't be opened

In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.

The logs show:

// Master assigned the root table to 82
2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386

//The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
 I think we shouldn't verify the root / meta in shutdownhandler processing

2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
         at $Proxy6.getRegionInfo(Unknown Source)
         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
         at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
         at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
         at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
         at java.io.FilterInputStream.read(FilterInputStream.java:116)
         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
         at java.io.DataInputStream.readInt(DataInputStream.java:370)
         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.



Region server logs:
2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1

// blocked for a long time.
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181


re: Root table couldn't be opened

Posted by Gaojinchao <ga...@huawei.com>.
Yes, I think shutdown handler should not verify machine that is included deadservers set.

2011-07-28 22:25:09,336 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Waiting for split writer threads to finish
2011-07-28 22:25:09,450 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Split writers finished
2011-07-28 22:25:09,602 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://158.1.101.82:9000/hbase/.META./1028785192/recovered.edits/0000000000000025786 (wrote 121 edits in 2567ms)
2011-07-28 22:25:09,860 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://158.1.101.82:9000/hbase/ufdr5/745550eb514e441f31ff26dbde8402ae/recovered.edits/0000000000000617740 (wrote 211642 edits in 141887ms)

//split logs finished and assigned root table firstly .at the same time, region server came out of GC, verifying root passed.
2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386

// region server is rejected and region server will shutdown itself. 
2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.

all logs is in attachment.

-----邮件原件-----
发件人: saint.ack@gmail.com [mailto:saint.ack@gmail.com] 代表 Stack
发送时间: 2011年8月16日 12:33
收件人: user@hbase.apache.org
主题: Re: Root table couldn't be opened

On Mon, Aug 15, 2011 at 9:23 PM, Gaojinchao <ga...@huawei.com> wrote:
> Why did the master replay its logs if it did not exit?

Sorry.  Which logs?

> Zk is expired because of gc. But region server isn't shutdown.
>

Right, but it probably went down soon after it came out of GC, right?


> (I like how you noticed the log message that says 82 has root and meta)
>
> Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
> It said that 82 has root and meta. "root=true" shows the dead region server has root table.
>

So, you think there is a bug in our shutdown handler where we are not
doing -ROOT- processing properly?
St.Ack

Re: Root table couldn't be opened

Posted by Stack <st...@duboce.net>.
On Mon, Aug 15, 2011 at 9:23 PM, Gaojinchao <ga...@huawei.com> wrote:
> Why did the master replay its logs if it did not exit?

Sorry.  Which logs?

> Zk is expired because of gc. But region server isn't shutdown.
>

Right, but it probably went down soon after it came out of GC, right?


> (I like how you noticed the log message that says 82 has root and meta)
>
> Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
> It said that 82 has root and meta. "root=true" shows the dead region server has root table.
>

So, you think there is a bug in our shutdown handler where we are not
doing -ROOT- processing properly?
St.Ack

Re: Root table couldn't be opened

Posted by Chris Richard <ch...@gmail.com>.
-----Original Message-----
From: Gaojinchao <ga...@huawei.com>
Date: Tue, 16 Aug 2011 04:23:58 
To: user@hbase.apache.org<us...@hbase.apache.org>
Reply-To: user@hbase.apache.org
Subject: re: Root table couldn't be opened

Why did the master replay its logs if it did not exit?
Zk is expired because of gc. But region server isn't shutdown.

(I like how you noticed the log message that says 82 has root and meta)

Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
It said that 82 has root and meta. "root=true" shows the dead region server has root table.


-----邮件原件-----
发件人: saint.ack@gmail.com [mailto:saint.ack@gmail.com] 代表 Stack
发送时间: 2011年8月16日 12:12
收件人: user@hbase.apache.org
主题: Re: Root table couldn't be opened

On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao <ga...@huawei.com> wrote:
> In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.
>
> The logs show:
>
> // Master assigned the root table to 82
> 2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
>
> //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
>  I think we shouldn't verify the root / meta in shutdownhandler processing
>


82 did not exit?

Why did the master replay its logs if it did not exit?


> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true


Isn't this the master handling 82 likes its been shutdown?


> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server

So, it looks like 82 tried to come in (after GC I suppose) but we told
it go away.

Why did we not notice that -ROOT- was on 82 and as part of the
shutdown handling of 82, we reassigned it.  This is what you are
saying in your subsequent message (I like how you noticed the log
message that says 82 has root and meta).  I'm not sure why it did not
reassign root.  Its skipping something in shutdown handler or the
verify location for root has a bug in it where we are not considering
the fact that current server held -ROOT- so if verification returns
current server as holding -ROOT-, then we should ignore it.

Good stuff Gao,
St.Ack

re: Root table couldn't be opened

Posted by Gaojinchao <ga...@huawei.com>.
Why did the master replay its logs if it did not exit?
Zk is expired because of gc. But region server isn't shutdown.

(I like how you noticed the log message that says 82 has root and meta)

Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
It said that 82 has root and meta. "root=true" shows the dead region server has root table.


-----邮件原件-----
发件人: saint.ack@gmail.com [mailto:saint.ack@gmail.com] 代表 Stack
发送时间: 2011年8月16日 12:12
收件人: user@hbase.apache.org
主题: Re: Root table couldn't be opened

On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao <ga...@huawei.com> wrote:
> In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.
>
> The logs show:
>
> // Master assigned the root table to 82
> 2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
>
> //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
>  I think we shouldn't verify the root / meta in shutdownhandler processing
>


82 did not exit?

Why did the master replay its logs if it did not exit?


> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true


Isn't this the master handling 82 likes its been shutdown?


> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server

So, it looks like 82 tried to come in (after GC I suppose) but we told
it go away.

Why did we not notice that -ROOT- was on 82 and as part of the
shutdown handling of 82, we reassigned it.  This is what you are
saying in your subsequent message (I like how you noticed the log
message that says 82 has root and meta).  I'm not sure why it did not
reassign root.  Its skipping something in shutdown handler or the
verify location for root has a bug in it where we are not considering
the fact that current server held -ROOT- so if verification returns
current server as holding -ROOT-, then we should ignore it.

Good stuff Gao,
St.Ack

Re: Root table couldn't be opened

Posted by Stack <st...@duboce.net>.
On Wed, Aug 10, 2011 at 7:05 PM, Gaojinchao <ga...@huawei.com> wrote:
> In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.
>
> The logs show:
>
> // Master assigned the root table to 82
> 2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
>
> //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
>  I think we shouldn't verify the root / meta in shutdownhandler processing
>


82 did not exit?

Why did the master replay its logs if it did not exit?


> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true


Isn't this the master handling 82 likes its been shutdown?


> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server

So, it looks like 82 tried to come in (after GC I suppose) but we told
it go away.

Why did we not notice that -ROOT- was on 82 and as part of the
shutdown handling of 82, we reassigned it.  This is what you are
saying in your subsequent message (I like how you noticed the log
message that says 82 has root and meta).  I'm not sure why it did not
reassign root.  Its skipping something in shutdown handler or the
verify location for root has a bug in it where we are not considering
the fact that current server held -ROOT- so if verification returns
current server as holding -ROOT-, then we should ignore it.

Good stuff Gao,
St.Ack

Re: Root table couldn't be opened

Posted by Gaojinchao <ga...@huawei.com>.
Thanks for your reply. It seems hbase-4168 can't fix my case. In my case ,there is not any error in master logs.
Master can visit region server and get the region information.

Master logs said that it is a region server with root and meta. But It only had assigned meta and skipped the root. 

//it said that it is a region server with root and meta.
2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true

> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
> 2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
> 2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
> 2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy6.getRegionInfo(Unknown Source)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server

// It said that meta was assigned . but Root had no logs. The above logs said exception is CatalogJanitor not shutdownhandle.

> 2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
> 2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
> 2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.

-----邮件原件-----
发件人: Ted Yu [mailto:yuzhihong@gmail.com] 
发送时间: 2011年8月11日 10:16
收件人: user@hbase.apache.org
抄送: user@hbase.apache.org
主题: Re: Root table couldn't be opened

I think this is hbase-4168 whose fix got committed today. 
We run 0.90.4 patched with hbase-4168. 

Please try the patch. 

On Aug 10, 2011, at 7:05 PM, Gaojinchao <ga...@huawei.com> wrote:

> In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.
> 
> The logs show:
> 
> // Master assigned the root table to 82
> 2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
> 
> //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
> I think we shouldn't verify the root / meta in shutdownhandler processing
> 
> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
> 2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
> 2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
> 2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy6.getRegionInfo(Unknown Source)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
> 2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
> 2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
> 2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.
> 
> 
> 
> Region server logs:
> 2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1
> 
> // blocked for a long time.
> 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
> 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
> 2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
> 2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181
> 

Re: Root table couldn't be opened

Posted by Ted Yu <yu...@gmail.com>.
I think this is hbase-4168 whose fix got committed today. 
We run 0.90.4 patched with hbase-4168. 

Please try the patch. 

On Aug 10, 2011, at 7:05 PM, Gaojinchao <ga...@huawei.com> wrote:

> In my cluster(version 0.90.3) , The root table couldn't be opened when one region server crashed because of gc.
> 
> The logs show:
> 
> // Master assigned the root table to 82
> 2011-07-28 21:34:34,710 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
> 
> //The host of 82 crashed, master finished the split log and reassigned the root and meta. But the region server didn't exit. So the root verified is passed.
> I think we shouldn't verify the root / meta in shutdownhandler processing
> 
> 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown handler to be executed, root=true, meta=true
> 2011-07-28 22:25:10,085 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in 316329 ms for hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
> 2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not running balancer because processing dead regionserver(s): [158-1-101-82,20020,1311885942386]
> 2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>         at $Proxy6.getRegionInfo(Unknown Source)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
>         at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
>         at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
>         at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
>         at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 remote=158-1-101-82/158.1.101.82:20020]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
>         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 as dead server
> 2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x23171e103d10018 Creating (or updating) unassigned node for 1028785192 with OFFLINE state
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=, dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available servers
> 2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
> 2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, region=1028785192/.META.
> 2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, region=1028785192/.META.
> 
> 
> 
> Region server logs:
> 2011-07-28 22:19:17,389 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because regionserver20020.cacheFlusher; priority=4, compaction queue size=1
> 
> // blocked for a long time.
> 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x13171e103d7003e, closing socket connection and attempting reconnect
> 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 552455ms for sessionid 0x23171e103d10020, closing socket connection and attempting reconnect
> 2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.222:2181
> 2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server /158.1.101.82:2181
>