You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Bill Graham <bi...@gmail.com> on 2011/01/26 00:27:12 UTC

Region is not online: -ROOT-,,0

Hi,

A developer on our team created a table today and something failed and
we fell back into the dire scenario we were in earlier this week. When
I got on the scene 2 of our 4 regions had crashed. When I brought them
back up, they wouldn't come online and the master was scrolling
messages like those in
https://issues.apache.org/jira/browse/HBASE-3406.

I'm running 0.90.0-rc1 and CDH3b2 with append enabled.

I shut down the entire cluster + zookeeper and restarted it. Now, I'm
getting two types of errors and the cluster won't come up:

- On one of the regionservers:
2011-01-25 15:12:00,287 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer:
NotServingRegionException; Region is not online: -ROOT-,,0

- And on the master this scrolls every few seconds. the log file
referenced is empty in HDFS.
2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
Waited 275444ms for lease recovery on
hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file
/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
10.14.98.90, because this file is already being created by NN_Recovery
on 10.10.220.15
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)

Any suggestions for how to get the -ROOT- back? I can see it in HDFS.

thanks,
Bill

Re: Region is not online: -ROOT-,,0

Posted by Ryan Rawson <ry...@gmail.com>.

These jiras might be related:

https://issues.apache.org/jira/browse/HDFS-1520

https://issues.apache.org/jira/browse/HDFS-1554

I'm not sure they would help in this situation, since the client
'NN_Recovery' isn't a "real" client (ie: a hbase regionserver).



On Tue, Jan 25, 2011 at 6:59 PM, Ryan Rawson <ry...@gmail.com> wrote:
> It's all about this line:
>
> "for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
> 10.14.98.90, because this file is already being created by NN_Recovery"
>
> I'm not really sure why that happens, I've seen that on my test
> clusters, and basically this holds up region redeployment hence your
> problems.
>
> Perhaps someone familiar with the deep internals of append recovery
> can speak up...
>
> -ryan
>
>
> On Tue, Jan 25, 2011 at 4:02 PM, Bill Graham <bi...@gmail.com> wrote:
>> I'm still not sure how I got into this situation, but I've gotten
>> myself out of it and I'm up and running.
>>
>> The fix was to shut down the cluster and remove the .log/ files from
>> HDFS. Then the master was able to start properly and a regionserver
>> was able to start up and serve the -ROOT- region.
>>
>> One theory as to the cause of this issue (twice now), is that I was
>> still getting bit by the issue of invalid hadoop maven jars in my
>> classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2
>> of my 4 regionservers. I'll add more commentary around HBASE-3436 in
>> the JIRA.
>>
>>
>>
>> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <bi...@gmail.com> wrote:
>>> Hi,
>>>
>>> A developer on our team created a table today and something failed and
>>> we fell back into the dire scenario we were in earlier this week. When
>>> I got on the scene 2 of our 4 regions had crashed. When I brought them
>>> back up, they wouldn't come online and the master was scrolling
>>> messages like those in
>>> https://issues.apache.org/jira/browse/HBASE-3406.
>>>
>>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>>>
>>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
>>> getting two types of errors and the cluster won't come up:
>>>
>>> - On one of the regionservers:
>>> 2011-01-25 15:12:00,287 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>> NotServingRegionException; Region is not online: -ROOT-,,0
>>>
>>> - And on the master this scrolls every few seconds. the log file
>>> referenced is empty in HDFS.
>>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
>>> Waited 275444ms for lease recovery on
>>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>>> failed to create file
>>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
>>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
>>> 10.14.98.90, because this file is already being created by NN_Recovery
>>> on 10.10.220.15
>>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>
>>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>>>
>>> thanks,
>>> Bill
>>>
>>
>

Re: Region is not online: -ROOT-,,0

Posted by Ryan Rawson <ry...@gmail.com>.

It's all about this line:

"for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
10.14.98.90, because this file is already being created by NN_Recovery"

I'm not really sure why that happens, I've seen that on my test
clusters, and basically this holds up region redeployment hence your
problems.

Perhaps someone familiar with the deep internals of append recovery
can speak up...

-ryan


On Tue, Jan 25, 2011 at 4:02 PM, Bill Graham <bi...@gmail.com> wrote:
> I'm still not sure how I got into this situation, but I've gotten
> myself out of it and I'm up and running.
>
> The fix was to shut down the cluster and remove the .log/ files from
> HDFS. Then the master was able to start properly and a regionserver
> was able to start up and serve the -ROOT- region.
>
> One theory as to the cause of this issue (twice now), is that I was
> still getting bit by the issue of invalid hadoop maven jars in my
> classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2
> of my 4 regionservers. I'll add more commentary around HBASE-3436 in
> the JIRA.
>
>
>
> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <bi...@gmail.com> wrote:
>> Hi,
>>
>> A developer on our team created a table today and something failed and
>> we fell back into the dire scenario we were in earlier this week. When
>> I got on the scene 2 of our 4 regions had crashed. When I brought them
>> back up, they wouldn't come online and the master was scrolling
>> messages like those in
>> https://issues.apache.org/jira/browse/HBASE-3406.
>>
>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>>
>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
>> getting two types of errors and the cluster won't come up:
>>
>> - On one of the regionservers:
>> 2011-01-25 15:12:00,287 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> NotServingRegionException; Region is not online: -ROOT-,,0
>>
>> - And on the master this scrolls every few seconds. the log file
>> referenced is empty in HDFS.
>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
>> Waited 275444ms for lease recovery on
>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>> failed to create file
>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
>> 10.14.98.90, because this file is already being created by NN_Recovery
>> on 10.10.220.15
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>
>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>>
>> thanks,
>> Bill
>>
>

Re: Region is not online: -ROOT-,,0

Posted by Bill Graham <bi...@gmail.com>.

I'm still not sure how I got into this situation, but I've gotten
myself out of it and I'm up and running.

The fix was to shut down the cluster and remove the .log/ files from
HDFS. Then the master was able to start properly and a regionserver
was able to start up and serve the -ROOT- region.

One theory as to the cause of this issue (twice now), is that I was
still getting bit by the issue of invalid hadoop maven jars in my
classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2
of my 4 regionservers. I'll add more commentary around HBASE-3436 in
the JIRA.



On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <bi...@gmail.com> wrote:
> Hi,
>
> A developer on our team created a table today and something failed and
> we fell back into the dire scenario we were in earlier this week. When
> I got on the scene 2 of our 4 regions had crashed. When I brought them
> back up, they wouldn't come online and the master was scrolling
> messages like those in
> https://issues.apache.org/jira/browse/HBASE-3406.
>
> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>
> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
> getting two types of errors and the cluster won't come up:
>
> - On one of the regionservers:
> 2011-01-25 15:12:00,287 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online: -ROOT-,,0
>
> - And on the master this scrolls every few seconds. the log file
> referenced is empty in HDFS.
> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
> Waited 275444ms for lease recovery on
> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
> 10.14.98.90, because this file is already being created by NN_Recovery
> on 10.10.220.15
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>
> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>
> thanks,
> Bill
>

Re: Region is not online: -ROOT-,,0

Posted by Bill Graham <bi...@gmail.com>.

Thanks for the comments. Attached is the log file from the master
after the restart. The last error message was repeated every second.

See comments below.

On Tue, Jan 25, 2011 at 7:20 PM, Stack <st...@duboce.net> wrote:
> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <bi...@gmail.com> wrote:
>> Hi,
>>
>> A developer on our team created a table today and something failed and
>> we fell back into the dire scenario we were in earlier this week. When
>> I got on the scene 2 of our 4 regions had crashed. When I brought them
>> back up, they wouldn't come online and the master was scrolling
>> messages like those in
>> https://issues.apache.org/jira/browse/HBASE-3406.
>>
>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>>
> Can you move to 0.90.0 release?

Will do. Was planning on doing this soon, but we'll prioritize this.

>
>
>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
>> getting two types of errors and the cluster won't come up:
>>
>> - On one of the regionservers:
>> 2011-01-25 15:12:00,287 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> NotServingRegionException; Region is not online: -ROOT-,,0
>>
>
> Can I see master log around startup please?

See attached.

>
>
>> - And on the master this scrolls every few seconds. the log file
>> referenced is empty in HDFS.
>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
>> Waited 275444ms for lease recovery on
>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>> failed to create file
>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
>> 10.14.98.90, because this file is already being created by NN_Recovery
>> on 10.10.220.15
>>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>>
>
>
> As Ryan says, this would seem to indicate the owning RegionServer is
> still up.  Is that the case?  Did the restart of the cluster for sure
> put down al RSs?

Yes, all RSs started up after the restart, just 2 wouldn't come online
and one of them was logging the errors about -ROOT-.

>
>
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>
>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>>
>
>
> Root will come back once master moves past log file splitting.

Yes, once I removed all logs from HDFS, the master came up and -ROOT-
was found. The splitting was hung on a file, hence the infinite loop
with AlreadyBeingCreatedExceptions.

>
> St.Ack
>

Re: Region is not online: -ROOT-,,0

Posted by Stack <st...@duboce.net>.

On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <bi...@gmail.com> wrote:
> Hi,
>
> A developer on our team created a table today and something failed and
> we fell back into the dire scenario we were in earlier this week. When
> I got on the scene 2 of our 4 regions had crashed. When I brought them
> back up, they wouldn't come online and the master was scrolling
> messages like those in
> https://issues.apache.org/jira/browse/HBASE-3406.
>
> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>
Can you move to 0.90.0 release?


> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
> getting two types of errors and the cluster won't come up:
>
> - On one of the regionservers:
> 2011-01-25 15:12:00,287 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online: -ROOT-,,0
>

Can I see master log around startup please?


> - And on the master this scrolls every few seconds. the log file
> referenced is empty in HDFS.
> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
> Waited 275444ms for lease recovery on
> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
> 10.14.98.90, because this file is already being created by NN_Recovery
> on 10.10.220.15
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>


As Ryan says, this would seem to indicate the owning RegionServer is
still up.  Is that the case?  Did the restart of the cluster for sure
put down al RSs?


        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>
> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>


Root will come back once master moves past log file splitting.

St.Ack