You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Matthew LeMieux <md...@mlogiciels.com> on 2010/09/09 03:00:50 UTC

HBase crash, need help getting back up

My HBase cluster just crashed.   One of the Region servers stopped (do not yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown everything, and restart fresh.  I did so (including zookeeper and HDFS). 

Upon restart, I'm getting the following message in the Master's log file repeating continuously with the number of ms waited counting up.  

2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)


The region servers are waiting with this being the final message in their log file: 

2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up

I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).  

The HDFS comes from CDH3.  

Does anybody have any ideas on what I can do to get back up and running?

Thank you, 

Matthew


Re: HBase crash, need help getting back up

Posted by Todd Lipcon <to...@cloudera.com>.
I think the root issue you ran into is HBASE-2975, which I coincidentally
also found last night. The fix is committed and should be in our next
rc/release.

Thanks
-Todd

On Thu, Sep 9, 2010 at 10:24 AM, Matthew LeMieux <md...@mlogiciels.com> wrote:

> Replies below
>
> On Sep 8, 2010, at 10:00 PM, Stack wrote:
>
> > recovered.edits is the name of the file produced when wal logs are
> > split; one is made per region
> >
> > Where you seeing that message?  Does it not have the full path the
> > recovered.edits file?
> >
>
> In the master log file.  Full path was not there.
>
> > You are running w/ perms enabled on this cluster?
> >
>
> It was enabled and it has now been turned off.  Will that fix the problem
> of a file not being executable?  In any case that problem is intermittent.
>  It usually shows up only after a partial restart (i.e. a Region server goes
> down and I restart it), but does not show up after a complete restart of the
> whole cluster.
>
> > Why did the regionservers go down?
>
> I tracked the reason for the most recent "crash" down to "too many open
> files" for the user that runs hadoop.  Very odd situation, both the user
> running hbase and hadoop were in the /etc/security/limits.conf file with a
> limit of 50000, but the change only worked for one user.   hadoop's account
> reported 1024, and the hbase user's account reported 50000 to 'ulimit -n'.
> I did three things before rebooting the machine, not sure which were needed
> to fix it:
>
>    *  I added "session required        pam_limits.so" to
> /etc/pam.d/common-session (pam_limits.so was already being referenced in
> several other files in /etc/pam.d, but was missing from this file)
>    *  gave hadoop a home directory that exists (by editing the /etc/passwd
> file)
>    *  I added "*                hard    nofile          50000" to the
> /etc/security/limits.conf file (in addition to the two lines for each user
> that were already there)
>
> (on Ubuntu Karmic, running CDH version: 0.20.2+320-1~karmic-cdh3b2)
>
> The CDH distribution doesn't appear to have the hadoop home directory
> situation figured out (they put it in a directory that gets deleted on
> reboots).  I change it routinely, but apparently missed this machine.
>
> This is likely to fix quite a few problems, but I think there is still a
> mystery to be solved.  I'll have to wait until it happens again to get a
> clean log of the event.
>
> FYI,
>
> Matthew
>
>
> > On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <md...@mlogiciels.com>
> wrote:
> >> Well, it was short lived, it only stayed up for a couple hours, all
> region servers crashed this time, not just one.
> >>
> >> Now, after restarting, I've got the master server complaining about not
> having executable permissions on "recovered.edits".  Where is this file?
> >>
> >>  Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.security.AccessControlException: Permission denied:
> user=mlcamus, access=EXECUTE,
> inode="recovered.edits":mlcamus:supergroup:rw-r--r--
> >>
> >> The message has repeated for a half hour, with this showing up in one
> region server:
> >>
> >> 2010-09-09 04:52:34,887 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; -ROOT-,,0
> >>
> >> I assume this will get better if I change permissions of some file...
> which one?
> >>
> >> -Matthew
> >>
> >>
> >> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
> >>
> >>> I tried moving that file to tmp.  It appears as though the master is no
> longer stuck, but clients are still not able to run queries.
> >>>
> >>> There aren't any messages passing by in the log files (just routine
> messages I see when the server isn't doing anything), but attempts to run
> queries resulted in not server region exceptions (i.e., count 'table').
> >>>
> >>> I tried enable 'table', and found that after this command there was a
> huge amount of activity in the log files, and I was able to run queries
> again.
> >>>
> >>> There was no previous call to disable 'table', but for some reason
> HBase wasn't bringing tables/regions online.
> >>>
> >>> I'm not sure what caused the problem or even if the actions I took will
> fix it again in the future, but I am back up and running for now.
> >>>
> >>> FYI,
> >>>
> >>> -Matthew
> >>>
> >>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
> >>>
> >>>> My HBase cluster just crashed.   One of the Region servers stopped (do
> not yet know why).  After restarting it, the cluster seemed a but wobbly, so
> I decided to shutdown everything, and restart fresh.  I did so (including
> zookeeper and HDFS).
> >>>>
> >>>> Upon restart, I'm getting the following message in the Master's log
> file repeating continuously with the number of ms waited counting up.
> >>>>
> >>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils:
> Waited 69188ms for lease recovery on
> hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
> 10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
> 10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000
> on client 10.104.37.247 because current leaseholder is trying to recreate
> file.
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
> >>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >>>>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>       at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
> >>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
> >>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
> >>>>       at java.security.AccessController.doPrivileged(Native Method)
> >>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
> >>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
> >>>>
> >>>>
> >>>> The region servers are waiting with this being the final message in
> their log file:
> >>>>
> >>>> 2010-09-09 00:53:49,111 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
> 10.104.37.247:60000 that we are up
> >>>>
> >>>> I've  been using this version for a little under a week without
> incident (
> http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
> >>>>
> >>>> The HDFS comes from CDH3.
> >>>>
> >>>> Does anybody have any ideas on what I can do to get back up and
> running?
> >>>>
> >>>> Thank you,
> >>>>
> >>>> Matthew
> >>>>
> >>>
> >>
> >>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase crash, need help getting back up

Posted by Matthew LeMieux <md...@mlogiciels.com>.
Replies below

On Sep 8, 2010, at 10:00 PM, Stack wrote:

> recovered.edits is the name of the file produced when wal logs are
> split; one is made per region
> 
> Where you seeing that message?  Does it not have the full path the
> recovered.edits file?
> 

In the master log file.  Full path was not there. 

> You are running w/ perms enabled on this cluster?
> 

It was enabled and it has now been turned off.  Will that fix the problem of a file not being executable?  In any case that problem is intermittent.  It usually shows up only after a partial restart (i.e. a Region server goes down and I restart it), but does not show up after a complete restart of the whole cluster. 

> Why did the regionservers go down?

I tracked the reason for the most recent "crash" down to "too many open files" for the user that runs hadoop.  Very odd situation, both the user running hbase and hadoop were in the /etc/security/limits.conf file with a limit of 50000, but the change only worked for one user.   hadoop's account reported 1024, and the hbase user's account reported 50000 to 'ulimit -n'.   I did three things before rebooting the machine, not sure which were needed to fix it: 
   
    *  I added "session required        pam_limits.so" to /etc/pam.d/common-session (pam_limits.so was already being referenced in several other files in /etc/pam.d, but was missing from this file)
    *  gave hadoop a home directory that exists (by editing the /etc/passwd file)
    *  I added "*                hard    nofile          50000" to the /etc/security/limits.conf file (in addition to the two lines for each user that were already there)

(on Ubuntu Karmic, running CDH version: 0.20.2+320-1~karmic-cdh3b2)

The CDH distribution doesn't appear to have the hadoop home directory situation figured out (they put it in a directory that gets deleted on reboots).  I change it routinely, but apparently missed this machine.  

This is likely to fix quite a few problems, but I think there is still a mystery to be solved.  I'll have to wait until it happens again to get a clean log of the event. 

FYI,

Matthew


> On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <md...@mlogiciels.com> wrote:
>> Well, it was short lived, it only stayed up for a couple hours, all region servers crashed this time, not just one.
>> 
>> Now, after restarting, I've got the master server complaining about not having executable permissions on "recovered.edits".  Where is this file?
>> 
>>  Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>> 
>> The message has repeated for a half hour, with this showing up in one region server:
>> 
>> 2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; -ROOT-,,0
>> 
>> I assume this will get better if I change permissions of some file... which one?
>> 
>> -Matthew
>> 
>> 
>> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>> 
>>> I tried moving that file to tmp.  It appears as though the master is no longer stuck, but clients are still not able to run queries.
>>> 
>>> There aren't any messages passing by in the log files (just routine messages I see when the server isn't doing anything), but attempts to run queries resulted in not server region exceptions (i.e., count 'table').
>>> 
>>> I tried enable 'table', and found that after this command there was a huge amount of activity in the log files, and I was able to run queries again.
>>> 
>>> There was no previous call to disable 'table', but for some reason HBase wasn't bringing tables/regions online.
>>> 
>>> I'm not sure what caused the problem or even if the actions I took will fix it again in the future, but I am back up and running for now.
>>> 
>>> FYI,
>>> 
>>> -Matthew
>>> 
>>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>> 
>>>> My HBase cluster just crashed.   One of the Region servers stopped (do not yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown everything, and restart fresh.  I did so (including zookeeper and HDFS).
>>>> 
>>>> Upon restart, I'm getting the following message in the Master's log file repeating continuously with the number of ms waited counting up.
>>>> 
>>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
>>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>>       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>> 
>>>> 
>>>> The region servers are waiting with this being the final message in their log file:
>>>> 
>>>> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up
>>>> 
>>>> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>>>> 
>>>> The HDFS comes from CDH3.
>>>> 
>>>> Does anybody have any ideas on what I can do to get back up and running?
>>>> 
>>>> Thank you,
>>>> 
>>>> Matthew
>>>> 
>>> 
>> 
>> 


Re: HBase crash, need help getting back up

Posted by Stack <st...@duboce.net>.
recovered.edits is the name of the file produced when wal logs are
split; one is made per region

Where you seeing that message?  Does it not have the full path the
recovered.edits file?

You are running w/ perms enabled on this cluster?

Why did the regionservers go down?

St.Ack

On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <md...@mlogiciels.com> wrote:
> Well, it was short lived, it only stayed up for a couple hours, all region servers crashed this time, not just one.
>
> Now, after restarting, I've got the master server complaining about not having executable permissions on "recovered.edits".  Where is this file?
>
>  Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>
> The message has repeated for a half hour, with this showing up in one region server:
>
> 2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; -ROOT-,,0
>
> I assume this will get better if I change permissions of some file... which one?
>
> -Matthew
>
>
> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>
>> I tried moving that file to tmp.  It appears as though the master is no longer stuck, but clients are still not able to run queries.
>>
>> There aren't any messages passing by in the log files (just routine messages I see when the server isn't doing anything), but attempts to run queries resulted in not server region exceptions (i.e., count 'table').
>>
>> I tried enable 'table', and found that after this command there was a huge amount of activity in the log files, and I was able to run queries again.
>>
>> There was no previous call to disable 'table', but for some reason HBase wasn't bringing tables/regions online.
>>
>> I'm not sure what caused the problem or even if the actions I took will fix it again in the future, but I am back up and running for now.
>>
>> FYI,
>>
>> -Matthew
>>
>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>
>>> My HBase cluster just crashed.   One of the Region servers stopped (do not yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown everything, and restart fresh.  I did so (including zookeeper and HDFS).
>>>
>>> Upon restart, I'm getting the following message in the Master's log file repeating continuously with the number of ms waited counting up.
>>>
>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>
>>>
>>> The region servers are waiting with this being the final message in their log file:
>>>
>>> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up
>>>
>>> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>>>
>>> The HDFS comes from CDH3.
>>>
>>> Does anybody have any ideas on what I can do to get back up and running?
>>>
>>> Thank you,
>>>
>>> Matthew
>>>
>>
>
>

Re: HBase crash, need help getting back up

Posted by Matthew LeMieux <md...@mlogiciels.com>.
Well, it was short lived, it only stayed up for a couple hours, all region servers crashed this time, not just one. 

Now, after restarting, I've got the master server complaining about not having executable permissions on "recovered.edits".  Where is this file?

 Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.security.AccessControlException: Permission denied: user=mlcamus, access=EXECUTE, inode="recovered.edits":mlcamus:supergroup:rw-r--r--

The message has repeated for a half hour, with this showing up in one region server: 

2010-09-09 04:52:34,887 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException; -ROOT-,,0

I assume this will get better if I change permissions of some file... which one?

-Matthew


On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:

> I tried moving that file to tmp.  It appears as though the master is no longer stuck, but clients are still not able to run queries.  
> 
> There aren't any messages passing by in the log files (just routine messages I see when the server isn't doing anything), but attempts to run queries resulted in not server region exceptions (i.e., count 'table'). 
> 
> I tried enable 'table', and found that after this command there was a huge amount of activity in the log files, and I was able to run queries again.  
> 
> There was no previous call to disable 'table', but for some reason HBase wasn't bringing tables/regions online.  
> 
> I'm not sure what caused the problem or even if the actions I took will fix it again in the future, but I am back up and running for now.  
> 
> FYI,
> 
> -Matthew
> 
> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
> 
>> My HBase cluster just crashed.   One of the Region servers stopped (do not yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown everything, and restart fresh.  I did so (including zookeeper and HDFS). 
>> 
>> Upon restart, I'm getting the following message in the Master's log file repeating continuously with the number of ms waited counting up.  
>> 
>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>       at java.security.AccessController.doPrivileged(Native Method)
>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>> 
>> 
>> The region servers are waiting with this being the final message in their log file: 
>> 
>> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up
>> 
>> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).  
>> 
>> The HDFS comes from CDH3.  
>> 
>> Does anybody have any ideas on what I can do to get back up and running?
>> 
>> Thank you, 
>> 
>> Matthew
>> 
> 


Re: HBase crash, need help getting back up

Posted by Stack <st...@duboce.net>.
On Wed, Sep 8, 2010 at 6:21 PM, Matthew LeMieux <md...@mlogiciels.com> wrote:
> I tried moving that file to tmp.  It appears as though the master is no longer stuck, but clients are still not able to run queries.
>
> There aren't any messages passing by in the log files (just routine messages I see when the server isn't doing anything), but attempts to run queries resulted in not server region exceptions (i.e., count 'table').
>

It completely fails or it counts for a while and then fails?

> I tried enable 'table', and found that after this command there was a huge amount of activity in the log files, and I was able to run queries again.
>
> There was no previous call to disable 'table', but for some reason HBase wasn't bringing tables/regions online.
>
> I'm not sure what caused the problem or even if the actions I took will fix it again in the future, but I am back up and running for now.
>
> FYI,
>

OK.

That it could not grab the lease is a problem.  You have the logs from
that time still?

St.Ack

Re: HBase crash, need help getting back up

Posted by Matthew LeMieux <md...@mlogiciels.com>.
I tried moving that file to tmp.  It appears as though the master is no longer stuck, but clients are still not able to run queries.  

There aren't any messages passing by in the log files (just routine messages I see when the server isn't doing anything), but attempts to run queries resulted in not server region exceptions (i.e., count 'table'). 

I tried enable 'table', and found that after this command there was a huge amount of activity in the log files, and I was able to run queries again.  

There was no previous call to disable 'table', but for some reason HBase wasn't bringing tables/regions online.  

I'm not sure what caused the problem or even if the actions I took will fix it again in the future, but I am back up and running for now.  

FYI,

-Matthew

On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:

> My HBase cluster just crashed.   One of the Region servers stopped (do not yet know why).  After restarting it, the cluster seemed a but wobbly, so I decided to shutdown everything, and restart fresh.  I did so (including zookeeper and HDFS). 
> 
> Upon restart, I'm getting the following message in the Master's log file repeating continuously with the number of ms waited counting up.  
> 
> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
> 
> 
> The region servers are waiting with this being the final message in their log file: 
> 
> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up
> 
> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).  
> 
> The HDFS comes from CDH3.  
> 
> Does anybody have any ideas on what I can do to get back up and running?
> 
> Thank you, 
> 
> Matthew
> 


Re: HBase crash, need help getting back up

Posted by Stack <st...@duboce.net>.
On Wed, Sep 8, 2010 at 6:00 PM, Matthew LeMieux <md...@mlogiciels.com> wrote:
> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 69188ms for lease recovery on hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because current leaseholder is trying to recreate file.
>

This is the master trying to take over the lease on a file that a
regionserver had open so it can split its logs.  It never succeeds?

If you have CDH3b2 and the below noted version of hbase, dfs append
should be on by default.

Was 10.104.37.247 down for sure?

What if you grep that file in namenode log, whats it show?

>
> 2010-09-09 00:53:49,111 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up
>

They are wating for the master to come up.  Its busy splitting files
(or stuck splitting files as per above).


> I've  been using this version for a little under a week without incident (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>

New candidate coming out soon.  Keep an eye out for it.  It fixes bugs
though nothing in the area you are currently suffering.

St.Ack

> The HDFS comes from CDH3.
>
> Does anybody have any ideas on what I can do to get back up and running?
>
> Thank you,
>
> Matthew
>
>