You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Malcolm Matalka <mm...@millennialmedia.com> on 2009/10/05 20:41:14 UTC

Recovering Corrupt FS Image on Amazon EBS

I am using Amazon EC2 with our HDFS on EBS volumes.  While running a job
today, our EBS volumes apparently died out of nowhere.  You can see the
logfile is even cut off:

 

2009-10-05 13:37:00,321 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/10.244.195.64
cmd=open
src=/user/root/reach.intermediate/20090928.1day/part-00058      dst=null
perm=null

2009-10-05 13:37:01,901 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root,root,bin,daemon,sys,adm,disk,wheel     ip=/10.242.15.15
cmd=open        src=/user/root/reach.intermedi

 

 

In the event of an error, we bring all the instances down.  I then tried
to rerun the job (bringing all the instances back up and then attaching
to EBS volumes) and the namenode will not come up.  The logfile gives
the error at the bottom.  What are my options here to recover the file
system?

 

Thanks,

Malcolm

 

 

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = ip-10-243-26-82/10.243.26.82

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.19.0

STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008

************************************************************/

2009-10-05 14:20:02,120 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=50001

2009-10-05 14:20:02,150 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
ip-10-243-26-82.ec2.internal/10.243.26.82:50001

2009-10-05 14:20:02,154 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null

2009-10-05 14:20:02,254 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.ganglia.GangliaContext

2009-10-05 14:20:02,417 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
fsOwner=root,root,bin,daemon,sys,adm,disk,wheel

2009-10-05 14:20:02,417 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
supergroup=supergroup

2009-10-05 14:20:02,417 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=true

2009-10-05 14:20:02,435 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
Initializing FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.ganglia.GangliaContext

2009-10-05 14:20:02,436 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStatusMBean

2009-10-05 14:20:02,751 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files = 23989

2009-10-05 14:20:06,859 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files under
construction = 0

2009-10-05 14:20:06,860 INFO
org.apache.hadoop.hdfs.server.common.Storage: Image file of size 3800773
loaded in 4 seconds.

2009-10-05 14:20:07,451 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NumberFormatException: For input string: ""

        at
java.lang.NumberFormatException.forInputString(NumberFormatException.jav
a:48)

        at java.lang.Integer.parseInt(Integer.java:468)

        at java.lang.Short.parseShort(Short.java:120)

        at java.lang.Short.parseShort(Short.java:78)

        at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readShort(FSEditLog.jav
a:1261)

        at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.j
ava:556)

        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
973)

        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
793)

        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSI
mage.java:352)

        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirecto
ry.java:87)

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesys
tem.java:311)

        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.
java:290)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java
:163)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:208
)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:194
)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.
java:859)

        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)

 

2009-10-05 14:20:07,451 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ip-10-243-26-82/10.243.26.82

 


Re: Recovering Corrupt FS Image on Amazon EBS

Posted by Allen Wittenauer <aw...@linkedin.com>.
On 10/5/09 11:57 AM, "Malcolm Matalka" <mm...@millennialmedia.com> wrote:

> Sadly I am not writing it to multiple files.  I will be now. Do you have
> a link on information to best practices in this regard?

I know there are some references in my "Hadoop 24/7" Apachecon presentation
from last year.  Does that count? ;)

http://wiki.apache.org/hadoop/NameNode  is probably the best link on
NameNode configuration.  We should probably set up a real "best practices"
link rather than having info scattered around the site.

> The upside is, the jobs I was running were all expendable so I can
> afford to lose what was written out.  Removing the edits file should
> only impact data I was writing, correct?

Any sort of changes, not just the data you were writing. [So permissions
changes, etc.]

> 
> Thank you Allen

No problem.  Good luck! :)

> 
> -----Original Message-----
> From: Allen Wittenauer [mailto:awittenauer@linkedin.com]
> Sent: Monday, October 05, 2009 14:52
> To: common-user@hadoop.apache.org; core-user@hadoop.apache.org
> Subject: Re: Recovering Corrupt FS Image on Amazon EBS
> 
> 
> 
> 
> On 10/5/09 11:41 AM, "Malcolm Matalka" <mm...@millennialmedia.com>
> wrote:
>> In the event of an error, we bring all the instances down.  I then
> tried
>> to rerun the job (bringing all the instances back up and then
> attaching
>> to EBS volumes) and the namenode will not come up.  The logfile gives
>> the error at the bottom.  What are my options here to recover the file
>> system?
> 
> Your edits file is corrupt.   You have some choices:
> 
> A) if you ran a secondary and ran it frequently, hacking the edits off
> at
> the point of corruption will set the HDFS pretty close to the point of
> last
> run
> 
> B) If you didn't run the secondary that often or you don't make that
> many
> changes, you may just want to ignore the edits file and bring up the
> HDFS
> without it.
> 
> C) Check your other directory--you -are- writing fsimage and edits to
> two
> different dirs, right?  The other edits file may be healthier.
> 
> But I suspect you're looking at data loss. :(
> 
>> 2009-10-05 14:20:07,451 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NumberFormatException: For input string: ""
>> 
>>         at
>> 
> java.lang.NumberFormatException.forInputString(NumberFormatException.jav
>> a:48)
>> 
>>         at java.lang.Integer.parseInt(Integer.java:468)
>> 
>>         at java.lang.Short.parseShort(Short.java:120)
>> 
>>         at java.lang.Short.parseShort(Short.java:78)
>> 
>>         at
>> 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readShort(FSEditLog.jav
>> a:1261)
> 
> 


RE: Recovering Corrupt FS Image on Amazon EBS

Posted by Malcolm Matalka <mm...@millennialmedia.com>.
Sadly I am not writing it to multiple files.  I will be now. Do you have
a link on information to best practices in this regard?

The upside is, the jobs I was running were all expendable so I can
afford to lose what was written out.  Removing the edits file should
only impact data I was writing, correct?

Thank you Allen

-----Original Message-----
From: Allen Wittenauer [mailto:awittenauer@linkedin.com] 
Sent: Monday, October 05, 2009 14:52
To: common-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: Recovering Corrupt FS Image on Amazon EBS




On 10/5/09 11:41 AM, "Malcolm Matalka" <mm...@millennialmedia.com>
wrote:
> In the event of an error, we bring all the instances down.  I then
tried
> to rerun the job (bringing all the instances back up and then
attaching
> to EBS volumes) and the namenode will not come up.  The logfile gives
> the error at the bottom.  What are my options here to recover the file
> system?

Your edits file is corrupt.   You have some choices:

A) if you ran a secondary and ran it frequently, hacking the edits off
at
the point of corruption will set the HDFS pretty close to the point of
last
run

B) If you didn't run the secondary that often or you don't make that
many
changes, you may just want to ignore the edits file and bring up the
HDFS
without it.

C) Check your other directory--you -are- writing fsimage and edits to
two
different dirs, right?  The other edits file may be healthier.

But I suspect you're looking at data loss. :(

> 2009-10-05 14:20:07,451 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NumberFormatException: For input string: ""
> 
>         at
>
java.lang.NumberFormatException.forInputString(NumberFormatException.jav
> a:48)
> 
>         at java.lang.Integer.parseInt(Integer.java:468)
> 
>         at java.lang.Short.parseShort(Short.java:120)
> 
>         at java.lang.Short.parseShort(Short.java:78)
> 
>         at
>
org.apache.hadoop.hdfs.server.namenode.FSEditLog.readShort(FSEditLog.jav
> a:1261)



Re: Recovering Corrupt FS Image on Amazon EBS

Posted by Allen Wittenauer <aw...@linkedin.com>.


On 10/5/09 11:41 AM, "Malcolm Matalka" <mm...@millennialmedia.com> wrote:
> In the event of an error, we bring all the instances down.  I then tried
> to rerun the job (bringing all the instances back up and then attaching
> to EBS volumes) and the namenode will not come up.  The logfile gives
> the error at the bottom.  What are my options here to recover the file
> system?

Your edits file is corrupt.   You have some choices:

A) if you ran a secondary and ran it frequently, hacking the edits off at
the point of corruption will set the HDFS pretty close to the point of last
run

B) If you didn't run the secondary that often or you don't make that many
changes, you may just want to ignore the edits file and bring up the HDFS
without it.

C) Check your other directory--you -are- writing fsimage and edits to two
different dirs, right?  The other edits file may be healthier.

But I suspect you're looking at data loss. :(

> 2009-10-05 14:20:07,451 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NumberFormatException: For input string: ""
> 
>         at
> java.lang.NumberFormatException.forInputString(NumberFormatException.jav
> a:48)
> 
>         at java.lang.Integer.parseInt(Integer.java:468)
> 
>         at java.lang.Short.parseShort(Short.java:120)
> 
>         at java.lang.Short.parseShort(Short.java:78)
> 
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.readShort(FSEditLog.jav
> a:1261)