You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Peter Falk <pe...@bugsoft.nu> on 2010/07/07 14:46:14 UTC

Please help! Corrupt fsimage?

Hi,

After a restart of our live cluster today, the name node fails to start with
the log message seen below. There is a big file called edits.new in the
"current" folder that seems be the only one that have received changes
recently (no changes to the edits or the fsimage for over a month). Is that
normal?

The last change to the edits.new file was right before shutting down the
cluster. It seems like the shutdown was unable to store valid fsimage,
edits, edits.new files. The secondary name node image does not include the
edits.new file, only edits and fsimage, which are identical to the name
nodes version. So no help from them.

Would appreciate any help in understanding what could have gone wrong. The
shutdown seemed to complete just fine, without any error message. Is there
any way to recreate the image from the data, or any other way to save our
production data?

Sincerely,
Peter

2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-07-07 14:30:27,019 DEBUG
org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
2010-07-07 14:30:27,149 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:298)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2010-07-07 14:30:27,151 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
        at java.io.DataInputStream.readShort(DataInputStream.java:298)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965

RE: Please help! Corrupt fsimage?

Posted by Michael Segel <mi...@hotmail.com>.
I know this is a little late in the game...

You could have forced the cluster out of safe mode and then use fsck to copy the bad blocks out to the file system. (See the help on fsck)

While that might not have helped recover lost data, it would have gotten your cloud back.

I would also find out where most of the corruption occurred. It sounds like you may have a bad disk.

HTH

-Mike


> From: peter@bugsoft.nu
> Date: Wed, 7 Jul 2010 22:25:27 +0200
> Subject: Re: Please help! Corrupt fsimage?
> To: common-user@hadoop.apache.org
> 
> FYI, just a small update. After starting the data nodes, the block reporting
> ratio was only 68% and the name node never went out of safe mode.
> Apparently, too many edits was lost. We have resorted to formatting the
> cluster for now, we have backup of the most essential data and have started
> restoring that data.
> 
> Of course, it is very disappointing with this data loss. We have kept copies
> of datanode data, as well as the corrupt fsimage and edits. If any one have
> any idea of how to restore the data, either by better merging the edits or
> by reconstructing fsimage from the datanode data somehow, please let me
> know!
> 
> Time to get some sleep now, it has been a long day...
> 
> Sincerely,
> Peter
> 
> On Wed, Jul 7, 2010 at 20:03, Peter Falk <pe...@bugsoft.nu> wrote:
> 
> > Thanks for the information Alex and Jean-Daniel! We have finally be able to
> > get the namenode to start, after patching the source code according to the
> > attached patch. It is based on the HDFS-1002 patch, but modified and
> > extended to fix additional NPE. It is made for hadoop 0.20.1.
> >
> > There seemed to be some corrupt edits and/or some missing files in fsimage
> > that cause NPE during upstart and merging of the edits into fsimage. Hope
> > that the attached patch may be of some use for people in similar situations.
> > We have not run an fsck yet, waiting for a raw copy of the data node data to
> > complete first. Lets hope that not too much was lost...
> >
> > Sincerely,
> > Peter
> >
> >
> > On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans <jd...@apache.org>wrote:
> >
> >> What Alex said, and also it really looks like
> >> https://issues.apache.org/jira/browse/HDFS-1024 from having the
> >> experience
> >> of that issue.
> >>
> >> J-D
> >>
> >> On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <al...@cloudera.com>
> >> wrote:
> >>
> >> > Hi Peter,
> >> >
> >> > The edits.new file is used when the edits and fsimage is pulled to the
> >> > secondarynamenode.  Here's the process:
> >> >
> >> > 1) SNN pulls edits and fsimage
> >> > 2) NN starts writing edits to edits.new
> >> > 3) SNN sends new fsimage to NN
> >> > 4) NN replaces its fsimage with the SNN fsimage
> >> > 5) NN replaces edits with edits.new
> >> >
> >> > Certainly taking a different fsimage and trying to apply edits to it
> >> won't
> >> > work.  Your best bet might be to take the 3-day-old fsimage with an
> >> empty
> >> > edits and delete edits.new.  But before you do any of this, make sure
> >> you
> >> > completely backup all values for dfs.name.dir and dfs.checkpoint.dir.
> >>  What
> >> > are the timestamps on the fsimage files in each dfs.name.dir and
> >> > dfs.checkpoint.dir?
> >> >
> >> > Do the namenode and secondarynamenode have enough disk space?  Have you
> >> > consulted the logs to learn why the SNN/NN didn't properly update the
> >> > fsimage and edits log?
> >> >
> >> > Hope this helps.
> >> >
> >> > Alex
> >> >
> >> > On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <pe...@bugsoft.nu> wrote:
> >> >
> >> > > Just a little update. We found a working fsimage that was just a
> >> couple
> >> > of
> >> > > days older than the corrupt one. We tried to replace the fsimage with
> >> the
> >> > > working one, and kept the edits and edits.new files, hoping the the
> >> > latest
> >> > > edits would be still in use. However, when starting the namenode, the
> >> > > following error message appears. Any thought ideas or hints of how to
> >> > > continue? Edit the edits files somehow?
> >> > >
> >> > > TIA,
> >> > > Peter
> >> > >
> >> > > 2010-07-07 16:21:10,312 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Number of files = 28372
> >> > > 2010-07-07 16:21:11,162 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Number of files under construction = 8
> >> > > 2010-07-07 16:21:11,164 INFO
> >> > org.apache.hadoop.hdfs.server.common.Storage:
> >> > > Image file of size 3315887 loaded in 0 seconds.
> >> > > 2010-07-07 16:21:11,164 DEBUG
> >> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> >> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
> >> numblocks
> >> > :
> >> > > 1
> >> > > clientHolder  clientMachine
> >> > > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> >> > > FSDirectory.unprotectedDelete: failed to remove
> >> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
> >> it
> >> > > does not exist
> >> > > 2010-07-07 16:21:11,164 ERROR
> >> > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> >> > > java.lang.NullPointerException
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> >> > >         at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > >        at
> >> > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > >        at
> >> > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > >        at
> >> > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >> > >
> >> > > 2010-07-07 16:21:11,165 INFO
> >> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> >> > > /************************************************************
> >> > > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> >> > > ************************************************************/
> >> > >
> >> > >
> >> > > On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > After a restart of our live cluster today, the name node fails to
> >> start
> >> > > > with the log message seen below. There is a big file called
> >> edits.new
> >> > in
> >> > > the
> >> > > > "current" folder that seems be the only one that have received
> >> changes
> >> > > > recently (no changes to the edits or the fsimage for over a month).
> >> Is
> >> > > that
> >> > > > normal?
> >> > > >
> >> > > > The last change to the edits.new file was right before shutting down
> >> > the
> >> > > > cluster. It seems like the shutdown was unable to store valid
> >> fsimage,
> >> > > > edits, edits.new files. The secondary name node image does not
> >> include
> >> > > the
> >> > > > edits.new file, only edits and fsimage, which are identical to the
> >> name
> >> > > > nodes version. So no help from them.
> >> > > >
> >> > > > Would appreciate any help in understanding what could have gone
> >> wrong.
> >> > > The
> >> > > > shutdown seemed to complete just fine, without any error message. Is
> >> > > there
> >> > > > any way to recreate the image from the data, or any other way to
> >> save
> >> > our
> >> > > > production data?
> >> > > >
> >> > > > Sincerely,
> >> > > > Peter
> >> > > >
> >> > > > 2010-07-07 14:30:26,949 INFO
> >> org.apache.hadoop.ipc.metrics.RpcMetrics:
> >> > > > Initializing RPC Metrics with hostName=NameNode, port=9000
> >> > > > 2010-07-07 14:30:26,960 INFO
> >> org.apache.hadoop.metrics.jvm.JvmMetrics:
> >> > > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> >> > > > 2010-07-07 14:30:27,019 DEBUG
> >> > > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> >> > hbase,hbase
> >> > > > 2010-07-07 14:30:27,149 ERROR
> >> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> >> > > > initialization failed.
> >> > > > java.io.EOFException
> >> > > >         at
> >> java.io.DataInputStream.readShort(DataInputStream.java:298)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > > >         at
> >> > > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > > >         at
> >> > > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >> > > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> >> > > server
> >> > > > on 9000
> >> > > > 2010-07-07 14:30:27,151 ERROR
> >> > > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> >> java.io.EOFException
> >> > > >         at
> >> java.io.DataInputStream.readShort(DataInputStream.java:298)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >> > > >         at
> >> > > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >> > > >         at
> >> > > >
> >> > >
> >> >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >> > > >         at
> >> > > >
> >> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> >> > > >
> >> > >
> >> >
> >>
> >
> >
 		 	   		  
_________________________________________________________________
The New Busy is not the old busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Re: Please help! Corrupt fsimage?

Posted by Peter Falk <pe...@bugsoft.nu>.
FYI, just a small update. After starting the data nodes, the block reporting
ratio was only 68% and the name node never went out of safe mode.
Apparently, too many edits was lost. We have resorted to formatting the
cluster for now, we have backup of the most essential data and have started
restoring that data.

Of course, it is very disappointing with this data loss. We have kept copies
of datanode data, as well as the corrupt fsimage and edits. If any one have
any idea of how to restore the data, either by better merging the edits or
by reconstructing fsimage from the datanode data somehow, please let me
know!

Time to get some sleep now, it has been a long day...

Sincerely,
Peter

On Wed, Jul 7, 2010 at 20:03, Peter Falk <pe...@bugsoft.nu> wrote:

> Thanks for the information Alex and Jean-Daniel! We have finally be able to
> get the namenode to start, after patching the source code according to the
> attached patch. It is based on the HDFS-1002 patch, but modified and
> extended to fix additional NPE. It is made for hadoop 0.20.1.
>
> There seemed to be some corrupt edits and/or some missing files in fsimage
> that cause NPE during upstart and merging of the edits into fsimage. Hope
> that the attached patch may be of some use for people in similar situations.
> We have not run an fsck yet, waiting for a raw copy of the data node data to
> complete first. Lets hope that not too much was lost...
>
> Sincerely,
> Peter
>
>
> On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> What Alex said, and also it really looks like
>> https://issues.apache.org/jira/browse/HDFS-1024 from having the
>> experience
>> of that issue.
>>
>> J-D
>>
>> On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <al...@cloudera.com>
>> wrote:
>>
>> > Hi Peter,
>> >
>> > The edits.new file is used when the edits and fsimage is pulled to the
>> > secondarynamenode.  Here's the process:
>> >
>> > 1) SNN pulls edits and fsimage
>> > 2) NN starts writing edits to edits.new
>> > 3) SNN sends new fsimage to NN
>> > 4) NN replaces its fsimage with the SNN fsimage
>> > 5) NN replaces edits with edits.new
>> >
>> > Certainly taking a different fsimage and trying to apply edits to it
>> won't
>> > work.  Your best bet might be to take the 3-day-old fsimage with an
>> empty
>> > edits and delete edits.new.  But before you do any of this, make sure
>> you
>> > completely backup all values for dfs.name.dir and dfs.checkpoint.dir.
>>  What
>> > are the timestamps on the fsimage files in each dfs.name.dir and
>> > dfs.checkpoint.dir?
>> >
>> > Do the namenode and secondarynamenode have enough disk space?  Have you
>> > consulted the logs to learn why the SNN/NN didn't properly update the
>> > fsimage and edits log?
>> >
>> > Hope this helps.
>> >
>> > Alex
>> >
>> > On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <pe...@bugsoft.nu> wrote:
>> >
>> > > Just a little update. We found a working fsimage that was just a
>> couple
>> > of
>> > > days older than the corrupt one. We tried to replace the fsimage with
>> the
>> > > working one, and kept the edits and edits.new files, hoping the the
>> > latest
>> > > edits would be still in use. However, when starting the namenode, the
>> > > following error message appears. Any thought ideas or hints of how to
>> > > continue? Edit the edits files somehow?
>> > >
>> > > TIA,
>> > > Peter
>> > >
>> > > 2010-07-07 16:21:10,312 INFO
>> > org.apache.hadoop.hdfs.server.common.Storage:
>> > > Number of files = 28372
>> > > 2010-07-07 16:21:11,162 INFO
>> > org.apache.hadoop.hdfs.server.common.Storage:
>> > > Number of files under construction = 8
>> > > 2010-07-07 16:21:11,164 INFO
>> > org.apache.hadoop.hdfs.server.common.Storage:
>> > > Image file of size 3315887 loaded in 0 seconds.
>> > > 2010-07-07 16:21:11,164 DEBUG
>> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
>> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
>> numblocks
>> > :
>> > > 1
>> > > clientHolder  clientMachine
>> > > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
>> > > FSDirectory.unprotectedDelete: failed to remove
>> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
>> it
>> > > does not exist
>> > > 2010-07-07 16:21:11,164 ERROR
>> > > org.apache.hadoop.hdfs.server.namenode.NameNode:
>> > > java.lang.NullPointerException
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
>> > >         at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>> > >        at
>> > >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>> > >        at
>> > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>> > >        at
>> > >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>> > >
>> > > 2010-07-07 16:21:11,165 INFO
>> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>> > > /************************************************************
>> > > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
>> > > ************************************************************/
>> > >
>> > >
>> > > On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > After a restart of our live cluster today, the name node fails to
>> start
>> > > > with the log message seen below. There is a big file called
>> edits.new
>> > in
>> > > the
>> > > > "current" folder that seems be the only one that have received
>> changes
>> > > > recently (no changes to the edits or the fsimage for over a month).
>> Is
>> > > that
>> > > > normal?
>> > > >
>> > > > The last change to the edits.new file was right before shutting down
>> > the
>> > > > cluster. It seems like the shutdown was unable to store valid
>> fsimage,
>> > > > edits, edits.new files. The secondary name node image does not
>> include
>> > > the
>> > > > edits.new file, only edits and fsimage, which are identical to the
>> name
>> > > > nodes version. So no help from them.
>> > > >
>> > > > Would appreciate any help in understanding what could have gone
>> wrong.
>> > > The
>> > > > shutdown seemed to complete just fine, without any error message. Is
>> > > there
>> > > > any way to recreate the image from the data, or any other way to
>> save
>> > our
>> > > > production data?
>> > > >
>> > > > Sincerely,
>> > > > Peter
>> > > >
>> > > > 2010-07-07 14:30:26,949 INFO
>> org.apache.hadoop.ipc.metrics.RpcMetrics:
>> > > > Initializing RPC Metrics with hostName=NameNode, port=9000
>> > > > 2010-07-07 14:30:26,960 INFO
>> org.apache.hadoop.metrics.jvm.JvmMetrics:
>> > > > Initializing JVM Metrics with processName=NameNode, sessionId=null
>> > > > 2010-07-07 14:30:27,019 DEBUG
>> > > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
>> > hbase,hbase
>> > > > 2010-07-07 14:30:27,149 ERROR
>> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>> > > > initialization failed.
>> > > > java.io.EOFException
>> > > >         at
>> java.io.DataInputStream.readShort(DataInputStream.java:298)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>> > > >         at
>> > > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>> > > >         at
>> > > >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>> > > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
>> > > server
>> > > > on 9000
>> > > > 2010-07-07 14:30:27,151 ERROR
>> > > > org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.io.EOFException
>> > > >         at
>> java.io.DataInputStream.readShort(DataInputStream.java:298)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>> > > >         at
>> > > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>> > > >         at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>> > > >         at
>> > > >
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
>> > > >
>> > >
>> >
>>
>
>

Re: Please help! Corrupt fsimage?

Posted by Peter Falk <pe...@bugsoft.nu>.
Thanks for the information Alex and Jean-Daniel! We have finally be able to
get the namenode to start, after patching the source code according to the
attached patch. It is based on the HDFS-1002 patch, but modified and
extended to fix additional NPE. It is made for hadoop 0.20.1.

There seemed to be some corrupt edits and/or some missing files in fsimage
that cause NPE during upstart and merging of the edits into fsimage. Hope
that the attached patch may be of some use for people in similar situations.
We have not run an fsck yet, waiting for a raw copy of the data node data to
complete first. Lets hope that not too much was lost...

Sincerely,
Peter

On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans <jd...@apache.org>wrote:

> What Alex said, and also it really looks like
> https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
> of that issue.
>
> J-D
>
> On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <al...@cloudera.com>
> wrote:
>
> > Hi Peter,
> >
> > The edits.new file is used when the edits and fsimage is pulled to the
> > secondarynamenode.  Here's the process:
> >
> > 1) SNN pulls edits and fsimage
> > 2) NN starts writing edits to edits.new
> > 3) SNN sends new fsimage to NN
> > 4) NN replaces its fsimage with the SNN fsimage
> > 5) NN replaces edits with edits.new
> >
> > Certainly taking a different fsimage and trying to apply edits to it
> won't
> > work.  Your best bet might be to take the 3-day-old fsimage with an empty
> > edits and delete edits.new.  But before you do any of this, make sure you
> > completely backup all values for dfs.name.dir and dfs.checkpoint.dir.
>  What
> > are the timestamps on the fsimage files in each dfs.name.dir and
> > dfs.checkpoint.dir?
> >
> > Do the namenode and secondarynamenode have enough disk space?  Have you
> > consulted the logs to learn why the SNN/NN didn't properly update the
> > fsimage and edits log?
> >
> > Hope this helps.
> >
> > Alex
> >
> > On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <pe...@bugsoft.nu> wrote:
> >
> > > Just a little update. We found a working fsimage that was just a couple
> > of
> > > days older than the corrupt one. We tried to replace the fsimage with
> the
> > > working one, and kept the edits and edits.new files, hoping the the
> > latest
> > > edits would be still in use. However, when starting the namenode, the
> > > following error message appears. Any thought ideas or hints of how to
> > > continue? Edit the edits files somehow?
> > >
> > > TIA,
> > > Peter
> > >
> > > 2010-07-07 16:21:10,312 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Number of files = 28372
> > > 2010-07-07 16:21:11,162 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Number of files under construction = 8
> > > 2010-07-07 16:21:11,164 INFO
> > org.apache.hadoop.hdfs.server.common.Storage:
> > > Image file of size 3315887 loaded in 0 seconds.
> > > 2010-07-07 16:21:11,164 DEBUG
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
> numblocks
> > :
> > > 1
> > > clientHolder  clientMachine
> > > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> > > FSDirectory.unprotectedDelete: failed to remove
> > > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
> it
> > > does not exist
> > > 2010-07-07 16:21:11,164 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.NameNode:
> > > java.lang.NullPointerException
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >        at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >        at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >        at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > >
> > > 2010-07-07 16:21:11,165 INFO
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > > /************************************************************
> > > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> > > ************************************************************/
> > >
> > >
> > > On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:
> > >
> > > > Hi,
> > > >
> > > > After a restart of our live cluster today, the name node fails to
> start
> > > > with the log message seen below. There is a big file called edits.new
> > in
> > > the
> > > > "current" folder that seems be the only one that have received
> changes
> > > > recently (no changes to the edits or the fsimage for over a month).
> Is
> > > that
> > > > normal?
> > > >
> > > > The last change to the edits.new file was right before shutting down
> > the
> > > > cluster. It seems like the shutdown was unable to store valid
> fsimage,
> > > > edits, edits.new files. The secondary name node image does not
> include
> > > the
> > > > edits.new file, only edits and fsimage, which are identical to the
> name
> > > > nodes version. So no help from them.
> > > >
> > > > Would appreciate any help in understanding what could have gone
> wrong.
> > > The
> > > > shutdown seemed to complete just fine, without any error message. Is
> > > there
> > > > any way to recreate the image from the data, or any other way to save
> > our
> > > > production data?
> > > >
> > > > Sincerely,
> > > > Peter
> > > >
> > > > 2010-07-07 14:30:26,949 INFO
> org.apache.hadoop.ipc.metrics.RpcMetrics:
> > > > Initializing RPC Metrics with hostName=NameNode, port=9000
> > > > 2010-07-07 14:30:26,960 INFO
> org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > > > 2010-07-07 14:30:27,019 DEBUG
> > > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> > hbase,hbase
> > > > 2010-07-07 14:30:27,149 ERROR
> > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > > > initialization failed.
> > > > java.io.EOFException
> > > >         at
> java.io.DataInputStream.readShort(DataInputStream.java:298)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > > >         at
> > > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> > > server
> > > > on 9000
> > > > 2010-07-07 14:30:27,151 ERROR
> > > > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> > > >         at
> java.io.DataInputStream.readShort(DataInputStream.java:298)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > > >         at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > > >         at
> > > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> > > >
> > >
> >
>

Re: Please help! Corrupt fsimage?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
What Alex said, and also it really looks like
https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
of that issue.

J-D

On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <al...@cloudera.com> wrote:

> Hi Peter,
>
> The edits.new file is used when the edits and fsimage is pulled to the
> secondarynamenode.  Here's the process:
>
> 1) SNN pulls edits and fsimage
> 2) NN starts writing edits to edits.new
> 3) SNN sends new fsimage to NN
> 4) NN replaces its fsimage with the SNN fsimage
> 5) NN replaces edits with edits.new
>
> Certainly taking a different fsimage and trying to apply edits to it won't
> work.  Your best bet might be to take the 3-day-old fsimage with an empty
> edits and delete edits.new.  But before you do any of this, make sure you
> completely backup all values for dfs.name.dir and dfs.checkpoint.dir.  What
> are the timestamps on the fsimage files in each dfs.name.dir and
> dfs.checkpoint.dir?
>
> Do the namenode and secondarynamenode have enough disk space?  Have you
> consulted the logs to learn why the SNN/NN didn't properly update the
> fsimage and edits log?
>
> Hope this helps.
>
> Alex
>
> On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <pe...@bugsoft.nu> wrote:
>
> > Just a little update. We found a working fsimage that was just a couple
> of
> > days older than the corrupt one. We tried to replace the fsimage with the
> > working one, and kept the edits and edits.new files, hoping the the
> latest
> > edits would be still in use. However, when starting the namenode, the
> > following error message appears. Any thought ideas or hints of how to
> > continue? Edit the edits files somehow?
> >
> > TIA,
> > Peter
> >
> > 2010-07-07 16:21:10,312 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Number of files = 28372
> > 2010-07-07 16:21:11,162 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Number of files under construction = 8
> > 2010-07-07 16:21:11,164 INFO
> org.apache.hadoop.hdfs.server.common.Storage:
> > Image file of size 3315887 loaded in 0 seconds.
> > 2010-07-07 16:21:11,164 DEBUG
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks
> :
> > 1
> > clientHolder  clientMachine
> > 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> > FSDirectory.unprotectedDelete: failed to remove
> > /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
> > does not exist
> > 2010-07-07 16:21:11,164 ERROR
> > org.apache.hadoop.hdfs.server.namenode.NameNode:
> > java.lang.NullPointerException
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
> >         at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >        at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >        at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >        at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> >
> > 2010-07-07 16:21:11,165 INFO
> > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> > ************************************************************/
> >
> >
> > On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:
> >
> > > Hi,
> > >
> > > After a restart of our live cluster today, the name node fails to start
> > > with the log message seen below. There is a big file called edits.new
> in
> > the
> > > "current" folder that seems be the only one that have received changes
> > > recently (no changes to the edits or the fsimage for over a month). Is
> > that
> > > normal?
> > >
> > > The last change to the edits.new file was right before shutting down
> the
> > > cluster. It seems like the shutdown was unable to store valid fsimage,
> > > edits, edits.new files. The secondary name node image does not include
> > the
> > > edits.new file, only edits and fsimage, which are identical to the name
> > > nodes version. So no help from them.
> > >
> > > Would appreciate any help in understanding what could have gone wrong.
> > The
> > > shutdown seemed to complete just fine, without any error message. Is
> > there
> > > any way to recreate the image from the data, or any other way to save
> our
> > > production data?
> > >
> > > Sincerely,
> > > Peter
> > >
> > > 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > > Initializing RPC Metrics with hostName=NameNode, port=9000
> > > 2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > > 2010-07-07 14:30:27,019 DEBUG
> > > org.apache.hadoop.security.UserGroupInformation: Unix Login:
> hbase,hbase
> > > 2010-07-07 14:30:27,149 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > > initialization failed.
> > > java.io.EOFException
> > >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >         at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> > server
> > > on 9000
> > > 2010-07-07 14:30:27,151 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> > >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> > >         at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> > >
> >
>

Re: Please help! Corrupt fsimage?

Posted by Alex Loddengaard <al...@cloudera.com>.
Hi Peter,

The edits.new file is used when the edits and fsimage is pulled to the
secondarynamenode.  Here's the process:

1) SNN pulls edits and fsimage
2) NN starts writing edits to edits.new
3) SNN sends new fsimage to NN
4) NN replaces its fsimage with the SNN fsimage
5) NN replaces edits with edits.new

Certainly taking a different fsimage and trying to apply edits to it won't
work.  Your best bet might be to take the 3-day-old fsimage with an empty
edits and delete edits.new.  But before you do any of this, make sure you
completely backup all values for dfs.name.dir and dfs.checkpoint.dir.  What
are the timestamps on the fsimage files in each dfs.name.dir and
dfs.checkpoint.dir?

Do the namenode and secondarynamenode have enough disk space?  Have you
consulted the logs to learn why the SNN/NN didn't properly update the
fsimage and edits log?

Hope this helps.

Alex

On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk <pe...@bugsoft.nu> wrote:

> Just a little update. We found a working fsimage that was just a couple of
> days older than the corrupt one. We tried to replace the fsimage with the
> working one, and kept the edits and edits.new files, hoping the the latest
> edits would be still in use. However, when starting the namenode, the
> following error message appears. Any thought ideas or hints of how to
> continue? Edit the edits files somehow?
>
> TIA,
> Peter
>
> 2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 28372
> 2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 8
> 2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 3315887 loaded in 0 seconds.
> 2010-07-07 16:21:11,164 DEBUG
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks :
> 1
> clientHolder  clientMachine
> 2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
> FSDirectory.unprotectedDelete: failed to remove
> /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
> does not exist
> 2010-07-07 16:21:11,164 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
>         at
>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
>
> 2010-07-07 16:21:11,165 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
> ************************************************************/
>
>
> On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:
>
> > Hi,
> >
> > After a restart of our live cluster today, the name node fails to start
> > with the log message seen below. There is a big file called edits.new in
> the
> > "current" folder that seems be the only one that have received changes
> > recently (no changes to the edits or the fsimage for over a month). Is
> that
> > normal?
> >
> > The last change to the edits.new file was right before shutting down the
> > cluster. It seems like the shutdown was unable to store valid fsimage,
> > edits, edits.new files. The secondary name node image does not include
> the
> > edits.new file, only edits and fsimage, which are identical to the name
> > nodes version. So no help from them.
> >
> > Would appreciate any help in understanding what could have gone wrong.
> The
> > shutdown seemed to complete just fine, without any error message. Is
> there
> > any way to recreate the image from the data, or any other way to save our
> > production data?
> >
> > Sincerely,
> > Peter
> >
> > 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> > Initializing RPC Metrics with hostName=NameNode, port=9000
> > 2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> > Initializing JVM Metrics with processName=NameNode, sessionId=null
> > 2010-07-07 14:30:27,019 DEBUG
> > org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
> > 2010-07-07 14:30:27,149 ERROR
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> > initialization failed.
> > java.io.EOFException
> >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> > 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping
> server
> > on 9000
> > 2010-07-07 14:30:27,151 ERROR
> > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
> >         at java.io.DataInputStream.readShort(DataInputStream.java:298)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
> >
>

Re: Please help! Corrupt fsimage?

Posted by Peter Falk <pe...@bugsoft.nu>.
Just a little update. We found a working fsimage that was just a couple of
days older than the corrupt one. We tried to replace the fsimage with the
working one, and kept the edits and edits.new files, hoping the the latest
edits would be still in use. However, when starting the namenode, the
following error message appears. Any thought ideas or hints of how to
continue? Edit the edits files somehow?

TIA,
Peter

2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 28372
2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 8
2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 3315887 loaded in 0 seconds.
2010-07-07 16:21:11,164 DEBUG
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks : 1
clientHolder  clientMachine
2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
does not exist
2010-07-07 16:21:11,164 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
        at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

2010-07-07 16:21:11,165 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
************************************************************/


On Wed, Jul 7, 2010 at 14:46, Peter Falk <pe...@bugsoft.nu> wrote:

> Hi,
>
> After a restart of our live cluster today, the name node fails to start
> with the log message seen below. There is a big file called edits.new in the
> "current" folder that seems be the only one that have received changes
> recently (no changes to the edits or the fsimage for over a month). Is that
> normal?
>
> The last change to the edits.new file was right before shutting down the
> cluster. It seems like the shutdown was unable to store valid fsimage,
> edits, edits.new files. The secondary name node image does not include the
> edits.new file, only edits and fsimage, which are identical to the name
> nodes version. So no help from them.
>
> Would appreciate any help in understanding what could have gone wrong. The
> shutdown seemed to complete just fine, without any error message. Is there
> any way to recreate the image from the data, or any other way to save our
> production data?
>
> Sincerely,
> Peter
>
> 2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-07-07 14:30:27,019 DEBUG
> org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
> 2010-07-07 14:30:27,149 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.EOFException
>         at java.io.DataInputStream.readShort(DataInputStream.java:298)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
> 2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 9000
> 2010-07-07 14:30:27,151 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>         at java.io.DataInputStream.readShort(DataInputStream.java:298)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
>         at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
>