You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Torsten Curdt <tc...@apache.org> on 2008/07/30 20:09:15 UTC

corrupted fsimage and edits

Just a bit of a feedback here.

One of our hadoop 0.16.4 namenodes had gotten a disk full incident  
today. No second backup namenode was in place. Both files fsimage and  
edits seem to have gotten corrupted. After quite a bit of debugging  
and fiddling with a hex edtor we managed to resurrect the files and  
continue with just minor loss.

Thankfully this only happened on a development cluster - not on  
production. But shouldn't that be something that should NEVER happen?

cheers
--
Torsten

Re: corrupted fsimage and edits

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.
You should also run a secondary name-node, which does namespace checkpoints and shrinks the edits log file.
And this is exactly the case when the checkpoint image comes handy.
http://wiki.apache.org/hadoop/FAQ#7
In the recent release you can start the primary node using the secondary image directly.
In the old releases you need to move some files around.
--Konstantin

Raghu Angadi wrote:
> Torsten Curdt wrote:
>>
>> On Jul 30, 2008, at 20:35, Raghu Angadi wrote:
>>
>>> You should always have more than one location (preferably on 
>>> different disks) for fsimage and editslog.
>>
>> On production we do frequent backups. Is there a mechanism from inside 
>> hadoop now to do something like that now? The "more than one location" 
>> bit sounds a little like that.
> 
> You can specify multiple directories for "dfs.name.dir", in which case 
> fsimage and editslog are written to multiple places. If one of these 
> goes bad, you can use the other one.
> 
> See http://wiki.apache.org/hadoop/FAQ#15
> 
> Raghu.
> 
>>> A few months back I had a proposal to keep checksums for each record 
>>> on fsimage and editslog and NameNode would recover transparently from 
>>> such corruptions when there are more than one copies available. It 
>>> didn't come up in priority since there were no such failures observed.
>>>
>>> You should certainly report these cases and will help the feature 
>>> gain more traction.
>>
>> Will file a bug report tomorrow.
>>
>> cheers
>> -- 
>> Torsten
> 
> 

Re: corrupted fsimage and edits

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Torsten Curdt wrote:
> 
> On Jul 30, 2008, at 20:35, Raghu Angadi wrote:
> 
>> You should always have more than one location (preferably on different 
>> disks) for fsimage and editslog.
> 
> On production we do frequent backups. Is there a mechanism from inside 
> hadoop now to do something like that now? The "more than one location" 
> bit sounds a little like that.

You can specify multiple directories for "dfs.name.dir", in which case 
fsimage and editslog are written to multiple places. If one of these 
goes bad, you can use the other one.

See http://wiki.apache.org/hadoop/FAQ#15

Raghu.

>> A few months back I had a proposal to keep checksums for each record 
>> on fsimage and editslog and NameNode would recover transparently from 
>> such corruptions when there are more than one copies available. It 
>> didn't come up in priority since there were no such failures observed.
>>
>> You should certainly report these cases and will help the feature gain 
>> more traction.
> 
> Will file a bug report tomorrow.
> 
> cheers
> -- 
> Torsten


Re: corrupted fsimage and edits

Posted by Torsten Curdt <tc...@apache.org>.
On Jul 30, 2008, at 20:35, Raghu Angadi wrote:

> You should always have more than one location (preferably on  
> different disks) for fsimage and editslog.

On production we do frequent backups. Is there a mechanism from inside  
hadoop now to do something like that now? The "more than one location"  
bit sounds a little like that.

> A few months back I had a proposal to keep checksums for each record  
> on fsimage and editslog and NameNode would recover transparently  
> from such corruptions when there are more than one copies available.  
> It didn't come up in priority since there were no such failures  
> observed.
>
> You should certainly report these cases and will help the feature  
> gain more traction.

Will file a bug report tomorrow.

cheers
--
Torsten

Re: corrupted fsimage and edits

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
You should always have more than one location (preferably on different 
disks) for fsimage and editslog.

A few months back I had a proposal to keep checksums for each record on 
fsimage and editslog and NameNode would recover transparently from such 
corruptions when there are more than one copies available. It didn't 
come up in priority since there were no such failures observed.

You should certainly report these cases and will help the feature gain 
more traction.

Raghu.

Torsten Curdt wrote:
> Just a bit of a feedback here.
> 
> One of our hadoop 0.16.4 namenodes had gotten a disk full incident 
> today. No second backup namenode was in place. Both files fsimage and 
> edits seem to have gotten corrupted. After quite a bit of debugging and 
> fiddling with a hex edtor we managed to resurrect the files and continue 
> with just minor loss.
> 
> Thankfully this only happened on a development cluster - not on 
> production. But shouldn't that be something that should NEVER happen?
> 
> cheers
> -- 
> Torsten