You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "Edupuganti, Sandhya" <sa...@amazon.com> on 2011/02/11 20:53:01 UTC

Debugging and fixing Safemode

Our Namenode is going into Safemode after every restart. It reports the ratio to be .98xxx whereas it is looking for 0.999 to leave the safe mode. So I'm guessing there must be one or two files that are under replicated.

Is there any way I can find out which files are under replicated, so that I can re copy them if I have or delete them.

I don’t want to end up with a silent Namenode in a safemode next time and causing all our jobs to fail.

Any pointers will be greatly appreciated

Many Thanks
Sandhya

Re: Debugging and fixing Safemode

Posted by Matthew Foley <ma...@yahoo-inc.com>.
Hi Sandhya,
the threshold for leaving safemode automatically is configurable; it defaults to 0.999, but you can change parameter "dfs.namenode.safemode.threshold-pct" to a different floating-point number in your config.  It is set to almost 100% by default, on the theory that (a) if you didn't hit 100% it means some of your datanodes didn't come up or suffered data loss, and (b) you might want to know about that before letting the cluster start writing and changing files.

When the cluster comes out of safe mode, it should automatically fix any under-replicated blocks; you don't need to take action to fix them yourself.  But any files that are damaged by loss of ALL replicas of a block will appear corrupted to applications.

You can run dfs fsck to identify problem files, and move them to lost+found or delete them.

Hope this helps,
--Matt


On Feb 11, 2011, at 11:53 AM, Edupuganti, Sandhya wrote:

Our Namenode is going into Safemode after every restart. It reports the ratio to be .98xxx whereas it is looking for 0.999 to leave the safe mode. So I'm guessing there must be one or two files that are under replicated.

Is there any way I can find out which files are under replicated, so that I can re copy them if I have or delete them.

I don’t want to end up with a silent Namenode in a safemode next time and causing all our jobs to fail.

Any pointers will be greatly appreciated

Many Thanks
Sandhya


Re: Debugging and fixing Safemode

Posted by Mahadev Konar <ma...@apache.org>.
http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck

Fsck should be of help.

thanks
mahadev

On Fri, Feb 11, 2011 at 11:53 AM, Edupuganti, Sandhya
<sa...@amazon.com> wrote:
> Our Namenode is going into Safemode after every restart. It reports the ratio to be .98xxx whereas it is looking for 0.999 to leave the safe mode. So I'm guessing there must be one or two files that are under replicated.
>
> Is there any way I can find out which files are under replicated, so that I can re copy them if I have or delete them.
>
> I don’t want to end up with a silent Namenode in a safemode next time and causing all our jobs to fail.
>
> Any pointers will be greatly appreciated
>
> Many Thanks
> Sandhya
>