You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Chathuri Wimalasena <ka...@gmail.com> on 2016/12/22 16:02:49 UTC

Safe mode on after restarting hadoop

Hi,

We have a hadoop cluster with 10 data nodes. We had a disk failure with the
login node where the namenode, secondary namenode running and replaced the
failed disk. Failed disk does not affect the data, it only affected the
operating system. After replacing the failed disk, when I restart the
hadoop services, hadoop is set to safe mode and does not let run jobs.
Below message shows in namenode UI.

Safe mode is ON. The reported blocks 391253 needs additional 412776 blocks
to reach the threshold 0.9990 of total blocks 804833. The number of live
datanodes 10 has reached the minimum number 0. Safe mode will be turned off
automatically once the thresholds have been reached.

I can see all the data nodes are up and running. Also when I check for
corrupt blocks, it shows as 0.

hdfs fsck / -list-corruptfileblocks
Connecting to namenode via
http://ln02:50070/fsck?ugi=hadoop&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

Any idea what's going on ? I can forcefully leave the safemode, but I'm
worried whether it might cause data corruption. Are there any safety steps
I should do before leave the safemode forcefully ?

Thanks,
Chathuri

Re: Safe mode on after restarting hadoop

Posted by Anu Engineer <ae...@hortonworks.com>.

Hi Chathuri,

This means that NN has not heard about all the blocks it is supposed to hear from the datanodes. Since all the datanodes are functional, here are some things to check.

1.  Is there any volume loss on data nodes?

2. You mentioned that you had a failure in Namenode, are you sure that the Namenode metadata was not affected in any way – for example you might have accidently copied an older snapshot of Namenode.

This warning is should go away once all 10 data nodes have reported in.

Leaving Safe mode by itself is not going to cause a data corruption, but HDFS is trying to tell you about a problem, so it would be better to investigate it rather than just ignore it.

Thanks
Anu


From: Chathuri Wimalasena <ka...@gmail.com>
Date: Thursday, December 22, 2016 at 8:02 AM
To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Safe mode on after restarting hadoop

Hi,

We have a hadoop cluster with 10 data nodes. We had a disk failure with the login node where the namenode, secondary namenode running and replaced the failed disk. Failed disk does not affect the data, it only affected the operating system. After replacing the failed disk, when I restart the hadoop services, hadoop is set to safe mode and does not let run jobs. Below message shows in namenode UI.

Safe mode is ON. The reported blocks 391253 needs additional 412776 blocks to reach the threshold 0.9990 of total blocks 804833. The number of live datanodes 10 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

I can see all the data nodes are up and running. Also when I check for corrupt blocks, it shows as 0.

hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://ln02:50070/fsck?ugi=hadoop&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

Any idea what's going on ? I can forcefully leave the safemode, but I'm worried whether it might cause data corruption. Are there any safety steps I should do before leave the safemode forcefully ?

Thanks,
Chathuri

Re: Safe mode on after restarting hadoop

Posted by Chathuri Wimalasena <ka...@gmail.com>.

It seems lot of files in HDFS is in corrupt state. Is there a way to
recover corrupt files ?

On Thu, Dec 22, 2016 at 1:39 PM, Mingliang Liu <li...@apache.org> wrote:

> Anu gave good analysis. Another simple case is that NN takes time to
> process the block reports before leaving the safe mode. You can monitor the
> safe mode report for progress. Check NN log for more information.
>
> L
>
> On Dec 22, 2016, at 8:02 AM, Chathuri Wimalasena <ka...@gmail.com>
> wrote:
>
> Hi,
>
> We have a hadoop cluster with 10 data nodes. We had a disk failure with
> the login node where the namenode, secondary namenode running and replaced
> the failed disk. Failed disk does not affect the data, it only affected the
> operating system. After replacing the failed disk, when I restart the
> hadoop services, hadoop is set to safe mode and does not let run jobs.
> Below message shows in namenode UI.
>
> Safe mode is ON. The reported blocks 391253 needs additional 412776 blocks
> to reach the threshold 0.9990 of total blocks 804833. The number of live
> datanodes 10 has reached the minimum number 0. Safe mode will be turned off
> automatically once the thresholds have been reached.
>
> I can see all the data nodes are up and running. Also when I check for
> corrupt blocks, it shows as 0.
>
> hdfs fsck / -list-corruptfileblocks
> Connecting to namenode via http://ln02:50070/fsck?ugi=
> hadoop&listcorruptfileblocks=1&path=%2F
> The filesystem under path '/' has 0 CORRUPT files
>
> Any idea what's going on ? I can forcefully leave the safemode, but I'm
> worried whether it might cause data corruption. Are there any safety steps
> I should do before leave the safemode forcefully ?
>
> Thanks,
> Chathuri
>
>
>

Re: Safe mode on after restarting hadoop

Posted by Mingliang Liu <li...@apache.org>.

Anu gave good analysis. Another simple case is that NN takes time to process the block reports before leaving the safe mode. You can monitor the safe mode report for progress. Check NN log for more information.

L

> On Dec 22, 2016, at 8:02 AM, Chathuri Wimalasena <ka...@gmail.com> wrote:
> 
> Hi, 
> 
> We have a hadoop cluster with 10 data nodes. We had a disk failure with the login node where the namenode, secondary namenode running and replaced the failed disk. Failed disk does not affect the data, it only affected the operating system. After replacing the failed disk, when I restart the hadoop services, hadoop is set to safe mode and does not let run jobs. Below message shows in namenode UI.
> 
> Safe mode is ON. The reported blocks 391253 needs additional 412776 blocks to reach the threshold 0.9990 of total blocks 804833. The number of live datanodes 10 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
> 
> I can see all the data nodes are up and running. Also when I check for corrupt blocks, it shows as 0.
> 
> hdfs fsck / -list-corruptfileblocks
> Connecting to namenode via http://ln02:50070/fsck?ugi=hadoop&listcorruptfileblocks=1&path=%2F <http://ln02:50070/fsck?ugi=hadoop&listcorruptfileblocks=1&path=%2F>
> The filesystem under path '/' has 0 CORRUPT files
> 
> Any idea what's going on ? I can forcefully leave the safemode, but I'm worried whether it might cause data corruption. Are there any safety steps I should do before leave the safemode forcefully ?
> 
> Thanks,
> Chathuri
>