You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert J Berger <rb...@runa.com> on 2011/09/17 10:03:23 UTC

Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping me out on the IRC channel. Well beyond the call of duty! Its people like Harsh that make the HBase/Hadoop community what it is and one of the joys of working  with this technology. And then one follow on question on how to recover from CORRUPT blocks.

The main thing I learnt other than being careful not to install packages on all the regionservers/slaves at one time that may cause Out of Memory Errors and crash all your java processes.. is that if:

Your namenode is stuck in safe mode, and even though the namenode log says that "Safe mode will be turned off automatically."
If there is enough wrong with your HDFS system like too many under-replicated blocks.
It seems that it has to be out of safe mode to correct the problem... 

I hallucinated that the datanodes by doing verifications were doing the work to get the namenode out of safe mode. And probably would have waited another few hours if Harsh hadn't helped me out and told me what probably everyone but me knew:

hadoop dfsadmin -safemode leave


CURRENT QUESTION ON CORRUPT BLOCKS:
------------------------------------------------------------------

After that the namenode did get all the under-replicated blocks replicated, but I ended up with about 200 blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were being compacted when the outage occurred. Otherwise I don't know why a lot of the bad blocks are in old tables, not data being written at the time of the crash. The hdfs filesystem dates also showed them as being old.

I am not sure what is the best thing to do now to be able to recover the CORRUPT/MISSING blocks and to get fsck to say all is healthy. 

Is the best thing to just do:

hadoop fsck -move

which will move what is left of the corrupt blocks into hdfs /lost+found?

Is there any way to recover those blocks? 

I may be able to get them from the backup/export of all our tables we did recently and I believe I can regenerate the rest. But it would be nice to know if there is a way to recover them if there was no other way.

Thanks in advance.
Rob
 
On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:

> Just had an HDFS/HBase instance where all the slave/regionservers processes crashed, but the namenode stayed up. I did proper shutdown of the namenode
> 
> After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 235 corrupt/missing blocks out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification succeeded. As far as I can tell there are no errors in the datanodes.
> 
> Can I expect it to self-heal? Or do I need to do something to help it along? Anyway to tell how long it will take to recover if I do have to just wait?
> 
> Other than the verification messages on the datanodes, the namenode fsck numbers are not changing and the namenode log continues to say:
> 
> The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
> 
> The ratio has not changed for over an hour now.
> 
> If you happen to know the answer, please get back to me right away by email or on #hadoop IRC as I'm trying to figure it out now...
> 
> Thanks!
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 
> 

__________________
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com




Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:

> 
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop 
community what it is and one of the
> joys of working  with this technology. And then one follow on question on how 
to recover from CORRUPT blocks.
> 
> The main thing I learnt other than being careful not to install packages on 
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java 
processes.. is that if:
> 
> Your namenode is stuck in safe mode, and even though the namenode log says 
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
> It seems that it has to be out of safe mode to correct the problem... 
> 
> I hallucinated that the datanodes by doing verifications were doing the work 
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped 
me out and told me what
> probably everyone but me knew:
> 
> hadoop dfsadmin -safemode leave
> 
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
> 
> After that the namenode did get all the under-replicated blocks replicated, 
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were 
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old 
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being 
old.
> 
> I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy. 
> 
> Is the best thing to just do:
> 
> hadoop fsck -move
> 
> which will move what is left of the corrupt blocks into hdfs /lost+found?
> 
> Is there any way to recover those blocks? 
> 
> I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover 
them if there was no other way.
> 
> Thanks in advance.
> Rob
> 
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> 
> > Just had an HDFS/HBase instance where all the slave/regionservers processes 
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> > 
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification 
succeeded. As far as I can
> tell there are no errors in the datanodes.
> > 
> > Can I expect it to self-heal? Or do I need to do something to help it along? 
Anyway to tell how long it will take
> to recover if I do have to just wait?
> > 
> > Other than the verification messages on the datanodes, the namenode fsck 
numbers are not changing and the
> namenode log continues to say:
> > 
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
Safe mode will be turned off automatically.
> > 
> > The ratio has not changed for over an hour now.
> > 
> > If you happen to know the answer, please get back to me right away by email 
or on #hadoop IRC as I'm trying to
> figure it out now...
> > 
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> > 
> > 
> > 
> 
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 

Having analogous situation here. Some of ours serves went away for a while. As 
we attached them back to the cluster it appeared that as a result we have 
multiple Missing/Corrupt blocks and some Mis-replicated blocks. 

I still can't figure out how to solve the issue of restoring the system to a 
normal working state. Can't figure out neither nice way to removing those 
corrupted files, nor restoring them. All of them are in the following folders: 
   /user/<user>/.Trash
   /user/<user>/.staging 

what following steps would be advised to solve our issue?

thanks, Tadas


Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:

> 
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop 
community what it is and one of the
> joys of working  with this technology. And then one follow on question on how 
to recover from CORRUPT blocks.
> 
> The main thing I learnt other than being careful not to install packages on 
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java 
processes.. is that if:
> 
> Your namenode is stuck in safe mode, and even though the namenode log says 
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
> It seems that it has to be out of safe mode to correct the problem... 
> 
> I hallucinated that the datanodes by doing verifications were doing the work 
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped 
me out and told me what
> probably everyone but me knew:
> 
> hadoop dfsadmin -safemode leave
> 
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
> 
> After that the namenode did get all the under-replicated blocks replicated, 
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were 
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old 
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being 
old.
> 
> I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy. 
> 
> Is the best thing to just do:
> 
> hadoop fsck -move
> 
> which will move what is left of the corrupt blocks into hdfs /lost+found?
> 
> Is there any way to recover those blocks? 
> 
> I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover 
them if there was no other way.
> 
> Thanks in advance.
> Rob
> 
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> 
> > Just had an HDFS/HBase instance where all the slave/regionservers processes 
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> > 
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification 
succeeded. As far as I can
> tell there are no errors in the datanodes.
> > 
> > Can I expect it to self-heal? Or do I need to do something to help it along? 
Anyway to tell how long it will take
> to recover if I do have to just wait?
> > 
> > Other than the verification messages on the datanodes, the namenode fsck 
numbers are not changing and the
> namenode log continues to say:
> > 
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
Safe mode will be turned off automatically.
> > 
> > The ratio has not changed for over an hour now.
> > 
> > If you happen to know the answer, please get back to me right away by email 
or on #hadoop IRC as I'm trying to
> figure it out now...
> > 
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> > 
> > 
> > 
> 
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 

Having analogous situation here. Some of ours serves went away for a while. As 
we attached them back to the cluster it appeared that as a result we have 
multiple Missing/Corrupt blocks and some Mis-replicated blocks. 

I still can't figure out how to solve the issue of restoring the system to a 
normal working state. Can't figure out neither nice way to removing those 
corrupted files, nor restoring them. All of them are in the following folders: 
   /user/<user>/.Trash
   /user/<user>/.staging 

what following steps would be advised to solve our issue?

thanks, Tadas


Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:

> 
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop 
community what it is and one of the
> joys of working  with this technology. And then one follow on question on how 
to recover from CORRUPT blocks.
> 
> The main thing I learnt other than being careful not to install packages on 
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java 
processes.. is that if:
> 
> Your namenode is stuck in safe mode, and even though the namenode log says 
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
> It seems that it has to be out of safe mode to correct the problem... 
> 
> I hallucinated that the datanodes by doing verifications were doing the work 
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped 
me out and told me what
> probably everyone but me knew:
> 
> hadoop dfsadmin -safemode leave
> 
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
> 
> After that the namenode did get all the under-replicated blocks replicated, 
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were 
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old 
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being 
old.
> 
> I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy. 
> 
> Is the best thing to just do:
> 
> hadoop fsck -move
> 
> which will move what is left of the corrupt blocks into hdfs /lost+found?
> 
> Is there any way to recover those blocks? 
> 
> I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover 
them if there was no other way.
> 
> Thanks in advance.
> Rob
> 
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> 
> > Just had an HDFS/HBase instance where all the slave/regionservers processes 
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> > 
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification 
succeeded. As far as I can
> tell there are no errors in the datanodes.
> > 
> > Can I expect it to self-heal? Or do I need to do something to help it along? 
Anyway to tell how long it will take
> to recover if I do have to just wait?
> > 
> > Other than the verification messages on the datanodes, the namenode fsck 
numbers are not changing and the
> namenode log continues to say:
> > 
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
Safe mode will be turned off automatically.
> > 
> > The ratio has not changed for over an hour now.
> > 
> > If you happen to know the answer, please get back to me right away by email 
or on #hadoop IRC as I'm trying to
> figure it out now...
> > 
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> > 
> > 
> > 
> 
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 

Having analogous situation here. Some of ours serves went away for a while. As 
we attached them back to the cluster it appeared that as a result we have 
multiple Missing/Corrupt blocks and some Mis-replicated blocks. 

I still can't figure out how to solve the issue of restoring the system to a 
normal working state. Can't figure out neither nice way to removing those 
corrupted files, nor restoring them. All of them are in the following folders: 
   /user/<user>/.Trash
   /user/<user>/.staging 

what following steps would be advised to solve our issue?

thanks, Tadas


Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)

Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:

> 
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping 
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop 
community what it is and one of the
> joys of working  with this technology. And then one follow on question on how 
to recover from CORRUPT blocks.
> 
> The main thing I learnt other than being careful not to install packages on 
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java 
processes.. is that if:
> 
> Your namenode is stuck in safe mode, and even though the namenode log says 
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated 
blocks.
> It seems that it has to be out of safe mode to correct the problem... 
> 
> I hallucinated that the datanodes by doing verifications were doing the work 
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped 
me out and told me what
> probably everyone but me knew:
> 
> hadoop dfsadmin -safemode leave
> 
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
> 
> After that the namenode did get all the under-replicated blocks replicated, 
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were 
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old 
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being 
old.
> 
> I am not sure what is the best thing to do now to be able to recover the 
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy. 
> 
> Is the best thing to just do:
> 
> hadoop fsck -move
> 
> which will move what is left of the corrupt blocks into hdfs /lost+found?
> 
> Is there any way to recover those blocks? 
> 
> I may be able to get them from the backup/export of all our tables we did 
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover 
them if there was no other way.
> 
> Thanks in advance.
> Rob
> 
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> 
> > Just had an HDFS/HBase instance where all the slave/regionservers processes 
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> > 
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification 
succeeded. As far as I can
> tell there are no errors in the datanodes.
> > 
> > Can I expect it to self-heal? Or do I need to do something to help it along? 
Anyway to tell how long it will take
> to recover if I do have to just wait?
> > 
> > Other than the verification messages on the datanodes, the namenode fsck 
numbers are not changing and the
> namenode log continues to say:
> > 
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. 
Safe mode will be turned off automatically.
> > 
> > The ratio has not changed for over an hour now.
> > 
> > If you happen to know the answer, please get back to me right away by email 
or on #hadoop IRC as I'm trying to
> figure it out now...
> > 
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> > 
> > 
> > 
> 
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
> 
> 

Having analogous situation here. Some of ours serves went away for a while. As 
we attached them back to the cluster it appeared that as a result we have 
multiple Missing/Corrupt blocks and some Mis-replicated blocks. 

I still can't figure out how to solve the issue of restoring the system to a 
normal working state. Can't figure out neither nice way to removing those 
corrupted files, nor restoring them. All of them are in the following folders: 
   /user/<user>/.Trash
   /user/<user>/.staging 

what following steps would be advised to solve our issue?

thanks, Tadas