You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert J Berger <rb...@runa.com> on 2011/09/17 10:03:23 UTC
Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping me out on the IRC channel. Well beyond the call of duty! Its people like Harsh that make the HBase/Hadoop community what it is and one of the joys of working with this technology. And then one follow on question on how to recover from CORRUPT blocks.
The main thing I learnt other than being careful not to install packages on all the regionservers/slaves at one time that may cause Out of Memory Errors and crash all your java processes.. is that if:
Your namenode is stuck in safe mode, and even though the namenode log says that "Safe mode will be turned off automatically."
If there is enough wrong with your HDFS system like too many under-replicated blocks.
It seems that it has to be out of safe mode to correct the problem...
I hallucinated that the datanodes by doing verifications were doing the work to get the namenode out of safe mode. And probably would have waited another few hours if Harsh hadn't helped me out and told me what probably everyone but me knew:
hadoop dfsadmin -safemode leave
CURRENT QUESTION ON CORRUPT BLOCKS:
------------------------------------------------------------------
After that the namenode did get all the under-replicated blocks replicated, but I ended up with about 200 blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were being compacted when the outage occurred. Otherwise I don't know why a lot of the bad blocks are in old tables, not data being written at the time of the crash. The hdfs filesystem dates also showed them as being old.
I am not sure what is the best thing to do now to be able to recover the CORRUPT/MISSING blocks and to get fsck to say all is healthy.
Is the best thing to just do:
hadoop fsck -move
which will move what is left of the corrupt blocks into hdfs /lost+found?
Is there any way to recover those blocks?
I may be able to get them from the backup/export of all our tables we did recently and I believe I can regenerate the rest. But it would be nice to know if there is a way to recover them if there was no other way.
Thanks in advance.
Rob
On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
> Just had an HDFS/HBase instance where all the slave/regionservers processes crashed, but the namenode stayed up. I did proper shutdown of the namenode
>
> After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows 235 corrupt/missing blocks out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification succeeded. As far as I can tell there are no errors in the datanodes.
>
> Can I expect it to self-heal? Or do I need to do something to help it along? Anyway to tell how long it will take to recover if I do have to just wait?
>
> Other than the verification messages on the datanodes, the namenode fsck numbers are not changing and the namenode log continues to say:
>
> The ratio of reported blocks 0.9980 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
>
> The ratio has not changed for over an hour now.
>
> If you happen to know the answer, please get back to me right away by email or on #hadoop IRC as I'm trying to figure it out now...
>
> Thanks!
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
>
__________________
Robert J Berger - CTO
Runa Inc.
+1 408-838-8896
http://blog.ibd.com
Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:
>
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop
community what it is and one of the
> joys of working with this technology. And then one follow on question on how
to recover from CORRUPT blocks.
>
> The main thing I learnt other than being careful not to install packages on
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java
processes.. is that if:
>
> Your namenode is stuck in safe mode, and even though the namenode log says
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated
blocks.
> It seems that it has to be out of safe mode to correct the problem...
>
> I hallucinated that the datanodes by doing verifications were doing the work
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped
me out and told me what
> probably everyone but me knew:
>
> hadoop dfsadmin -safemode leave
>
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
>
> After that the namenode did get all the under-replicated blocks replicated,
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being
old.
>
> I am not sure what is the best thing to do now to be able to recover the
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy.
>
> Is the best thing to just do:
>
> hadoop fsck -move
>
> which will move what is left of the corrupt blocks into hdfs /lost+found?
>
> Is there any way to recover those blocks?
>
> I may be able to get them from the backup/export of all our tables we did
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover
them if there was no other way.
>
> Thanks in advance.
> Rob
>
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
>
> > Just had an HDFS/HBase instance where all the slave/regionservers processes
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> >
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification
succeeded. As far as I can
> tell there are no errors in the datanodes.
> >
> > Can I expect it to self-heal? Or do I need to do something to help it along?
Anyway to tell how long it will take
> to recover if I do have to just wait?
> >
> > Other than the verification messages on the datanodes, the namenode fsck
numbers are not changing and the
> namenode log continues to say:
> >
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
> >
> > The ratio has not changed for over an hour now.
> >
> > If you happen to know the answer, please get back to me right away by email
or on #hadoop IRC as I'm trying to
> figure it out now...
> >
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> >
> >
> >
>
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
Having analogous situation here. Some of ours serves went away for a while. As
we attached them back to the cluster it appeared that as a result we have
multiple Missing/Corrupt blocks and some Mis-replicated blocks.
I still can't figure out how to solve the issue of restoring the system to a
normal working state. Can't figure out neither nice way to removing those
corrupted files, nor restoring them. All of them are in the following folders:
/user/<user>/.Trash
/user/<user>/.staging
what following steps would be advised to solve our issue?
thanks, Tadas
Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:
>
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop
community what it is and one of the
> joys of working with this technology. And then one follow on question on how
to recover from CORRUPT blocks.
>
> The main thing I learnt other than being careful not to install packages on
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java
processes.. is that if:
>
> Your namenode is stuck in safe mode, and even though the namenode log says
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated
blocks.
> It seems that it has to be out of safe mode to correct the problem...
>
> I hallucinated that the datanodes by doing verifications were doing the work
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped
me out and told me what
> probably everyone but me knew:
>
> hadoop dfsadmin -safemode leave
>
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
>
> After that the namenode did get all the under-replicated blocks replicated,
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being
old.
>
> I am not sure what is the best thing to do now to be able to recover the
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy.
>
> Is the best thing to just do:
>
> hadoop fsck -move
>
> which will move what is left of the corrupt blocks into hdfs /lost+found?
>
> Is there any way to recover those blocks?
>
> I may be able to get them from the backup/export of all our tables we did
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover
them if there was no other way.
>
> Thanks in advance.
> Rob
>
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
>
> > Just had an HDFS/HBase instance where all the slave/regionservers processes
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> >
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification
succeeded. As far as I can
> tell there are no errors in the datanodes.
> >
> > Can I expect it to self-heal? Or do I need to do something to help it along?
Anyway to tell how long it will take
> to recover if I do have to just wait?
> >
> > Other than the verification messages on the datanodes, the namenode fsck
numbers are not changing and the
> namenode log continues to say:
> >
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
> >
> > The ratio has not changed for over an hour now.
> >
> > If you happen to know the answer, please get back to me right away by email
or on #hadoop IRC as I'm trying to
> figure it out now...
> >
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> >
> >
> >
>
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
Having analogous situation here. Some of ours serves went away for a while. As
we attached them back to the cluster it appeared that as a result we have
multiple Missing/Corrupt blocks and some Mis-replicated blocks.
I still can't figure out how to solve the issue of restoring the system to a
normal working state. Can't figure out neither nice way to removing those
corrupted files, nor restoring them. All of them are in the following folders:
/user/<user>/.Trash
/user/<user>/.staging
what following steps would be advised to solve our issue?
thanks, Tadas
Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:
>
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop
community what it is and one of the
> joys of working with this technology. And then one follow on question on how
to recover from CORRUPT blocks.
>
> The main thing I learnt other than being careful not to install packages on
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java
processes.. is that if:
>
> Your namenode is stuck in safe mode, and even though the namenode log says
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated
blocks.
> It seems that it has to be out of safe mode to correct the problem...
>
> I hallucinated that the datanodes by doing verifications were doing the work
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped
me out and told me what
> probably everyone but me knew:
>
> hadoop dfsadmin -safemode leave
>
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
>
> After that the namenode did get all the under-replicated blocks replicated,
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being
old.
>
> I am not sure what is the best thing to do now to be able to recover the
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy.
>
> Is the best thing to just do:
>
> hadoop fsck -move
>
> which will move what is left of the corrupt blocks into hdfs /lost+found?
>
> Is there any way to recover those blocks?
>
> I may be able to get them from the backup/export of all our tables we did
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover
them if there was no other way.
>
> Thanks in advance.
> Rob
>
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
>
> > Just had an HDFS/HBase instance where all the slave/regionservers processes
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> >
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification
succeeded. As far as I can
> tell there are no errors in the datanodes.
> >
> > Can I expect it to self-heal? Or do I need to do something to help it along?
Anyway to tell how long it will take
> to recover if I do have to just wait?
> >
> > Other than the verification messages on the datanodes, the namenode fsck
numbers are not changing and the
> namenode log continues to say:
> >
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
> >
> > The ratio has not changed for over an hour now.
> >
> > If you happen to know the answer, please get back to me right away by email
or on #hadoop IRC as I'm trying to
> figure it out now...
> >
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> >
> >
> >
>
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
Having analogous situation here. Some of ours serves went away for a while. As
we attached them back to the cluster it appeared that as a result we have
multiple Missing/Corrupt blocks and some Mis-replicated blocks.
I still can't figure out how to solve the issue of restoring the system to a
normal working state. Can't figure out neither nice way to removing those
corrupted files, nor restoring them. All of them are in the following folders:
/user/<user>/.Trash
/user/<user>/.staging
what following steps would be advised to solve our issue?
thanks, Tadas
Re: Any way to recover CORRUPT/MISSING blocks? (was: HELP NEEDED: What to do after crash and fsck says that .2% Blocks missing. Namenode in safemode)
Posted by Tadas Makčinskas <ta...@bdc.lt>.
Robert J Berger <rb...@...> writes:
>
> Just want to follow up to first thank QwertyM aka Harsh Chouraria for helping
me out on the IRC channel. Well
> beyond the call of duty! Its people like Harsh that make the HBase/Hadoop
community what it is and one of the
> joys of working with this technology. And then one follow on question on how
to recover from CORRUPT blocks.
>
> The main thing I learnt other than being careful not to install packages on
all the regionservers/slaves at
> one time that may cause Out of Memory Errors and crash all your java
processes.. is that if:
>
> Your namenode is stuck in safe mode, and even though the namenode log says
that "Safe mode will be turned off automatically."
> If there is enough wrong with your HDFS system like too many under-replicated
blocks.
> It seems that it has to be out of safe mode to correct the problem...
>
> I hallucinated that the datanodes by doing verifications were doing the work
to get the namenode out of safe
> mode. And probably would have waited another few hours if Harsh hadn't helped
me out and told me what
> probably everyone but me knew:
>
> hadoop dfsadmin -safemode leave
>
> CURRENT QUESTION ON CORRUPT BLOCKS:
> ------------------------------------------------------------------
>
> After that the namenode did get all the under-replicated blocks replicated,
but I ended up with about 200
> blocks that fsck considered CORRUPT and/or MISSING. It looked like tables were
being compacted when the
> outage occurred. Otherwise I don't know why a lot of the bad blocks are in old
tables, not data being written
> at the time of the crash. The hdfs filesystem dates also showed them as being
old.
>
> I am not sure what is the best thing to do now to be able to recover the
CORRUPT/MISSING blocks and to get fsck to
> say all is healthy.
>
> Is the best thing to just do:
>
> hadoop fsck -move
>
> which will move what is left of the corrupt blocks into hdfs /lost+found?
>
> Is there any way to recover those blocks?
>
> I may be able to get them from the backup/export of all our tables we did
recently and I believe I can
> regenerate the rest. But it would be nice to know if there is a way to recover
them if there was no other way.
>
> Thanks in advance.
> Rob
>
> On Sep 16, 2011, at 12:50 AM, Robert J Berger wrote:
>
> > Just had an HDFS/HBase instance where all the slave/regionservers processes
crashed, but the namenode
> stayed up. I did proper shutdown of the namenode
> >
> > After bringing Hadoop back up the namenode is stuck in safe mode. Fsck shows
235 corrupt/missing blocks
> out of 117280 Blocks. All the slaves are doing DataBlockScanner: Verification
succeeded. As far as I can
> tell there are no errors in the datanodes.
> >
> > Can I expect it to self-heal? Or do I need to do something to help it along?
Anyway to tell how long it will take
> to recover if I do have to just wait?
> >
> > Other than the verification messages on the datanodes, the namenode fsck
numbers are not changing and the
> namenode log continues to say:
> >
> > The ratio of reported blocks 0.9980 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
> >
> > The ratio has not changed for over an hour now.
> >
> > If you happen to know the answer, please get back to me right away by email
or on #hadoop IRC as I'm trying to
> figure it out now...
> >
> > Thanks!
> > __________________
> > Robert J Berger - CTO
> > Runa Inc.
> > +1 408-838-8896
> > http://blog.ibd.com
> >
> >
> >
>
> __________________
> Robert J Berger - CTO
> Runa Inc.
> +1 408-838-8896
> http://blog.ibd.com
>
>
Having analogous situation here. Some of ours serves went away for a while. As
we attached them back to the cluster it appeared that as a result we have
multiple Missing/Corrupt blocks and some Mis-replicated blocks.
I still can't figure out how to solve the issue of restoring the system to a
normal working state. Can't figure out neither nice way to removing those
corrupted files, nor restoring them. All of them are in the following folders:
/user/<user>/.Trash
/user/<user>/.staging
what following steps would be advised to solve our issue?
thanks, Tadas