You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Chathuri Wimalasena <ka...@gmail.com> on 2016/12/27 19:54:41 UTC

How to recover from CORRUPT HDFS state

Hi,

We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are
running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on
logging node 2. We are facing a terrible issue with our hadoop cluster
recently. There are lot of files in HDFS in corrupt state. We are unable to
figure out what cause this mass corruption and how to recover from it. HDFS
has 40 TB of data and we are worried that we might have to rebuild the
cluster from scratch due to this errors. Our cluster had some file system
issues recently. Below is the list of events that took place before that.
Both Hadoop and HBase are running on ln02 (logging node 2). 

   - Nov 30 - SSD drives on ln02 node has died which triggered a kernel
   panic and reboot.
   - Dec 20 - ln02 file system set to Read-only and both hard drives on
   ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and
   rebooted, and it came back up. One data node was also down on the same day
   due to disk failure.
   - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys
   admin replaced the failed SSD with another SSD. Another data node was down
   on the same day.

On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to
restart Hadoop and HBase without any issue. Everything worked as expected.
But on Dec 21st, when I restarted Hadoop, it has automatically switch to
the "Safe mode" and hadoop fs fsck command showed lot of corrupt and
missing files. Output of fsck is below.
............................Status: CORRUPT
 Total size:    46454858557036 B (Total open files size: 1340 B)
 Total dirs:    43405
 Total files:   122028
 Total symlinks:                0 (Files currently being written: 10)
 Total blocks (validated):      804832 (avg. block size 57719944 B) (Total
open file blocks (not validated): 10)
  ********************************
  UNDER MIN REPL'D BLOCKS:      413578 (51.386875 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        18683
  MISSING BLOCKS:       413578
  MISSING SIZE:         26785603097998 B
  CORRUPT BLOCKS:       413578
  ********************************
 Minimally replicated blocks:   391254 (48.613125 %)
 Over-replicated blocks:        26548 (3.2985766 %)
 Under-replicated blocks:       286 (0.035535365 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     1.4916517
 Corrupt blocks:                413578
 Missing replicas:              572 (0.023681387 %)
 Number of data-nodes:          10
 Number of racks:               1
FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds


The filesystem under path '/' is CORRUPT


HDFS web ui shows below message.

*Safe mode is ON. The reported blocks 391254 needs additional 412774 blocks
to reach the threshold 0.9990 of total blocks 804832. The number of live
datanodes 10 has reached the minimum number 0. Safe mode will be turned off
automatically once the thresholds have been reached.*

We experienced some data nodes showing Input/output errors intermittently
as well.

Anyone experienced such situation before and any idea to recover from this
is greatly appreciated.
Thanks,
Chathuri

Re: How to recover from CORRUPT HDFS state

Posted by Chathuri Wimalasena <ka...@gmail.com>.

Thanks Krishna for the suggestion. We rebooted the whole cluster and start
HDFS back. But still HDFS is in safe mode and more than half of blocks are
in CORRUPT state.

On Tue, Dec 27, 2016 at 6:08 PM, Krishna Kalyan <kr...@gmail.com>
wrote:

> Hello Chathuri,
> I have experienced this before. When the disk cannot handle the write ops,
> it tries to save itself by locking itself and becomes read only. (Quick fix
> : restart your server, Long term fix: tune HBase params like writes / file
> size).
>
> I am not a sys admin, I might be wrong.
>
> (You should manually check the state of all disks in you cluster )
> Check /var/log/messages to understand under what circumstances your SSDs
> failed.
>
> Krishna
>
>
> On Tue, Dec 27, 2016 at 8:54 PM, Chathuri Wimalasena <kamalasini@gmail.com
> > wrote:
>
>> Hi,
>>
>> We have a hadoop cluster which has 3 login nodes and 10 data nodes. We
>> are running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running
>> on logging node 2. We are facing a terrible issue with our hadoop cluster
>> recently. There are lot of files in HDFS in corrupt state. We are unable to
>> figure out what cause this mass corruption and how to recover from it. HDFS
>> has 40 TB of data and we are worried that we might have to rebuild the
>> cluster from scratch due to this errors. Our cluster had some file system
>> issues recently. Below is the list of events that took place before that.
>> Both Hadoop and HBase are running on ln02 (logging node 2). 
>>
>>    - Nov 30 - SSD drives on ln02 node has died which triggered a kernel
>>    panic and reboot.
>>    - Dec 20 - ln02 file system set to Read-only and both hard drives on
>>    ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and
>>    rebooted, and it came back up. One data node was also down on the same day
>>    due to disk failure.
>>    - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys
>>    admin replaced the failed SSD with another SSD. Another data node was down
>>    on the same day.
>>
>> On nov 30th and Dec 20 th after sys admin rebooted the node, I was able
>> to restart Hadoop and HBase without any issue. Everything worked as
>> expected. But on Dec 21st, when I restarted Hadoop, it has automatically
>> switch to the "Safe mode" and hadoop fs fsck command showed lot of corrupt
>> and missing files. Output of fsck is below.
>> ............................Status: CORRUPT
>>  Total size:    46454858557036 B (Total open files size: 1340 B)
>>  Total dirs:    43405
>>  Total files:   122028
>>  Total symlinks:                0 (Files currently being written: 10)
>>  Total blocks (validated):      804832 (avg. block size 57719944 B)
>> (Total open file blocks (not validated): 10)
>>   ********************************
>>   UNDER MIN REPL'D BLOCKS:      413578 (51.386875 %)
>>   dfs.namenode.replication.min: 1
>>   CORRUPT FILES:        18683
>>   MISSING BLOCKS:       413578
>>   MISSING SIZE:         26785603097998 B
>>   CORRUPT BLOCKS:       413578
>>   ********************************
>>  Minimally replicated blocks:   391254 (48.613125 %)
>>  Over-replicated blocks:        26548 (3.2985766 %)
>>  Under-replicated blocks:       286 (0.035535365 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     1.4916517
>>  Corrupt blocks:                413578
>>  Missing replicas:              572 (0.023681387 %)
>>  Number of data-nodes:          10
>>  Number of racks:               1
>> FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds
>>
>>
>> The filesystem under path '/' is CORRUPT
>>
>>
>> HDFS web ui shows below message.
>>
>> *Safe mode is ON. The reported blocks 391254 needs additional 412774
>> blocks to reach the threshold 0.9990 of total blocks 804832. The number of
>> live datanodes 10 has reached the minimum number 0. Safe mode will be
>> turned off automatically once the thresholds have been reached.*
>>
>> We experienced some data nodes showing Input/output errors intermittently
>> as well.
>>
>> Anyone experienced such situation before and any idea to recover from
>> this is greatly appreciated.
>> Thanks,
>> Chathuri
>>
>
>

Re: How to recover from CORRUPT HDFS state

Posted by Krishna Kalyan <kr...@gmail.com>.

Hello Chathuri,
I have experienced this before. When the disk cannot handle the write ops,
it tries to save itself by locking itself and becomes read only. (Quick fix
: restart your server, Long term fix: tune HBase params like writes / file
size).

I am not a sys admin, I might be wrong.

(You should manually check the state of all disks in you cluster )
Check /var/log/messages to understand under what circumstances your SSDs
failed.

Krishna


On Tue, Dec 27, 2016 at 8:54 PM, Chathuri Wimalasena <ka...@gmail.com>
wrote:

> Hi,
>
> We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are
> running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on
> logging node 2. We are facing a terrible issue with our hadoop cluster
> recently. There are lot of files in HDFS in corrupt state. We are unable to
> figure out what cause this mass corruption and how to recover from it. HDFS
> has 40 TB of data and we are worried that we might have to rebuild the
> cluster from scratch due to this errors. Our cluster had some file system
> issues recently. Below is the list of events that took place before that.
> Both Hadoop and HBase are running on ln02 (logging node 2). 
>
>    - Nov 30 - SSD drives on ln02 node has died which triggered a kernel
>    panic and reboot.
>    - Dec 20 - ln02 file system set to Read-only and both hard drives on
>    ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and
>    rebooted, and it came back up. One data node was also down on the same day
>    due to disk failure.
>    - Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys
>    admin replaced the failed SSD with another SSD. Another data node was down
>    on the same day.
>
> On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to
> restart Hadoop and HBase without any issue. Everything worked as expected.
> But on Dec 21st, when I restarted Hadoop, it has automatically switch to
> the "Safe mode" and hadoop fs fsck command showed lot of corrupt and
> missing files. Output of fsck is below.
> ............................Status: CORRUPT
>  Total size:    46454858557036 B (Total open files size: 1340 B)
>  Total dirs:    43405
>  Total files:   122028
>  Total symlinks:                0 (Files currently being written: 10)
>  Total blocks (validated):      804832 (avg. block size 57719944 B) (Total
> open file blocks (not validated): 10)
>   ********************************
>   UNDER MIN REPL'D BLOCKS:      413578 (51.386875 %)
>   dfs.namenode.replication.min: 1
>   CORRUPT FILES:        18683
>   MISSING BLOCKS:       413578
>   MISSING SIZE:         26785603097998 B
>   CORRUPT BLOCKS:       413578
>   ********************************
>  Minimally replicated blocks:   391254 (48.613125 %)
>  Over-replicated blocks:        26548 (3.2985766 %)
>  Under-replicated blocks:       286 (0.035535365 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     1.4916517
>  Corrupt blocks:                413578
>  Missing replicas:              572 (0.023681387 %)
>  Number of data-nodes:          10
>  Number of racks:               1
> FSCK ended at Sat Dec 24 13:25:10 EST 2016 in 8378 milliseconds
>
>
> The filesystem under path '/' is CORRUPT
>
>
> HDFS web ui shows below message.
>
> *Safe mode is ON. The reported blocks 391254 needs additional 412774
> blocks to reach the threshold 0.9990 of total blocks 804832. The number of
> live datanodes 10 has reached the minimum number 0. Safe mode will be
> turned off automatically once the thresholds have been reached.*
>
> We experienced some data nodes showing Input/output errors intermittently
> as well.
>
> Anyone experienced such situation before and any idea to recover from this
> is greatly appreciated.
> Thanks,
> Chathuri
>