You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Tao Jie (JIRA)" <ji...@apache.org> on 2017/04/10 06:57:41 UTC

[jira] [Created] (HDFS-11638) Support marking a datanode dead by DFSAdmin

Tao Jie created HDFS-11638:
------------------------------

             Summary: Support marking a datanode dead by DFSAdmin
                 Key: HDFS-11638
                 URL: https://issues.apache.org/jira/browse/HDFS-11638
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Tao Jie


We have met such a circumstance that:
Kernal error occured on one slave node, and error message like
{code}
Apr 1 08:48:05 xxhdn033 kernel: BUG: soft lockup - CPU#0 stuck for 67s! [java:19096]
Apr 1 08:48:05 xxhdn033 kernel: Modules linked in: bridge stp llc fuse autofs4 bonding ipv6 uinput iTCO_wdt iTCO_vendor_support microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler sb_edac edac_core joydev i2c_i801 i2c_core lpc_ich mfd_core sg ses enclosure ixgbe dca ptp pps_core mdio ext4 jbd2 mbcache sd_mod crc_t10dif ahci megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
{code}
The datanode process was still alive and continued to send heartbeat to the namenode, but it could not response any command to this node and reading or writing blocks on this datanode would fail. As a result, request to the HDFS would be slower since too many read/write timeout.
We try to walk around this case by adding a dfsadmin command that mark such a abnormal datanode as dead by force until it get restarted. When this case happens again, it would avoid the client to access the error datanode.
Any thought?




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org