You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2016/09/16 21:50:21 UTC

[jira] [Resolved] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files

     [ https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wei-Chiu Chuang resolved HDFS-10777.
------------------------------------
    Resolution: Invalid

Close this jira as invalid and I'll file an improvement jira to add logging or metric when DataNode disks become flaky.

> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>
>                 Key: HDFS-10777
>                 URL: https://issues.apache.org/jira/browse/HDFS-10777
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10777.01.patch
>
>
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a side-effect that if DU encounters an exception, the exception is caught, logged and ignored, essentially fixes HDFS-9908 (in which case runaway exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there is no immediate action made by the DataNode other than logging the exception. Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few number of directories blindly. If a disk goes bad, it is often possible that only a few files are bad initially and that by checking only a small number of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively verify the files are not accessible, remove the volume, and make the failure visible by showing it in JMX, so that administrators can spot the failure via monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org