You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Kihwal Lee (JIRA)" <ji...@apache.org> on 2016/10/20 16:35:58 UTC

[jira] [Commented] (HADOOP-13738) DiskChecker should perform some disk IO

    [ https://issues.apache.org/jira/browse/HADOOP-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592298#comment-15592298 ] 

Kihwal Lee commented on HADOOP-13738:
-------------------------------------

The existing implementation is mainly for detecting read-only file system (mkdir fails with EROFS) and unmounted storage (fails with EPERM).  

We have seen cases where written data is lost after closing because delayed block allocation failed in kernel. Since this failure is asynchronous to the file write/close, no user process received an error.  I think enabling {{syncOnClose}} will make such writes to fail with {{EIO}}.  The write-sync test will more likely detect this kind of conditions, so I think this approach has a merit.

Another common disk failure mode involves read error. Writes go through fine, but reading back can cause an unrecoverable error/hang. Unless the affected sector is used for file system metadata, no action at file system-level will be taken.  This is kind of being dealt with by adding the affected block to the volume scanner queue. The write-sync check will still catch many bad disks.

Any particular reason why it retries on FNFE? When do you think that will happen?

> DiskChecker should perform some disk IO
> ---------------------------------------
>
>                 Key: HADOOP-13738
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13738
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>         Attachments: HADOOP-13738.01.patch
>
>
> DiskChecker can fail to detect total disk/controller failures indefinitely. We have seen this in real clusters. DiskChecker performs simple permissions-based checks on directories which do not guarantee that any disk IO will be attempted.
> A simple improvement is to write some data and flush it to the disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org