You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Arpit Agarwal (JIRA)" <ji...@apache.org> on 2018/07/02 20:37:00 UTC

[jira] [Comment Edited] (HADOOP-15493) DiskChecker should handle disk full situation

    [ https://issues.apache.org/jira/browse/HADOOP-15493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530413#comment-16530413 ] 

Arpit Agarwal edited comment on HADOOP-15493 at 7/2/18 8:36 PM:
----------------------------------------------------------------

{quote}I think we have to rely on the system to detect a failed controller/drive. Maybe we should just attempt to provoke the disk to go read-only. Have the DN periodically write a file to its storages every n-many mins – but take no action upon failure. Instead rely on the normal disk check to subsequently discover the disk is read-only.
{quote}
When you say 'we have to rely on the system', do you mean the OS?

We saw disk failures (and rarely controller failures) go undetected indefinitely. Application requests would fail and trigger disk checker which always succeeded. We had customers hit data loss after multiple undetected disk failures over a few days.

 
{quote}I don't think this disk-is-writable check should be in common.
{quote}
We can make the write check HDFS-internal. We still need a disk full check if the write fails. Perhaps the safest option is a threshold which avoids false positives and allows false negatives.


was (Author: arpitagarwal):
{quote}I think we have to rely on the system to detect a failed controller/drive. Maybe we should just attempt to provoke the disk to go read-only. Have the DN periodically write a file to its storages every n-many mins – but take no action upon failure. Instead rely on the normal disk check to subsequently discover the disk is read-only.
{quote}
When you say 'we have to rely on the system', do you mean the OS?

We saw disk failures (and rarely controller failures) go undetected indefinitely. Application requests would fail and trigger disk checker which always succeeded. We had customers hit data loss after multiple undetected disk failures over a few days.

 
{quote}I don't think this disk-is-writable check should be in common.
{quote}
We can make the write check HDFS-internal. We still need a disk full check. Perhaps the safest option is a threshold which avoids false positives and allows false negatives.

> DiskChecker should handle disk full situation
> ---------------------------------------------
>
>                 Key: HADOOP-15493
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15493
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HADOOP-15493.01.patch, HADOOP-15493.02.patch
>
>
> DiskChecker#checkDirWithDiskIo creates a file to verify that the disk is writable.
> However check should not fail when file creation fails due to disk being full. This avoids marking full disks as _failed_.
> Reported by [~kihwal] and [~daryn] in HADOOP-15450. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org