You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by "Mark Gui (Jira)" <ji...@apache.org> on 2021/03/23 14:01:00 UTC

[jira] [Commented] (HDDS-4666) Handling disk issues in Datanodes

    [ https://issues.apache.org/jira/browse/HDDS-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307093#comment-17307093 ] 

Mark Gui commented on HDDS-4666:
--------------------------------

I raise a related problem here, currently the MutableVolumeSet has a periodic disk checker which runs every 15mins(fixed, not configurable), but does not handle failure on the write path, we may have to introduce on-demand disk checks when io failure is hit(due to bad disks, etc).

By the way, hdfs has only a lazy on-demand disk checker:

https://blog.cloudera.com/hdfs-datanode-scanners-and-disk-checker-explained/

> Handling disk issues in Datanodes
> ---------------------------------
>
>                 Key: HDDS-4666
>                 URL: https://issues.apache.org/jira/browse/HDDS-4666
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode, SCM
>            Reporter: Shashikant Banerjee
>            Assignee: Shashikant Banerjee
>            Priority: Major
>
> Currently, there is no notion of reserved space on datanodes as it exists on hdfs datanodes. Similarly, a datanode low on disk capacity continues to participate in pipeline allocation activity and keep on receiving write requests and these requests fail and potentially will end up running into retry loop in the client.
> Similarly, ratis log disks are currently not accounted for by disk checker. Once a ratis disk gets full, existing pipelines can not be closed as closing of pipeline involves taking a ratis snapshot which will not succeed if the ratis disk is full. Similarly, new pipelines cannot be functional on such disks and will end up failing write requests and contribute in client retry chain.
> Similarly, nodes low on disk capacity should not be choosen as targets for container re-replication.
> The goal of the Jira is address disk related issues on datanodes holistically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org