You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Bankim Bhavsar (Jira)" <ji...@apache.org> on 2019/11/06 22:11:00 UTC

[jira] [Comment Edited] (KUDU-2904) Master shouldn't allow master tablet operations after a disk failure

    [ https://issues.apache.org/jira/browse/KUDU-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968758#comment-16968758 ] 

Bankim Bhavsar edited comment on KUDU-2904 at 11/6/19 10:10 PM:
----------------------------------------------------------------

Fixed with commit [34efee128945fe1577499acf8eda31d39807ab3b|https://github.com/apache/kudu/commit/34efee128945fe1577499acf8eda31d39807ab3b]


was (Author: bankim):
Fixed with commit [https://github.com/apache/kudu/commit/34efee128945fe1577499acf8eda31d39807ab3b]

> Master shouldn't allow master tablet operations after a disk failure
> --------------------------------------------------------------------
>
>                 Key: KUDU-2904
>                 URL: https://issues.apache.org/jira/browse/KUDU-2904
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs, master
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Assignee: Bankim Bhavsar
>            Priority: Critical
>              Labels: newbie
>             Fix For: 1.12.0
>
>
> The master doesn't register any FS error handlers, which means that in the event of a disk failure that doesn't intrinsically crash the server (i.e. a disk failure to one of several directories), the master tablet is not failed and may undergo additional MM ops. This is forbidden: the invariant is that a tablet with a failed disk should itself fail. In the master perhaps the behavior should be more severe (i.e. perhaps the master should crash itself).
> This surfaced with a user report of multiple minor delta compactions on a master even after one of them had failed during a SyncDir() call on its superblock flush. The metadata was corrupt: the blocks added to the superblock by the compaction were marked as deleted in the LBM. It's unclear whether the in-memory state of the superblock was corrupted by the failure and subsequent compactions, or whether the corruption was caused by something else. Either way, no operations should have been permitted following the initial failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)