You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/09/22 06:02:00 UTC

[jira] [Commented] (KUDU-3191) Fail tablet replicas that suffer from KUDU-2233 instead of crashing

    [ https://issues.apache.org/jira/browse/KUDU-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199832#comment-17199832 ] 

ASF subversion and git services commented on KUDU-3191:
-------------------------------------------------------

Commit fcceb8b1a20afff30e15b6248a56ab3e06b61e79 in kudu's branch refs/heads/master from Andrew Wong
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=fcceb8b ]

KUDU-3191: fail replicas when KUDU-2233 is detected

Despite the longstanding fixes that stop bad KUDU-2233 compactions,
users still see the results of already corrupted data, particularly when
upgrading to newer versions that may compact more aggressively than
older versions.

Rather than crashing when hitting a KUDU-2233 failure, this patch
updates the behavior to fail the replica. Similar to disk failures or
CFile checksum corruption, this will trigger re-replication to happen,
and eviction will only happen if there is a healthy majority.

The hope is that fewer users will see this corruption cause problems, as
the corruption will henceforth not crash servers, and only tablets with
a majority corrupted will be unavailable.

Change-Id: I43570b961dfd5eb8518328121585255d32cf2ebb
Reviewed-on: http://gerrit.cloudera.org:8080/16471
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>


> Fail tablet replicas that suffer from KUDU-2233 instead of crashing
> -------------------------------------------------------------------
>
>                 Key: KUDU-3191
>                 URL: https://issues.apache.org/jira/browse/KUDU-3191
>             Project: Kudu
>          Issue Type: Task
>          Components: compaction
>            Reporter: Andrew Wong
>            Assignee: Andrew Wong
>            Priority: Major
>
> KUDU-2233 results in persisted corruption that causes a broken invariant, leading to a server crash. The recovery process for this corruption is arduous, especially if there are multiple tablet replicas in a given server that suffer from it -- users typically start the server, see the crash, remove the affected replica manually via tooling, and restart, repeatedly until the server comes up healthily.
> Instead, we should consider treating this as we do CFile block-level corruption[1] and fail the tablet replica. At best, we end up recovering from a non-corrupted replica. At worst, we'd end up with multiple corrupted replicas, which is still better than what we have today, which is multiple corrupted replicas and unavailable servers that lead to excessive re-replication.
> [1] https://github.com/apache/kudu/commit/cf6927cb153f384afb649b664de1d4276bd6d83f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)