You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/12/02 16:06:00 UTC

[jira] [Commented] (KUDU-3341) Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error

    [ https://issues.apache.org/jira/browse/KUDU-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452480#comment-17452480 ] 

ASF subversion and git services commented on KUDU-3341:
-------------------------------------------------------

Commit 0222c3163129b1d6c1c37b216482aa64f921c415 in kudu's branch refs/heads/master from zhangyifan27
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0222c31 ]

KUDU-3341: Stop retrying to DeleteTablet on wrong server

This patch improves catalog_manager's behavior when delete tablet with
a 'WRONG_SERVER_UUID' error. It's better to mark this RetryTask failed
than keep retrying to send too many requests. Because master would
always receive same error message until the "wrong uuid server" restarts
with a "correct uuid", at that time the tserver would send full tablets
report and then trigger the deletion of outdated tablets.

I also add a test that reproduces the scenario described in the JIRA.

Change-Id: Ieaa36086300bda7f958570c690b951dc090c342a
Reviewed-on: http://gerrit.cloudera.org:8080/18057
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Reviewed-by: Attila Bukor <ab...@apache.org>


> Catalog Manager should stop retrying DeleteTablet when receive WRONG_SERVER_UUID error
> --------------------------------------------------------------------------------------
>
>                 Key: KUDU-3341
>                 URL: https://issues.apache.org/jira/browse/KUDU-3341
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master
>            Reporter: YifanZhang
>            Assignee: YifanZhang
>            Priority: Minor
>
> Sometimes a tablet server could be shutdown because of detected disk failures, and this server would be re-added to the cluster with all data cleared.
> Replicas could be replicated after  {{\-\-follower_unavailable_considered_failed_sec}} seconds. And then master send DeleteTablet RPCs to this tserver, but receive either a RPC failure(tserver was shutdown) or a WRONG_SERVER_UUID error(tserver started with a new uuid), and keep retrying to delete tablets after {{{}--unresponsive_ts_rpc_timeout_ms{}}}(default 1 hour).
> It's not so necessary to retry when receive WRONG_SERVER_UUID errors, because the server uuid could only be corrected by restarting the tablet server, at that time full tablet reports would sent to master and if any, outdated replicas could be deleted finally.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)