You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/09/13 16:46:21 UTC

[jira] [Commented] (KUDU-1608) Catalog Manager DeleteTablet retry logic is broken

    [ https://issues.apache.org/jira/browse/KUDU-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487720#comment-15487720 ] 

Todd Lipcon commented on KUDU-1608:
-----------------------------------

bq. Arguably this should only be done lazily when the dead tablets report in, since most of the time the tablet will be ejected due to failure (and will never be seen again).

This won't really work since we only report tablets when something has changed. It's quite possible (actually common) for a tablet to get evicted due to falling too far behind, and in that case it's important for the master to send the request.

I agree, though, if the master sees that the remote server is actually down (eg conn refused) it should stop retrying, because when the remote server comes back up, it will report the tablet.

> Catalog Manager DeleteTablet retry logic is broken
> --------------------------------------------------
>
>                 Key: KUDU-1608
>                 URL: https://issues.apache.org/jira/browse/KUDU-1608
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>            Reporter: Dan Burkert
>
> There are a couple of issues with the Catalog Manager's retry logic for DeleteTablet requests:
> 1. The retries loop indefinitely
> 2. The RPC response is checked against a whitelist of fatal errors, instead of a list of retriable errors.  Additionally, we are missing many fatal errors on this list such as WRONG_SERVER_UUID and UNKNOWN_ERROR.  I think we should instead only retry on errors which we know we can recover from.
> 3. The catalog manager aggressively sends out DeleteTablet requests to tablet servers when tablets are ejected from the group.  Arguably this should only be done lazily when the dead tablets report in, since most of the time the tablet will be ejected due to failure (and will never be seen again).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)