You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Xiao Chen (JIRA)" <ji...@apache.org> on 2017/09/08 05:43:00 UTC

[jira] [Commented] (HADOOP-14521) KMS client needs retry logic

    [ https://issues.apache.org/jira/browse/HADOOP-14521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158148#comment-16158148 ] 

Xiao Chen commented on HADOOP-14521:
------------------------------------

Unfortunately, this turns out still brought some incompatible behaviors to the KMS client, for some corner cases.

Reason is:
- After this, client uses {{RetryPolicies.failoverOnNetworkException}} with {{TryOnceThenFail}}.
- Before this, whatever the exception is, client always tries on all servers.
- While our initial judgement of no-retry for certain types of exceptions (e.g. AccessControlException etc.) makes sense, a lot of the wrapped IOExceptions are no longer retried. This includes the EOFE like HADOOP-14841, and the IOException wrapped GSSE met in HADOOP-14445.

As a result, this fix made HADOOP-14841 and HADOOP-14445 from 'wrong-but-works' kind of issues, become 'wrong-and-breaks' kind of issues.

Given the KMS has been mysterious and untamed at times, I'm not confident the above list is exhaustive.
Suggest we:
- mark this incompatible
- either provide an addendum or do a follow-on jira, to keep existing behavior.

[~shahrs87], thoughts?

> KMS client needs retry logic
> ----------------------------
>
>                 Key: HADOOP-14521
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14521
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.6.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>             Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
>         Attachments: HADOOP-14521.09.patch, HADOOP-14521-branch-2.8.002.patch, HADOOP-14521-branch-2.8.2.patch, HADOOP-14521-trunk-10.patch, HDFS-11804-branch-2.8.patch, HDFS-11804-trunk-1.patch, HDFS-11804-trunk-2.patch, HDFS-11804-trunk-3.patch, HDFS-11804-trunk-4.patch, HDFS-11804-trunk-5.patch, HDFS-11804-trunk-6.patch, HDFS-11804-trunk-7.patch, HDFS-11804-trunk-8.patch, HDFS-11804-trunk.patch
>
>
> The kms client appears to have no retry logic – at all.  It's completely decoupled from the ipc retry logic.  This has major impacts if the KMS is unreachable for any reason, including but not limited to network connection issues, timeouts, the +restart during an upgrade+.
> This has some major ramifications:
> # Jobs may fail to submit, although oozie resubmit logic should mask it
> # Non-oozie launchers may experience higher rates if they do not already have retry logic.
> # Tasks reading EZ files will fail, probably be masked by framework reattempts
> # EZ file creation fails after creating a 0-length file – client receives EDEK in the create response, then fails when decrypting the EDEK
> # Bulk hadoop fs copies, and maybe distcp, will prematurely fail



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org