You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by xieyi <xi...@163.com> on 2022/02/11 07:14:00 UTC

flink on yarn HDFS_DELEGATION_TOKEN清除后,任务am attempt时失败


老师们好:
请教一个问题,
         由于hadoop  Delegation token 会在超过Max Lifetime(默认7天)后过期清除,对于长期运行任务,yarn提到有三种策略解决这个问题:https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#securing-long-lived-yarn-services


想知道flink on yarn是如何解决hadoop  Delegation token 过期的呢?看官网似乎说得不够清楚


目前在生产环境遇到了如下故障:
flink 1.12 on yarn,yarn的nodemanager是容器化部署的,nodemanager偶尔会挂掉重启。当flink 任务运行超过7天后,若某个flink任务的JM(am)所在的nodemanager重启,am会进行attempt(attempt时获取的是任务提交时的13770506这个token,但这个token已经从namenode清除了),但attempt失败,失败原因为:


Failing this attempt.Diagnostics: token (HDFS_DELEGATION_TOKEN token 1377**** for user***) can't be found in cache
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 1377****for user***) can't be found in cache


疑问: flink on yarn在HADOOP Delegation token清除后,是如何更新的呢?是生成了新的token吗?
              如果生成了新的token,为何am attempt 时,还会继续获取已清除的这个token(13770506)
这个故障是否和nodemanager容器化部署有关?nodemanager重启后,因为保存keytab的相关文件被清除了?