You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Paul Lam <pa...@gmail.com> on 2019/04/25 08:20:13 UTC

Re: Long running application failed to init containers due to anthentication errors

Hi Billy,

Thanks for your information. We’ve figured it out. 

Our YARN cluster is using ViewFS backed by multiple HDFSs, and the Flink applications are using one of them but not the first one (the first one would be used by YARN underneath), hence the tokens don’t match.

Best,
Paul Lam

> 在 2019年3月24日,06:00,Billy Watson <wi...@gmail.com> 写道:
> 
> So just a hunch because we’ve been dealing with something similar. When the failure occurs, has the resource manager also failed over just recently or in the previous 24 hours?
> 
> One thing to try: catch this exception and manually fail to the new master/resource manager. 
> 
> - Billy Watson
> 
> On Thu, Nov 29, 2018 at 21:16 Paul Lam <paullin3280@gmail.com <ma...@gmail.com>> wrote:
> Hi,
> 
> I’m running Flink applications on YARN 2.6.0-cdh5.6.0 and get a situation. After running for a while (could be longer than 7 days) the application might
> need to rescale up or recover from a node failure but it is not able to allocate new containers. All the incoming containers would fail to localize resources
> and create log aggregation dirs for lack of credentials, so the Flink application never gets the requested containers. It seems that the credentials in the
> container launch context somehow disappears.
> 
> I find this looks very similar to FLINK-6376[1] and YARN-2704[2], but both of them should have been fixed. The Flink AM gets the hdfs delegation token from
>  the client, put it into the container launch context and will not refresh it afterwards. But IMHO, if the token is expired, the exception should be “token expired”
> or “token not found in cache”, but now what I get is “client cannot authenticate via [token, kerberos]”. 
> 
> This happens very randomly, and I have been struggling with it for couples of days. Any help would be greatly appreciated. Thanks a lot!
> 
> [1] https://issues.apache.org/jira/browse/FLINK-6376 <https://issues.apache.org/jira/browse/FLINK-6376>[2] https://issues.apache.org/jira/browse/YARN-2704 <https://issues.apache.org/jira/browse/YARN-2704>
> 
> Best,
> Paul Lam
> 
> 
> -- 
> William Watson