You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Niels Basjes <Ni...@basjes.nl> on 2015/12/02 11:47:42 UTC

Flink job on secure Yarn fails after many hours

Hi,


We have a Kerberos secured Yarn cluster here and I'm experimenting
with Apache Flink on top of that.

A few days ago I started a very simple Flink application (just stream
the time as a String into HBase 10 times per second).

I (deliberately) asked our IT-ops guys to make my account have a max
ticket time of 5 minutes and a max renew time of 10 minutes (yes,
ridiculously low timeout values because I needed to validate this
https://issues.apache.org/jira/browse/FLINK-2977).

This job is started with a keytab file and after running for 31 hours
it suddenly failed with the exception you see below.

I had the same job running for almost 400 hours until that failed too
(I was too late to check the logfiles but I suspect the same problem).


So in that time span my tickets have expired and new tickets have been
obtained several hundred times.


The main error I see is that in the process of a ticket expiring and
being renewed I see this message:

     *Not retrying because the invoked method is not idempotent, and
unable to determine whether it was invoked*


Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )

Flink is version 0.10.1


How do I fix this?
Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?


Niels Basjes



21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
          - PriviledgedActionException as:nbasjes (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_000001
21:30:27,861 WARN  org.apache.hadoop.ipc.Client
          - Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_000001
21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
          - PriviledgedActionException as:nbasjes (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from
appattempt_1443166961758_163901_000001*21:30:27,891 WARN
org.apache.hadoop.io.retry.RetryInvocationHandler             -
Exception while invoking class
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
Not retrying because the invoked method is not idempotent, and unable
to determine whether it was invoked
*org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
AMRMToken from appattempt_1443166961758_163901_000001
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
	... 29 more
21:30:27,943 ERROR akka.actor.OneForOneStrategy
          - Invalid AMRMToken from
appattempt_1443166961758_163901_000001
org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
AMRMToken from appattempt_1443166961758_163901_000001
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
	... 29 more
21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
          - Stopping JobManager
akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
21:30:28,088 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source -> Sink: Unnamed (1/1)
(db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
21:30:28,113 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: Custom Source -> Sink: Unnamed (1/1)
(db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
          - Stopped BLOB server at 0.0.0.0:41281
21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
          - Actor akka://flink/user/jobmanager#403236912 terminated,
stopping process...
21:30:28,286 INFO
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         -
Removing web root dir
/tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

unsubscribe

Posted by 范昂 <fa...@adxdata.com>.


发自我的 iPhone

> 在 2015年12月3日，上午1:41，Maximilian Michels <mx...@apache.org> 写道：
> 
> Great. Here is the commit to try out:
> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
> 
> If you already have the Flink repository, check it out using
> 
> git fetch https://github.com/mxm/flink/
> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
> 
> Alternatively, here's a direct download link to the sources with the
> fix included:
> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
> 
> Thanks a lot,
> Max
> 
>> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> Sure, just give me the git repo url to build and I'll give it a try.
>> 
>> Niels
>> 
>>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org> wrote:
>>> 
>>> I mentioned that the exception gets thrown when requesting container
>>> status information. We need this to send a heartbeat to YARN but it is
>>> not very crucial if this fails once for the running job. Possibly, we
>>> could work around this problem by retrying N times in case of an
>>> exception.
>>> 
>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>> we provide and test again?
>>> 
>>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>>> No, I was just asking.
>>>> No upgrade is possible for the next month or two.
>>>> 
>>>> This week is our busiest day of the year ...
>>>> Our shop is doing about 10 orders per second these days ...
>>>> 
>>>> So they won't upgrade until next January/February
>>>> 
>>>> Niels
>>>> 
>>>> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
>>>> wrote:
>>>>> 
>>>>> Hi Niels,
>>>>> 
>>>>> You mentioned you have the option to update Hadoop and redeploy the
>>>>> job. Would be great if you could do that and let us know how it turns
>>>>> out.
>>>>> 
>>>>> Cheers,
>>>>> Max
>>>>> 
>>>>>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I posted the entire log from the first log line at the moment of
>>>>>> failure
>>>>>> to
>>>>>> the very end of the logfile.
>>>>>> This is all I have.
>>>>>> 
>>>>>> As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>>>>> Yarn
>>>>>> is
>>>>>> that it catches the "Invalid Token" and then (if keytab) gets a new
>>>>>> Kerberos
>>>>>> ticket (or tgt?).
>>>>>> When the new ticket has been obtained it retries the call that
>>>>>> previously
>>>>>> failed.
>>>>>> To me it seemed that this call can fail over the invalid Token yet it
>>>>>> cannot
>>>>>> be retried.
>>>>>> 
>>>>>> At this moment I'm thinking a bug in Hadoop.
>>>>>> 
>>>>>> Niels
>>>>>> 
>>>>>> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi Niels,
>>>>>>> 
>>>>>>> Sorry for hear you experienced this exception. From a first glance,
>>>>>>> it
>>>>>>> looks like a bug in Hadoop to me.
>>>>>>> 
>>>>>>>> "Not retrying because the invoked method is not idempotent, and
>>>>>>>> unable
>>>>>>>> to determine whether it was invoked"
>>>>>>> 
>>>>>>> That is nothing to worry about. This is Hadoop's internal retry
>>>>>>> mechanism that re-attempts to do actions which previously failed if
>>>>>>> that's possible. Since the action is not idempotent (it cannot be
>>>>>>> executed again without risking to change the state of the execution)
>>>>>>> and it also doesn't track its execution states, it won't be retried
>>>>>>> again.
>>>>>>> 
>>>>>>> The main issue is this exception:
>>>>>>> 
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from >appattempt_1443166961758_163901_000001
>>>>>>> 
>>>>>>> From the stack trace it is clear that this exception occurs upon
>>>>>>> requesting container status information from the Resource Manager:
>>>>>>> 
>>>>>>>> at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>> 
>>>>>>> Are there any more exceptions in the log? Do you have the complete
>>>>>>> logs available and could you share them?
>>>>>>> 
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Max
>>>>>>> 
>>>>>>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> We have a Kerberos secured Yarn cluster here and I'm experimenting
>>>>>>>> with
>>>>>>>> Apache Flink on top of that.
>>>>>>>> 
>>>>>>>> A few days ago I started a very simple Flink application (just
>>>>>>>> stream
>>>>>>>> the
>>>>>>>> time as a String into HBase 10 times per second).
>>>>>>>> 
>>>>>>>> I (deliberately) asked our IT-ops guys to make my account have a
>>>>>>>> max
>>>>>>>> ticket
>>>>>>>> time of 5 minutes and a max renew time of 10 minutes (yes,
>>>>>>>> ridiculously
>>>>>>>> low
>>>>>>>> timeout values because I needed to validate this
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-2977).
>>>>>>>> 
>>>>>>>> This job is started with a keytab file and after running for 31
>>>>>>>> hours
>>>>>>>> it
>>>>>>>> suddenly failed with the exception you see below.
>>>>>>>> 
>>>>>>>> I had the same job running for almost 400 hours until that failed
>>>>>>>> too
>>>>>>>> (I
>>>>>>>> was
>>>>>>>> too late to check the logfiles but I suspect the same problem).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> So in that time span my tickets have expired and new tickets have
>>>>>>>> been
>>>>>>>> obtained several hundred times.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The main error I see is that in the process of a ticket expiring
>>>>>>>> and
>>>>>>>> being
>>>>>>>> renewed I see this message:
>>>>>>>> 
>>>>>>>>     Not retrying because the invoked method is not idempotent,
>>>>>>>> and
>>>>>>>> unable
>>>>>>>> to determine whether it was invoked
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>>>>>>> 
>>>>>>>> Flink is version 0.10.1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> How do I fix this?
>>>>>>>> Is this a bug (in either Hadoop or Flink) or am I doing something
>>>>>>>> wrong?
>>>>>>>> Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Niels Basjes
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>>>>>>>> - Exception encountered while connecting to the server :
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,891 WARN
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler
>>>>>>>> - Exception while invoking class
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>>>>>>> Not retrying because the invoked method is not idempotent, and
>>>>>>>> unable
>>>>>>>> to
>>>>>>>> determine whether it was invoked
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>      at
>>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>>>>>>      at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>>>>>> Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>>>>>      at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>>>      at
>>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>>>>      at
>>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>>>>      at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>>>>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>>>      at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>>>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>>>>      at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>>>>      at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>> Caused by:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>>      at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>>>>>>      at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>>>>>>      at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>>>>>>      ... 29 more
>>>>>>>> 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>>>>>>> - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>>      at
>>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>>>>>>      at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>>>>>> Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>>>>>      at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>>>      at
>>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>>>>      at
>>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>>>>      at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>>>>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>>>      at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>>>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>>>>      at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>>>>      at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>> Caused by:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>>      at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>>>>>>      at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>>>>>>      at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>>>>>>      at
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>>>>>>      ... 29 more
>>>>>>>> 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>>>>>>>> - Stopping JobManager
>>>>>>>> akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>>>>>>> 21:30:28,088 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1)
>>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>>>>>>> CANCELING
>>>>>>>> 21:30:28,113 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1)
>>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>>>>>>> FAILED
>>>>>>>> 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>>>>> - Stopped BLOB server at 0.0.0.0:41281
>>>>>>>> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>>>>>> - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>>>>>> stopping
>>>>>>>> process...
>>>>>>>> 21:30:28,286 INFO
>>>>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>>>>>> - Removing web root dir
>>>>>>>> /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>> 
>>>>>>>> Niels Basjes
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards / Met vriendelijke groeten,
>>>>>> 
>>>>>> Niels Basjes
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best regards / Met vriendelijke groeten,
>>>> 
>>>> Niels Basjes
>> 
>> 
>> 
>> 
>> --
>> Best regards / Met vriendelijke groeten,
>> 
>> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

No, this issue is now gone for us.
The fixed in 1.2.0 ensured that we are now able to run jobs on our cluster
beyond the 7 days limit.

Niels

On Wed, Apr 12, 2017 at 5:35 PM, Robert Metzger <rm...@apache.org> wrote:

> Niels, are you still facing this issue?
>
> As far as I understood it, the security changes in Flink 1.2.0 use a new
> Kerberos mechanism that allows infinite token renewal.
>
> On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
>> Hi Niels,
>>
>> Thanks for the feedback. As far as I know, Hadoop deliberately
>> defaults to the one week maximum life time of delegation tokens. Have
>> you tried increasing the maximum token life time or was that not an
>> option?
>>
>> I wonder why do you use a while loop? Would it be possible to use the
>> Yarn failover mechanism which starts a new ApplicationMaster and
>> resubmits the job?
>>
>> Thanks,
>> Max
>>
>>
>> On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > Hi,
>> >
>> > In my environment doing the "proxy" thing didn't work.
>> > With an token expire of 168 hours (1 week) the job consistently
>> terminates
>> > at exactly (within a margin of 10 seconds) 173.5 hours.
>> > So far we have not been able to solve this problem.
>> >
>> > Our teams now simply assume the thing fails once in a while and have an
>> > automatic restart feature (i.e. shell script with a while true loop).
>> > The best guess at a root cause is this
>> > https://issues.apache.org/jira/browse/HDFS-9276
>> >
>> > If you have a real solution or a reference to a related bug report to
>> this
>> > problem then please share!
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
>> > <th...@ericsson.com> wrote:
>> >>
>> >> Hi Max,
>> >>
>> >> I will try these workaround.
>> >> Thanks
>> >>
>> >> Thomas
>> >>
>> >> ________________________________________
>> >> De : Maximilian Michels [mxm@apache.org]
>> >> Envoyé : mardi 15 mars 2016 16:51
>> >> À : user@flink.apache.org
>> >> Cc : Niels Basjes
>> >> Objet : Re: Flink job on secure Yarn fails after many hours
>> >>
>> >> Hi Thomas,
>> >>
>> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> >> to properly run Kerberos applications on Hadoop clusters. Versions
>> >> before that have critical bugs related to the internal security token
>> >> handling that may expire the token although it is still valid.
>> >>
>> >> That said, there is another limitation of Hadoop that the maximum
>> >> internal token life time is one week. To work around this limit, you
>> >> have two options:
>> >>
>> >> a) increasing the maximum token life time
>> >>
>> >> In yarn-site.xml:
>> >>
>> >> <property>
>> >>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >> In hdfs-site.xml
>> >>
>> >> <property>
>> >>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>> >>   <value>9223372036854775807</value>
>> >> </property>
>> >>
>> >>
>> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>> >>
>> >> From
>> >> http://www.cloudera.com/documentation/enterprise/5-3-x/
>> topics/cm_sg_yarn_long_jobs.html
>> >>
>> >> "You can work around this by configuring the ResourceManager as a
>> >> proxy user for the corresponding HDFS NameNode so that the
>> >> ResourceManager can request new tokens when the existing ones are past
>> >> their maximum lifetime."
>> >>
>> >> @Nils: Could you comment on what worked best for you?
>> >>
>> >> Best,
>> >> Max
>> >>
>> >>
>> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> >> <th...@ericsson.com> wrote:
>> >> >
>> >> > Hello everyone,
>> >> >
>> >> >
>> >> >
>> >> > We are facing the same probleme now in our Flink applications, launch
>> >> > using YARN.
>> >> >
>> >> > Just want to know if there is any update about this exception ?
>> >> >
>> >> >
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > Thomas
>> >> >
>> >> >
>> >> >
>> >> > ________________________________
>> >> >
>> >> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
>> >> > [Niels@basjes.nl]
>> >> > Envoyé : vendredi 4 décembre 2015 10:40
>> >> > À : user@flink.apache.org
>> >> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >> >
>> >> > Hi Maximilian,
>> >> >
>> >> > I just downloaded the version from your google drive and used that to
>> >> > run my test topology that accesses HBase.
>> >> > I deliberately started it twice to double the chance to run into this
>> >> > situation.
>> >> >
>> >> > I'll keep you posted.
>> >> >
>> >> > Niels
>> >> >
>> >> >
>> >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
>> >> > wrote:
>> >> >>
>> >> >> Hi Niels,
>> >> >>
>> >> >> Just got back from our CI. The build above would fail with a
>> >> >> Checkstyle error. I corrected that. Also I have built the binaries
>> for
>> >> >> your Hadoop version 2.6.0.
>> >> >>
>> >> >> Binaries:
>> >> >>
>> >> >>
>> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat
>> -fail-0.10.1.zip
>> >> >>
>> >> >> Thanks,
>> >> >> Max
>> >> >>
>> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912
>> terminated,
>> >> >> >>>> >> >> > stopping
>> >> >> >>>> >> >> > process...
>> >> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >> >>>> >> >> > - Removing web root dir
>> >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > --
>> >> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >> >> >
>> >> >> >>>> >> >> > Niels Basjes
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> >
>> >> >> >>>> >> > --
>> >> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >> >
>> >> >> >>>> >> > Niels Basjes
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> >
>> >> >> >>>> > --
>> >> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >> >>>> >
>> >> >> >>>> > Niels Basjes
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> --
>> >> >> >>> Best regards / Met vriendelijke groeten,
>> >> >> >>>
>> >> >> >>> Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Robert Metzger <rm...@apache.org>.

Niels, are you still facing this issue?

As far as I understood it, the security changes in Flink 1.2.0 use a new
Kerberos mechanism that allows infinite token renewal.

On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Niels,
>
> Thanks for the feedback. As far as I know, Hadoop deliberately
> defaults to the one week maximum life time of delegation tokens. Have
> you tried increasing the maximum token life time or was that not an
> option?
>
> I wonder why do you use a while loop? Would it be possible to use the
> Yarn failover mechanism which starts a new ApplicationMaster and
> resubmits the job?
>
> Thanks,
> Max
>
>
> On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > Hi,
> >
> > In my environment doing the "proxy" thing didn't work.
> > With an token expire of 168 hours (1 week) the job consistently
> terminates
> > at exactly (within a margin of 10 seconds) 173.5 hours.
> > So far we have not been able to solve this problem.
> >
> > Our teams now simply assume the thing fails once in a while and have an
> > automatic restart feature (i.e. shell script with a while true loop).
> > The best guess at a root cause is this
> > https://issues.apache.org/jira/browse/HDFS-9276
> >
> > If you have a real solution or a reference to a related bug report to
> this
> > problem then please share!
> >
> > Niels Basjes
> >
> >
> >
> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> > <th...@ericsson.com> wrote:
> >>
> >> Hi Max,
> >>
> >> I will try these workaround.
> >> Thanks
> >>
> >> Thomas
> >>
> >> ________________________________________
> >> De : Maximilian Michels [mxm@apache.org]
> >> Envoyé : mardi 15 mars 2016 16:51
> >> À : user@flink.apache.org
> >> Cc : Niels Basjes
> >> Objet : Re: Flink job on secure Yarn fails after many hours
> >>
> >> Hi Thomas,
> >>
> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
> >> to properly run Kerberos applications on Hadoop clusters. Versions
> >> before that have critical bugs related to the internal security token
> >> handling that may expire the token although it is still valid.
> >>
> >> That said, there is another limitation of Hadoop that the maximum
> >> internal token life time is one week. To work around this limit, you
> >> have two options:
> >>
> >> a) increasing the maximum token life time
> >>
> >> In yarn-site.xml:
> >>
> >> <property>
> >>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
> >>   <value>9223372036854775807</value>
> >> </property>
> >>
> >> In hdfs-site.xml
> >>
> >> <property>
> >>   <name>dfs.namenode.delegation.token.max-lifetime</name>
> >>   <value>9223372036854775807</value>
> >> </property>
> >>
> >>
> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
> >>
> >> From
> >> http://www.cloudera.com/documentation/enterprise/5-3-
> x/topics/cm_sg_yarn_long_jobs.html
> >>
> >> "You can work around this by configuring the ResourceManager as a
> >> proxy user for the corresponding HDFS NameNode so that the
> >> ResourceManager can request new tokens when the existing ones are past
> >> their maximum lifetime."
> >>
> >> @Nils: Could you comment on what worked best for you?
> >>
> >> Best,
> >> Max
> >>
> >>
> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
> >> <th...@ericsson.com> wrote:
> >> >
> >> > Hello everyone,
> >> >
> >> >
> >> >
> >> > We are facing the same probleme now in our Flink applications, launch
> >> > using YARN.
> >> >
> >> > Just want to know if there is any update about this exception ?
> >> >
> >> >
> >> >
> >> > Thanks
> >> >
> >> >
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> > ________________________________
> >> >
> >> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
> >> > [Niels@basjes.nl]
> >> > Envoyé : vendredi 4 décembre 2015 10:40
> >> > À : user@flink.apache.org
> >> > Objet : Re: Flink job on secure Yarn fails after many hours
> >> >
> >> > Hi Maximilian,
> >> >
> >> > I just downloaded the version from your google drive and used that to
> >> > run my test topology that accesses HBase.
> >> > I deliberately started it twice to double the chance to run into this
> >> > situation.
> >> >
> >> > I'll keep you posted.
> >> >
> >> > Niels
> >> >
> >> >
> >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
> >> > wrote:
> >> >>
> >> >> Hi Niels,
> >> >>
> >> >> Just got back from our CI. The build above would fail with a
> >> >> Checkstyle error. I corrected that. Also I have built the binaries
> for
> >> >> your Hadoop version 2.6.0.
> >> >>
> >> >> Binaries:
> >> >>
> >> >>
> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-
> heartbeat-fail-0.10.1.zip
> >> >>
> >> >> Thanks,
> >> >> Max
> >> >>
> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
> >> >> >>>> >> >> > 21:30:28,185 ERROR
> >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
> >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912
> terminated,
> >> >> >>>> >> >> > stopping
> >> >> >>>> >> >> > process...
> >> >> >>>> >> >> > 21:30:28,286 INFO
> >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> >> >>>> >> >> > - Removing web root dir
> >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >> >>>> >> >> >
> >> >> >>>> >> >> >
> >> >> >>>> >> >> > --
> >> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >> >> >
> >> >> >>>> >> >> > Niels Basjes
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> >
> >> >> >>>> >> > --
> >> >> >>>> >> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >> >
> >> >> >>>> >> > Niels Basjes
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> >
> >> >> >>>> > --
> >> >> >>>> > Best regards / Met vriendelijke groeten,
> >> >> >>>> >
> >> >> >>>> > Niels Basjes
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> Best regards / Met vriendelijke groeten,
> >> >> >>>
> >> >> >>> Niels Basjes
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best regards / Met vriendelijke groeten,
> >> >
> >> > Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Hi Niels,

Thanks for the feedback. As far as I know, Hadoop deliberately
defaults to the one week maximum life time of delegation tokens. Have
you tried increasing the maximum token life time or was that not an
option?

I wonder why do you use a while loop? Would it be possible to use the
Yarn failover mechanism which starts a new ApplicationMaster and
resubmits the job?

Thanks,
Max


On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hi,
>
> In my environment doing the "proxy" thing didn't work.
> With an token expire of 168 hours (1 week) the job consistently terminates
> at exactly (within a margin of 10 seconds) 173.5 hours.
> So far we have not been able to solve this problem.
>
> Our teams now simply assume the thing fails once in a while and have an
> automatic restart feature (i.e. shell script with a while true loop).
> The best guess at a root cause is this
> https://issues.apache.org/jira/browse/HDFS-9276
>
> If you have a real solution or a reference to a related bug report to this
> problem then please share!
>
> Niels Basjes
>
>
>
> On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> <th...@ericsson.com> wrote:
>>
>> Hi Max,
>>
>> I will try these workaround.
>> Thanks
>>
>> Thomas
>>
>> ________________________________________
>> De : Maximilian Michels [mxm@apache.org]
>> Envoyé : mardi 15 mars 2016 16:51
>> À : user@flink.apache.org
>> Cc : Niels Basjes
>> Objet : Re: Flink job on secure Yarn fails after many hours
>>
>> Hi Thomas,
>>
>> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> to properly run Kerberos applications on Hadoop clusters. Versions
>> before that have critical bugs related to the internal security token
>> handling that may expire the token although it is still valid.
>>
>> That said, there is another limitation of Hadoop that the maximum
>> internal token life time is one week. To work around this limit, you
>> have two options:
>>
>> a) increasing the maximum token life time
>>
>> In yarn-site.xml:
>>
>> <property>
>>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>>   <value>9223372036854775807</value>
>> </property>
>>
>> In hdfs-site.xml
>>
>> <property>
>>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>>   <value>9223372036854775807</value>
>> </property>
>>
>>
>> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>>
>> From
>> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>>
>> "You can work around this by configuring the ResourceManager as a
>> proxy user for the corresponding HDFS NameNode so that the
>> ResourceManager can request new tokens when the existing ones are past
>> their maximum lifetime."
>>
>> @Nils: Could you comment on what worked best for you?
>>
>> Best,
>> Max
>>
>>
>> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> <th...@ericsson.com> wrote:
>> >
>> > Hello everyone,
>> >
>> >
>> >
>> > We are facing the same probleme now in our Flink applications, launch
>> > using YARN.
>> >
>> > Just want to know if there is any update about this exception ?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > Thomas
>> >
>> >
>> >
>> > ________________________________
>> >
>> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes
>> > [Niels@basjes.nl]
>> > Envoyé : vendredi 4 décembre 2015 10:40
>> > À : user@flink.apache.org
>> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >
>> > Hi Maximilian,
>> >
>> > I just downloaded the version from your google drive and used that to
>> > run my test topology that accesses HBase.
>> > I deliberately started it twice to double the chance to run into this
>> > situation.
>> >
>> > I'll keep you posted.
>> >
>> > Niels
>> >
>> >
>> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Just got back from our CI. The build above would fail with a
>> >> Checkstyle error. I corrected that. Also I have built the binaries for
>> >> your Hadoop version 2.6.0.
>> >>
>> >> Binaries:
>> >>
>> >>
>> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >>>> >> >> > stopping
>> >> >>>> >> >> > process...
>> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >>>> >> >> > - Removing web root dir
>> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > --
>> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >> >
>> >> >>>> >> >> > Niels Basjes
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > --
>> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >
>> >> >>>> >> > Niels Basjes
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >>>> >
>> >> >>>> > Niels Basjes
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best regards / Met vriendelijke groeten,
>> >> >>>
>> >> >>> Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

In my environment doing the "proxy" thing didn't work.
With an token expire of 168 hours (1 week) the job consistently terminates
at exactly (within a margin of 10 seconds) 173.5 hours.
So far we have not been able to solve this problem.

Our teams now simply assume the thing fails once in a while and have an
automatic restart feature (i.e. shell script with a while true loop).
The best guess at a root cause is this
https://issues.apache.org/jira/browse/HDFS-9276

If you have a real solution or a reference to a related bug report to this
problem then please share!

Niels Basjes



On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault <
thomas.lamirault@ericsson.com> wrote:

> Hi Max,
>
> I will try these workaround.
> Thanks
>
> Thomas
>
> ________________________________________
> De : Maximilian Michels [mxm@apache.org]
> Envoyé : mardi 15 mars 2016 16:51
> À : user@flink.apache.org
> Cc : Niels Basjes
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Thomas,
>
> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
> to properly run Kerberos applications on Hadoop clusters. Versions
> before that have critical bugs related to the internal security token
> handling that may expire the token although it is still valid.
>
> That said, there is another limitation of Hadoop that the maximum
> internal token life time is one week. To work around this limit, you
> have two options:
>
> a) increasing the maximum token life time
>
> In yarn-site.xml:
>
> <property>
>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>   <value>9223372036854775807</value>
> </property>
>
> In hdfs-site.xml
>
> <property>
>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>   <value>9223372036854775807</value>
> </property>
>
>
> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>
> From
> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>
> "You can work around this by configuring the ResourceManager as a
> proxy user for the corresponding HDFS NameNode so that the
> ResourceManager can request new tokens when the existing ones are past
> their maximum lifetime."
>
> @Nils: Could you comment on what worked best for you?
>
> Best,
> Max
>
>
> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
> <th...@ericsson.com> wrote:
> >
> > Hello everyone,
> >
> >
> >
> > We are facing the same probleme now in our Flink applications, launch
> using YARN.
> >
> > Just want to know if there is any update about this exception ?
> >
> >
> >
> > Thanks
> >
> >
> >
> > Thomas
> >
> >
> >
> > ________________________________
> >
> > De : niels@basj.es [niels@basj.es] de la part de Niels Basjes [
> Niels@basjes.nl]
> > Envoyé : vendredi 4 décembre 2015 10:40
> > À : user@flink.apache.org
> > Objet : Re: Flink job on secure Yarn fails after many hours
> >
> > Hi Maximilian,
> >
> > I just downloaded the version from your google drive and used that to
> run my test topology that accesses HBase.
> > I deliberately started it twice to double the chance to run into this
> situation.
> >
> > I'll keep you posted.
> >
> > Niels
> >
> >
> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>
> wrote:
> >>
> >> Hi Niels,
> >>
> >> Just got back from our CI. The build above would fail with a
> >> Checkstyle error. I corrected that. Also I have built the binaries for
> >> your Hadoop version 2.6.0.
> >>
> >> Binaries:
> >>
> >>
> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
> >>
> >> Thanks,
> >> Max
> >>
> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
> >> >>>> >> >> > 21:30:28,185 ERROR
> org.apache.flink.runtime.jobmanager.JobManager
> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
> >> >>>> >> >> > stopping
> >> >>>> >> >> > process...
> >> >>>> >> >> > 21:30:28,286 INFO
> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> >>>> >> >> > - Removing web root dir
> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >>>> >> >> >
> >> >>>> >> >> >
> >> >>>> >> >> > --
> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
> >> >>>> >> >> >
> >> >>>> >> >> > Niels Basjes
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> > --
> >> >>>> >> > Best regards / Met vriendelijke groeten,
> >> >>>> >> >
> >> >>>> >> > Niels Basjes
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> > --
> >> >>>> > Best regards / Met vriendelijke groeten,
> >> >>>> >
> >> >>>> > Niels Basjes
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Best regards / Met vriendelijke groeten,
> >> >>>
> >> >>> Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

RE:Flink job on secure Yarn fails after many hours

Posted by Thomas Lamirault <th...@ericsson.com>.

Hi Max,

I will try these workaround.
Thanks

Thomas

________________________________________
De : Maximilian Michels [mxm@apache.org]
Envoyé : mardi 15 mars 2016 16:51
À : user@flink.apache.org
Cc : Niels Basjes
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
  <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
  <name>dfs.namenode.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>


b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

>From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max


On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<th...@ericsson.com> wrote:
>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> ________________________________
>
> De : niels@basj.es [niels@basj.es] de la part de Niels Basjes [Niels@basjes.nl]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : user@flink.apache.org
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
  <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
  <name>dfs.namenode.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>


b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

>From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max


On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<th...@ericsson.com> wrote:
>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> ________________________________
>
> De : niels@basj.es [niels@basj.es] de la part de Niels Basjes [Niels@basjes.nl]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : user@flink.apache.org
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

RE:Flink job on secure Yarn fails after many hours

Posted by Thomas Lamirault <th...@ericsson.com>.

Hello everyone,

We are facing the same probleme now in our Flink applications, launch using YARN.

Just want to know if there is any update about this exception ?

Thanks

Thomas

________________________________

De : niels@basj.es [niels@basj.es] de la part de Niels Basjes [Niels@basjes.nl]
Envoyé : vendredi 4 décembre 2015 10:40
À : user@flink.apache.org
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Maximilian,

I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
I deliberately started it twice to double the chance to run into this situation.

I'll keep you posted.

Niels

On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org>> wrote:
Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281<http://0.0.0.0:41281>
>>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>> >> >> > stopping
>>>> >> >> > process...
>>>> >> >> > 21:30:28,286 INFO
>>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>> >> >> > - Removing web root dir
>>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best regards / Met vriendelijke groeten,
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best regards / Met vriendelijke groeten,
>>>> >> >
>>>> >> > Niels Basjes
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best regards / Met vriendelijke groeten,
>>>> >
>>>> > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes

--
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi Maximilian,

I just downloaded the version from your google drive and used that to run
my test topology that accesses HBase.
I deliberately started it twice to double the chance to run into this
situation.

I'll keep you posted.

Niels


On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Niels,
>
> Just got back from our CI. The build above would fail with a
> Checkstyle error. I corrected that. Also I have built the binaries for
> your Hadoop version 2.6.0.
>
> Binaries:
>
>
> https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing
>
> Source:
>
> https://github.com/mxm/flink/tree/kerberos-yarn-heartbeat-fail-0.10.1
>
> git fetch https://github.com/mxm/flink/ \
> kerberos-yarn-heartbeat-fail-0.10.1 && git checkout FETCH_HEAD
>
>
> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>
> Thanks,
> Max
>
> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <mx...@apache.org> wrote:
> > I forgot you're using Flink 0.10.1. The above was for the master.
> >
> > So here's the commit for Flink 0.10.1:
> >
> https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd
> >
> > git fetch https://github.com/mxm/flink/ \
> > a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD
> >
> >
> https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip
> >
> > Thanks,
> > Max
> >
> > On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >> Great. Here is the commit to try out:
> >>
> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
> >>
> >> If you already have the Flink repository, check it out using
> >>
> >> git fetch https://github.com/mxm/flink/
> >> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
> >>
> >> Alternatively, here's a direct download link to the sources with the
> >> fix included:
> >>
> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
> >>
> >> Thanks a lot,
> >> Max
> >>
> >> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> >>> Sure, just give me the git repo url to build and I'll give it a try.
> >>>
> >>> Niels
> >>>
> >>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >>>>
> >>>> I mentioned that the exception gets thrown when requesting container
> >>>> status information. We need this to send a heartbeat to YARN but it is
> >>>> not very crucial if this fails once for the running job. Possibly, we
> >>>> could work around this problem by retrying N times in case of an
> >>>> exception.
> >>>>
> >>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
> >>>> we provide and test again?
> >>>>
> >>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> >>>> > No, I was just asking.
> >>>> > No upgrade is possible for the next month or two.
> >>>> >
> >>>> > This week is our busiest day of the year ...
> >>>> > Our shop is doing about 10 orders per second these days ...
> >>>> >
> >>>> > So they won't upgrade until next January/February
> >>>> >
> >>>> > Niels
> >>>> >
> >>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
> >>>> > wrote:
> >>>> >>
> >>>> >> Hi Niels,
> >>>> >>
> >>>> >> You mentioned you have the option to update Hadoop and redeploy the
> >>>> >> job. Would be great if you could do that and let us know how it
> turns
> >>>> >> out.
> >>>> >>
> >>>> >> Cheers,
> >>>> >> Max
> >>>> >>
> >>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl>
> wrote:
> >>>> >> > Hi,
> >>>> >> >
> >>>> >> > I posted the entire log from the first log line at the moment of
> >>>> >> > failure
> >>>> >> > to
> >>>> >> > the very end of the logfile.
> >>>> >> > This is all I have.
> >>>> >> >
> >>>> >> > As far as I understand the Kerberos and Keytab mechanism in
> Hadoop
> >>>> >> > Yarn
> >>>> >> > is
> >>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a
> new
> >>>> >> > Kerberos
> >>>> >> > ticket (or tgt?).
> >>>> >> > When the new ticket has been obtained it retries the call that
> >>>> >> > previously
> >>>> >> > failed.
> >>>> >> > To me it seemed that this call can fail over the invalid Token
> yet it
> >>>> >> > cannot
> >>>> >> > be retried.
> >>>> >> >
> >>>> >> > At this moment I'm thinking a bug in Hadoop.
> >>>> >> >
> >>>> >> > Niels
> >>>> >> >
> >>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <
> mxm@apache.org>
> >>>> >> > wrote:
> >>>> >> >>
> >>>> >> >> Hi Niels,
> >>>> >> >>
> >>>> >> >> Sorry for hear you experienced this exception. From a first
> glance,
> >>>> >> >> it
> >>>> >> >> looks like a bug in Hadoop to me.
> >>>> >> >>
> >>>> >> >> > "Not retrying because the invoked method is not idempotent,
> and
> >>>> >> >> > unable
> >>>> >> >> > to determine whether it was invoked"
> >>>> >> >>
> >>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
> >>>> >> >> mechanism that re-attempts to do actions which previously
> failed if
> >>>> >> >> that's possible. Since the action is not idempotent (it cannot
> be
> >>>> >> >> executed again without risking to change the state of the
> execution)
> >>>> >> >> and it also doesn't track its execution states, it won't be
> retried
> >>>> >> >> again.
> >>>> >> >>
> >>>> >> >> The main issue is this exception:
> >>>> >> >>
> >>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
> >>>> >> >> > Invalid
> >>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
> >>>> >> >>
> >>>> >> >> From the stack trace it is clear that this exception occurs upon
> >>>> >> >> requesting container status information from the Resource
> Manager:
> >>>> >> >>
> >>>> >> >> >at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >>>> >> >>
> >>>> >> >> Are there any more exceptions in the log? Do you have the
> complete
> >>>> >> >> logs available and could you share them?
> >>>> >> >>
> >>>> >> >>
> >>>> >> >> Best regards,
> >>>> >> >> Max
> >>>> >> >>
> >>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
> >>>> >> >> wrote:
> >>>> >> >> > Hi,
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm
> experimenting
> >>>> >> >> > with
> >>>> >> >> > Apache Flink on top of that.
> >>>> >> >> >
> >>>> >> >> > A few days ago I started a very simple Flink application (just
> >>>> >> >> > stream
> >>>> >> >> > the
> >>>> >> >> > time as a String into HBase 10 times per second).
> >>>> >> >> >
> >>>> >> >> > I (deliberately) asked our IT-ops guys to make my account
> have a
> >>>> >> >> > max
> >>>> >> >> > ticket
> >>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
> >>>> >> >> > ridiculously
> >>>> >> >> > low
> >>>> >> >> > timeout values because I needed to validate this
> >>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
> >>>> >> >> >
> >>>> >> >> > This job is started with a keytab file and after running for
> 31
> >>>> >> >> > hours
> >>>> >> >> > it
> >>>> >> >> > suddenly failed with the exception you see below.
> >>>> >> >> >
> >>>> >> >> > I had the same job running for almost 400 hours until that
> failed
> >>>> >> >> > too
> >>>> >> >> > (I
> >>>> >> >> > was
> >>>> >> >> > too late to check the logfiles but I suspect the same
> problem).
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > So in that time span my tickets have expired and new tickets
> have
> >>>> >> >> > been
> >>>> >> >> > obtained several hundred times.
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > The main error I see is that in the process of a ticket
> expiring
> >>>> >> >> > and
> >>>> >> >> > being
> >>>> >> >> > renewed I see this message:
> >>>> >> >> >
> >>>> >> >> >      Not retrying because the invoked method is not
> idempotent,
> >>>> >> >> > and
> >>>> >> >> > unable
> >>>> >> >> > to determine whether it was invoked
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
> >>>> >> >> >
> >>>> >> >> > Flink is version 0.10.1
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > How do I fix this?
> >>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing
> something
> >>>> >> >> > wrong?
> >>>> >> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > Niels Basjes
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > 21:30:27,821 WARN
> org.apache.hadoop.security.UserGroupInformation
> >>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> >>>> >> >> > - Exception encountered while connecting to the server :
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> > 21:30:27,861 WARN
> org.apache.hadoop.security.UserGroupInformation
> >>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> > 21:30:27,891 WARN
> >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
> >>>> >> >> > - Exception while invoking class
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> >>>> >> >> > Not retrying because the invoked method is not idempotent, and
> >>>> >> >> > unable
> >>>> >> >> > to
> >>>> >> >> > determine whether it was invoked
> >>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
> >>>> >> >> > Invalid
> >>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> >       at
> >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>>> >> >> > Method)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >>>> >> >> >       at
> >>>> >> >> >
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> >>>> >> >> > Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >>>> >> >> >       at
> >>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >>>> >> >> >       at
> >>>> >> >> >
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >>>> >> >> >       at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >>>> >> >> >       at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>> >> >> > Caused by:
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >>>> >> >> >       ... 29 more
> >>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> >>>> >> >> > - Invalid AMRMToken from
> appattempt_1443166961758_163901_000001
> >>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
> >>>> >> >> > Invalid
> >>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> >       at
> >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>>> >> >> > Method)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >>>> >> >> >       at
> >>>> >> >> >
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> >>>> >> >> > Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >>>> >> >> >       at
> >>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >>>> >> >> >       at
> >>>> >> >> >
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >>>> >> >> >       at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >>>> >> >> >       at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >>>> >> >> > Caused by:
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >>>> >> >> >       at
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >>>> >> >> >       ... 29 more
> >>>> >> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> >>>> >> >> > - Stopping JobManager
> >>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> >>>> >> >> > 21:30:28,088 INFO
> >>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
> >>>> >> >> > CANCELING
> >>>> >> >> > 21:30:28,113 INFO
> >>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
> >>>> >> >> > FAILED
> >>>> >> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> >>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
> >>>> >> >> > 21:30:28,185 ERROR
> org.apache.flink.runtime.jobmanager.JobManager
> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
> >>>> >> >> > stopping
> >>>> >> >> > process...
> >>>> >> >> > 21:30:28,286 INFO
> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >>>> >> >> > - Removing web root dir
> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > --
> >>>> >> >> > Best regards / Met vriendelijke groeten,
> >>>> >> >> >
> >>>> >> >> > Niels Basjes
> >>>> >> >
> >>>> >> >
> >>>> >> >
> >>>> >> >
> >>>> >> > --
> >>>> >> > Best regards / Met vriendelijke groeten,
> >>>> >> >
> >>>> >> > Niels Basjes
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Best regards / Met vriendelijke groeten,
> >>>> >
> >>>> > Niels Basjes
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Best regards / Met vriendelijke groeten,
> >>>
> >>> Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing

Source:

https://github.com/mxm/flink/tree/kerberos-yarn-heartbeat-fail-0.10.1

git fetch https://github.com/mxm/flink/ \
kerberos-yarn-heartbeat-fail-0.10.1 && git checkout FETCH_HEAD

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <mx...@apache.org> wrote:
> I forgot you're using Flink 0.10.1. The above was for the master.
>
> So here's the commit for Flink 0.10.1:
> https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd
>
> git fetch https://github.com/mxm/flink/ \
> a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD
>
> https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip
>
> Thanks,
> Max
>
> On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <mx...@apache.org> wrote:
>> Great. Here is the commit to try out:
>> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>>
>> If you already have the Flink repository, check it out using
>>
>> git fetch https://github.com/mxm/flink/
>> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>>
>> Alternatively, here's a direct download link to the sources with the
>> fix included:
>> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>>
>> Thanks a lot,
>> Max
>>
>> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>> Sure, just give me the git repo url to build and I'll give it a try.
>>>
>>> Niels
>>>
>>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org> wrote:
>>>>
>>>> I mentioned that the exception gets thrown when requesting container
>>>> status information. We need this to send a heartbeat to YARN but it is
>>>> not very crucial if this fails once for the running job. Possibly, we
>>>> could work around this problem by retrying N times in case of an
>>>> exception.
>>>>
>>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>>> we provide and test again?
>>>>
>>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>>> > No, I was just asking.
>>>> > No upgrade is possible for the next month or two.
>>>> >
>>>> > This week is our busiest day of the year ...
>>>> > Our shop is doing about 10 orders per second these days ...
>>>> >
>>>> > So they won't upgrade until next January/February
>>>> >
>>>> > Niels
>>>> >
>>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
>>>> > wrote:
>>>> >>
>>>> >> Hi Niels,
>>>> >>
>>>> >> You mentioned you have the option to update Hadoop and redeploy the
>>>> >> job. Would be great if you could do that and let us know how it turns
>>>> >> out.
>>>> >>
>>>> >> Cheers,
>>>> >> Max
>>>> >>
>>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>>> >> > Hi,
>>>> >> >
>>>> >> > I posted the entire log from the first log line at the moment of
>>>> >> > failure
>>>> >> > to
>>>> >> > the very end of the logfile.
>>>> >> > This is all I have.
>>>> >> >
>>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>>> >> > Yarn
>>>> >> > is
>>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>>>> >> > Kerberos
>>>> >> > ticket (or tgt?).
>>>> >> > When the new ticket has been obtained it retries the call that
>>>> >> > previously
>>>> >> > failed.
>>>> >> > To me it seemed that this call can fail over the invalid Token yet it
>>>> >> > cannot
>>>> >> > be retried.
>>>> >> >
>>>> >> > At this moment I'm thinking a bug in Hadoop.
>>>> >> >
>>>> >> > Niels
>>>> >> >
>>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hi Niels,
>>>> >> >>
>>>> >> >> Sorry for hear you experienced this exception. From a first glance,
>>>> >> >> it
>>>> >> >> looks like a bug in Hadoop to me.
>>>> >> >>
>>>> >> >> > "Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked"
>>>> >> >>
>>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>>>> >> >> mechanism that re-attempts to do actions which previously failed if
>>>> >> >> that's possible. Since the action is not idempotent (it cannot be
>>>> >> >> executed again without risking to change the state of the execution)
>>>> >> >> and it also doesn't track its execution states, it won't be retried
>>>> >> >> again.
>>>> >> >>
>>>> >> >> The main issue is this exception:
>>>> >> >>
>>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>>>> >> >>
>>>> >> >> From the stack trace it is clear that this exception occurs upon
>>>> >> >> requesting container status information from the Resource Manager:
>>>> >> >>
>>>> >> >> >at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >>
>>>> >> >> Are there any more exceptions in the log? Do you have the complete
>>>> >> >> logs available and could you share them?
>>>> >> >>
>>>> >> >>
>>>> >> >> Best regards,
>>>> >> >> Max
>>>> >> >>
>>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
>>>> >> >> wrote:
>>>> >> >> > Hi,
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>>>> >> >> > with
>>>> >> >> > Apache Flink on top of that.
>>>> >> >> >
>>>> >> >> > A few days ago I started a very simple Flink application (just
>>>> >> >> > stream
>>>> >> >> > the
>>>> >> >> > time as a String into HBase 10 times per second).
>>>> >> >> >
>>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>>>> >> >> > max
>>>> >> >> > ticket
>>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>>>> >> >> > ridiculously
>>>> >> >> > low
>>>> >> >> > timeout values because I needed to validate this
>>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>>>> >> >> >
>>>> >> >> > This job is started with a keytab file and after running for 31
>>>> >> >> > hours
>>>> >> >> > it
>>>> >> >> > suddenly failed with the exception you see below.
>>>> >> >> >
>>>> >> >> > I had the same job running for almost 400 hours until that failed
>>>> >> >> > too
>>>> >> >> > (I
>>>> >> >> > was
>>>> >> >> > too late to check the logfiles but I suspect the same problem).
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > So in that time span my tickets have expired and new tickets have
>>>> >> >> > been
>>>> >> >> > obtained several hundred times.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > The main error I see is that in the process of a ticket expiring
>>>> >> >> > and
>>>> >> >> > being
>>>> >> >> > renewed I see this message:
>>>> >> >> >
>>>> >> >> >      Not retrying because the invoked method is not idempotent,
>>>> >> >> > and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>>> >> >> >
>>>> >> >> > Flink is version 0.10.1
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > How do I fix this?
>>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>>>> >> >> > wrong?
>>>> >> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>>>> >> >> > - Exception encountered while connecting to the server :
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,891 WARN
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>>>> >> >> > - Exception while invoking class
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>>> >> >> > Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to
>>>> >> >> > determine whether it was invoked
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> >       at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> >       at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> >       at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> >       at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> >       ... 29 more
>>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> >       at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> >       at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> >       at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> >       at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> >       at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> >       ... 29 more
>>>> >> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>>>> >> >> > - Stopping JobManager
>>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>>> >> >> > 21:30:28,088 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>>> >> >> > CANCELING
>>>> >> >> > 21:30:28,113 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>>> >> >> > FAILED
>>>> >> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>> >> >> > stopping
>>>> >> >> > process...
>>>> >> >> > 21:30:28,286 INFO
>>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>> >> >> > - Removing web root dir
>>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best regards / Met vriendelijke groeten,
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best regards / Met vriendelijke groeten,
>>>> >> >
>>>> >> > Niels Basjes
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best regards / Met vriendelijke groeten,
>>>> >
>>>> > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

I forgot you're using Flink 0.10.1. The above was for the master.

So here's the commit for Flink 0.10.1:
https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd

git fetch https://github.com/mxm/flink/ \
a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD

https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <mx...@apache.org> wrote:
> Great. Here is the commit to try out:
> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>
> If you already have the Flink repository, check it out using
>
> git fetch https://github.com/mxm/flink/
> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>
> Alternatively, here's a direct download link to the sources with the
> fix included:
> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>
> Thanks a lot,
> Max
>
> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> Sure, just give me the git repo url to build and I'll give it a try.
>>
>> Niels
>>
>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org> wrote:
>>>
>>> I mentioned that the exception gets thrown when requesting container
>>> status information. We need this to send a heartbeat to YARN but it is
>>> not very crucial if this fails once for the running job. Possibly, we
>>> could work around this problem by retrying N times in case of an
>>> exception.
>>>
>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>> we provide and test again?
>>>
>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>> > No, I was just asking.
>>> > No upgrade is possible for the next month or two.
>>> >
>>> > This week is our busiest day of the year ...
>>> > Our shop is doing about 10 orders per second these days ...
>>> >
>>> > So they won't upgrade until next January/February
>>> >
>>> > Niels
>>> >
>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
>>> > wrote:
>>> >>
>>> >> Hi Niels,
>>> >>
>>> >> You mentioned you have the option to update Hadoop and redeploy the
>>> >> job. Would be great if you could do that and let us know how it turns
>>> >> out.
>>> >>
>>> >> Cheers,
>>> >> Max
>>> >>
>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I posted the entire log from the first log line at the moment of
>>> >> > failure
>>> >> > to
>>> >> > the very end of the logfile.
>>> >> > This is all I have.
>>> >> >
>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>> >> > Yarn
>>> >> > is
>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>>> >> > Kerberos
>>> >> > ticket (or tgt?).
>>> >> > When the new ticket has been obtained it retries the call that
>>> >> > previously
>>> >> > failed.
>>> >> > To me it seemed that this call can fail over the invalid Token yet it
>>> >> > cannot
>>> >> > be retried.
>>> >> >
>>> >> > At this moment I'm thinking a bug in Hadoop.
>>> >> >
>>> >> > Niels
>>> >> >
>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi Niels,
>>> >> >>
>>> >> >> Sorry for hear you experienced this exception. From a first glance,
>>> >> >> it
>>> >> >> looks like a bug in Hadoop to me.
>>> >> >>
>>> >> >> > "Not retrying because the invoked method is not idempotent, and
>>> >> >> > unable
>>> >> >> > to determine whether it was invoked"
>>> >> >>
>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>>> >> >> mechanism that re-attempts to do actions which previously failed if
>>> >> >> that's possible. Since the action is not idempotent (it cannot be
>>> >> >> executed again without risking to change the state of the execution)
>>> >> >> and it also doesn't track its execution states, it won't be retried
>>> >> >> again.
>>> >> >>
>>> >> >> The main issue is this exception:
>>> >> >>
>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>>> >> >>
>>> >> >> From the stack trace it is clear that this exception occurs upon
>>> >> >> requesting container status information from the Resource Manager:
>>> >> >>
>>> >> >> >at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >>
>>> >> >> Are there any more exceptions in the log? Do you have the complete
>>> >> >> logs available and could you share them?
>>> >> >>
>>> >> >>
>>> >> >> Best regards,
>>> >> >> Max
>>> >> >>
>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
>>> >> >> wrote:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> >
>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>>> >> >> > with
>>> >> >> > Apache Flink on top of that.
>>> >> >> >
>>> >> >> > A few days ago I started a very simple Flink application (just
>>> >> >> > stream
>>> >> >> > the
>>> >> >> > time as a String into HBase 10 times per second).
>>> >> >> >
>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>>> >> >> > max
>>> >> >> > ticket
>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>>> >> >> > ridiculously
>>> >> >> > low
>>> >> >> > timeout values because I needed to validate this
>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>>> >> >> >
>>> >> >> > This job is started with a keytab file and after running for 31
>>> >> >> > hours
>>> >> >> > it
>>> >> >> > suddenly failed with the exception you see below.
>>> >> >> >
>>> >> >> > I had the same job running for almost 400 hours until that failed
>>> >> >> > too
>>> >> >> > (I
>>> >> >> > was
>>> >> >> > too late to check the logfiles but I suspect the same problem).
>>> >> >> >
>>> >> >> >
>>> >> >> > So in that time span my tickets have expired and new tickets have
>>> >> >> > been
>>> >> >> > obtained several hundred times.
>>> >> >> >
>>> >> >> >
>>> >> >> > The main error I see is that in the process of a ticket expiring
>>> >> >> > and
>>> >> >> > being
>>> >> >> > renewed I see this message:
>>> >> >> >
>>> >> >> >      Not retrying because the invoked method is not idempotent,
>>> >> >> > and
>>> >> >> > unable
>>> >> >> > to determine whether it was invoked
>>> >> >> >
>>> >> >> >
>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>> >> >> >
>>> >> >> > Flink is version 0.10.1
>>> >> >> >
>>> >> >> >
>>> >> >> > How do I fix this?
>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>>> >> >> > wrong?
>>> >> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>>> >> >> >
>>> >> >> >
>>> >> >> > Niels Basjes
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>>> >> >> > - Exception encountered while connecting to the server :
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,891 WARN
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>>> >> >> > - Exception while invoking class
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>> >> >> > Not retrying because the invoked method is not idempotent, and
>>> >> >> > unable
>>> >> >> > to
>>> >> >> > determine whether it was invoked
>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> >       at
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> >> >> > Method)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> >> >> >       at
>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>> >> >> > Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >> >       at
>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>> >> >> >       at
>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> >> >> >       at
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >> >> > Caused by:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> >       ... 29 more
>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> >       at
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> >> >> > Method)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> >> >> >       at
>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>> >> >> > Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >> >       at
>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>> >> >> >       at
>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> >> >> >       at
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >> >> > Caused by:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>> >> >> >       at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> >       ... 29 more
>>> >> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>>> >> >> > - Stopping JobManager
>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>> >> >> > 21:30:28,088 INFO
>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>> >> >> > CANCELING
>>> >> >> > 21:30:28,113 INFO
>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>> >> >> > FAILED
>>> >> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>> >> >> > stopping
>>> >> >> > process...
>>> >> >> > 21:30:28,286 INFO
>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> >> >> > - Removing web root dir
>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Best regards / Met vriendelijke groeten,
>>> >> >> >
>>> >> >> > Niels Basjes
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Best regards / Met vriendelijke groeten,
>>> >> >
>>> >> > Niels Basjes
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Best regards / Met vriendelijke groeten,
>>> >
>>> > Niels Basjes
>>
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Great. Here is the commit to try out:
https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3

If you already have the Flink repository, check it out using

git fetch https://github.com/mxm/flink/
f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD

Alternatively, here's a direct download link to the sources with the
fix included:
https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip

Thanks a lot,
Max

On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> Sure, just give me the git repo url to build and I'll give it a try.
>
> Niels
>
> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> I mentioned that the exception gets thrown when requesting container
>> status information. We need this to send a heartbeat to YARN but it is
>> not very crucial if this fails once for the running job. Possibly, we
>> could work around this problem by retrying N times in case of an
>> exception.
>>
>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>> we provide and test again?
>>
>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > No, I was just asking.
>> > No upgrade is possible for the next month or two.
>> >
>> > This week is our busiest day of the year ...
>> > Our shop is doing about 10 orders per second these days ...
>> >
>> > So they won't upgrade until next January/February
>> >
>> > Niels
>> >
>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> You mentioned you have the option to update Hadoop and redeploy the
>> >> job. Would be great if you could do that and let us know how it turns
>> >> out.
>> >>
>> >> Cheers,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> >> > Hi,
>> >> >
>> >> > I posted the entire log from the first log line at the moment of
>> >> > failure
>> >> > to
>> >> > the very end of the logfile.
>> >> > This is all I have.
>> >> >
>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>> >> > Yarn
>> >> > is
>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>> >> > Kerberos
>> >> > ticket (or tgt?).
>> >> > When the new ticket has been obtained it retries the call that
>> >> > previously
>> >> > failed.
>> >> > To me it seemed that this call can fail over the invalid Token yet it
>> >> > cannot
>> >> > be retried.
>> >> >
>> >> > At this moment I'm thinking a bug in Hadoop.
>> >> >
>> >> > Niels
>> >> >
>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
>> >> > wrote:
>> >> >>
>> >> >> Hi Niels,
>> >> >>
>> >> >> Sorry for hear you experienced this exception. From a first glance,
>> >> >> it
>> >> >> looks like a bug in Hadoop to me.
>> >> >>
>> >> >> > "Not retrying because the invoked method is not idempotent, and
>> >> >> > unable
>> >> >> > to determine whether it was invoked"
>> >> >>
>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>> >> >> mechanism that re-attempts to do actions which previously failed if
>> >> >> that's possible. Since the action is not idempotent (it cannot be
>> >> >> executed again without risking to change the state of the execution)
>> >> >> and it also doesn't track its execution states, it won't be retried
>> >> >> again.
>> >> >>
>> >> >> The main issue is this exception:
>> >> >>
>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>> >> >>
>> >> >> From the stack trace it is clear that this exception occurs upon
>> >> >> requesting container status information from the Resource Manager:
>> >> >>
>> >> >> >at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >>
>> >> >> Are there any more exceptions in the log? Do you have the complete
>> >> >> logs available and could you share them?
>> >> >>
>> >> >>
>> >> >> Best regards,
>> >> >> Max
>> >> >>
>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> >
>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>> >> >> > with
>> >> >> > Apache Flink on top of that.
>> >> >> >
>> >> >> > A few days ago I started a very simple Flink application (just
>> >> >> > stream
>> >> >> > the
>> >> >> > time as a String into HBase 10 times per second).
>> >> >> >
>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>> >> >> > max
>> >> >> > ticket
>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>> >> >> > ridiculously
>> >> >> > low
>> >> >> > timeout values because I needed to validate this
>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >> >> >
>> >> >> > This job is started with a keytab file and after running for 31
>> >> >> > hours
>> >> >> > it
>> >> >> > suddenly failed with the exception you see below.
>> >> >> >
>> >> >> > I had the same job running for almost 400 hours until that failed
>> >> >> > too
>> >> >> > (I
>> >> >> > was
>> >> >> > too late to check the logfiles but I suspect the same problem).
>> >> >> >
>> >> >> >
>> >> >> > So in that time span my tickets have expired and new tickets have
>> >> >> > been
>> >> >> > obtained several hundred times.
>> >> >> >
>> >> >> >
>> >> >> > The main error I see is that in the process of a ticket expiring
>> >> >> > and
>> >> >> > being
>> >> >> > renewed I see this message:
>> >> >> >
>> >> >> >      Not retrying because the invoked method is not idempotent,
>> >> >> > and
>> >> >> > unable
>> >> >> > to determine whether it was invoked
>> >> >> >
>> >> >> >
>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >> >> >
>> >> >> > Flink is version 0.10.1
>> >> >> >
>> >> >> >
>> >> >> > How do I fix this?
>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>> >> >> > wrong?
>> >> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>> >> >> >
>> >> >> >
>> >> >> > Niels Basjes
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>> >> >> > - Exception encountered while connecting to the server :
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,891 WARN
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>> >> >> > - Exception while invoking class
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> >> >> > Not retrying because the invoked method is not idempotent, and
>> >> >> > unable
>> >> >> > to
>> >> >> > determine whether it was invoked
>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> >       at
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> >> > Method)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >> >       at
>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> >> >> > Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >> >       at
>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >> >       at
>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >> >       at
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> >> > Caused by:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> >       ... 29 more
>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> >       at
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> >> > Method)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >> >       at
>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> >> >> > Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >> >       at
>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >> >       at
>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >> >       at
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> >> > Caused by:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >> >       at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> >       ... 29 more
>> >> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>> >> >> > - Stopping JobManager
>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> >> >> > 21:30:28,088 INFO
>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>> >> >> > CANCELING
>> >> >> > 21:30:28,113 INFO
>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>> >> >> > FAILED
>> >> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >> > stopping
>> >> >> > process...
>> >> >> > 21:30:28,286 INFO
>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >> > - Removing web root dir
>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >> >
>> >> >> > Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Sure, just give me the git repo url to build and I'll give it a try.

Niels

On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <mx...@apache.org> wrote:

> I mentioned that the exception gets thrown when requesting container
> status information. We need this to send a heartbeat to YARN but it is
> not very crucial if this fails once for the running job. Possibly, we
> could work around this problem by retrying N times in case of an
> exception.
>
> Would it be possible for you to deploy a custom Flink 0.10.1 version
> we provide and test again?
>
> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > No, I was just asking.
> > No upgrade is possible for the next month or two.
> >
> > This week is our busiest day of the year ...
> > Our shop is doing about 10 orders per second these days ...
> >
> > So they won't upgrade until next January/February
> >
> > Niels
> >
> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >>
> >> Hi Niels,
> >>
> >> You mentioned you have the option to update Hadoop and redeploy the
> >> job. Would be great if you could do that and let us know how it turns
> >> out.
> >>
> >> Cheers,
> >> Max
> >>
> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> >> > Hi,
> >> >
> >> > I posted the entire log from the first log line at the moment of
> failure
> >> > to
> >> > the very end of the logfile.
> >> > This is all I have.
> >> >
> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
> Yarn
> >> > is
> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
> >> > Kerberos
> >> > ticket (or tgt?).
> >> > When the new ticket has been obtained it retries the call that
> >> > previously
> >> > failed.
> >> > To me it seemed that this call can fail over the invalid Token yet it
> >> > cannot
> >> > be retried.
> >> >
> >> > At this moment I'm thinking a bug in Hadoop.
> >> >
> >> > Niels
> >> >
> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
> >> > wrote:
> >> >>
> >> >> Hi Niels,
> >> >>
> >> >> Sorry for hear you experienced this exception. From a first glance,
> it
> >> >> looks like a bug in Hadoop to me.
> >> >>
> >> >> > "Not retrying because the invoked method is not idempotent, and
> >> >> > unable
> >> >> > to determine whether it was invoked"
> >> >>
> >> >> That is nothing to worry about. This is Hadoop's internal retry
> >> >> mechanism that re-attempts to do actions which previously failed if
> >> >> that's possible. Since the action is not idempotent (it cannot be
> >> >> executed again without risking to change the state of the execution)
> >> >> and it also doesn't track its execution states, it won't be retried
> >> >> again.
> >> >>
> >> >> The main issue is this exception:
> >> >>
> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
> >> >>
> >> >> From the stack trace it is clear that this exception occurs upon
> >> >> requesting container status information from the Resource Manager:
> >> >>
> >> >> >at
> >> >> >
> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >>
> >> >> Are there any more exceptions in the log? Do you have the complete
> >> >> logs available and could you share them?
> >> >>
> >> >>
> >> >> Best regards,
> >> >> Max
> >> >>
> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl>
> wrote:
> >> >> > Hi,
> >> >> >
> >> >> >
> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
> >> >> > with
> >> >> > Apache Flink on top of that.
> >> >> >
> >> >> > A few days ago I started a very simple Flink application (just
> stream
> >> >> > the
> >> >> > time as a String into HBase 10 times per second).
> >> >> >
> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
> max
> >> >> > ticket
> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
> >> >> > ridiculously
> >> >> > low
> >> >> > timeout values because I needed to validate this
> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
> >> >> >
> >> >> > This job is started with a keytab file and after running for 31
> hours
> >> >> > it
> >> >> > suddenly failed with the exception you see below.
> >> >> >
> >> >> > I had the same job running for almost 400 hours until that failed
> too
> >> >> > (I
> >> >> > was
> >> >> > too late to check the logfiles but I suspect the same problem).
> >> >> >
> >> >> >
> >> >> > So in that time span my tickets have expired and new tickets have
> >> >> > been
> >> >> > obtained several hundred times.
> >> >> >
> >> >> >
> >> >> > The main error I see is that in the process of a ticket expiring
> and
> >> >> > being
> >> >> > renewed I see this message:
> >> >> >
> >> >> >      Not retrying because the invoked method is not idempotent, and
> >> >> > unable
> >> >> > to determine whether it was invoked
> >> >> >
> >> >> >
> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
> >> >> >
> >> >> > Flink is version 0.10.1
> >> >> >
> >> >> >
> >> >> > How do I fix this?
> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
> >> >> > wrong?
> >> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
> >> >> >
> >> >> >
> >> >> > Niels Basjes
> >> >> >
> >> >> >
> >> >> >
> >> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >> >
> >> >> >
> >> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> >> >> > - Exception encountered while connecting to the server :
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >> >
> >> >> >
> >> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> > 21:30:27,891 WARN
> org.apache.hadoop.io.retry.RetryInvocationHandler
> >> >> > - Exception while invoking class
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> >> >> > Not retrying because the invoked method is not idempotent, and
> unable
> >> >> > to
> >> >> > determine whether it was invoked
> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
> Invalid
> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >> >       at
> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> >> > Method)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >> >       at
> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >> >       at
> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >> >       at
> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >> >       at
> >> >> >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >> >       at
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> >> > Caused by:
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >> >       ... 29 more
> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
> Invalid
> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >> >       at
> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> >> > Method)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >> >       at
> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >> >       at
> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >> >       at
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >> >       at
> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >> >       at
> >> >> >
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >> >       at
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> >> > Caused by:
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >> >       at
> >> >> >
> >> >> >
> >> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >> >       ... 29 more
> >> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> >> >> > - Stopping JobManager
> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> >> >> > 21:30:28,088 INFO
> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
> CANCELING
> >> >> > 21:30:28,113 INFO
> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
> FAILED
> >> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> >> >> > - Stopped BLOB server at 0.0.0.0:41281
> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> >> >> > process...
> >> >> > 21:30:28,286 INFO
> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> >> > - Removing web root dir
> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Best regards / Met vriendelijke groeten,
> >> >> >
> >> >> > Niels Basjes
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Best regards / Met vriendelijke groeten,
> >> >
> >> > Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

I mentioned that the exception gets thrown when requesting container
status information. We need this to send a heartbeat to YARN but it is
not very crucial if this fails once for the running job. Possibly, we
could work around this problem by retrying N times in case of an
exception.

Would it be possible for you to deploy a custom Flink 0.10.1 version
we provide and test again?

On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> No, I was just asking.
> No upgrade is possible for the next month or two.
>
> This week is our busiest day of the year ...
> Our shop is doing about 10 orders per second these days ...
>
> So they won't upgrade until next January/February
>
> Niels
>
> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> Hi Niels,
>>
>> You mentioned you have the option to update Hadoop and redeploy the
>> job. Would be great if you could do that and let us know how it turns
>> out.
>>
>> Cheers,
>> Max
>>
>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > Hi,
>> >
>> > I posted the entire log from the first log line at the moment of failure
>> > to
>> > the very end of the logfile.
>> > This is all I have.
>> >
>> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn
>> > is
>> > that it catches the "Invalid Token" and then (if keytab) gets a new
>> > Kerberos
>> > ticket (or tgt?).
>> > When the new ticket has been obtained it retries the call that
>> > previously
>> > failed.
>> > To me it seemed that this call can fail over the invalid Token yet it
>> > cannot
>> > be retried.
>> >
>> > At this moment I'm thinking a bug in Hadoop.
>> >
>> > Niels
>> >
>> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Sorry for hear you experienced this exception. From a first glance, it
>> >> looks like a bug in Hadoop to me.
>> >>
>> >> > "Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked"
>> >>
>> >> That is nothing to worry about. This is Hadoop's internal retry
>> >> mechanism that re-attempts to do actions which previously failed if
>> >> that's possible. Since the action is not idempotent (it cannot be
>> >> executed again without risking to change the state of the execution)
>> >> and it also doesn't track its execution states, it won't be retried
>> >> again.
>> >>
>> >> The main issue is this exception:
>> >>
>> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>> >>
>> >> From the stack trace it is clear that this exception occurs upon
>> >> requesting container status information from the Resource Manager:
>> >>
>> >> >at
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >>
>> >> Are there any more exceptions in the log? Do you have the complete
>> >> logs available and could you share them?
>> >>
>> >>
>> >> Best regards,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>> >> > with
>> >> > Apache Flink on top of that.
>> >> >
>> >> > A few days ago I started a very simple Flink application (just stream
>> >> > the
>> >> > time as a String into HBase 10 times per second).
>> >> >
>> >> > I (deliberately) asked our IT-ops guys to make my account have a max
>> >> > ticket
>> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>> >> > ridiculously
>> >> > low
>> >> > timeout values because I needed to validate this
>> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >> >
>> >> > This job is started with a keytab file and after running for 31 hours
>> >> > it
>> >> > suddenly failed with the exception you see below.
>> >> >
>> >> > I had the same job running for almost 400 hours until that failed too
>> >> > (I
>> >> > was
>> >> > too late to check the logfiles but I suspect the same problem).
>> >> >
>> >> >
>> >> > So in that time span my tickets have expired and new tickets have
>> >> > been
>> >> > obtained several hundred times.
>> >> >
>> >> >
>> >> > The main error I see is that in the process of a ticket expiring and
>> >> > being
>> >> > renewed I see this message:
>> >> >
>> >> >      Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked
>> >> >
>> >> >
>> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >> >
>> >> > Flink is version 0.10.1
>> >> >
>> >> >
>> >> > How do I fix this?
>> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>> >> > wrong?
>> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>> >> >
>> >> >
>> >> > Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>> >> > - Exception encountered while connecting to the server :
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
>> >> > - Exception while invoking class
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> >> > Not retrying because the invoked method is not idempotent, and unable
>> >> > to
>> >> > determine whether it was invoked
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >       at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >       at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >       at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >       at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >       at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >       at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >       at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >       at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >       at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >       ... 29 more
>> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >       at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >       at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >       at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >       at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >       at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >       at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >       at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >       at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >       at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >       at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >       at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >       ... 29 more
>> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>> >> > - Stopping JobManager
>> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> >> > 21:30:28,088 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> >> > 21:30:28,113 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>> >> > - Stopped BLOB server at 0.0.0.0:41281
>> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> >> > process...
>> >> > 21:30:28,286 INFO
>> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> > - Removing web root dir
>> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

No, I was just asking.
No upgrade is possible for the next month or two.

This week is our busiest day of the year ...
Our shop is doing about 10 orders per second these days ...

So they won't upgrade until next January/February

Niels

On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Niels,
>
> You mentioned you have the option to update Hadoop and redeploy the
> job. Would be great if you could do that and let us know how it turns
> out.
>
> Cheers,
> Max
>
> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> > Hi,
> >
> > I posted the entire log from the first log line at the moment of failure
> to
> > the very end of the logfile.
> > This is all I have.
> >
> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn
> is
> > that it catches the "Invalid Token" and then (if keytab) gets a new
> Kerberos
> > ticket (or tgt?).
> > When the new ticket has been obtained it retries the call that previously
> > failed.
> > To me it seemed that this call can fail over the invalid Token yet it
> cannot
> > be retried.
> >
> > At this moment I'm thinking a bug in Hadoop.
> >
> > Niels
> >
> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >>
> >> Hi Niels,
> >>
> >> Sorry for hear you experienced this exception. From a first glance, it
> >> looks like a bug in Hadoop to me.
> >>
> >> > "Not retrying because the invoked method is not idempotent, and unable
> >> > to determine whether it was invoked"
> >>
> >> That is nothing to worry about. This is Hadoop's internal retry
> >> mechanism that re-attempts to do actions which previously failed if
> >> that's possible. Since the action is not idempotent (it cannot be
> >> executed again without risking to change the state of the execution)
> >> and it also doesn't track its execution states, it won't be retried
> >> again.
> >>
> >> The main issue is this exception:
> >>
> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from >appattempt_1443166961758_163901_000001
> >>
> >> From the stack trace it is clear that this exception occurs upon
> >> requesting container status information from the Resource Manager:
> >>
> >> >at
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >>
> >> Are there any more exceptions in the log? Do you have the complete
> >> logs available and could you share them?
> >>
> >>
> >> Best regards,
> >> Max
> >>
> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> >> > Hi,
> >> >
> >> >
> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
> with
> >> > Apache Flink on top of that.
> >> >
> >> > A few days ago I started a very simple Flink application (just stream
> >> > the
> >> > time as a String into HBase 10 times per second).
> >> >
> >> > I (deliberately) asked our IT-ops guys to make my account have a max
> >> > ticket
> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
> ridiculously
> >> > low
> >> > timeout values because I needed to validate this
> >> > https://issues.apache.org/jira/browse/FLINK-2977).
> >> >
> >> > This job is started with a keytab file and after running for 31 hours
> it
> >> > suddenly failed with the exception you see below.
> >> >
> >> > I had the same job running for almost 400 hours until that failed too
> (I
> >> > was
> >> > too late to check the logfiles but I suspect the same problem).
> >> >
> >> >
> >> > So in that time span my tickets have expired and new tickets have been
> >> > obtained several hundred times.
> >> >
> >> >
> >> > The main error I see is that in the process of a ticket expiring and
> >> > being
> >> > renewed I see this message:
> >> >
> >> >      Not retrying because the invoked method is not idempotent, and
> >> > unable
> >> > to determine whether it was invoked
> >> >
> >> >
> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
> >> >
> >> > Flink is version 0.10.1
> >> >
> >> >
> >> > How do I fix this?
> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
> wrong?
> >> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
> >> >
> >> >
> >> > Niels Basjes
> >> >
> >> >
> >> >
> >> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >
> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> >> > - Exception encountered while connecting to the server :
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >> >
> >> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
> >> > - Exception while invoking class
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> >> > Not retrying because the invoked method is not idempotent, and unable
> to
> >> > determine whether it was invoked
> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >       at
> >> >
> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >       at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >       at
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >       at
> >> >
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >       at
> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >       at
> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >       at
> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> > Caused by:
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >       ... 29 more
> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> >> > AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >       at
> >> >
> >> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >> >       at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >> >       at
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >       at
> >> >
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >> >       at
> >> >
> >> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >> >       at
> >> >
> >> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >> >       at
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >> >       at
> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >> >       at
> >> >
> >> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >> >       at
> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >> >       at
> >> >
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> >       at
> >> >
> >> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> > Caused by:
> >> >
> >> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >> >       at
> >> >
> >> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >> >       ... 29 more
> >> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> >> > - Stopping JobManager
> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> >> > 21:30:28,088 INFO
> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> >> > 21:30:28,113 INFO
> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> >> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> >> > - Stopped BLOB server at 0.0.0.0:41281
> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> >> > process...
> >> > 21:30:28,286 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> >> > - Removing web root dir
> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >> >
> >> >
> >> > --
> >> > Best regards / Met vriendelijke groeten,
> >> >
> >> > Niels Basjes
> >
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Hi Niels,

You mentioned you have the option to update Hadoop and redeploy the
job. Would be great if you could do that and let us know how it turns
out.

Cheers,
Max

On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hi,
>
> I posted the entire log from the first log line at the moment of failure to
> the very end of the logfile.
> This is all I have.
>
> As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is
> that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos
> ticket (or tgt?).
> When the new ticket has been obtained it retries the call that previously
> failed.
> To me it seemed that this call can fail over the invalid Token yet it cannot
> be retried.
>
> At this moment I'm thinking a bug in Hadoop.
>
> Niels
>
> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org> wrote:
>>
>> Hi Niels,
>>
>> Sorry for hear you experienced this exception. From a first glance, it
>> looks like a bug in Hadoop to me.
>>
>> > "Not retrying because the invoked method is not idempotent, and unable
>> > to determine whether it was invoked"
>>
>> That is nothing to worry about. This is Hadoop's internal retry
>> mechanism that re-attempts to do actions which previously failed if
>> that's possible. Since the action is not idempotent (it cannot be
>> executed again without risking to change the state of the execution)
>> and it also doesn't track its execution states, it won't be retried
>> again.
>>
>> The main issue is this exception:
>>
>> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from >appattempt_1443166961758_163901_000001
>>
>> From the stack trace it is clear that this exception occurs upon
>> requesting container status information from the Resource Manager:
>>
>> >at
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>
>> Are there any more exceptions in the log? Do you have the complete
>> logs available and could you share them?
>>
>>
>> Best regards,
>> Max
>>
>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>> > Hi,
>> >
>> >
>> > We have a Kerberos secured Yarn cluster here and I'm experimenting with
>> > Apache Flink on top of that.
>> >
>> > A few days ago I started a very simple Flink application (just stream
>> > the
>> > time as a String into HBase 10 times per second).
>> >
>> > I (deliberately) asked our IT-ops guys to make my account have a max
>> > ticket
>> > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously
>> > low
>> > timeout values because I needed to validate this
>> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >
>> > This job is started with a keytab file and after running for 31 hours it
>> > suddenly failed with the exception you see below.
>> >
>> > I had the same job running for almost 400 hours until that failed too (I
>> > was
>> > too late to check the logfiles but I suspect the same problem).
>> >
>> >
>> > So in that time span my tickets have expired and new tickets have been
>> > obtained several hundred times.
>> >
>> >
>> > The main error I see is that in the process of a ticket expiring and
>> > being
>> > renewed I see this message:
>> >
>> >      Not retrying because the invoked method is not idempotent, and
>> > unable
>> > to determine whether it was invoked
>> >
>> >
>> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >
>> > Flink is version 0.10.1
>> >
>> >
>> > How do I fix this?
>> > Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
>> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>> >
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
>> > - Exception encountered while connecting to the server :
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
>> > - Exception while invoking class
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> > Not retrying because the invoked method is not idempotent, and unable to
>> > determine whether it was invoked
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> >       at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >       at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >       at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >       at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >       at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >       at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >       at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >       at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >       at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >       at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >       at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >       at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >       at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >       at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >       at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >       at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >       at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >       at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >       at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >       at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >       ... 29 more
>> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> >       at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >       at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >       at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >       at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >       at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >       at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >       at java.lang.reflect.Method.invoke(Method.java:606)
>> >       at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >       at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >       at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >       at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >       at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >       at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >       at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >       at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >       at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >       at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >       at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >       at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >       at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >       at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >       at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >       at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >       ... 29 more
>> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
>> > - Stopping JobManager
>> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> > 21:30:28,088 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> > 21:30:28,113 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
>> > - Stopped BLOB server at 0.0.0.0:41281
>> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> > process...
>> > 21:30:28,286 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> > - Removing web root dir
>> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

I posted the entire log from the first log line at the moment of failure to
the very end of the logfile.
This is all I have.

As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is
that it catches the "Invalid Token" and then (if keytab) gets a new
Kerberos ticket (or tgt?).
When the new ticket has been obtained it retries the call that previously
failed.
To me it seemed that this call can fail over the invalid Token yet it
cannot be retried.

At this moment I'm thinking a bug in Hadoop.

Niels

On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Niels,
>
> Sorry for hear you experienced this exception. From a first glance, it
> looks like a bug in Hadoop to me.
>
> > "Not retrying because the invoked method is not idempotent, and unable
> to determine whether it was invoked"
>
> That is nothing to worry about. This is Hadoop's internal retry
> mechanism that re-attempts to do actions which previously failed if
> that's possible. Since the action is not idempotent (it cannot be
> executed again without risking to change the state of the execution)
> and it also doesn't track its execution states, it won't be retried
> again.
>
> The main issue is this exception:
>
> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from >appattempt_1443166961758_163901_000001
>
> From the stack trace it is clear that this exception occurs upon
> requesting container status information from the Resource Manager:
>
> >at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>
> Are there any more exceptions in the log? Do you have the complete
> logs available and could you share them?
>
>
> Best regards,
> Max
>
> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> > Hi,
> >
> >
> > We have a Kerberos secured Yarn cluster here and I'm experimenting with
> > Apache Flink on top of that.
> >
> > A few days ago I started a very simple Flink application (just stream the
> > time as a String into HBase 10 times per second).
> >
> > I (deliberately) asked our IT-ops guys to make my account have a max
> ticket
> > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously
> low
> > timeout values because I needed to validate this
> > https://issues.apache.org/jira/browse/FLINK-2977).
> >
> > This job is started with a keytab file and after running for 31 hours it
> > suddenly failed with the exception you see below.
> >
> > I had the same job running for almost 400 hours until that failed too (I
> was
> > too late to check the logfiles but I suspect the same problem).
> >
> >
> > So in that time span my tickets have expired and new tickets have been
> > obtained several hundred times.
> >
> >
> > The main error I see is that in the process of a ticket expiring and
> being
> > renewed I see this message:
> >
> >      Not retrying because the invoked method is not idempotent, and
> unable
> > to determine whether it was invoked
> >
> >
> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
> >
> > Flink is version 0.10.1
> >
> >
> > How do I fix this?
> > Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
> > Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
> >
> >
> > Niels Basjes
> >
> >
> >
> > 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> > 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> > - Exception encountered while connecting to the server :
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> > 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> >
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> > 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
> > - Exception while invoking class
> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> > Not retrying because the invoked method is not idempotent, and unable to
> > determine whether it was invoked
> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> > AMRMToken from appattempt_1443166961758_163901_000001
> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> >       at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >       at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >       at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >       at
> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >       at
> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >       at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >       at
> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >       at
> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >       at
> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >       at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >       at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >       at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >       at
> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >       at
> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >       at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >       at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > Caused by:
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >       at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >       at
> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >       ... 29 more
> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> > AMRMToken from appattempt_1443166961758_163901_000001
> >       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> >       at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> >       at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> >       at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> >       at
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> >       at
> >
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> >       at
> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> >       at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >       at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >       at java.lang.reflect.Method.invoke(Method.java:606)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >       at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >       at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> >       at
> >
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> >       at
> >
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> >       at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> >       at
> >
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> >       at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> >       at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> >       at
> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> >       at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> >       at
> >
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> >       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> >       at
> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> >       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> >       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> >       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> >       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> >       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> >       at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> >       at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >       at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > Caused by:
> >
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
> >       at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> >       at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> >       at
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >       at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> >       at
> >
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> >       ... 29 more
> > 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> > - Stopping JobManager akka.tcp://flink@10.10.200.3:39527/user/jobmanager
> .
> > 21:30:28,088 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
> > - Source: Custom Source -> Sink: Unnamed (1/1)
> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> > 21:30:28,113 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
> > - Source: Custom Source -> Sink: Unnamed (1/1)
> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> > 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> > - Stopped BLOB server at 0.0.0.0:41281
> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> > process...
> > 21:30:28,286 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> > - Removing web root dir
> /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Posted by Maximilian Michels <mx...@apache.org>.

Hi Niels,

Sorry for hear you experienced this exception. From a first glance, it
looks like a bug in Hadoop to me.

> "Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked"

That is nothing to worry about. This is Hadoop's internal retry
mechanism that re-attempts to do actions which previously failed if
that's possible. Since the action is not idempotent (it cannot be
executed again without risking to change the state of the execution)
and it also doesn't track its execution states, it won't be retried
again.

The main issue is this exception:

>org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from >appattempt_1443166961758_163901_000001

>From the stack trace it is clear that this exception occurs upon
requesting container status information from the Resource Manager:

>at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)

Are there any more exceptions in the log? Do you have the complete
logs available and could you share them?


Best regards,
Max

On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <Ni...@basjes.nl> wrote:
> Hi,
>
>
> We have a Kerberos secured Yarn cluster here and I'm experimenting with
> Apache Flink on top of that.
>
> A few days ago I started a very simple Flink application (just stream the
> time as a String into HBase 10 times per second).
>
> I (deliberately) asked our IT-ops guys to make my account have a max ticket
> time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low
> timeout values because I needed to validate this
> https://issues.apache.org/jira/browse/FLINK-2977).
>
> This job is started with a keytab file and after running for 31 hours it
> suddenly failed with the exception you see below.
>
> I had the same job running for almost 400 hours until that failed too (I was
> too late to check the logfiles but I suspect the same problem).
>
>
> So in that time span my tickets have expired and new tickets have been
> obtained several hundred times.
>
>
> The main error I see is that in the process of a ticket expiring and being
> renewed I see this message:
>
>      Not retrying because the invoked method is not idempotent, and unable
> to determine whether it was invoked
>
>
> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>
> Flink is version 0.10.1
>
>
> How do I fix this?
> Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
> Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?
>
>
> Niels Basjes
>
>
>
> 21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN  org.apache.hadoop.ipc.Client
> - Exception encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler
> - Exception while invoking class
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> Not retrying because the invoked method is not idempotent, and unable to
> determine whether it was invoked
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> 	at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> 	at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> 	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> 	at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> 	at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> 	at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> 	at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> 	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> 	at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> 	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> 	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> 	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> 	at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> 	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> 	at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> 	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> 	at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> 	at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> 	at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> 	... 29 more
> 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> 	at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> 	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> 	at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> 	at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> 	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> 	at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> 	at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> 	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> 	at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> 	at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> 	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> 	at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> 	at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> 	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> 	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> 	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> 	at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> 	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> 	at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> 	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> 	at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> 	at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> 	at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> 	... 29 more
> 21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager
> - Stopping JobManager akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> 21:30:28,088 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> 21:30:28,113 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> 21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer
> - Stopped BLOB server at 0.0.0.0:41281
> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> process...
> 21:30:28,286 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes