You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Greg Finch <fi...@gmail.com> on 2018/08/31 18:01:06 UTC

akka.ask.timeout setting not honored

I'm having a problem with akka timeout when starting my cluster.  The error
is "Ask timed out after 10000 ms.".  I have changed the akka.ask.timeout
config setting to be 300000 ms, but it still times out and fails after 10
seconds.  I confirmed that the config is properly set by both checking the
Job Manager configuration tab (it shows 300000 ms) as well logging the
output of AkkaUtils.getTimeout(configuration) which also shows 300000ms.
It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the
LocalStreamEnvironment will not honor this setting, but that is not my
case.  I am running on a cluster (AWS EMR) using the regular
StreamExecutionEnvironment.  This is Flink 1.5.2.

Any ideas?

~~~~~

2018-08-31 17:37:55 INFO
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new
token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new
token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR
o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms].
Sender[null] sent message of type
"org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for
application to be successfully unregistered.
2018-08-31 17:38:41 INFO
o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  -
Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-81 - Association with
remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027]
has failed, address is now gated for [50] ms. Reason: [Disassociated]

Re: akka.ask.timeout setting not honored

Posted by Greg Finch <fi...@gmail.com>.

Hi Gary,

Turns out, the configuration warning you mentioned was the key.  The
akka.ask.timeout requires a duration unit, but the web.timeout setting is
looking for a long.  So the change I made earlier would not have applied
since it couldn't read `300s`.  Since making that change (`web.timeout:
300000`), I have not been able to reproduce the error - everything starts
successfully every time.  I do have debug logging turned on for now.  If it
happens again in the next couple of days, I will send details with debug
logs.

Thanks again for your help!
Greg

On Fri, Aug 31, 2018 at 3:21 PM Gary Yao <ga...@data-artisans.com> wrote:

> Hi Greg,
>
> Unfortunately the environment information [1] is not logged. Can you set
> the
> log level for all Flink packages to DEBUG?
>
> Do you install Flink yourself on EMR, or do you use the pre-installed one?
> Can you show us the command with which you start the cluster/submit the
> job?
>
> I do not know if it is related but I found these warnings in your second
> log file:
>
>     2018-08-31 19:14:32 WARN
> org.apache.flink.configuration.Configuration  - Configuration cannot
> evaluate value 300s as a long integer number
>     2018-08-31 19:14:32 WARN
> org.apache.flink.configuration.Configuration  - Configuration cannot
> evaluate value 300s as a long integer number
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281
>
> On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <fi...@gmail.com>
> wrote:
>
>> Well ... that didn't take long.  The next time I tried, I got the Akka
>> timeout again.  Attached are the logs from the last attempt.  They're very
>> similar to the other logs I sent.
>>
>> On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <fi...@gmail.com>
>> wrote:
>>
>>> Thanks Gary.  Attached is the jobmanager log.  You are correct that this
>>> is running on YARN.  I changed web.timeout as you suggested - that seems to
>>> be working the few times I tested it.  This problem comes and goes though -
>>> sometimes it starts before it times out.  I'll keep the web.timeout setting
>>> and reply again if the problem comes up again.  Thanks again for your quick
>>> response!
>>>
>>> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <ga...@data-artisans.com> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>> Can you describe the steps to reproduce the problem, or can you attach
>>>> the
>>>> full jobmanager logs? Because JobExecutionResultHandler appears in your
>>>> log, I
>>>> assume that you are starting a job cluster on YARN. Without seeing the
>>>> complete logs, I cannot be sure what exactly happens. For now, you can
>>>> try
>>>> setting the config option web.timeout to a higher value.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <fi...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm having a problem with akka timeout when starting my cluster.  The
>>>>> error is "Ask timed out after 10000 ms.".  I have changed the
>>>>> akka.ask.timeout config setting to be 300000 ms, but it still times out and
>>>>> fails after 10 seconds.  I confirmed that the config is properly set by
>>>>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>>>>> well logging the output of AkkaUtils.getTimeout(configuration) which also
>>>>> shows 300000ms.  It seems something is not honoring that configuration
>>>>> value.
>>>>>
>>>>> I did find a different thread that discussed the fact that the
>>>>> LocalStreamEnvironment will not honor this setting, but that is not my
>>>>> case.  I am running on a cluster (AWS EMR) using the regular
>>>>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> ~~~~~
>>>>>
>>>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
>>>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
>>>>> 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
>>>>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>>>> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>>>> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>>>> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>>>> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>> 2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
>>>>> 2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
>>>>> java.lang.InterruptedException: null
>>>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>>>>> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>>>> 	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>>>>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>>>>
>>>>>
>>>>>
>>>>
>

Re: akka.ask.timeout setting not honored

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Greg,

Unfortunately the environment information [1] is not logged. Can you set the
log level for all Flink packages to DEBUG?

Do you install Flink yourself on EMR, or do you use the pre-installed one?
Can you show us the command with which you start the cluster/submit the job?

I do not know if it is related but I found these warnings in your second
log file:

    2018-08-31 19:14:32 WARN  org.apache.flink.configuration.Configuration
- Configuration cannot evaluate value 300s as a long integer number
    2018-08-31 19:14:32 WARN  org.apache.flink.configuration.Configuration
- Configuration cannot evaluate value 300s as a long integer number

Best,
Gary

[1]
https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281

On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <fi...@gmail.com> wrote:

> Well ... that didn't take long.  The next time I tried, I got the Akka
> timeout again.  Attached are the logs from the last attempt.  They're very
> similar to the other logs I sent.
>
> On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <fi...@gmail.com> wrote:
>
>> Thanks Gary.  Attached is the jobmanager log.  You are correct that this
>> is running on YARN.  I changed web.timeout as you suggested - that seems to
>> be working the few times I tested it.  This problem comes and goes though -
>> sometimes it starts before it times out.  I'll keep the web.timeout setting
>> and reply again if the problem comes up again.  Thanks again for your quick
>> response!
>>
>> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <ga...@data-artisans.com> wrote:
>>
>>> Hi Greg,
>>>
>>> Can you describe the steps to reproduce the problem, or can you attach
>>> the
>>> full jobmanager logs? Because JobExecutionResultHandler appears in your
>>> log, I
>>> assume that you are starting a job cluster on YARN. Without seeing the
>>> complete logs, I cannot be sure what exactly happens. For now, you can
>>> try
>>> setting the config option web.timeout to a higher value.
>>>
>>> Best,
>>> Gary
>>>
>>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <fi...@gmail.com>
>>> wrote:
>>>
>>>> I'm having a problem with akka timeout when starting my cluster.  The
>>>> error is "Ask timed out after 10000 ms.".  I have changed the
>>>> akka.ask.timeout config setting to be 300000 ms, but it still times out and
>>>> fails after 10 seconds.  I confirmed that the config is properly set by
>>>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>>>> well logging the output of AkkaUtils.getTimeout(configuration) which
>>>> also shows 300000ms.  It seems something is not honoring that configuration
>>>> value.
>>>>
>>>> I did find a different thread that discussed the fact that the
>>>> LocalStreamEnvironment will not honor this setting, but that is not my
>>>> case.  I am running on a cluster (AWS EMR) using the regular
>>>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>>>
>>>> Any ideas?
>>>>
>>>> ~~~~~
>>>>
>>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
>>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
>>>> 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
>>>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>>> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>>> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>>> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>>> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> 2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
>>>> 2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
>>>> java.lang.InterruptedException: null
>>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>>>> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>>> 	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>>>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>>>
>>>>
>>>>
>>>

Re: akka.ask.timeout setting not honored

Posted by Greg Finch <fi...@gmail.com>.

Well ... that didn't take long.  The next time I tried, I got the Akka
timeout again.  Attached are the logs from the last attempt.  They're very
similar to the other logs I sent.

On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <fi...@gmail.com> wrote:

> Thanks Gary.  Attached is the jobmanager log.  You are correct that this
> is running on YARN.  I changed web.timeout as you suggested - that seems to
> be working the few times I tested it.  This problem comes and goes though -
> sometimes it starts before it times out.  I'll keep the web.timeout setting
> and reply again if the problem comes up again.  Thanks again for your quick
> response!
>
> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <ga...@data-artisans.com> wrote:
>
>> Hi Greg,
>>
>> Can you describe the steps to reproduce the problem, or can you attach the
>> full jobmanager logs? Because JobExecutionResultHandler appears in your
>> log, I
>> assume that you are starting a job cluster on YARN. Without seeing the
>> complete logs, I cannot be sure what exactly happens. For now, you can try
>> setting the config option web.timeout to a higher value.
>>
>> Best,
>> Gary
>>
>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <fi...@gmail.com>
>> wrote:
>>
>>> I'm having a problem with akka timeout when starting my cluster.  The
>>> error is "Ask timed out after 10000 ms.".  I have changed the
>>> akka.ask.timeout config setting to be 300000 ms, but it still times out and
>>> fails after 10 seconds.  I confirmed that the config is properly set by
>>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>>> well logging the output of AkkaUtils.getTimeout(configuration) which also
>>> shows 300000ms.  It seems something is not honoring that configuration
>>> value.
>>>
>>> I did find a different thread that discussed the fact that the
>>> LocalStreamEnvironment will not honor this setting, but that is not my
>>> case.  I am running on a cluster (AWS EMR) using the regular
>>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>>
>>> Any ideas?
>>>
>>> ~~~~~
>>>
>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
>>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
>>> 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
>>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> 2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
>>> 2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
>>> java.lang.InterruptedException: null
>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>>> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>> 	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>>
>>>
>>>
>>

Re: akka.ask.timeout setting not honored

Posted by Greg Finch <fi...@gmail.com>.

Thanks Gary.  Attached is the jobmanager log.  You are correct that this is
running on YARN.  I changed web.timeout as you suggested - that seems to be
working the few times I tested it.  This problem comes and goes though -
sometimes it starts before it times out.  I'll keep the web.timeout setting
and reply again if the problem comes up again.  Thanks again for your quick
response!

On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <ga...@data-artisans.com> wrote:

> Hi Greg,
>
> Can you describe the steps to reproduce the problem, or can you attach the
> full jobmanager logs? Because JobExecutionResultHandler appears in your
> log, I
> assume that you are starting a job cluster on YARN. Without seeing the
> complete logs, I cannot be sure what exactly happens. For now, you can try
> setting the config option web.timeout to a higher value.
>
> Best,
> Gary
>
> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <fi...@gmail.com>
> wrote:
>
>> I'm having a problem with akka timeout when starting my cluster.  The
>> error is "Ask timed out after 10000 ms.".  I have changed the
>> akka.ask.timeout config setting to be 300000 ms, but it still times out and
>> fails after 10 seconds.  I confirmed that the config is properly set by
>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>> well logging the output of AkkaUtils.getTimeout(configuration) which also
>> shows 300000ms.  It seems something is not honoring that configuration
>> value.
>>
>> I did find a different thread that discussed the fact that the
>> LocalStreamEnvironment will not honor this setting, but that is not my
>> case.  I am running on a cluster (AWS EMR) using the regular
>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>
>> Any ideas?
>>
>> ~~~~~
>>
>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
>> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
>> 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>> 	at java.lang.Thread.run(Thread.java:748)
>> 2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
>> 2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
>> java.lang.InterruptedException: null
>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> 	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>
>>
>>
>

Re: akka.ask.timeout setting not honored

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your
log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary

On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <fi...@gmail.com> wrote:

> I'm having a problem with akka timeout when starting my cluster.  The
> error is "Ask timed out after 10000 ms.".  I have changed the
> akka.ask.timeout config setting to be 300000 ms, but it still times out and
> fails after 10 seconds.  I confirmed that the config is properly set by
> both checking the Job Manager configuration tab (it shows 300000 ms) as
> well logging the output of AkkaUtils.getTimeout(configuration) which also
> shows 300000ms.  It seems something is not honoring that configuration
> value.
>
> I did find a different thread that discussed the fact that the
> LocalStreamEnvironment will not honor this setting, but that is not my
> case.  I am running on a cluster (AWS EMR) using the regular
> StreamExecutionEnvironment.  This is Flink 1.5.2.
>
> Any ideas?
>
> ~~~~~
>
> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
> 2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
> 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> 	at java.lang.Thread.run(Thread.java:748)
> 2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
> 2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
> java.lang.InterruptedException: null
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
> 	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]
>
>
>