You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Jason Brelloch <jb...@gmail.com> on 2017/04/20 16:42:54 UTC

Fetching metrics failed.

Hey all,

So we are doing some experimenting around large keyed state in Flink 1.2 on
a single task manager and we keep having our task manager killed by the job
manager after about 10 minutes due to this exception:

Fetching metrics failed.
akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:
37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]] after
[10000 ms]

The task manager logs show nothing out of the ordinary, but the job manager
logs shows this:

2017-04-19 20:56:52,230 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address
is now gated for [5000] ms. Reason: [Disassociated]
2017-04-19 20:56:53,986 Fetching metrics failed.
2017-04-19 20:57:43,584 Association with remote system
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, address
is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by:
[Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244]
2017-04-19 20:57:49,517 Detected unreachable: [akka.tcp://flink@flink-s-
load-uscen-a-c001-n011:37244]
2017-04-19 20:57:49,517 Task manager akka.tcp://flink@flink-s-load-
uscen-a-c001-n011:37244/user/taskmanager terminated.

The weird part is, we have not set up any metrics reporters or anything so
I am not really sure why the Job Manager is asking the task manager about
them.  Is there a way to disable these metrics requests, or does anyone
know what is causing them?

Thanks,
-- 
*Jason Brelloch* | Product Developer
3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
<http://www.bettercloud.com/>
Subscribe to the BetterCloud Monitor
<https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch>
-
Get IT delivered to your inbox

Re: Fetching metrics failed.

Posted by Chesnay Schepler <ch...@apache.org>.
Hello,

the MetricQueryService is used by the webUI to fetch fetch metrics from 
the JobManager and all TaskManagers. It is only used when the
webUI is accessed.

Based on the logs you gave the TaskManager isn't killed by the 
JobManager; instead the JobManager only detected that the TaskManager 
has shut down.

It is highly unlikely that the MetricQueryService is the cause of this; 
the exception you are seeing is due to the TaskManager being no longer 
reachable. Can't fetch metrics when the TaskManager isn't there anymore.

How do you mange the Flink cluster? (Yarn etc.) Given that no exception 
appears in the log i would assume that the TaskManager JVM was killed 
from the outside.

Regards,
Chesnay

On 20.04.2017 18:42, Jason Brelloch wrote:
> Hey all,
>
> So we are doing some experimenting around large keyed state in Flink 
> 1.2 on a single task manager and we keep having our task manager 
> killed by the job manager after about 10 minutes due to this exception:
>
> Fetching metrics failed.
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/MetricQueryService_0f7bba0b16b18e83b69c4a50e657bb1f]] 
> after [10000 ms]
>
> The task manager logs show nothing out of the ordinary, but the job 
> manager logs shows this:
>
> 2017-04-19 20:56:52,230 Association with remote system 
> [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, 
> address is now gated for [5000] ms. Reason: [Disassociated]
> 2017-04-19 20:56:53,986 Fetching metrics failed.
> 2017-04-19 20:57:43,584 Association with remote system 
> [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244] has failed, 
> address is now gated for [5000] ms. Reason: [Association failed with 
> [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]] Caused by: 
> [Connection refused: flink-s-load-uscen-a-c001-n011/10.34.48.40:37244 
> <http://10.34.48.40:37244>]
> 2017-04-19 20:57:49,517 Detected unreachable: 
> [akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244]
> 2017-04-19 20:57:49,517 Task manager 
> akka.tcp://flink@flink-s-load-uscen-a-c001-n011:37244/user/taskmanager 
> terminated.
>
> The weird part is, we have not set up any metrics reporters or 
> anything so I am not really sure why the Job Manager is asking the 
> task manager about them.  Is there a way to disable these metrics 
> requests, or does anyone know what is causing them?
>
> Thanks,
> -- 
> *Jason Brelloch* | Product Developer
> 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305
> <http://www.bettercloud.com/>
> Subscribe to the BetterCloud Monitor 
> <https://www.bettercloud.com/monitor?utm_source=bettercloud_email&utm_medium=email_signature&utm_campaign=monitor_launch> - 
> Get IT delivered to your inbox