You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Aljoscha Krettek (JIRA)" <ji...@apache.org> on 2017/08/02 09:08:00 UTC
[jira] [Updated] (FLINK-7340) Taskmanager hung after temporary DNS
outage
[ https://issues.apache.org/jira/browse/FLINK-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aljoscha Krettek updated FLINK-7340:
------------------------------------
Component/s: Distributed Coordination
> Taskmanager hung after temporary DNS outage
> -------------------------------------------
>
> Key: FLINK-7340
> URL: https://issues.apache.org/jira/browse/FLINK-7340
> Project: Flink
> Issue Type: Bug
> Components: Core, Distributed Coordination
> Affects Versions: 1.3.1
> Environment: Non-HA Flink running in Kubernetes.
> Reporter: Joshua Griffith
>
> After a Kubernetes node failure, several TaskManagers and the DNS system were automatically restarted. One TaskManager was unable to connect to the JobManager and continually logged the following errors:
> {quote}
> 2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt 595, timeout: 30000 milliseconds)
> 2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO Remoting flink-akka.remote.default-remote-dispatcher-240 - Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not been restarted. Keeping it quarantined.
> {quote}
> After exec'ing into the container, I was able to {{telnet jobmanager 6123}} successfully and {{dig jobmanager}} showed the correct IP in DNS. I suspect that the TaskManager cached a bad IP address for the JobManager when the DNS system was restarting and it used that cached address rather than respecting the 30s TTL and getting a new one for the next request. It may be a good idea for the TaskManager to explicitly perform a DNS lookup after JobManager connection failures.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)