You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2018/03/02 16:35:00 UTC

[jira] [Commented] (FLINK-8829) Flink in EMR(YARN) is down due to Akka communication issue

    [ https://issues.apache.org/jira/browse/FLINK-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383785#comment-16383785 ] 

Stephan Ewen commented on FLINK-8829:
-------------------------------------

One reason why this could happen that we saw in the past is conflicts between Akka's Netty and Netty instances pulled in by Hadoop (through EMR). In some cases, that resulted in connections dying even though network connectivity was there.

In Flink 1.4.x, we shade and relocate Akka's Netty to ensure such conflicts don't happen any more.

You could try to do the following:
  - Upgrade for Flink's 1.4.x line.
  - Try to remove Netty being pulled in via Hadoop. That is not super easy, you would need to use a Flink version built against the same Hadoop version as EMR runs (Flink should exclude or shade Hadoop's netty) and prevent the Hadoop classpath from being added to the Flink classpath.

If you go with option one,  1.4.2 coming out in a few days, 1.4.1 is fine except for a classloading bug when using Kafka with a custom watermark generator.

> Flink in EMR(YARN) is down due to Akka communication issue
> ----------------------------------------------------------
>
>                 Key: FLINK-8829
>                 URL: https://issues.apache.org/jira/browse/FLINK-8829
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.2
>            Reporter: Aleksandr Filichkin
>            Priority: Major
>
> Hi,
> We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink job is down due to:
> _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]|mailto:flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]] Caused by: [Connection refused: ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.209:42177] 2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://[flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]] 2018-02-16 19:00:05,596 INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to JobManager akka.tcp://[flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager|mailto:flink@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager]. Triggering connection timeout._
> Do you have any ideas how to troubleshoot it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)