You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Bruno Aranda <ba...@apache.org> on 2019/03/26 18:21:43 UTC

1.7.2 requires several attempts to start in AWS EMR's Yarn

Hi,

I did write recently about our problems with 1.7.2 for which we still
haven't found a solution and the cluster is very unstable. I am trying to
point now to a different problem that maybe it is related somehow and we
don't understand.

When we restart a Flink Session in Yarn, we see it takes a few attempts in
order for the container with the JM to be stable. The following Gist
contains the logs from the 4 attempts before a 5th successful one:

https://gist.github.com/siliconcat/3f6b7869e4796151a6bf23ed5342f516

We fail to see why the JM fails. In the first case, I can see a SIGTERM 15,
so I assume it is the cluster manager killing it or something, but I am not
sure what happens in the other cases, or why would the manager kill that
container. We run 38 streaming jobs and we are using the same resources
that we were using before with Flink 1.6 (for which we were using legacy
mode).

Thanks for any insights. We are losing a lot of hair with 1.7.2...

Cheers,

Bruno