You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/05/15 07:58:38 UTC

[GitHub] [flink] tillrohrmann commented on issue #8412: [FLINK-12111][tests] Harden AbstractTaskManagerProcessFailureRecoveryTest

tillrohrmann commented on issue #8412: [FLINK-12111][tests] Harden AbstractTaskManagerProcessFailureRecoveryTest
URL: https://github.com/apache/flink/pull/8412#issuecomment-492549029

After an offline discussion with @zentol, we concluded that the following is happening most likely:

Killing the first TM process `TM1` does not happen synchronously. Therefore, it can happen that `TM1` sees the proceed marker file and finishes the computation. Next, the reducer is started and consumes the data from `TM1`. Now `TM1` is killed and the network stack signals the connection loss. Since the reducer is running with a parallelism of `1` and is deployed on `TM2`, the `ExecutionGraph` can be quickly restarted without waiting on the heartbeat to time out (because the `rpcTimeout` is set to `100 s` and the cancel attempts won't fail). Restarting the `ExecutionGraph` will result in deploying some tasks to `TM1` whose heartbeat hasn't been timed out. Last the heartbeat of `TM1` times out and the job fails a second time.

The proposed solution is the following:

* Wait for `TM1` to be terminated before creating the proceed marker file. That way the mappers running on `TM1` should never complete.
* Set the number of allowed restarts to `2` in order to allow for two job restarts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services