You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/05/15 06:41:38 UTC

[GitHub] [flink] zentol commented on issue #8412: [FLINK-12111][tests] Harden AbstractTaskManagerProcessFailureRecoveryTest

zentol commented on issue #8412: [FLINK-12111][tests] Harden AbstractTaskManagerProcessFailureRecoveryTest
URL: https://github.com/apache/flink/pull/8412#issuecomment-492525060
 
 
   It weakens the test in some regards (as this kind of weird timing issues is no longer covered) but strengthens it in other areas (BATCH mode being _actually_ tested).
   
   I tried to reproduce the scenario you described by upping the heartbeat timeout, so that the network stack always fails first. This however didn't work; the restart was delayed since all tasks on the TM that timed out were stuck in a CANCELING state. Only once the heartbeat actually timed out could the restart proceed. As such I'm no longer sure whether this scenario can actually occur.
   
   Upping the restart delay should work, will implement that right away.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services