You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/07/10 09:35:04 UTC

[GitHub] [flink] StephanEwen commented on issue #9058: [FLINK-13166] Add support for batch slot requests to SlotPoolImpl

StephanEwen commented on issue #9058: [FLINK-13166] Add support for batch slot requests to SlotPoolImpl
URL: https://github.com/apache/flink/pull/9058#issuecomment-509988368

I agree with Till here. The logic is not yet perfect, but should be an improvement over the current state.

Under fine-grained recovery, the current state would lead to failure of a task and individual recovery, re-triggering a request to the RM. That is good, but the downside is that it takes away recovery attempts. I think this is tricky for users to understand, that we rely on failure / recovery to re-request resources. It makes re-try attempts meaningless and brings users to debug jobs (because they see unexpected failures) when really nothing is wrong.

With this change here, we don't rely on failure/recovery any more, but do not re-trigger timed out requests within a stage. It may hence be that a stage does not optimally use its resources. Requests come again in the next stage.

Like Till suggested, for 1.10, we should consider a different model. Requests from the SlotPool to the RM should not time out (unless there is an actual failure) and resources that appear at the RM make it to the SlotPool. Letting the SlotPool periodically request resources seems like a workaround to me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services