You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2019/11/08 08:16:00 UTC

[jira] [Comment Edited] (FLINK-14606) Simplify params of Execution#processFail

    [ https://issues.apache.org/jira/browse/FLINK-14606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969932#comment-16969932 ] 

Zhu Zhu edited comment on FLINK-14606 at 11/8/19 8:15 AM:
----------------------------------------------------------

Thanks [~chesnay] for sharing the thoughts. 
My bad that in the description I talked too much on why this change would work, but not enough for the motivation to do it. 
The motivation is actually to fix some potential issues. If the issues are not valid or there are other better ways to solve it, it's fine ignore the initial proposal.

Potential issues:
1. If a task deployment succeeded at TM, while the success response to JM is lost, it may not be canceled (isCallback==true in this case, see the [execution deploy logic|https://github.com/apache/flink/blob/b0a9afdd24fb70131b1e80d46d0ca101235a4a36/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L739])
2. From entry {{Execution#markFailed(Throwable)}}, the task is thought to not be deployed, thus isCallback==true and  {{sendCancelRpcCall}} will not be invoked. But we will still releasing partitions since releasePartitions==true, even though the task is not deployed and the partition then should not have been created. Would it matter if we set releasePartitions to be false in this case to reduce RPC calls?

For the {{fromSchedulerNg}} flag, agreed that we can keep it as is since the benefit to unify it is not obvious.





was (Author: zhuzh):
Thanks [~chesnay] for sharing the thoughts. 
My bad that in the description I talked too much on why this change would work, but not enough for the motivation to do it. 
The motivation is actually to fix some potential issues. If the issues are not valid or there are other better ways to solve it, it's fine ignore the initial proposal.

Potential issues:
1. tasks may not be canceled if a task deployment succeeded at TM, while the success response to JM is lost. (isCallback==true in this case, see the [execution deploy logic|https://github.com/apache/flink/blob/b0a9afdd24fb70131b1e80d46d0ca101235a4a36/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L739])
2. From entry {{Execution#markFailed(Throwable)}}, the task is thought to not be deployed, thus isCallback==true and  {{sendCancelRpcCall}} will not be invoked. But we will still releasing partitions since releasePartitions==true, even though the task is not deployed and the partition then should not have been created. Would it matter if we set releasePartitions to be false in this case to reduce RPC calls?

For the {{fromSchedulerNg}} flag, agreed that we can keep it as is since the benefit to unify it is not obvious.




> Simplify params of Execution#processFail
> ----------------------------------------
>
>                 Key: FLINK-14606
>                 URL: https://issues.apache.org/jira/browse/FLINK-14606
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> The 3 params fromSchedulerNg/releasePartitions/isCallback of Execution#processFail are quite a mess while they seem to be correlated. 
> I'd propose to simplify the prams of processFail by using a {{isInternalError}} to replace those 3 params. {{isInternalError}} is true iff the failure is from TM(strictly speaking, notified from SchedulerBase). This also hardens the handling of cases that a task is successfully deployed but JM does not realize it(see #3 below).
> Here's why these 3 params can be simplified:
> 1. {{fromSchedulerNg}}, true iff the failure is from TM and isLegacyScheduling==false.
>     It's only used like this: {{if (!fromSchedulerNg && !isLegacyScheduling()))}}. So it's the same to use {{!isInternalFailure}} to replace it.
> 2. {{releasePartitions}}, true iff the failure is from TM.
>   Now the value is exactly the same as {{isInternalFailure}}, we can drop it and use {{isInternalFailure}} instead.
> 3. {{isCallback}}, true iff the failure is from TM or the task is not deployed.
>     It's only used like this: {{(!isCallback && (current == RUNNING || current == DEPLOYING))}}.
>     So using {{!isInternalFailure}} to replace it would be enough. It is a bit different for the case that a task deployment to a task manager fails, which set {{isCallback}} to true previously. However, it would be safer to signal a cancel call, in case the deployment is actually a success but the response is lost on network.
> cc [~GJL]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)