You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/01/27 12:44:46 UTC

[GitHub] [flink] XComp edited a comment on pull request #14686: [FLINK-6042][runtime] Adds exception history

XComp edited a comment on pull request #14686:
URL: https://github.com/apache/flink/pull/14686#issuecomment-768170545


   After talking to @chesnay about the exception history, I had to reiterate over the implementation as I realized that I tried to solve it using wrong assumptions. I try to summarize my conclusions. Let me know if I got something wrong. 
   
   Processing a state change runs in the main thread, i.e. at least changing the different execution states to `FAILED` does not compete with each other. If a failure happens, all executions of the same pipeline region change their state to `CANCELING`. In the current implementation, no failures are logged for these executions due to the way how [Execution.processFail](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1093) is implemented. 
   
   I need the failure information for the exception history. Hence, I have to start collecting the failure even when the `Execution` is in state `CANCELING` or `CANCELED`. That should be ok as this `failureInfo` is really only used for the exception history (i.e. in the `JobExceptionsHandler`).
   
   For the execution history, we have to differentiate between job failures and restartable failures: 
   * Task failures are grouped by a key that is based on the set of `Executions` that need to be restarted (using their `ExecutionAttemptId` as a reference).
   * ~Job failures are grouped based on the `ExecutionGraph`. There is no unique identifier for `ExecutionGraph`, yet. I'm thinking of extending `AccessExecutionGraph` to add such an identifier. The alternative would be to use the set of `Executions` like it's done for the restartable failures. The problem with that approach is that a restartable failure affecting all `ExecutionVertices` of the `ExecutionGraph` would overlap with a job failure.~ the global failures will be ignored for now
   
   The exception history is then passed along with the `ArchivedExecutionGraph` to the `JobExceptionsHandler` in something like a `ExecutionGraphInformation` as proposed by @tillrohrmann . `ExecutionGraphCache` and related classes have to adapted accordingly. The `JobExceptionsHandler` will do the translation into a REST response.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org