You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Panagiotis Garefalakis (Jira)" <ji...@apache.org> on 2024/03/22 05:10:00 UTC

[jira] [Updated] (FLINK-33121) Failed precondition in JobExceptionsHandler due to concurrent global failures

     [ https://issues.apache.org/jira/browse/FLINK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Panagiotis Garefalakis updated FLINK-33121:
-------------------------------------------
    Description: 
We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}) -- However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}).

A solution to this could be to ignore Global failures while being in a Restarting phase on the Adaptive scheduler.

This PR also fixes a smaller bug where we dont pass the [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] properly.

Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step)

  was:
{{JobExceptionsHandler#createRootExceptionInfo}} makes the assumption that *Global* Failures (with null Task name) may *only* be RootExceptions (jobs are considered in FAILED state when this happens and no further exceptions are captured) and *Local/Task* may be part of concurrent exceptions List *--* if this precondition is violated, an assertion is thrown as part of {{{}asserLocalExceptionInfo{}}}.

The issue lies within [convertFailures|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java#L422]] logic where we take the failureCollection pointer and convert it to a HistoryEntry.
In more detail, we are passing the first Failure and a pointer to the remaining failures collection as part of HistoryEntry creation — and then add the entry in the exception History.
In our specific scenario a Local Failure first comes in, we call convertFailures that creates a HistoryEntry and removes the LocalFailure from the collection while also passing a pointer to the empty failureCollection. Then a Global failure comes in (and before conversion), it is added to the failureCollection (that was empty) just before serving the requestJob that returns the List of History Entries.
This messes things up, as the LocalFailure now has a ConcurrentExceptionsCollection with a Global Failure that should never happen (causing the assertion).
A solution is to create a Copy of the failureCollection in the conversion instead of passing the pointer around (as I did in the updated PR)

This PR also fixes a smaller bug where we dont pass the [taskName|[https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104]|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] properly.

Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step)


> Failed precondition in JobExceptionsHandler due to concurrent global failures
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-33121
>                 URL: https://issues.apache.org/jira/browse/FLINK-33121
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Panagiotis Garefalakis
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>              Labels: pull-request-available
>
> We make the assumption that Global Failures (with null Task name) may only be RootExceptions and and Local/Task exception may be part of concurrent exceptions List (see {{{}JobExceptionsHandler#createRootExceptionInfo{}}}) -- However, when the Adaptive scheduler is in a Restarting phase due to an existing failure (that is now the new Root) we can still, in rare occasions, capture new Global failures, violating this condition (with an assertion is thrown as part of {{{}assertLocalExceptionInfo{}}}).
> A solution to this could be to ignore Global failures while being in a Restarting phase on the Adaptive scheduler.
> This PR also fixes a smaller bug where we dont pass the [taskName|https://github.com/apache/flink/pull/23440/files#diff-0c8b850bbd267631fbe04bb44d8bb3c7e87c3c6aabae904fabdb758026f7fa76R104] properly.
> Note: DefaultScheduler does not suffer from this issue as it treats failures directly as HistoryEntries (no conversion step)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)