You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by "argibbs (via GitHub)" <gi...@apache.org> on 2023/09/13 13:32:22 UTC

[GitHub] [airflow] argibbs commented on issue #34339: Tasks being marked as failed even after running successfully

argibbs commented on issue #34339:
URL: https://github.com/apache/airflow/issues/34339#issuecomment-1717644586

As is always the way: after sitting on this problem for days before raising the issue, I have just noticed that we are getting occasional timeouts in the dag processor manager on the dags that are most frequently exhibiting the problem. (Why we're hitting the timeout is a separate problem, but baby steps)

After dropping down to a single scheduler, this was manifesting as the dags dropping out of the gui then reappearing, which is how I discovered it. (Aside: given how critical the dag processor manager's core loop is to airflow reliability, I feel like it gets nowhere near as much error reporting as it should do. Really, the GUI should be flagging up process timeouts).

When running with multiple schedulers, we never noticed this flickering in and out of existence in the GUI. Total guess, but maybe this was because there was always at least one of the schedulers which had recently processed the dag ok... 🤷

My working hypothesis is now that a scheduler would timeout processing the dags, and this would somehow cause all the active tasks in the affected dags to be blatted as failed. (Insert suitable jazz hand waving over the specifics). I checked a few of the failures I've seen, and I do see timeouts in the processor at roughly the same time that the gantt shows the tasks being blatted as failed, despite actually running ok.

Anyhoo, I am now running two experiments:
1. Multiple schedulers + increased timeout.
2. Single scheduler + increased timeout.

Note:
Obviously, the third experiment is:
3. Single scheduler + default timeout.

I have been running this (a single scheduler + default timeout) for several days in one env, and the problem seemed to have gone away (which is why I was suspicious of multiple schedulers), but I have just checked the dag processor logs for that env, and it simply seems to have not been experiencing timeouts, so I guess it's possible I simply picked a less contested box for that sole scheduler. Or maybe timeouts and single scheduler is fine, and it's timeouts + multiple schedulers that's the problem. Or maybe I'm chasing a red herring.

I'll update if I find more.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org