You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/06/21 19:41:32 UTC

[GitHub] [airflow] kcphila commented on issue #17507: Task processes killed with WARNING - Recorded pid does not match the current pid

kcphila commented on issue #17507:
URL: https://github.com/apache/airflow/issues/17507#issuecomment-1162252733

   Hi all,
   
   I am experiencing this on 2.3.2 with LocalExecutor (4 schedulers), Postgres, and Ubuntu 22.04. 
   
   This is, however, running a clone of our staging environment of dags that run fine on 2.1.4 and Ubuntu 16.04.  I'm also running on a much smaller and less powerful instance, and so it may be exacerbating race conditions.
   
   I did some investigation into the process state, and when this error leads to a failure, this is what I see in process executions:
   
   - The Scheduler task is the root of everything, as you'd expect (`airflow scheduler`)
   - `recorded_pid` , which is assigned to be the taskinstance pid (`ti.pid`) normally and the parent of the taskinstance pid (`psutil.Process(ti.pid).ppid()`) when RUN_AS_USER is set.  When failing, this consistently shows up as the worker (`worker -- LocalExecutor`). This is a persistent and long term process. 
   - The child of the *recorded_pid* is the pid of the current process (as reported by `os.getpid()`), which is the airflow task supervisor. This (and everything below) is one of the short term task-specific processes.
   - The `current_pid` can be different things, but always appears to be the child of the task supervisor / current pid.  Often times this must be a fleeting process as I can barely catch a record of it when I'm trying to fetch a snapshot.  Here are a couple that I have seen:
      - In some cases, I have seen this as the task runner's pid - `airflow tasks run [taskname]` 
      - I have also seen this as the `airflow task su`, and the tasks are RUN_AS_USER, so likely related.
   
   I came to wonder, since this error happens because (a) the final `recorded_pid` is not None and (B) `recorded_pid` != `current_pid` - it doesn't make much sense to ever be comparing against the Task Instance pid since that's hanging around for a very long time and the heatbeat function appears to be identifying when the current task runner is zombified or missing.
   
   As I've investigated further, I've found on task failures for RUN_AS_USER tasks in which this fails, the `ti.pid` is almost invariably `None`, which means the `recorded_pid` comes in as `psutil.Process(None).ppid()`, which will be the parent of the current process. I am currently under the impression that this was not intended - and that the error condition should only be tested when `ti.pid is not None`, instead of `recorded_pid is not None`.  
   
   I'm testing this right now and it seems to work - and if that seems to hold up I'll put in a PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org