You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 12:34:07 UTC

[GitHub] [airflow] turbaszek opened a new issue #11302: Add job_id to DagRun table

turbaszek opened a new issue #11302:
URL: https://github.com/apache/airflow/issues/11302


   **Description**
   
   Currently, we have `run_type` in the DagRun table but there's no way to determine what job created a DagRun (no 1-1 relation between DagRun and Job). This can be helpful in debugging (I think especially in the case of Scheduler HA and BackfillJobs) as this will also allow users to check which scheduler / backfill job triggered their task (now the job_id in TaskInstance is always id of LocalTaskJob)
   
   **Use case / motivation**
   
   Debugging purposes and better data consistency.
   
   **Related Issues**
   
   https://github.com/apache/airflow/pull/8227
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-704828116


   After a discussion with @turbaszek - I think it's super useful to have this information as it allows us to do investigations on the reasons for problems and we can add more information - like we will know which scheduler created a DagRun.  I would love to get it added even for 2.0 after merging the HA change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705084788


   Got it, that makes sense.
   
   There's code in the scheduler ha branch they will kill timed out/not heartbeating SchedulerJobs - that could be extended to Backfill job too.
   
   I'm not sure what change would be needed to detect zombie tasks from backfilled jobs, but it's a good goal I agree.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705225535


   +1 from my side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek edited a comment on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
turbaszek edited a comment on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705088478


   I will try to tackle it once #10956 is merged 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705088478


   I will try to tackle it once #9630 is merged 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-704237776


   @ashb @mik-laj I'm happy to hear what do you think


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-704318109


   > Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)
   
   I have an operator that triggers backfill job and observes that process. If the "parent" task dies (SIGKILL) then there are zombie tasks (scheduled/none state) from backfill job that are not cleaned up by anything. But that's probably an egde case + we use custom implementation of BackfillJob (that's why I was able to fix it using `DagRun.conf` for storing job_id - a hack). 
   
   However, I think that this information can be helpful in case of multiple schedulers, for example, to find that only one of them has some problems (no idea what problems).
   
   So, basically my suggestion is about adding not crucial information that **may** help sometimes 😄 
   
   > I would probably suggest naming it `created_by_job_id` - it's a bit clearer what it stores then just `job_id`
   
   +1 to this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705225535


   +1 from my side. This can be used to tune the scheduler better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705110826


   +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-749158977


   > @turbaszek Did we do this all of or just the "job id" part?
   
   Just the job id, I'm going to propose an AIP in few days for redesigning backfill.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
turbaszek commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705062847


   Additionally, this may help us with making backfill runnable remotely.
   
   ## The problem
   
   Run:
   ```
   airflow dags backfill -v -s 2020-10-06 example_bash_operator
   # once there's a process running single task, do the following:
   pkill -9 -f backfil
   ```
   this will result in "zombie" DagRun and related task instance that will not be cleaned up by the scheduler (at least that's my understanding). Example:
   <img width="2188" alt="Screenshot 2020-10-07 at 18 34 47" src="https://user-images.githubusercontent.com/9528307/95360543-ce73a200-08cb-11eb-8116-552da95fb105.png">
   
   However, querying the the job table we see:
   <img width="1687" alt="Screenshot 2020-10-07 at 18 39 07" src="https://user-images.githubusercontent.com/9528307/95360957-683b4f00-08cc-11eb-854b-d50eb3fe965e.png">
   
   So, the backfill job is still running according to Airflow state but that's not true as we killed the job 👎 
   
   ## Possible solution
   
   Link a specific job to DagRun triggered by it (using the `job_id`)  and then run a process that will kill the zombies.
   
   This can be done either by:
   - killing a DR (and related TIs) that is in an unfinished state (running, none, scheduled, queued) but the job that was running it is in error state
   - killing a DR (and related TIs) that is in an unfinished state but the job that was running it didn't heartbeat for the last few minutes (configurable)
   
   Cleaning of such zombies can be easily triggered by the scheduler. 
   
   I think this may bring us closer to triggering backfill via API / UI.
   
   WDYT? @ashb @kaxil @potiuk @mik-laj @dimberman 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-705110826


   +1 to this change, thanks @turbaszek 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #11302: Add job_id to DagRun table and remove BackfillJob zombies

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-749006204


   @turbaszek Did we do this all of or just the "job id" part?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #11302: Add job_id to DagRun table

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #11302:
URL: https://github.com/apache/airflow/issues/11302#issuecomment-704291021


   Do you have a case where it's interesting to know what job created the DagRun? (I ask because I can't think of one immediately)
   
   Don't forget that in the case of trigger (via CLI or webserver) there is no job id to use.
   
   I would probably suggest naming it `created_by_job_id` - it's a bit clearer what it stores then just `job_id`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org