You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/09 15:17:52 UTC

[GitHub] [airflow] john-jac opened a new issue #15306: Support Serialized DAGs on CLI Commands

john-jac opened a new issue #15306:
URL: https://github.com/apache/airflow/issues/15306


   **Description**
   
   CLI commands such as backfill and list_dags currently parse dags before executing.  This introduces 2 primary issues. 1) That parse process can be time consuming, and 2) It will not work if running on a web server that does not have access to DAGs and/or their respective Python libraries or Airflow plugins.
   
   **Use case / motivation**
   
   By providing users an option to use serialized DAGs with the Airflow CLI, users can opt for the more efficient method of executing commands based on the information available in the metadatabase rather than relying solely on parsing the source DAGs.
   
   **Are you willing to submit a PR?**
   
   Yes
   
   **Related Issues**
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] jhtimmins commented on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
jhtimmins commented on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-818397269


   @kaxil I'm not super familiar with DAG serialization, but I think this sounds reasonable. Are there any drawbacks to supporting this functionality?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] anitakar commented on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
anitakar commented on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-848862444


   At least for Airflow 1.10.15 not using serialization leads to very ineffective task execution in which the whole dagbag is parsed locally before a task is executed.
   Here is the excerpt from the code/stacktrace that proves my point:
   1. `airflow worker` command starts celery_executor (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L1554)
   2. Then worker executes `airflow tasks run` as set by scheduler (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L596)
   3. And in there get_dag is called without specifying store_serialized_dags which is false by default (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L618)
   4. In there the whole dags directory is parsed locally on worker (https://github.com/apache/airflow/blob/5786dcdc392f7a2649f398353a0beebef01c428e/airflow/bin/cli.py#L164)
   
   It seems very inefficient to parse all dags before each task execution.
   
   I have committed a few fixes to dag serialization. I would be happy to fix at least the path for task execution within worker.
   
   @kaxil @potiuk WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-819513114


   list_dags is fine -- backfill won't work on serialized dags.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] anitakar edited a comment on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
anitakar edited a comment on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-848862444


   At least for Airflow 1.10.15 not using serialization leads to very ineffective task execution in which the whole dagbag is parsed locally before a task is executed.
   Here is the excerpt from the code/stacktrace that proves my point:
   1. `airflow worker` command starts celery_executor (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L1554)
   2. Then worker executes `airflow tasks run` as set by scheduler (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L596)
   3. And in there get_dag is called without specifying store_serialized_dags which is false by default (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L618)
   4. In there the whole dags directory is parsed locally on worker (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L164)
   
   It seems very inefficient to parse all dags before each task execution.
   
   I have committed a few fixes to dag serialization. I would be happy to fix at least the path for task execution within worker.
   
   Sorry, my mistake. The dag is pickled: https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] anitakar removed a comment on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
anitakar removed a comment on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-848862444


   At least for Airflow 1.10.15 not using serialization leads to very ineffective task execution in which the whole dagbag is parsed locally before a task is executed.
   Here is the excerpt from the code/stacktrace that proves my point:
   1. `airflow worker` command starts celery_executor (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L1554)
   2. Then worker executes `airflow tasks run` as set by scheduler (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L596)
   3. And in there get_dag is called without specifying store_serialized_dags which is false by default (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L618)
   4. In there the whole dags directory is parsed locally on worker (https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L164)
   
   It seems very inefficient to parse all dags before each task execution.
   
   I have committed a few fixes to dag serialization. I would be happy to fix at least the path for task execution within worker.
   
   Sorry, my mistake. The dag is pickled: https://github.com/apache/airflow/blob/1.10.15/airflow/bin/cli.py#L622


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] john-jac commented on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
john-jac commented on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-836941555


   > list_dags is fine -- backfill won't work on serialized dags.
   
   @kaxil could you provide details as to why backfill can't run on serialized dags?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #15306: Support Serialized DAGs on CLI Commands

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #15306:
URL: https://github.com/apache/airflow/issues/15306#issuecomment-819421484


   Would it make more sense to change them to always use the serialised DAGs instead, since all DAGs are now serialised in 2.0? Or are there things that only parsing the real DAG can do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org