You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/08 16:13:02 UTC

[GitHub] [airflow] collinmcnulty opened a new pull request #19471: Add cli command for 'airflow dags reserialize`

collinmcnulty opened a new pull request #19471:
URL: https://github.com/apache/airflow/pull/19471


   Dag serialization is currently out of the hands of the user. Whenever dag reserialization is required, I run this python script:
   
   ```
   from airflow.models.serialized_dag import SerializedDagModel
   from airflow.settings import Session
   session = Session()
   session.query(SerializedDagModel).delete()
   session.commit()
   ```
   
   This PR makes running that script as simple as `airflow dags reserialize`.
   
   closes: #19432 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-988444636


   > Did I miss any important information?
   
   I think you missed the information that Airlfow 1.10 (not even 1.10.12 but any 1.10 version) reached end of life in June 2021 and will not receive any updates (not even critical fixes) any more. We do not produce any code that is targetted for 1.10 for about half a year already.
   
   See here for example: https://github.com/apache/airflow#version-life-cycle
   
   Please migrate to Airflow 2 as soon as possible
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963583324


   there's a spelling_wordlist.txt you should be able to modify


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EricGao888 edited a comment on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
EricGao888 edited a comment on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-987655152


   Our team are using airflow with Celery Executor. For example, we have a cluster of 1 header machine and 2 worker machines. We deploy webserver and scheduler on header and have the workers run on worker machines. If I run `airflow dags reserialize` on header machine, will the two workers get updated? Will there be any consistent issues? Should I run `airflow dags reserialize` on both header and workers? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963411445


   Doc build is failing. I think it would also be good to add this info to the `troubleshooting` section in the upcoming : `upgrading` page: https://github.com/apache/airflow/pull/19453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-965642370


   https://github.com/apache/airflow/pull/19367  / https://github.com/apache/airflow/pull/18120 (WIP) will hopefully fix the underlying issue
   
   But a CLI command to show serialized representation, actually serialize from a dag file or clear -- all would be very handy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #19471:
URL: https://github.com/apache/airflow/pull/19471


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish edited a comment on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
dstandish edited a comment on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963565510


   @collinmcnulty can you share some context re under what circumstances you find that this needs to be done?  perhaps there is a change we should make to the dag parsing process that makes it so we don't need to jump in there manually


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963577774


   @dstandish  The biggest was part of the 2.2 upgrade where the Param change was supposed to be backwards compatible, but wasn't strictly. I've also seen cases where a Dag Import error didn't show up because there was a valid but outdated, serialized version of the dag. Not sure if the latter case is still present in main. I can check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963612604


   > @dstandish It will reserialize if it successfully parses. But if you break a DAG, then it will keep the old serialized version. See more discussion over longer term fixes in #19367
   
   thanks, will take a look over there.   but if the dag is broken, reserializing wouldn't work anyway would it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#discussion_r746890230



##########
File path: airflow/cli/cli_parser.py
##########
@@ -977,6 +977,17 @@ class GroupCommand(NamedTuple):
             ARG_SAVE_DAGRUN,
         ),
     ),
+    ActionCommand(
+        name='reserialize',
+        help="Reserialize all DAGs by parsing the DagBag files",
+        description=(
+            "Drop all serialized dags from the metadata DB. This will cause all DAGs to be reserialized "
+            "from the DagBag folder. This can be helpful if your serialized DAGs get out of sync with the "
+            "version of Airflow that you are running."
+        ),
+        func=lazy_load_command('airflow.cli.commands.dag_command.dag_reserialize'),

Review comment:
       (2) As the name is `reserialize` -- it will only reserialize if the scheduler is running. If the scheduler is not running for whatever reason, the users will have a wrong impression that `reserialize` does actually deserialize instead of just "clear"/delete




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-966081839


   > #19367 / #18120 (WIP) will hopefully fix the underlying issue
   > 
   > But a CLI command to show serialized representation, actually serialize from a dag file or clear -- all would be very handy
   
   Agree. That's a good idea to add more tooling here - not only reserializing but outputing serialized dags nicely formatted and coloured using pygments - that woudl be reallly nice. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963585735


   > The biggest was part of the 2.2 upgrade where the Param change was supposed to be backwards compatible, but wasn't strictly. I've also seen cases where a Dag Import error didn't show up because there was a valid but outdated, serialized version of the dag. Not sure if the latter case is still present in main. I can check.
   
   interesting i wonder if perhaps it would be good to have it automatically reserialize at the end of an upgrade.  i'm not sure with what frequency this is done currently, but i would have assumed it would be done with each parsing interval like every 5 minutes or something 🤔
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-987667614


   Workers don’t need DAG information, only the scheduler does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963625979


   It won't "work" but it will surface the Import error and will stop trying to schedule a broken dag


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963579876


   Docs are failing because they do not like the word "reserialize".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-981702261


   The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EricGao888 commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-988457294


   > 
   
   
   
   > > Did I miss any important information?
   > 
   > I think you missed the information that Airlfow 1.10 (not even 1.10.12 but any 1.10 version) reached end of life in June 2021 and will not receive any updates (not even critical fixes) any more. We do not produce any code that is targetted for 1.10 for about half a year already.
   > 
   > See here for example: https://github.com/apache/airflow#version-life-cycle
   > 
   > Please migrate to Airflow 2 as soon as possible
   
   Thanks for the reply. Will migrate to Airflow 2 ASAP.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EricGao888 commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-988441211


   > Workers don’t need DAG information, only the scheduler does.
   
   Thanks for the reply. Another question is will the python script `from airflow.models.serialized_dag import SerializedDagModel
   from airflow.settings import Session
   session = Session()
   session.query(SerializedDagModel).delete()
   session.commit()` work for airflow 1.10.12?
   ![image](https://user-images.githubusercontent.com/34905992/145137812-9cf47bec-9f39-4bbf-aa8b-c273c70ee3d0.png)
   I found the some info from https://airflow.apache.org/docs/apache-airflow/stable/dag-serialization.html
   It seems in airflow 1.10.12, the scheduler does not use Serialized DAGS. Does it mean the sync py script in https://github.com/apache/airflow/pull/19471 does not work for airflow 1.10.12? I did a quick test and didn’t get the dags synced in airflow 1.10.12. Did I miss any important information?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EricGao888 commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
EricGao888 commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-987655152


   Our team are using airflow with Celery Executor. For example, we have a cluster of 1 header machine and 2 worker machines. We deploy webserver and scheduler on header and have the workers run on worker machines. If I run `airflow dags reserialize` on header machine, will the two workers get update? Will there be any consistent issues? Should I run `airflow dags reserialize` on both header and workers? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on a change in pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on a change in pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#discussion_r745110416



##########
File path: airflow/cli/cli_parser.py
##########
@@ -977,6 +977,17 @@ class GroupCommand(NamedTuple):
             ARG_SAVE_DAGRUN,
         ),
     ),
+    ActionCommand(
+        name='reserialize',
+        help="Reserialize all DAGs by parsing the DagBag files",
+        description=(
+            "Drop all serialized dags from the metadata DB. This will cause all DAGs to be reserialized "
+            "from the DagBag folder. This can be helpful if your serialized DAGs get out of sync with the "
+            "version of Airflow that you are running."
+        ),
+        func=lazy_load_command('airflow.cli.commands.dag_command.dag_reserialize'),

Review comment:
       1. I thought I did nest it there? That was my intention. Am I misreading something?
   2. I was thinking that it would be better to keep all the dag parsing logs together, but that's not a very strongly held opinion.
   3. Will do.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963601316


   @dstandish It will reserialize if it successfully parses. But if you break a DAG, then it will keep the old serialized version. See more discussion over longer term fixes in #19367 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-963565510


   @collinmcnulty can you share some context re when you need to do this?  perhaps there is a change we should make to the dag parsing process that makes it so we don't need to jump in there manually


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-978414077


   will you rebase/continue that @collinmcnulty ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] collinmcnulty commented on pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
collinmcnulty commented on pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#issuecomment-981699885


   Yes I have rebased (back from vacation). I was not planning on adding any more features to this PR. Adding the display of the serialized dags with pygments is something I would have to study how to do. I was thinking it might be worthwhile to review this PR as is and open a separate issue to add the additional tooling. If you disagree, I can look at adding this tooling, but it will take me a little longer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#discussion_r744991850



##########
File path: airflow/cli/cli_parser.py
##########
@@ -977,6 +977,17 @@ class GroupCommand(NamedTuple):
             ARG_SAVE_DAGRUN,
         ),
     ),
+    ActionCommand(
+        name='reserialize',
+        help="Reserialize all DAGs by parsing the DagBag files",
+        description=(
+            "Drop all serialized dags from the metadata DB. This will cause all DAGs to be reserialized "
+            "from the DagBag folder. This can be helpful if your serialized DAGs get out of sync with the "
+            "version of Airflow that you are running."
+        ),
+        func=lazy_load_command('airflow.cli.commands.dag_command.dag_reserialize'),

Review comment:
       I think this needs to be thought through.
   
   1) It should be nested under `airflow dags reserialize`
   2) For actual `reserialize` we should serialize after deleting too rather than waiting for dag parsing process _I think_
   3) Needs some unit tests




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #19471: Add cli command for 'airflow dags reserialize`

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #19471:
URL: https://github.com/apache/airflow/pull/19471#discussion_r751042377



##########
File path: airflow/cli/commands/dag_command.py
##########
@@ -441,3 +442,10 @@ def dag_test(args, session=None):
             _display_dot_via_imgcat(dot_graph)
         if show_dagrun:
             print(dot_graph.source)
+
+
+@provide_session
+@cli_utils.action_logging
+def dag_reserialize(args, session=None):
+    session.query(SerializedDagModel).delete()
+    session.commit()

Review comment:
       ```suggestion
   @provide_session
   @cli_utils.action_logging
   def dag_reserialize(args, session=None):
       session.query(SerializedDagModel).delete(synchronize_session=False)
   ```
   
   Skip synchronisation to make the operation slightly faster. The session is going to end immediately after this so there’s no point to keep things in sync.
   
   The `provide_session` decorator calls `commit()` automatically, we don’t need to do that again.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org