You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/09/05 23:21:37 UTC

[GitHub] [airflow] mik-laj opened a new pull request, #26170: Add subdir parameter to dags reserialize command

mik-laj opened a new pull request, #26170:
URL: https://github.com/apache/airflow/pull/26170

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   Hi. 
   Today I looked at this command and noticed a few problems:
   - The `DagBag.collect_dags` method is called two times. The first time through `DagBag.__init__`, and the second time explicitly in a command code.  This is not needed and causes performance degradation.
   -  `safe_mode` has been overridden to `false` in the command code for no reason, which means more files are processed than needed.
   - Parameter `subdir` is not supported, which is inconsistent with the rest of the commands that create the DagBag instance. Whenever a DagBag is created by any other commands, it is possible to set the `dag_folder` using the `subdir` parameter including `airflow dag list`, `airflow scheduler`, and others.
   
   I had a problem with choosing a PR title that would look good in a changelog, because we have 3 very related problems here, but I think adding a new CLI parameter is the most user-facing, The rest of the problems would not be (probably) noticed by anyone without looking at the code.
   
   CC: @collinmcnulty, @potiuk, @uranusjr, @sfc-gh-mkmak 
   
   Best regards,
   Kamil BreguĊ‚a
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #26170:
URL: https://github.com/apache/airflow/pull/26170#discussion_r963443430


##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   But the arguments here is different from the call in `__init__`.



##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   But the arguments here are different from the call in `__init__`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kaxil commented on a diff in pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
kaxil commented on code in PR #26170:
URL: https://github.com/apache/airflow/pull/26170#discussion_r963469974


##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   Since `DagBag().file_last_changed` will be empty, I _think_ it won't make a difference if `only_if_updated` is set to `True` or `False`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
potiuk merged PR #26170:
URL: https://github.com/apache/airflow/pull/26170


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #26170:
URL: https://github.com/apache/airflow/pull/26170#discussion_r963334792


##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   Does this line need to be kept?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mik-laj commented on a diff in pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
mik-laj commented on code in PR #26170:
URL: https://github.com/apache/airflow/pull/26170#discussion_r963430842


##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   No, and I described in the PR description why it is not needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] kaxil commented on pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
kaxil commented on PR #26170:
URL: https://github.com/apache/airflow/pull/26170#issuecomment-1237889714

   Test failure @mik-laj :
   
   ```
   ___________________ TestCliTasks.test_run_get_serialized_dag ___________________
     
     self = <tests.cli.commands.test_task_command.TestCliTasks object at 0x7f2bcf86d890>
     mock_local_job = <MagicMock name='LocalTaskJob' id='139826142040912'>
     mock_get_dag_by_deserialization = <MagicMock name='get_dag_by_deserialization' id='139826142042192'>
     
         @mock.patch("airflow.cli.commands.task_command.get_dag_by_deserialization")
         @mock.patch("airflow.cli.commands.task_command.LocalTaskJob")
         def test_run_get_serialized_dag(self, mock_local_job, mock_get_dag_by_deserialization):
             """
             Test using serialized dag for local task_run
             """
             task_id = self.dag.task_ids[0]
             args = [
                 'tasks',
                 'run',
                 '--ignore-all-dependencies',
                 '--local',
                 self.dag_id,
                 task_id,
                 self.run_id,
             ]
     >       mock_get_dag_by_deserialization.return_value = SerializedDagModel.get(self.dag_id).dag
     E       AttributeError: 'NoneType' object has no attribute 'dag'
     
     tests/cli/commands/test_task_command.py:152: AttributeError
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] mik-laj commented on a diff in pull request #26170: Add subdir parameter to dags reserialize command

Posted by GitBox <gi...@apache.org>.
mik-laj commented on code in PR #26170:
URL: https://github.com/apache/airflow/pull/26170#discussion_r963479085


##########
airflow/cli/commands/dag_command.py:
##########
@@ -503,6 +503,5 @@ def dag_reserialize(args, session: Session = NEW_SESSION):
     session.query(SerializedDagModel).delete(synchronize_session=False)
 
     if not args.clear_only:
-        dagbag = DagBag()
-        dagbag.collect_dags(only_if_updated=False, safe_mode=False)

Review Comment:
   only_if_update = False means the dags will be read again if the file has changed.  In our case, we don't need to load the same files twice, so this parameter doesn't apply here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org