You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/03/25 22:33:43 UTC

[GitHub] [airflow] mik-laj opened a new pull request #7878: Load all jobs models at once

mik-laj opened a new pull request #7878: Load all jobs models at once
URL: https://github.com/apache/airflow/pull/7878
 
 
   We use model polymorphism in `airflow.jobs` classes, so we must load all classes each time. 
   https://github.com/apache/airflow/blob/master/airflow/jobs/base_job.py#L65-L73
   Otherwise, SQLAlchemy will not be able to execute the queries correctly.
   So, in order not to load a lot of code, we should divide these classes into smaller ones. I think it won't be a problem to separate the loops (BackfillLoop, SchedulerLoop, LocalTaskJob) from the entity model. However, I would like to do it later. For now I am proposing a temporary solution. Loading all classes every time.
   
   This problem does not appear in CI tests, because then we load all classes. It also does not occur in real situations, because the current coupling of all classes causes that these models are loaded anyway, but it blocks refactoring.This makes it difficult to eliminate coupling between classes. In particular, it prevents me from completing this change.
   https://github.com/apache/airflow/pull/7806
   The problem sometimes occurs when running individual tests.
   
   I'm using the following code to check the impact of this change on Airflow.
   ```python
   import sys
   from contextlib import contextmanager
   
   
   class TraceImportResult:
       def __init__(self):
           self.new_modules = None
   
       def print_new_modules(self):
           for i, new_module in enumerate(sorted(trace_resuslt.new_modules)):
               print(f"{i}. {new_module}")
   
   
   @contextmanager
   def trace_imports():
       before_modules = set(sys.modules.keys())
       result = TraceImportResult()
   
       try:
           yield result
       finally:
           after_modules = set(sys.modules.keys())
           before_count = len(before_modules)
           after_count = len(after_modules)
           diff_count = after_count - before_count
           result.new_modules = after_modules - before_modules
           print(f"Modules loaded - before: {before_count}")
           print(f"Modules loaded - after:  {after_count}")
           print(f"Modules loaded - diff:   {diff_count}")
   
   
   if __name__ == "__main__":
       # Example:
   
       with trace_imports() as trace_resuslt:
           import airflow  # noqa # pylint: disable=unused-import
   
       trace_resuslt.print_new_modules()
       with trace_imports() as trace_resuslt:
           import airflow.jobs  # noqa # pylint: disable=unused-import
   
       trace_resuslt.print_new_modules()
   ```
   
   We learn from this that adding these imports will cause that an additional 32 modules will be loaded each time the code `import airflow.jobs` is executed.
   ```
   0. airflow.executors
   1. airflow.executors.base_executor
   2. airflow.executors.executor_loader
   3. airflow.executors.local_executor
   4. airflow.executors.sequential_executor
   5. airflow.jobs
   6. airflow.jobs.backfill_job
   7. airflow.jobs.base_job
   8. airflow.jobs.local_task_job
   9. airflow.jobs.scheduler_job
   10. airflow.task
   11. airflow.task.task_runner
   12. airflow.task.task_runner.base_task_runner
   13. airflow.task.task_runner.standard_task_runner
   14. airflow.utils.asciiart
   15. airflow.utils.configuration
   16. airflow.utils.dag_processing
   17. airflow.utils.process_utils
   18. glob
   19. lockfile
   20. lockfile.linklockfile
   21. lockfile.pidlockfile
   22. multiprocessing.managers
   23. multiprocessing.pool
   24. psutil
   25. psutil._common
   26. psutil._compat
   27. psutil._psosx
   28. psutil._psposix
   29. psutil._psutil_osx
   30. psutil._psutil_posix
   31. setproctitle
   ```
   
   Without this change, loading individual modules will load the following amount of new modules.
   ``import airflow.jobs.base_job `` - 5
   ``import airflow.jobs.scheduler_job `` - 31
   ``import airflow.jobs.backfill_job`` - 11
   ``mport airflow.jobs.local_task_job`` - 24
   ---
   Issue link: WILL BE INSERTED BY [boring-cyborg](https://github.com/kaxil/boring-cyborg)
   
   Make sure to mark the boxes below before creating PR: [x]
   
   - [ ] Description above provides context of the change
   - [ ] Unit tests coverage for changes (not needed for documentation changes)
   - [ ] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
   - [ ] Relevant documentation is updated including usage instructions.
   - [ ] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
   
   ---
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398396690
 
 

 ##########
 File path: airflow/jobs/__init__.py
 ##########
 @@ -16,3 +16,7 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+import airflow.jobs.backfill_job  # noqa
+import airflow.jobs.base_job  # noqa
+import airflow.jobs.local_task_job  # noqa
+import airflow.jobs.scheduler_job  # noqa
 
 Review comment:
   What benefit do you see from moving this to a .flake8 file?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on a change in pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398369651
 
 

 ##########
 File path: airflow/jobs/__init__.py
 ##########
 @@ -16,3 +16,7 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+import airflow.jobs.backfill_job  # noqa
+import airflow.jobs.base_job  # noqa
+import airflow.jobs.local_task_job  # noqa
+import airflow.jobs.scheduler_job  # noqa
 
 Review comment:
   Instead of `noqa` we can add this file to per-file-ignores in .flake8 . WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj commented on a change in pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398395471
 
 

 ##########
 File path: airflow/jobs/__init__.py
 ##########
 @@ -16,3 +16,7 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+import airflow.jobs.backfill_job  # noqa
+import airflow.jobs.base_job  # noqa
+import airflow.jobs.local_task_job  # noqa
+import airflow.jobs.scheduler_job  # noqa
 
 Review comment:
   I prefer to keep ignore close to where the problem is. Nobody looks into the configuration file. Ignoring files in the configuration file is helpful if you have generated or similar files. Then you do not want to modify the file because ... the file is generated, so its modification is difficult. WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] potiuk commented on a change in pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398478361
 
 

 ##########
 File path: airflow/jobs/__init__.py
 ##########
 @@ -16,3 +16,7 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+import airflow.jobs.backfill_job  # noqa
+import airflow.jobs.base_job  # noqa
+import airflow.jobs.local_task_job  # noqa
+import airflow.jobs.scheduler_job  # noqa
 
 Review comment:
   It's ok with noqa :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] mik-laj merged pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
mik-laj merged pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [airflow] turbaszek commented on a change in pull request #7878: Load all job models at once

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398444364
 
 

 ##########
 File path: airflow/jobs/__init__.py
 ##########
 @@ -16,3 +16,7 @@
 # specific language governing permissions and limitations
 # under the License.
 #
+import airflow.jobs.backfill_job  # noqa
+import airflow.jobs.base_job  # noqa
+import airflow.jobs.local_task_job  # noqa
+import airflow.jobs.scheduler_job  # noqa
 
 Review comment:
   > What benefit do you see from moving this to a .flake8 file?
   
   Less code-obfuscating comments. We've already has files in this file. But no strong opinion on that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services