You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/03/25 22:33:43 UTC
[GitHub] [airflow] mik-laj opened a new pull request #7878: Load all jobs
models at once
mik-laj opened a new pull request #7878: Load all jobs models at once
URL: https://github.com/apache/airflow/pull/7878
We use model polymorphism in `airflow.jobs` classes, so we must load all classes each time.
https://github.com/apache/airflow/blob/master/airflow/jobs/base_job.py#L65-L73
Otherwise, SQLAlchemy will not be able to execute the queries correctly.
So, in order not to load a lot of code, we should divide these classes into smaller ones. I think it won't be a problem to separate the loops (BackfillLoop, SchedulerLoop, LocalTaskJob) from the entity model. However, I would like to do it later. For now I am proposing a temporary solution. Loading all classes every time.
This problem does not appear in CI tests, because then we load all classes. It also does not occur in real situations, because the current coupling of all classes causes that these models are loaded anyway, but it blocks refactoring.This makes it difficult to eliminate coupling between classes. In particular, it prevents me from completing this change.
https://github.com/apache/airflow/pull/7806
The problem sometimes occurs when running individual tests.
I'm using the following code to check the impact of this change on Airflow.
```python
import sys
from contextlib import contextmanager
class TraceImportResult:
def __init__(self):
self.new_modules = None
def print_new_modules(self):
for i, new_module in enumerate(sorted(trace_resuslt.new_modules)):
print(f"{i}. {new_module}")
@contextmanager
def trace_imports():
before_modules = set(sys.modules.keys())
result = TraceImportResult()
try:
yield result
finally:
after_modules = set(sys.modules.keys())
before_count = len(before_modules)
after_count = len(after_modules)
diff_count = after_count - before_count
result.new_modules = after_modules - before_modules
print(f"Modules loaded - before: {before_count}")
print(f"Modules loaded - after: {after_count}")
print(f"Modules loaded - diff: {diff_count}")
if __name__ == "__main__":
# Example:
with trace_imports() as trace_resuslt:
import airflow # noqa # pylint: disable=unused-import
trace_resuslt.print_new_modules()
with trace_imports() as trace_resuslt:
import airflow.jobs # noqa # pylint: disable=unused-import
trace_resuslt.print_new_modules()
```
We learn from this that adding these imports will cause that an additional 32 modules will be loaded each time the code `import airflow.jobs` is executed.
```
0. airflow.executors
1. airflow.executors.base_executor
2. airflow.executors.executor_loader
3. airflow.executors.local_executor
4. airflow.executors.sequential_executor
5. airflow.jobs
6. airflow.jobs.backfill_job
7. airflow.jobs.base_job
8. airflow.jobs.local_task_job
9. airflow.jobs.scheduler_job
10. airflow.task
11. airflow.task.task_runner
12. airflow.task.task_runner.base_task_runner
13. airflow.task.task_runner.standard_task_runner
14. airflow.utils.asciiart
15. airflow.utils.configuration
16. airflow.utils.dag_processing
17. airflow.utils.process_utils
18. glob
19. lockfile
20. lockfile.linklockfile
21. lockfile.pidlockfile
22. multiprocessing.managers
23. multiprocessing.pool
24. psutil
25. psutil._common
26. psutil._compat
27. psutil._psosx
28. psutil._psposix
29. psutil._psutil_osx
30. psutil._psutil_posix
31. setproctitle
```
Without this change, loading individual modules will load the following amount of new modules.
``import airflow.jobs.base_job `` - 5
``import airflow.jobs.scheduler_job `` - 31
``import airflow.jobs.backfill_job`` - 11
``mport airflow.jobs.local_task_job`` - 24
---
Issue link: WILL BE INSERTED BY [boring-cyborg](https://github.com/kaxil/boring-cyborg)
Make sure to mark the boxes below before creating PR: [x]
- [ ] Description above provides context of the change
- [ ] Unit tests coverage for changes (not needed for documentation changes)
- [ ] Commits follow "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)"
- [ ] Relevant documentation is updated including usage instructions.
- [ ] I will engage committers as explained in [Contribution Workflow Example](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#contribution-workflow-example).
---
In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
Read the [Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines) for more information.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] mik-laj commented on a change in pull request #7878: Load
all job models at once
Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398396690
##########
File path: airflow/jobs/__init__.py
##########
@@ -16,3 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#
+import airflow.jobs.backfill_job # noqa
+import airflow.jobs.base_job # noqa
+import airflow.jobs.local_task_job # noqa
+import airflow.jobs.scheduler_job # noqa
Review comment:
What benefit do you see from moving this to a .flake8 file?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] turbaszek commented on a change in pull request #7878:
Load all job models at once
Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398369651
##########
File path: airflow/jobs/__init__.py
##########
@@ -16,3 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#
+import airflow.jobs.backfill_job # noqa
+import airflow.jobs.base_job # noqa
+import airflow.jobs.local_task_job # noqa
+import airflow.jobs.scheduler_job # noqa
Review comment:
Instead of `noqa` we can add this file to per-file-ignores in .flake8 . WDYT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] mik-laj commented on a change in pull request #7878: Load
all job models at once
Posted by GitBox <gi...@apache.org>.
mik-laj commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398395471
##########
File path: airflow/jobs/__init__.py
##########
@@ -16,3 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#
+import airflow.jobs.backfill_job # noqa
+import airflow.jobs.base_job # noqa
+import airflow.jobs.local_task_job # noqa
+import airflow.jobs.scheduler_job # noqa
Review comment:
I prefer to keep ignore close to where the problem is. Nobody looks into the configuration file. Ignoring files in the configuration file is helpful if you have generated or similar files. Then you do not want to modify the file because ... the file is generated, so its modification is difficult. WDYT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] potiuk commented on a change in pull request #7878: Load
all job models at once
Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398478361
##########
File path: airflow/jobs/__init__.py
##########
@@ -16,3 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#
+import airflow.jobs.backfill_job # noqa
+import airflow.jobs.base_job # noqa
+import airflow.jobs.local_task_job # noqa
+import airflow.jobs.scheduler_job # noqa
Review comment:
It's ok with noqa :)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] mik-laj merged pull request #7878: Load all job models at
once
Posted by GitBox <gi...@apache.org>.
mik-laj merged pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
[GitHub] [airflow] turbaszek commented on a change in pull request #7878:
Load all job models at once
Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #7878: Load all job models at once
URL: https://github.com/apache/airflow/pull/7878#discussion_r398444364
##########
File path: airflow/jobs/__init__.py
##########
@@ -16,3 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#
+import airflow.jobs.backfill_job # noqa
+import airflow.jobs.base_job # noqa
+import airflow.jobs.local_task_job # noqa
+import airflow.jobs.scheduler_job # noqa
Review comment:
> What benefit do you see from moving this to a .flake8 file?
Less code-obfuscating comments. We've already has files in this file. But no strong opinion on that.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services