You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/09/12 23:44:02 UTC

[GitHub] [airflow] potiuk commented on a diff in pull request #33355: Add Python Virtualenv Operator Caching

potiuk commented on code in PR #33355:
URL: https://github.com/apache/airflow/pull/33355#discussion_r1323758553


##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
             index_urls=self.index_urls,
         )
 
+    def _calculate_cache_hash(self) -> str:

Review Comment:
   > I wonder if this is that worthwhile… Instead of relying on the directory name, we could just dump the values into a JSON file inside the virtual environment directory, and read it out later for cache validation. That could help debugging if caching doesn’t work as expected too.
   
   I agree with @jens-scheffler-bosch "write once, use many" statement.
   
   Validation is not nearly enough. What will you do if the validation fails? Are we going to delete the venv? or recreate it? What happens if we are running in Celery Worker (multiple tasks) and each of them are using "almost" the same environment. For example someone submitted a change to the DAG that will bump a dependency version and this new task is scheduled, but in the same Celery worker we are still running an old versions of the same task from the previous version of the DAG.
   
   Should we delete the venv while another task is running and recreate it? Or should we fail the task? Or maybe wait until the other task completes (how?)
   
   I think really - venvs should be "write once, use many" and immutable after they are created. This is precisely what @jens-scheffler-bosch  addressed well that even if we have two task attempting to create identical venv, we have the file lock that prevents them to do so - and only one will create it and the other will use it. But when it is already created you should not modify existing venv, this is a different story - because you don't want to prevent parallel "read/modify" - you want to prevent parallel "create". You actually want multiple tasks to use the same venv in parallell once created - that's almost the whole point of it.
   
   IMHI calidation is not cutting it because there is no good action that you can attempt once validation fails. We MUST have different venvs if they are different. 
   
   Interestingly - this is for example the very same thing what `pre-commit` does. PRe-commit also calculates hash of the venv - based on the pre-commit definition. If you have two or more tasks with the same "dependencies" and "python version" - they will re-use existing venv and they are even re-used between multiple branches you might work on (see ~/.cache/pre-commit)  and they also get fully recreated with a different hash when anything changes.
   



##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
             index_urls=self.index_urls,
         )
 
+    def _calculate_cache_hash(self) -> str:

Review Comment:
   > I wonder if this is that worthwhile… Instead of relying on the directory name, we could just dump the values into a JSON file inside the virtual environment directory, and read it out later for cache validation. That could help debugging if caching doesn’t work as expected too.
   
   I agree with @jens-scheffler-bosch "write once, use many" statement.
   
   Validation is not nearly enough. What will you do if the validation fails? Are we going to delete the venv? or recreate it? What happens if we are running in Celery Worker (multiple tasks) and each of them are using "almost" the same environment. For example someone submitted a change to the DAG that will bump a dependency version and this new task is scheduled, but in the same Celery worker we are still running an old versions of the same task from the previous version of the DAG.
   
   Should we delete the venv while another task is running and recreate it? Or should we fail the task? Or maybe wait until the other task completes (how?)
   
   I think really - venvs should be "write once, use many" and immutable after they are created. This is precisely what @jens-scheffler-bosch  addressed well that even if we have two task attempting to create identical venv, we have the file lock that prevents them to do so - and only one will create it and the other will use it. But when it is already created you should not modify existing venv, this is a different story - because you don't want to prevent parallel "read/modify" - you want to prevent parallel "create". You actually want multiple tasks to use the same venv in parallell once created - that's almost the whole point of it.
   
   IMHI calidation is not cutting it because there is no good action that you can attempt once validation fails. We MUST have different venvs if they are different. 
   
   Interestingly - this is for example the very same thing what `pre-commit` does. PRe-commit also calculates hash of the venv - based on the pre-commit definition. If you have two or more tasks with the same "dependencies" and "python version" - they will re-use existing venv and they are even re-used between multiple branches you might work on (see ~/.cache/pre-commit)  and they also get fully recreated with a different hash when anything changes.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org