You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/08/19 15:09:29 UTC

[GitHub] [airflow] barrywhart opened a new pull request, #25829: Fix templating for KubernetesPodOperator env_vars field

barrywhart opened a new pull request, #25829:
URL: https://github.com/apache/airflow/pull/25829

   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of an existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   related: #25827 (Fixes the main issue, but as mentioned in the issue, I'd like to also add templating support for `resources` and `tolerations`, possibly in this PR.)
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code changes, an Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in a newsfragment file, named `{pr_number}.significant.rst` or `{issue_number}.significant.rst`, in [newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r954028943


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   We use this for data science model training and prediction. Re: XcomArg -- the environment variables are not coming from an upstream task. They're arbitrary runtime settings the data scientist wants to pass to the job. (Unless I'm misunderstanding how XcomArg works.)
   
   The number of variables is not known at DAG definition time, only at DAG submission time. See the automated test on the PR for a fairly realistic example using `dag_run.conf`.
   
   Re: the security concern, is it possible you are thinking of `eval`? My understanding is that `literal_eval` is safe. There are many uses of it elsewhere in Airflow. More info here: https://stackoverflow.com/questions/4710247/python-3-are-there-any-known-security-holes-in-ast-literal-evalnode-or-string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on PR #25829:
URL: https://github.com/apache/airflow/pull/25829#issuecomment-1238089254

   Can I please get a new review on this? The review thus far was incomplete and based on a misunderstanding about the safety of `ast.literal_eval()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r952109246


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   What is the expected format of `self.env_vars` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r952640723


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   At this point, assuming Jinja templating was used, it's a string rendered by Jinja. In order for this to work correctly with `convert_env_vars()`, it needs to be a Python dictionary literal. (While `convert_env_vars()` supports `list`, apparently that requires the list items to _already_ be instances of `V1EnvVar`.) If supporting `list` output from Jinja seems useful, I think that would be a straightforward addition to this PR.
   
   Example value:
   ```
   "{'foo1': 'bar1', 'foo2': 'bar2', 'run_id': 'manual__2016-01-01T01:00:00+01:00'}"
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart closed pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart closed pull request #25829: Fix templating for KubernetesPodOperator env_vars field
URL: https://github.com/apache/airflow/pull/25829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25829:
URL: https://github.com/apache/airflow/pull/25829#issuecomment-1221150032

   I do not think this is a good solution (see comments in https://github.com/apache/airflow/discussions/25841) . I believe (@jedcunningham and @dimberman) - using k8s objects for env_vars only is a deliberate decision and you can still template names and values in it (templating proceses such objects recursively). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r954085725


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   A bit more about our use case: I'm an MLOps engineer. We provide a platform to make data science development and deployment easy for data scientists. Data scientists don't write DAGs or other Airflow code. Instead, every PR build for every data science repo runs a single, standard Airflow DAG that is highly flexible. To define environment variables required by the Kubernetes pod, data scientists update a simple list of key-value pairs in an `.env` file. The CI/CD job passes these variables as part of the `dag_run.conf` JSON structure.
   
   This is the architecture of the MLOps experiment pipeline:
   ![image](https://user-images.githubusercontent.com/1678585/186484507-23368f5f-3516-4d9d-8baf-60c226c623d9.png)
   
   Here are a couple of typical `.env` files from two different repos:
   
   1. Simple project
   ```
   SCRIPT=train.py
   GOOGLE_APPLICATION_CREDENTIALS="/root/.config/gcloud/application_default_credentials.json"
   ```
   
   2. Complex project
   ```
   PREDICTION_DATE=2022-05-18
   SCRIPT=evaluate.py
   MODEL_NAME=multiembedder
   
   RECOMMENDER_NAME=multiembedder_sku_copurchase
   RUN_ID=2022.08.11.10.57.07-beta
   #MULTIEMBEDDER_ID=large-model
   
   COPURCHASE_CANDIDATES_BY_CAT_ID=2022.08.02.13.40.13-beta
   COVIEW_CANDIDATES_BY_CAT_ID=2022.08.02.13.53.21-beta
   #RUN_ID=temp-1234
   #SCRIPT=monitor.py
   #STRATEGY=sku__copurchase
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r950907137


##########
kubernetes_tests/test_kubernetes_pod_operator.py:
##########
@@ -702,6 +704,59 @@ def test_env_vars(self):
         ]
         assert self.expected_pod == actual_pod
 
+    @staticmethod
+    def propagate_run_id(conf, run_id):
+        """Jinja filter. Test DAG uses this via user_defined_filters."""
+        if run_id:
+            conf = deepcopy(conf)
+            # In a real DAG, this value might be provided by XCOM output of an
+            # upstream task.
+            conf["run_id"] = run_id
+        return conf
+
+    def test_env_vars_are_templatized(self):
+        # WHEN
+        task_id = "task" + self.get_current_task_name()
+        conf = {
+            task_id: {
+                # These become environment variables passed to the pod. There
+                # could be *any number* of variables.
+                "foo1": "bar1",
+                "foo2": "bar2",
+            },
+        }
+
+        # Templated environment variables from two sources:
+        # - dag_run.conf: Arbitrary number of variables
+        # - Jinja filter propagate_run_id: Could add an arbitrary number of
+        #   additional variables. In this case, it's 0 or 1 variables.
+        env_vars = f'{{{{ dag_run.conf["{task_id}"] | propagate_run_id(run_id) }}}}'

Review Comment:
   This is very similar to our real-world MLOps use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
uranusjr commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r953454171


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   Can you provide a concrete example this is needed? I believe a more modern solution for this problem is to use XComArg. Using `literal_eval` like this is a security hazard and must be avoided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] barrywhart commented on a diff in pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
barrywhart commented on code in PR #25829:
URL: https://github.com/apache/airflow/pull/25829#discussion_r954085725


##########
airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py:
##########
@@ -409,6 +425,12 @@ def extract_xcom(self, pod: k8s.V1Pod):
             self.log.info("xcom result: \n%s", result)
             return json.loads(result)
 
+    def pre_execute(self, context):
+        if isinstance(self.env_vars, str):
+            self.env_vars = convert_env_vars(ast.literal_eval(self.env_vars.strip()))

Review Comment:
   A bit more about our use case: I'm an MLOps engineer. We provide a platform to make data science development and deployment easy for data scientists. Data scientists don't write DAGs or other Airflow code. Instead, every PR build for every data science repo runs a single, standard Airflow DAG that is highly flexible. We call it the MLOps experiment pipeline. To define environment variables required by Kubernetes pod(s) in this workflow, data scientists update a simple list of key-value pairs in an `.env` file. The CI/CD job passes these variables as part of the `dag_run.conf` JSON structure.
   
   This is the architecture of the MLOps experiment pipeline:
   ![image](https://user-images.githubusercontent.com/1678585/186484507-23368f5f-3516-4d9d-8baf-60c226c623d9.png)
   
   Here are a couple of typical `.env` files from two different repos:
   
   1. Simple project
   ```
   SCRIPT=train.py
   GOOGLE_APPLICATION_CREDENTIALS="/root/.config/gcloud/application_default_credentials.json"
   ```
   
   2. Complex project
   ```
   PREDICTION_DATE=2022-05-18
   SCRIPT=evaluate.py
   MODEL_NAME=multiembedder
   
   RECOMMENDER_NAME=multiembedder_sku_copurchase
   RUN_ID=2022.08.11.10.57.07-beta
   #MULTIEMBEDDER_ID=large-model
   
   COPURCHASE_CANDIDATES_BY_CAT_ID=2022.08.02.13.40.13-beta
   COVIEW_CANDIDATES_BY_CAT_ID=2022.08.02.13.53.21-beta
   #RUN_ID=temp-1234
   #SCRIPT=monitor.py
   #STRATEGY=sku__copurchase
   ```
   
   Note that the exact same DAG supports both projects, one with 2 variables and one with around 10 variables. This flexibility is vital for our data science users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #25829: Fix templating for KubernetesPodOperator env_vars field

Posted by GitBox <gi...@apache.org>.
potiuk commented on PR #25829:
URL: https://github.com/apache/airflow/pull/25829#issuecomment-1243055014

   You have a conflict to solve (that's first).
   
   While I agree the problem is important to solve, I think string and using literal_eval is just wrong.
   We are talking about passing dictionary of strings - and this dictionary of strings should be passed to EnvVars not string. Is it possible? Can you do it this way while still being templatised in your case?
   
   I think literal_eval has one big problem with user-input data - namely that you are never sure what type of object you will get - it can be string, list, dict etc - depending on the data passed by the user as input. And this is - I believe what @uranusjr refers to. Have you tests what happens if the user passes a Tuple or Dict of Dict of Strings? I think not. That's why literal evel is often a shortcut that should be avoided - even if it is used in a few places - I'd avoid it unless you want to handle a truly generic case that you cannot foresee.
   
   This case is DIFFERENT. you have a variable number of entries Dict[str, str] tp pass.  Is it possible to work out a scheme where you can only receive and validate that as an input from the user?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org