You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/11/12 21:22:06 UTC

[GitHub] [airflow] turbaszek opened a new pull request #12327: Use different deserialization method in XCom init_on_load

turbaszek opened a new pull request #12327:
URL: https://github.com/apache/airflow/pull/12327


   The init_on_load method used deserialize_value method which
   in case of custom XCom backends may perform requests to external
   services (for example downloading file from buckets).
   
   This is problematic because wherever we query XCom the resuest would be
   send (for example when listing XCom in webui). This PR proposes implementing
   orm_deserialize_value which allows overriding this behavior. By default
   we use BaseXCom.deserialize_value.
   
   closes: #12315
   
   I'm testing this with the following backend:
   ```py
   class GCSXComBackend(BaseXCom):
       PREFIX = "xcom_gs://"
       BUCKET_NAME = "airflow-xcom-backend"
   
       @staticmethod
       def serialize_value(value: Any):
           if isinstance(value, pd.DataFrame):
               hook = GCSHook()
               with NamedTemporaryFile("w+") as f:
                   object_name = "data_" + f.name.replace("/", "_")
                   value.to_csv(f.name)
                   f.flush()
                   hook.upload(
                       bucket_name=GCSXComBackend.BUCKET_NAME,
                       object_name=object_name,
                       filename=f.name
                   )
               value = GCSXComBackend.PREFIX + object_name
           return BaseXCom.serialize_value(value)
   
       @staticmethod
       def deserialize_value(result) -> Any:
           result = BaseXCom.deserialize_value(result)
           if isinstance(result, str) and result.startswith(GCSXComBackend.PREFIX):
               object_name = result.replace(GCSXComBackend.PREFIX, "")
               with GCSHook().provide_file(
                   bucket_name=GCSXComBackend.BUCKET_NAME,
                   object_name=object_name,
               ) as f:
                   f.flush()
                   result = pd.read_csv(f.name)
           return result
   ```
   
   And that's what I see in XCom table:
   <img width="1673" alt="Screenshot 2020-11-12 at 22 18 56" src="https://user-images.githubusercontent.com/9528307/98997809-6bfa6b00-2535-11eb-90b3-2b57985a4e66.png">
   
   In logs I can see that the data is uploaded and downloaded as expected.
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726684675


   > What was the error you got (in text please) when using this backend? I don't quite see how this would be used in your case.
   
   @ashb  as mentioned in #12315 the first problem I saw was loading of XCom model view in web ui:
   ```
     File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/templates/appbuilder/general/widgets/base_list.html", line 23, in top-level template code
       {% block begin_loop_values %}
     File "/opt/airflow/airflow/www/templates/airflow/model_list.html", line 80, in block "begin_loop_values"
       {% elif item[value] != None %}
     File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 1326, in __nonzero__
       raise ValueError(
   ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
   ```
   
   However, as described there it can be fixed by changing `!= None` to `is not none` in model template. Here's the result:
   
   <img width="1673" alt="Screenshot 2020-11-13 at 11 21 10" src="https://user-images.githubusercontent.com/9528307/99061788-58391e00-25a2-11eb-88b2-fe80bcdd226d.png">
   
   But the presence of dataframes in this table suggests that they are retrieved from GCS bucket (they are not stored in XCom table, where I store only reference to the bucket object). To show this view we do:
   ```
           xcomlist = (
               session.query(XCom)
               .filter(XCom.dag_id == dag_id, XCom.task_id == task_id, XCom.execution_date == dttm)
               .all()
           )
   ```
   Which queries XComs and calls `deserialize_value` when recreating XCom object from database in `init_on_load` method.
   
   What this PR propose is to have `orm_deserialize_value` for "get value from database to create XCom object" and  `deserialize_value` to "deserialize XCom object into other object". I'm happy to hear other possible options, however this one is simplest one that came to my mind.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522451025



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       Why not change this to `self.deserialize_value` value instead. This way it will use the `deserialize_value` from the CustomXCom Backend.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r523014626



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       Done in 18513c2




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r524024672



##########
File path: docs/concepts.rst
##########
@@ -793,6 +793,12 @@ to a class that is subclass of :class:`~airflow.models.xcom.BaseXCom`. To alter
 deserialization mechanism the custom class should override ``serialize_value`` and ``deserialize_value``
 methods.
 
+It is also possible to override the ``orm_deserialize_value`` method which is used for deserialization when
+recreating ORM XCom object. This happens every time we query the XCom table, for example when we want to populate
+XCom list view in webserver. By default Airflow will use ``BaseXCom.orm_deserialize_value`` method which returns the
+value stored in Airflow database. In this way Airflow is avoiding unnecessary requests that may occur in custom
+``deserialize`` methods of custom XCom backend.

Review comment:
       ```suggestion
   value stored in Airflow database. If your XCom backend performs expensive operations, or has
   large values that aren't useful to show in such a view, override this method to provide an
   alternative representation.
   ```

##########
File path: docs/concepts.rst
##########
@@ -793,6 +793,12 @@ to a class that is subclass of :class:`~airflow.models.xcom.BaseXCom`. To alter
 deserialization mechanism the custom class should override ``serialize_value`` and ``deserialize_value``
 methods.
 
+It is also possible to override the ``orm_deserialize_value`` method which is used for deserialization when
+recreating ORM XCom object. This happens every time we query the XCom table, for example when we want to populate
+XCom list view in webserver. By default Airflow will use ``BaseXCom.orm_deserialize_value`` method which returns the
+value stored in Airflow database. In this way Airflow is avoiding unnecessary requests that may occur in custom
+``deserialize`` methods of custom XCom backend.

Review comment:
       Just this @turbaszek Then :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726628485


   What was the error you got (in text please) when using this backend? I don't quite see how this would be used in your case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726824591


   Thinking about it I lean towards hardcoding the `init_on_load` with `BaseXCom.deserialize_value` - in this way we will make sure that users do not break internal stuff. As an addition to that I would add note in docs that is is required to call `BaseXCom` methods in custom `serialize_value` and `deserialize_value` to make sure that the value we store in **xcom tabel** is json serialisable (somehow discouraging users to use pickle + db as storage or anything like that).
   
   What do you think @ashb @kaxil ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek merged pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek merged pull request #12327:
URL: https://github.com/apache/airflow/pull/12327


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726698981


   > Which queries XComs and calls deserialize_value when recreating XCom object from database in init_on_load method. Definitely not something we want to do with 100s of big dataframes or other objects.
   
   Ahhh yeah, def not something we want. This wasn't clear to me from this PR. If I'm understanding the problem there are actually two issues we need to address:
   
   1. Looking at the Xcom list page would request a bunch of files from GCS bucket (in your example)
   2. Even if that was "cheap"/cached(/whatever), we don't want to display huge xcom payloads _anyway_.
   
   Have I summarised the problem correctly?
   
   How about something clearer in intent -- something like `get_xcom_summary`? (This name isn't great either.) From reading the PR/code it this PR it wasn't clear when this new feature would actually be used.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-727883027


   The PR needs to run all tests because it modifies core of Airflow! Please rebase it to latest master or ask committer to re-run it!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-727854120


   @ashb is there anything else that should be addressed in this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726373656


   [The Workflow run](https://github.com/apache/airflow/actions/runs/360548690) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r523415279



##########
File path: airflow/models/xcom.py
##########
@@ -235,30 +235,40 @@ def serialize_value(value: Any):
             return json.dumps(value).encode('UTF-8')
         except (ValueError, TypeError):
             log.error(
-                "Could not serialize the XCOM value into JSON. "
+                "Could not serialize the XCom value into JSON. "
                 "If you are using pickles instead of JSON "
-                "for XCOM, then you need to enable pickle "
-                "support for XCOM in your airflow config."
+                "for XCom, then you need to enable pickle "
+                "support for XCom in your airflow config."
             )
             raise
 
     @staticmethod
-    def deserialize_value(result) -> Any:
-        """Deserialize Xcom value from str or pickle object"""
+    def deserialize_value(result: "XCom") -> Any:
+        """Deserialize XCom value from str or pickle object"""
         enable_pickling = conf.getboolean('core', 'enable_xcom_pickling')
         if enable_pickling:
             return pickle.loads(result.value)
         try:
             return json.loads(result.value.decode('UTF-8'))
         except JSONDecodeError:
             log.error(
-                "Could not deserialize the XCOM value from JSON. "
+                "Could not deserialize the XCom value from JSON. "
                 "If you are using pickles instead of JSON "
-                "for XCOM, then you need to enable pickle "
-                "support for XCOM in your airflow config."
+                "for XCom, then you need to enable pickle "
+                "support for XCom in your airflow config."
             )
             raise
 
+    def orm_deserialize_value(self) -> Any:
+        """
+        Deserialize method which is used to reconstruct ORM XCom object.
+
+        This method should be overridden in custom XCom backends to avoid
+        unnecessary request or other resource consuming operations when
+        creating XCom orm model

Review comment:
       Good suggestion! Done in 529e911




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek edited a comment on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek edited a comment on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726684675


   > What was the error you got (in text please) when using this backend? I don't quite see how this would be used in your case.
   
   @ashb  as mentioned in #12315 the first problem I saw was loading of XCom model view in web ui:
   ```
     File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/templates/appbuilder/general/widgets/base_list.html", line 23, in top-level template code
       {% block begin_loop_values %}
     File "/opt/airflow/airflow/www/templates/airflow/model_list.html", line 80, in block "begin_loop_values"
       {% elif item[value] != None %}
     File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 1326, in __nonzero__
       raise ValueError(
   ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
   ```
   
   However, as described there it can be fixed by changing `!= None` to `is not none` in model template. Here's the result:
   
   <img width="1673" alt="Screenshot 2020-11-13 at 11 21 10" src="https://user-images.githubusercontent.com/9528307/99061788-58391e00-25a2-11eb-88b2-fe80bcdd226d.png">
   
   But the presence of dataframes in this table suggests that they are retrieved from GCS bucket (they are not stored in XCom table, where I store only reference to the bucket object). To show this view we do:
   ```
           xcomlist = (
               session.query(XCom)
               .filter(XCom.dag_id == dag_id, XCom.task_id == task_id, XCom.execution_date == dttm)
               .all()
           )
   ```
   Which queries XComs and calls `deserialize_value` when recreating XCom object from database in `init_on_load` method. Definitely not something we want to do with 100s of big dataframes or other objects. 
   
   What this PR propose is to have `orm_deserialize_value` for "get value from database to create XCom object" and  `deserialize_value` to "deserialize XCom object into other object". I'm happy to hear other possible options, however this one is simplest one that came to my mind.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726364289


   The PR needs to run all tests because it modifies core of Airflow! Please rebase it to latest master or ask committer to re-run it!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522474998



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       @kaxil this will not help as custom backends can perform any logic in this function and return any type of object (that was initial issues #12315 ). So doing a query on XCom may cause tenths of requests :D That's why I wanted to decouple it.
   
   So `orm_deserialize_value` is for "get value from database to create XCom object" when `deserialize_value` is to "deserialize XCom object into other object"




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726707793


   > Ahhh yeah, def not something we want. This wasn't clear to me from this PR. If I'm understanding the problem there are actually two issues we need to address:
   > 
   > 1. Looking at the Xcom list page would request a bunch of files from GCS bucket (in your example)
   > 2. Even if that was "cheap"/cached(/whatever), we don't want to display huge xcom payloads _anyway_.
   > 
   > Have I summarised the problem correctly?
   
   Yes, thanks for making it clearer!
   
   > How about something clearer in intent -- something like `get_xcom_summary`? 
   
   Summary suggest that we return something more than one value (for example summary in pandas), how about `get_raw_value`? 
   
   > From reading the PR/code it this PR it wasn't clear when this new feature would actually be used.
   
   To be honest I don't think overriding how to reconstruct ORM model is something that should be clear / suggested to users (although should be configurable to fully customise the XCom). I suppose that in 95% cases using the `BaseXCom` method should be ok and that's what we would use by default. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522451025



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       Why not change this to `self.deserialize_value` value instead. This way it will use the `deserialize_value` from the CustomXCom Backend.
   
   ```suggestion
               self.value = self.deserialize_value(self)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r523412814



##########
File path: airflow/models/xcom.py
##########
@@ -235,30 +235,40 @@ def serialize_value(value: Any):
             return json.dumps(value).encode('UTF-8')
         except (ValueError, TypeError):
             log.error(
-                "Could not serialize the XCOM value into JSON. "
+                "Could not serialize the XCom value into JSON. "
                 "If you are using pickles instead of JSON "
-                "for XCOM, then you need to enable pickle "
-                "support for XCOM in your airflow config."
+                "for XCom, then you need to enable pickle "
+                "support for XCom in your airflow config."
             )
             raise
 
     @staticmethod
-    def deserialize_value(result) -> Any:
-        """Deserialize Xcom value from str or pickle object"""
+    def deserialize_value(result: "XCom") -> Any:
+        """Deserialize XCom value from str or pickle object"""
         enable_pickling = conf.getboolean('core', 'enable_xcom_pickling')
         if enable_pickling:
             return pickle.loads(result.value)
         try:
             return json.loads(result.value.decode('UTF-8'))
         except JSONDecodeError:
             log.error(
-                "Could not deserialize the XCOM value from JSON. "
+                "Could not deserialize the XCom value from JSON. "
                 "If you are using pickles instead of JSON "
-                "for XCOM, then you need to enable pickle "
-                "support for XCOM in your airflow config."
+                "for XCom, then you need to enable pickle "
+                "support for XCom in your airflow config."
             )
             raise
 
+    def orm_deserialize_value(self) -> Any:
+        """
+        Deserialize method which is used to reconstruct ORM XCom object.
+
+        This method should be overridden in custom XCom backends to avoid
+        unnecessary request or other resource consuming operations when
+        creating XCom orm model

Review comment:
       I think I'd like an indication of _when_ this might be needed in the docs please.
   
   Something like
   
   ```suggestion
           creating XCom orm model. This is is used when viewing XCom listing in the webserver, for example.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r523023365



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.orm_deserialize_value(self)
+            self.value = self.orm_deserialize_value(self)

Review comment:
       Can this be `self.orm_deserialize_value()`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726628485


   What was the error you got (in text please) when using this backend? I don't quite see how this would be used in your case.
   
   Or can you show an update example backend using this new method?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
kaxil commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522450084



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       why not just change this to `cls.deserialize_value(self)` ? and that way it should use the `deserialize_value` method from Custom backend 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726698981


   > Which queries XComs and calls deserialize_value when recreating XCom object from database in init_on_load method. Definitely not something we want to do with 100s of big dataframes or other objects.
   
   Ahhh yeah, def not something we want. This wasn't clear to me from this PR. If I'm understanding the problem there are actually two issues we need to address:
   
   1. Looking at the Xcom list page would request a bunch of files from GCS bucket (in your example)
   2. Even if that was "cheap"/cached(/whatever), we don't want to display huge xcom payloads _anyway_.
   
   Have I summarised the problem correctly?
   
   How about something clearer in intent -- something like `get_xcom_summary`? From reading the PR/code it this PR it wasn't clear when this new feature would actually be used.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522879651



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       I think either way we should make this a `self.` call
   ```suggestion
               self.value = self.orm_deserialize_value()
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r523028845



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.orm_deserialize_value(self)
+            self.value = self.orm_deserialize_value(self)

Review comment:
       Yeah, make sense, I didn't get it earlier. Done in 5ad38dc




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r524063547



##########
File path: docs/concepts.rst
##########
@@ -793,6 +793,12 @@ to a class that is subclass of :class:`~airflow.models.xcom.BaseXCom`. To alter
 deserialization mechanism the custom class should override ``serialize_value`` and ``deserialize_value``
 methods.
 
+It is also possible to override the ``orm_deserialize_value`` method which is used for deserialization when
+recreating ORM XCom object. This happens every time we query the XCom table, for example when we want to populate
+XCom list view in webserver. By default Airflow will use ``BaseXCom.orm_deserialize_value`` method which returns the
+value stored in Airflow database. In this way Airflow is avoiding unnecessary requests that may occur in custom
+``deserialize`` methods of custom XCom backend.

Review comment:
       @ashb done in 32ce1c8, I also change the order of the sentences to preserve "view" context 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek edited a comment on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek edited a comment on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726684675


   > What was the error you got (in text please) when using this backend? I don't quite see how this would be used in your case.
   
   @ashb  as mentioned in #12315 the first problem I saw was loading of XCom model view in web ui:
   ```
     File "/usr/local/lib/python3.8/site-packages/flask_appbuilder/templates/appbuilder/general/widgets/base_list.html", line 23, in top-level template code
       {% block begin_loop_values %}
     File "/opt/airflow/airflow/www/templates/airflow/model_list.html", line 80, in block "begin_loop_values"
       {% elif item[value] != None %}
     File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 1326, in __nonzero__
       raise ValueError(
   ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
   ```
   
   However, as described there it can be fixed by changing `!= None` to `is not none` in model template. However, this suggested that we are retrieving dataframes (which we should not!). After using `is not none` in list template:
   
   <img width="1673" alt="Screenshot 2020-11-13 at 11 21 10" src="https://user-images.githubusercontent.com/9528307/99061788-58391e00-25a2-11eb-88b2-fe80bcdd226d.png">
   
   But the presence of dataframes in this table suggests that they are retrieved from GCS bucket (they are not stored in XCom table, where I store only reference to the bucket object). To show this view we do:
   ```
           xcomlist = (
               session.query(XCom)
               .filter(XCom.dag_id == dag_id, XCom.task_id == task_id, XCom.execution_date == dttm)
               .all()
           )
   ```
   Which queries XComs and calls `deserialize_value` when recreating XCom object from database in `init_on_load` method. Definitely not something we want to do with 100s of big dataframes or other objects. 
   
   What this PR propose is to have `orm_deserialize_value` for "get value from database to create XCom object" and  `deserialize_value` to "deserialize XCom object into other object". I'm happy to hear other possible options, however this one is simplest one that came to my mind.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] turbaszek commented on a change in pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
turbaszek commented on a change in pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#discussion_r522474998



##########
File path: airflow/models/xcom.py
##########
@@ -63,7 +63,7 @@ def init_on_load(self):
         i.e automatically deserialize Xcom value when loading from DB.
         """
         try:
-            self.value = XCom.deserialize_value(self)
+            self.value = XCom.orm_deserialize_value(self)

Review comment:
       @kaxil this will not help as custom backends can perform any logic in this function and return any type of object (that was initial issues #12315 ). So doing a query on XCom may cause tenths of requests :D That's why I wanted to decouple it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #12327: Use different deserialization method in XCom init_on_load

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12327:
URL: https://github.com/apache/airflow/pull/12327#issuecomment-726373733


   [The Workflow run](https://github.com/apache/airflow/actions/runs/360549582) is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Backport packages$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org