You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/07 16:52:38 UTC

[GitHub] [airflow] PatrickfBraz opened a new issue, #28199: Large objects managed by custom XCom backend are being rendered and saved in rendered_task_instance_fields table

PatrickfBraz opened a new issue, #28199:
URL: https://github.com/apache/airflow/issues/28199

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   In order to provide greater flexibility and ease for the implementation of DAGs and Tasks in our Airflow instance, we decided to implement our custom backend for XCom. In this way, we save in the database only the reference to objects that are serialized in pickle and saved in Google Cloud Storage (GCS).
   
   All the recommendations found in this documentation were followed, including the implementation of the orm_deserialize_value method to create a short and lightweight representation of the objects saved in the GCS.
   
   The custom backend implemented works perfectly and has been in production for a few months. Along with this, there has recently been a strong push on the team to implement the new DAGs using the TaskFlow API and slowly refactor the old DAGs. However, during the implementation of a new DAG which works with DataFrames from the pandas library we noticed that the execution presented errors not in the Task that generated the DataFrame but in the Tasks that consumed it and during the debug procedure we discovered that the problems were being caused because the DataFrames were being saved (or just trying to) into the rendered_task_instance_fields table.
   
   We believed that only arguments provided as a template were actually rendered and saved in this table, but apparently TaskFlow shares information between Tasks through templates (I don't know exactly how it works). Also, one would expect, as with the tab that renders the XComs, that the orm_deserialize_value method would be called, but that doesn't seem to be the case.
   
   Example os a sample code:
   ```python: 
   import pendulum
   import pandas as pd
   from airflow.decorators import dag, task
   
   now = pendulum.now()
   
   
   @dag(start_date=now, schedule="@daily", catchup=False)
   def sample():
       @task()
       def get_dataframe():
           return pd.read_csv("https://gist.githubusercontent.com/chriddyp/feaa84b34854e53fb72a/raw/dbba00aeafb981f0f50014030d1b6ad0399d957d/example-data.csv")
   
       @task()
       def load_dataframe(df: pd.DataFrame):
           print(df.head())
   
       load_dataframe(get_dataframe())
   
   
   sample()
   ```
   There is no problem executing the DAG:
   
   ![image](https://user-images.githubusercontent.com/25487679/206240137-6763a8b9-d4fa-4828-ad07-027a207a08a1.png)
   
   
   The following XCom information is rendered on UI:
   
   ![image](https://user-images.githubusercontent.com/25487679/206240537-1ed174c8-4a7b-4b97-9bb5-9fbd2fb2ad4d.png)
   
   Checking the XCom table on the local database used by airflow:
   
   ![image](https://user-images.githubusercontent.com/25487679/206240723-e506f85c-e926-4a50-82c2-5e520725c9bc.png)
   
   Everything ok so far. However, when checking the rendered template tab:
   
   ![image](https://user-images.githubusercontent.com/25487679/206234035-b78469ea-1b94-47a4-aba7-73d38505fff2.png)
   
   Confirming if this was really the object saved in the database:
   
   ![image](https://user-images.githubusercontent.com/25487679/206234195-ad423fa3-2374-46fc-828d-57a1363183f3.png)
   
   
   
   ### What you think should happen instead
   
   I expected that even if the object is deserialized, the method that returns the lightweight representation of the saved object is called. The created representation is enough for DEBUG purposes and doesn't burden the database with bulky objects.
   
   ### How to reproduce
   
   
   1.  Implement a new custom backend that accepts pandas DataFrame. [Here you can find an example](https://docs.astronomer.io/learn/custom-xcom-backends). NOTE: don't forget to implement the orm_deserialize_value to create a lighter representation of the Dataframe. 
   2. Reference your new backend on the core config. Ex: AIRFLOW__CORE__XCOM_BACKEND=path.to.custom.XComBackend
   3. Set up the local environment with the docker-compose file: https://airflow.apache.org/docs/apache-airflow/2.5.0/docker-compose.yaml
   4. Execute the sample code above.
   5. Look at the Xcoms and rendered template tabs.
   
   ### Operating System
   
   NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
   
   ### Versions of Apache Airflow Providers
   
   In the production environment, we are using:
   
   apache-airflow-providers-postgres>=4.0.0
   apache-airflow-providers-apache-beam>=4.0.0
   apache-airflow-providers-cncf-kubernetes>=4.1.0
   apache-airflow-providers-datadog>=3.0.0
   apache-airflow-providers-google>=8.0.0
   apache-airflow-providers-http>=3.0.0
   apache-airflow-providers-microsoft-mssql>=3.0.0
   apache-airflow-providers-mongo>=3.0.0
   apache-airflow-providers-mysql>=3.0.0
   apache-airflow-providers-odbc>=3.0.0
   apache-airflow-providers-sftp>=3.0.0
   apache-airflow-providers-ssh>=3.0.0
   apache-airflow-providers-airbyte>=3.0.0
   apache-airflow-upgrade-check==1.4.0
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   We manage a fork of the official Airflow chart which we customize for our use case. Deployment is done on a v1.21 Kubernetes cluster hosted on Google Kubernetes Engine (GKE)
   
   ### Anything else
   
   This occurrence is not a critical problem since using the Operators in the classic way already allows us to use the custom XCom without problems. However, TaskFlow presents a much more readable and friendly way of producing code. Since we want to democratize and facilitate the access and implementation of DAGs between different teams in the same Airflow instance, using TaskFlow will be of great use to us.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1342739390

   Using `orm_deserialize` is mostly what I’m thinking, except if it’s not viable we could add a separate interface (I think we need some kind of backward compat measure so a new function may be needed anyway). Not displaying the args at all when a custom XCom backend is involved seems a bit unfriendly to the user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1342728724

   > Hmm, I’m not sure how it’d be actionable. Perhaps we need to add more hooks to the XCom backend interface for this.
   
   Maybe we should simply do not disply args * if they come from XCom and we have a custom backend? or use orm_deserialize in this case to display it? I guess it is possible to determine where the op_args/kwargs come from ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Prevent large objects from being stored in RenderedTaskInstanceFields [airflow]

Posted by "ephraimbuddy (via GitHub)" <gi...@apache.org>.
ephraimbuddy commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1976640568

   Hi @uranusjr @potiuk , I'm looking at this issue but it doesn't seem like using `orm_deserialize_value` would solve this or am I missing something? The code that causes the writing of this value is here: https://github.com/apache/airflow/blob/30f7b2abe6991fe6e565f17f7d0701e80ecba0d3/airflow/models/renderedtifields.py#L118-L120 and I'm thinking it should be solved there. thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28199: Large objects managed by custom XCom backend are being rendered and saved in rendered_task_instance_fields table

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1342139451

   Hmm, I’m not sure how it’d be actionable. Perhaps we need to add more hooks to the XCom backend interface for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28199: Large objects managed by custom XCom backend are being rendered and saved in rendered_task_instance_fields table

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1341268755

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Prevent large objects from being stored in RenderedTaskInstanceFields [airflow]

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1977756610

   The context is that when we render the templates (which are saved to RTIF later), the rendering process also implicitly resolves XCom, causing big values to be loaded (and thus saved to RTIF). We do want the values to be loaded for execution, so one possible solution would be to render the templates twice if a custom XCom backend is detected (by checking if `orm_deserialize_value` is overriden), once normally for execution, once without loading the large values for storage. I think handling this in `serialize_template_field` might be more difficult; when we reach there, all templates have been rendered and we may miss a lot of context on whether we want a value to be abbrevated or not.
   
   But thinking of this again, maybe that is fine…? Custom XCom is just one special case (where the problem is more significant). It could be argued we should reduce _all_ large values anyway regardless of where they are from, and handling that in `serialize_template_field` would be reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] Joffreybvn commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by "Joffreybvn (via GitHub)" <gi...@apache.org>.
Joffreybvn commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1698582168

   Hello,
   
   I am having similar issue. **I'm whilling to make a PR**. Same story as @PatrickfBraz: giving a try to "full taskflow" and Airflow native features, instead of doing everything with KubernetesPodOperator / PythonOperator. Using Airflow 2.7.0.
   
   I have a custom XCom backend. My dag consist of: `HttpOperator -> SQLExecuteQueryOperator` to pull data from an API, passing it via XCom, and saving it into a database. I use a SQL template which pull XCom and render a big query:
   ```sql
   {%- set data = ti.xcom_pull(task_ids='pull_data_from_api') -%}
   {%- for entry in data -%}
   INSERT INTO "SCHEMA_NAME"."TABLE_NAME" VALUES ('{{entry|tojson}}');
   {% endfor %}
   ```
   Resulting in: *(This is an example with mockup data)*
   ![Screenshot 2023-08-29 164037](https://github.com/apache/airflow/assets/17571457/0a6879e2-083b-455f-9175-5d0c0cd4d6f2)
   The database I'm pushing to has limitation: I have to push data doing multiple queries. As a workaround, I could do an 'executemany' with `params` in a PythonOperator. But as soon as this chunk of code would turn into a custom reusable Operator, I will have a templated `parameters` which will anyway save all the data into the database.
   
   I also see the `AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK ` parameter. But rendered templates are great (especially for beginners), I don't want to disable them, I want to keep them visible in the UI.
   
   ---
   
   So, if I understand well this discussion, a PR to begin with could render the template two times:
   - One for the actual execution of the task, with real data - not stored in the db.
   - One to be stored in the db/ shown in the UI, based on orm_deserialize method
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dilex42 commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by "dilex42 (via GitHub)" <gi...@apache.org>.
dilex42 commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1522233528

   Hi. I have the same problem and it's kinda frustrating as I think it completely ruins Taskflow idea of presenting DAGs as composition of tasks.
   As per your [Medium article](https://medium.com/apache-airflow/airflow-2-0-dag-authoring-redesigned-651edc397178) and issue #8059, Custom XCom was designed to allow users to pass large objects between tasks for cleaner code. And it works great when a task returns something. Problem arises later when you try to pass that returned value to another task. Then all of a sudden we face another attempt to write this large object to database for some reason. So in reality whole "pass objects from task to task" concept isn't working and that's why I completely agree that this issue should be classified as a bug.
   
   Next, I am wondering what possible solutions could there be:
   1. Is it necessary to deserialize XComs before task serialization in db? While I am not an avid user of `rendered template` tab, I understand why some would like to see `templates_dict` rendered. But what's the usage of rendered args and especially rendered xcoms? You can always check them in xcoms section or in you custom storage. And, I believe, that tasks that yet to be run have something like `XComArgs(DecoratedPython)...` in their rendered args. So can't it stay that way in others states?
   2. Perhaps you will be open to let users customize what is serializing in each task? Currently it's `['templates_dict', 'op_args', 'op_kwargs']` but maybe we can pass some parameter to change that behaviour and exclude args from certain tasks
   3. That being said, I agree that using `orm_deserialize` seems to be good idea as it gives necessary customisation opportunities and kinda why it was invented, if I understand correctly. Am I wrong in hoping that this change is only 1 line of code and should be pretty straightforward or there are a lot of places where this should be changed? If you can point me to specific code in question, I don't mind to try to make a PR.
   
   Also for people with same problem, I'm currently using a workaround of explicitly pulling desired XCom inside a task. It's not ideal but it's clean enough and works.
   
   Thanks for your time, and sorry if it sounds harsh, I'm just frustrated a little bit😞 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1698652727

   > * One for the actual execution of the task, with real data - not stored in the db
   > * One to be stored in the db/shown in the UI, based on orm_deserialize method + A mechanism to detect if it actually needs to be rendered two times (checking custom XCom backend)
   
   Instead of checking a custom backend, we can just check whether the custom backend actually re-implements `orm_deserialize_value` or not, and only render a second time if that’s the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1523748803

   > Thanks for your time, and sorry if it sounds harsh, I'm just frustrated a little bitdisappointed
   
   Why not make a PR an try to fix it. This is why usually do. Usually when you create a PR and try to fix it you find a good way of doing it and by doing it and showing what you propose, it will be clear what you are proposing. discussing over proposed improvement is always a good idea.
   
   Airflow is created by almost 2.500 people - mostly those who felt frustrated with something and then implemented a fix or feature. This is how it works here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28199:
URL: https://github.com/apache/airflow/issues/28199#issuecomment-1342748326

   Yeah, not very unfriendly. Using `orm_deserialize` is best and I think entirely in-line with the intended use of the method.
   
   I'd even say the current behaviour is a bug because clearly the orm_deserialize method was exactly going to handle those kind of cases according to description:
   
   > There is also an orm_deserialize_value method that is called **whenever the XCom objects are rendered for UI or reporting purposes**; if you have large or expensive-to-retrieve values in your XComs, you should override this method to avoid calling that code (and instead return a lighter, incomplete representation) so the UI remains responsive.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Prevent large objects from being stored in RenderedTaskInstanceFields [airflow]

Posted by "ephraimbuddy (via GitHub)" <gi...@apache.org>.
ephraimbuddy closed issue #28199: Prevent large objects from being stored in RenderedTaskInstanceFields
URL: https://github.com/apache/airflow/issues/28199


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org