You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/06/22 08:55:38 UTC

[GitHub] [airflow] ashb commented on pull request #16084: Added new pipeline example for the tutorial docs (Issue #11208)

ashb commented on pull request #16084:
URL: https://github.com/apache/airflow/pull/16084#issuecomment-865764114


   This example pipeline encodes a few anti-pattern that we don't want to encourage:
   
   Having one task make an HTTP and write it to a local file, and then a second task pick up that file and process it is will not work for a number of reasons:
   
   
   1. If you re-run an old `insert_data` task, it's going to insert _new_ data.
   1. This task is not idempotent -- every time you run it you will just get another copy of the rows inserted. We should use UPSERT or some kind "delete date range then insert" rather than a blind insert
   1. It work outside of the local executor -- if you use  Celery executors, the `get_data` and `insert_data` tasks could end up running on different nodes, or in the case of Kube, _will_ not work, as the file in the container will be thrown away when the task finishes and the pod is deleted.
   
   @Sanchit112 Could you update your follow-on PR to take these in to account?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org