You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/11/25 07:40:25 UTC

[GitHub] [airflow] dstandish edited a comment on issue #6210: [AIRFLOW-5567] BaseReschedulePokeOperator

dstandish edited a comment on issue #6210: [AIRFLOW-5567] BaseReschedulePokeOperator
URL: https://github.com/apache/airflow/pull/6210#issuecomment-558030520
 
 
   > Having a state table will have a fundamental impact on the idempotency of the execution of the tasks.
   
   It's optional to use such a thing.  Just like it is with XCom.  If you don't use it, nothing is changed.
   
   > Why would the manual triggering of a dag introduce issues, the execution date will be equal to the moment that it was triggered. I think it should work as well.
   
   Because execution_date is run date minus one interval.  So, suppose I want to persist state with XCom (which I do in many jobs), and I have a daily job, running at midnight.  At end of each run, we push some value to XCom.  At start of next job, we retrieve last value and use it somehow. Consider this case:
   * run 1: 12am D1
   * run 2: manually triggered at 8am (exec date is D1 8am; xcom retrieves from run 1)
   * run 3: 12am D2
   * run 4: 12am D3
   * run 5: 12am D4
   
   Outcome:
   * Run 3 will retrive the XCom from run 1, because its execution date is prior to run 2 execution date.
   * Run 4 retrieves run 2 for same reason.
   * Run 5 retrieves run 4 (finally things are back in order); run 3 xcom is never retrieved by any job.
   
   The schedule interval edge PR would resolve the execution date ordering problem.  But if XCom is cleared at start of task, it is remains problematic as a mechanism for state persistence.
   
   > Since this will introduce such as a fundamental change to the way operators were intended, being idempotent, I think it would be great to first start an AIP on the topic, so we can have a clear and structured approach.
   
   An AIP sounds reasonable.  I am just a bit skeptical of the notion that this is some radical change; I would be shocked if stateful processes were not already an extremely common use pattern.  Here the goal would be to provide better support for them out of the box.  
   
   Airflow provides great support for a particular kind of "idempotent" task, but surely it doesn't say this is the only way we can use it! 
   
   Anyway, I have occasionally rambled on dev list about these issues.  I am not sure what the best solution is.  I wish there could be clearer and more generalized separation between the concepts of "run date" and "interval of interest", but I am not sure what that should look like.  But having a simple way to persist state would be of great immediate help to me, and to this PR, incidentally.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services