You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "Sibabrata Pattanaik (spattana)" <sp...@cisco.com.INVALID> on 2020/09/07 05:04:50 UTC

Question on dynamic tasks in a DAG and wait_for_downstream

Hello Team,

Currently we are using airflow version - 1.10.10 to data ingest.

In our DAG, we create tasks dynamically based on data volume , i.e if data volume is high, number of parallel tasks increases and if the data volume is less number of parallel tasks reduces in the next run or vice versa.
As DAG execution instance use the same table to update, we use 'wait_for_downstream'  to True to maintain the data consistency and make sure next run should not happen if the previous run is in progress or failed.

In this scenario, we are seeing one issue i.e. If previous instances has less number of tasks then the current one because of dynamic task creation, then the current DAG is always in waiting state . As the current DAG  is waiting for the new task/s which are generated during the run but not exists in the previous DAG instance, but waiting for the same tasks to be in completion state in the previous DAG.  As soon as we manually mark those tasks as completed in the previous DAG instance, current DAG start running .

Let me know if you have any work around for this scenario.

Thanks
Sibabrata Pattanaik
--------------------------
spattana@cisco.com
VOIP 84260416
+91 80  44260416
--------------------------

Re: Question on dynamic tasks in a DAG and wait_for_downstream

Posted by Ry Walker <ry...@rywalker.com>.

Hi Sibarata -

I discussed your use case with some of our data engineers, we might have
some recommendations on how to better execute this - let us know if you
want to chat.

-Ry
Airflow Committer + Founder/CTO of Astronomer

On Mon, Sep 7, 2020 at 4:30 AM Sibabrata Pattanaik (spattana)
<sp...@cisco.com.invalid> wrote:

> Hello Team,
>
> Currently we are using airflow version - 1.10.10 to data ingest.
>
> In our DAG, we create tasks dynamically based on data volume , i.e if data
> volume is high, number of parallel tasks increases and if the data volume
> is less number of parallel tasks reduces in the next run or vice versa.
> As DAG execution instance use the same table to update, we use
> 'wait_for_downstream'  to True to maintain the data consistency and make
> sure next run should not happen if the previous run is in progress or
> failed.
>
> In this scenario, we are seeing one issue i.e. If previous instances has
> less number of tasks then the current one because of dynamic task creation,
> then the current DAG is always in waiting state . As the current DAG  is
> waiting for the new task/s which are generated during the run but not
> exists in the previous DAG instance, but waiting for the same tasks to be
> in completion state in the previous DAG.  As soon as we manually mark those
> tasks as completed in the previous DAG instance, current DAG start running .
>
> Let me know if you have any work around for this scenario.
>
> Thanks
> Sibabrata Pattanaik
> --------------------------
> spattana@cisco.com
> VOIP 84260416
> +91 80  44260416
> --------------------------
>
>