You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by "Ferruzzi, Dennis" <fe...@amazon.com.INVALID> on 2022/04/27 15:30:57 UTC

task_id declarations

Hi folks, I'm hoping for a little history lesson.  I'm idly wondering if there is a way to make a fairly big change (for me), but want to understand the reason it is the way it is now, before I go and put much time into "fixing" it.

Every time I write a DAG it bugs me that we have to essentially name a task twice and I'm thinking of proposing/implementing the change.  For example:


    train_model = SageMakerTrainingOperator(
        task_id='train_model',
        config=TRAINING_CONFIG,
    )


I'd love to see the task_id default to the task's variable name.  It's exceedingly rare in my DAGs for those two values not to be identical and it catches me from time to time forgetting to state the task_id.   But maybe there is a reason this is the way it works, or maybe my personal experiences are just too limited to see why this is a Bad Idea.

Re: task_id declarations

Posted by "Ferruzzi, Dennis" <fe...@amazon.com.INVALID>.
Yeah, alright, I guess that makes sense.   Thank you both for the replies.

Maybe it's just confirmation bias in action, but it seems like I almost always see DAGs written like

```

with DAG(dag_id="simple", start_date=datetime(2022, 4, 1)) as dag:
    task1 = Operator1(task_id='task1')
    task2 = Operator1(task_id='task2')

    task3 = Operator1(task_id='task3')

    task1 >> task2 >> task3
```

and it feels so repetitive.  Maybe I'm just trivializing a change that is both unnecessary and largely intrusive.

Thanks for the replies.

________________________________
From: Jed Cunningham <je...@apache.org>
Sent: Wednesday, April 27, 2022 1:45 PM
To: dev@airflow.apache.org
Subject: RE: [EXTERNAL]task_id declarations


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.


Also keep in mind you don't even have to put tasks in a variable, for example:

```
with DAG(dag_id="simple", start_date=datetime(2022, 4, 1)) as dag:
    BashOperator(task_id="hello", bash_command="echo hello")
    BashOperator(task_id="world", bash_command="echo world")
```

Re: task_id declarations

Posted by Jed Cunningham <je...@apache.org>.
Also keep in mind you don't even have to put tasks in a variable, for
example:

```
with DAG(dag_id="simple", start_date=datetime(2022, 4, 1)) as dag:
    BashOperator(task_id="hello", bash_command="echo hello")
    BashOperator(task_id="world", bash_command="echo world")
```

Re: task_id declarations

Posted by Jarek Potiuk <ja...@potiuk.com>.
I think the original reason is how Python parsing works. At the moment
we create the task the variable name is not known. First task is
created as an object and then the result of it is assigned to a
variable. And I think we have no super-reliable way (unless there is
some wild Python trickery) to find out what is the actual variable
being assigned to (and my gut feeling is that there might be cases
that will make our attempt fail). Theoretically, you could parse the
AST of your Python DAG, you could potentially find out what is the
variable name that is going to be assigned to - see this SO question:
https://stackoverflow.com/questions/18425225/getting-the-name-of-a-variable-as-a-string.
But I am afraid finding out the right assignment in AST in a general
case (including nested frames, built-ins handling etc.) would be
either very error-prone or even impossible in some cases. And for sure
it would be much slower, because you would have to access and traverse
the AST of Python DAG being executed right now, find a proper
assignment and get the name of the variable from there.

I am not sure if it is worth it but maybe someone would like to
prototype it and run it on many DAGs to see if this could be a viable
option?
J.

On Wed, Apr 27, 2022 at 5:31 PM Ferruzzi, Dennis
<fe...@amazon.com.invalid> wrote:
>
> Hi folks, I'm hoping for a little history lesson.  I'm idly wondering if there is a way to make a fairly big change (for me), but want to understand the reason it is the way it is now, before I go and put much time into "fixing" it.
>
> Every time I write a DAG it bugs me that we have to essentially name a task twice and I'm thinking of proposing/implementing the change.  For example:
>
>
>     train_model = SageMakerTrainingOperator(
>         task_id='train_model',
>         config=TRAINING_CONFIG,
>     )
>
>
> I'd love to see the task_id default to the task's variable name.  It's exceedingly rare in my DAGs for those two values not to be identical and it catches me from time to time forgetting to state the task_id.   But maybe there is a reason this is the way it works, or maybe my personal experiences are just too limited to see why this is a Bad Idea.