You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Song Liu <so...@outlook.com> on 2018/05/10 07:58:47 UTC

Interesting things about how to know it's a DAG file

Hi,

I just create a custom Dag class naming such as "MyPipeline" by extending the "DAG" class, but Airflow is failed to identify this is a DAG file.

After digging into the Airflow implementation around the dag_processing.py file:

```
# Heuristic that guesses whether a Python file contains an # Airflow DAG definition. might_contain_dag = True if safe_mode and not zipfile.is_zipfile(file_path): with open(file_path, 'rb') as f: content = f.read() might_contain_dag = all( [s in content for s in (b'DAG', b'airflow')])
```

So if the keyword "DAG" and "airflow" contained, it is a DAG file.

I don't know is there any other be more scientific way for this ?

Thanks,
Song

Re: Interesting things about how to know it's a DAG file

Posted by Arthur Wiedmer <ar...@gmail.com>.
I think that the idea of a Dag Fetcher is a great one.

Manifests are a good idea, or indeed it could look to a specific airflow
DAG index to instantiate the look up behaviour if it needs to be
programmatic.

We may want to offer a simple Dag Fetcher which follows the current
behaviour for backward compatibility, if we want to target 2.0 for the Dag
Fetcher implementation.

Best,
Arthur


On Thu, May 10, 2018 at 10:37 AM Gabriel Silk <gs...@dropbox.com.invalid>
wrote:

> What about a manifest file that names all the DAGs? Or a naming convention
> for the DAG files themselves?
>
> Alternatively, there could be a single entry point (ie, index.py) from
> which all the DAGs are instantiated. There's probably some complexity in
> making that work with the multi-process scheduler model, but doesn't seem
> insurmountable.
>
> On Thu, May 10, 2018 at 10:31 AM, Arthur Wiedmer <arthur.wiedmer@gmail.com
> >
> wrote:
>
> > Hi Song,
> >
> > I agree that this is not ideal, but it is difficult to do otherwise
> without
> > parsing/executing the Python code.
> >
> > Note that an import from airflow should be enough, or DAG in a comment. I
> > think we are open to other solutions, if anyone on the list has better
> > ideas.
> >
> >
> > Best,
> > Arthur
> >
> >
> >
> > On Thu, May 10, 2018 at 12:59 AM Song Liu <so...@outlook.com> wrote:
> >
> > > Hi,
> > >
> > > I just create a custom Dag class naming such as "MyPipeline" by
> extending
> > > the "DAG" class, but Airflow is failed to identify this is a DAG file.
> > >
> > > After digging into the Airflow implementation around the
> > dag_processing.py
> > > file:
> > >
> > > ```
> > > # Heuristic that guesses whether a Python file contains an # Airflow
> DAG
> > > definition. might_contain_dag = True if safe_mode and not
> > > zipfile.is_zipfile(file_path): with open(file_path, 'rb') as f:
> content =
> > > f.read() might_contain_dag = all( [s in content for s in (b'DAG',
> > > b'airflow')])
> > > ```
> > >
> > > So if the keyword "DAG" and "airflow" contained, it is a DAG file.
> > >
> > > I don't know is there any other be more scientific way for this ?
> > >
> > > Thanks,
> > > Song
> > >
> >
>

Re: Interesting things about how to know it's a DAG file

Posted by Gabriel Silk <gs...@dropbox.com.INVALID>.
What about a manifest file that names all the DAGs? Or a naming convention
for the DAG files themselves?

Alternatively, there could be a single entry point (ie, index.py) from
which all the DAGs are instantiated. There's probably some complexity in
making that work with the multi-process scheduler model, but doesn't seem
insurmountable.

On Thu, May 10, 2018 at 10:31 AM, Arthur Wiedmer <ar...@gmail.com>
wrote:

> Hi Song,
>
> I agree that this is not ideal, but it is difficult to do otherwise without
> parsing/executing the Python code.
>
> Note that an import from airflow should be enough, or DAG in a comment. I
> think we are open to other solutions, if anyone on the list has better
> ideas.
>
>
> Best,
> Arthur
>
>
>
> On Thu, May 10, 2018 at 12:59 AM Song Liu <so...@outlook.com> wrote:
>
> > Hi,
> >
> > I just create a custom Dag class naming such as "MyPipeline" by extending
> > the "DAG" class, but Airflow is failed to identify this is a DAG file.
> >
> > After digging into the Airflow implementation around the
> dag_processing.py
> > file:
> >
> > ```
> > # Heuristic that guesses whether a Python file contains an # Airflow DAG
> > definition. might_contain_dag = True if safe_mode and not
> > zipfile.is_zipfile(file_path): with open(file_path, 'rb') as f: content =
> > f.read() might_contain_dag = all( [s in content for s in (b'DAG',
> > b'airflow')])
> > ```
> >
> > So if the keyword "DAG" and "airflow" contained, it is a DAG file.
> >
> > I don't know is there any other be more scientific way for this ?
> >
> > Thanks,
> > Song
> >
>

Re: Interesting things about how to know it's a DAG file

Posted by Arthur Wiedmer <ar...@gmail.com>.
Hi Song,

I agree that this is not ideal, but it is difficult to do otherwise without
parsing/executing the Python code.

Note that an import from airflow should be enough, or DAG in a comment. I
think we are open to other solutions, if anyone on the list has better
ideas.


Best,
Arthur



On Thu, May 10, 2018 at 12:59 AM Song Liu <so...@outlook.com> wrote:

> Hi,
>
> I just create a custom Dag class naming such as "MyPipeline" by extending
> the "DAG" class, but Airflow is failed to identify this is a DAG file.
>
> After digging into the Airflow implementation around the dag_processing.py
> file:
>
> ```
> # Heuristic that guesses whether a Python file contains an # Airflow DAG
> definition. might_contain_dag = True if safe_mode and not
> zipfile.is_zipfile(file_path): with open(file_path, 'rb') as f: content =
> f.read() might_contain_dag = all( [s in content for s in (b'DAG',
> b'airflow')])
> ```
>
> So if the keyword "DAG" and "airflow" contained, it is a DAG file.
>
> I don't know is there any other be more scientific way for this ?
>
> Thanks,
> Song
>