You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Mocheng Guo <gm...@gmail.com> on 2022/08/09 23:46:27 UTC

[Proposal] Creating DAG through the REST api

Hi Everyone,

I have an enhancement proposal for the REST API service. This is based on
the observations that Airflow users want to be able to access Airflow more
easily as a platform service.

The motivation comes from the following use cases:
1. Users like data scientists want to iterate over data quickly with
interactive feedback in minutes, e.g. managing data pipelines inside
Jupyter Notebook while executing them in a remote airflow cluster.
2. Services targeting specific audiences can generate DAGs based on inputs
like user command or external triggers, and they want to be able to submit
DAGs programmatically without manual intervention.

I believe such use cases would help promote Airflow usability and gain more
customer popularity. The existing DAG repo brings considerable overhead for
such scenarios, a shared repo requires offline processes and can be slow to
rollout.

The proposal aims to provide an alternative where a DAG can be transmitted
online and here are some key points:
1. A DAG is packaged individually so that it can be distributable over the
network. For example, a DAG may be a serialized binary or a zip file.
2. The Airflow REST API is the ideal place to talk with the external world.
The API would provide a generic interface to accept DAG artifacts and
should be extensible to support different artifact formats if needed.
3. DAG persistence needs to be implemented since they are not part of the
DAG repository.
4. Same behavior for DAGs supported in API vs those defined in the repo,
i.e. users write DAGs in the same syntax, and its scheduling, execution,
and web server UI should behave the same way.

Since DAGs are written as code, running arbitrary code inside Airflow may
pose high security risks. Here are a few proposals to stop the security
breach:
1. Accept DAGs only from trusted parties. Airflow already supports
pluggable authentication modules where strong authentication such as
Kerberos can be used.
2. Execute DAG code as the API identity, i.e. A DAG created through the API
service will have run_as_user set to be the API identity.
3. To enforce data access control on DAGs, the API identity should also be
used to access the data warehouse.

We shared a demo based on a prototype implementation in the summit and some
details are described in this ppt
<https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
and would love to get feedback and comments from the community about this
initiative.

thanks
Mocheng

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
Development experience of DAG authoring should for sure be improved (and
will be - this will be our focus in the coming months). But it makes
completely no sense IMHO to add DAG authoring experience to Airflow UI
when you have Pycharm, IntelliJ, VSCode. Vim, Github UI and plenty of other
tools that allow you to edit Python code WAY WAY WAY more efficiently than
any of those. It would make 0 sense for us to re-develop all those features
in airflow UI.

For development you'd do much better by starting `airflow standalone` and
editing DAG files that are locally available (that will allow you to
immediately run the DAGs when you change them locally) or even using
DebugExecutor:
https://airflow.apache.org/docs/apache-airflow/stable/executor/debug.html
to run the tasks (which does not need running Airflow at all) or using
`airflow tasks test` or `airflow dags test` - neither of which even need a
running airflow installation - just locally installed airflow package. All
of those are actually much better Python development environment

But if we ever go to some way of declarative approach for DAGs we might
likely consider being able to edit them via Airflow UI. It's much more
viable from both - security point of view and the fact that any declarative
approach we might want to use will be rather "airflow" or "workflow"
specific and there will likely be not nearly as many better ways of editing
them as we have currently for Python programs.

J.




On Thu, Aug 25, 2022 at 5:42 PM Nishant Sharma <ni...@gmail.com>
wrote:

> Hi,
> I have also felt the need at times for creating DAG's through the REST
> API. But I understand the security concerns associated with such
> implementation.
>
> If not submission through  REST API at least some sort of *development
> mode* in airflow interface to create and edit dag's for a user
> session. Not sure if this was brought up previously.
>
> Thanks,
> Nishant
>
> On Thu, Aug 25, 2022 at 6:32 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Just in case - please watch the devlist for the announcement of the "SIG
>> multitenancy" group if it slips my mind.
>>
>> On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> Cool. I will make sure to include you ! I think this is something that
>>> will happen in September, The holiday period is not the best to organize it.
>>>
>>> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gm...@gmail.com> wrote:
>>>
>>>> My use case needs automation and security: those are the two key
>>>> requirements and does not have to be REST API if there is another way that
>>>> DAGs could be submitted to a cloud storage securely. Sure I would
>>>> appreciate it if you could include me when organizing AIP-1 related
>>>> meetings. Kerberos is a ticket based system in which a ticket has a limited
>>>> lifetime. Using kerberos, a workload could be authenticated before
>>>> persistence so that Airflow uses its kerberos keytab to execute, which is
>>>> similar to the current implementation in worker, another possible scenarios
>>>> is a persisted workload needs to include a kerberos renewable TGT to be
>>>> used by Airflow worker, but this is more complex and I would be happy to
>>>> discuss more in meetings. I will draft a more detailed document for review.
>>>>
>>>> thanks
>>>> Mocheng
>>>>
>>>>
>>>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> None of those requirements are supported by Airflow. And opening REST
>>>>> API does not solve the authentication use case you mentioned.
>>>>>
>>>>> This is a completely new requirement you have - basically what you
>>>>> want is workflow identity and it should be rather independent from the way
>>>>> DAG is submitted. It would require to attach some kind of identity and
>>>>> signaturea and some way of making sure that the DAG has not been tampered
>>>>> with, in a way that the worker could use the identity when executing the
>>>>> workload and be sure that no-one else modified the DAG - including any of
>>>>> the files that the DAG uses. This is an interesting case but it has nothing
>>>>> to do with using or not the REST API. REST API alone will not give you the
>>>>> user identity guarantees that you need here. The distributed nature of
>>>>> Airflow basically requires such workflow identity has to be provided by
>>>>> cryptographic signatures and verifying the integrity of the DAG rather than
>>>>> basing it on REST API authentication.
>>>>>
>>>>> BTW. We do support already Kerberos authentication for some of our
>>>>> operators but identity is necessarily per instance executing the workload -
>>>>> not the user submitting the DAG.
>>>>>
>>>>> This could be one of the improvement proposals that could in the
>>>>> future become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>>>>> interested in leading and proposing such an AIP i will be soon (a month or
>>>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>>>>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>>>>> approved there (and AIP-43 close to completion) and the next steps should
>>>>> be introducing fine graines security layer to executing the workloads.
>>>>> Adding workload identity might be part of it. If you would like to work on
>>>>> that - you are most welcome. It means to prepare and discuss proposals, get
>>>>> consensus of involved parties, leading it to a vote and finally
>>>>> implementing it.
>>>>>
>>>>> J
>>>>>
>>>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com>
>>>>> napisał:
>>>>>
>>>>>> >> Could you please elaborate why this would be a problem to use
>>>>>> those (really good for file pushing) APIs ?
>>>>>>
>>>>>> Submitting DAGs directly to cloud storage API does help for some part
>>>>>> of the use case requirement, but cloud storage does not provide the
>>>>>> security a data warehouse needs. A typical auth model supported in data
>>>>>> warehouse is Kerberos, and a data warehouse provides limited view to a
>>>>>> kerberos user with authorization rules. We need users to submit DAGs with
>>>>>> identities supported by the data warehouse, so that Apache Spark jobs will
>>>>>> be executed as the kerberos user who submits a DAG which in turns decide
>>>>>> what data can be processed, there may also be need to handle impersonation,
>>>>>> so there needs to be an additional layer to handle data warehouse auth e.g.
>>>>>> kerberos.
>>>>>>
>>>>>> Assuming dags are already inside the cloud storage, and I think
>>>>>> AIP-5/20 would work better than the current mono repo model if it could
>>>>>> support better flexibility and less latency, and I would be very interested
>>>>>> to be part of the design and implementation.
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com>
>>>>>> wrote:
>>>>>>
>>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>>> has to accept code, both Tomasz and Constance's examples let me think that
>>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>>> example, say another service provides an API which can be invoked by its
>>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>>> instead of each building its own solution.
>>>>>>>
>>>>>>> > You seem to repeat more of the same. This is exactly what we want
>>>>>>> to avoid. IF you can push a code over API you can push Any Code. And
>>>>>>> precisely the "Access Control" you mentioned or rejecting the call when
>>>>>>> "considering code unsafe" those are the decisions we already deliberately
>>>>>>> decided we do not want Airflow REST API to make. Whether the code it's
>>>>>>> generated or not does not matter because Airflow has no idea whatsoever if
>>>>>>> it has been manipulated with, between the time it was generated and pushed.
>>>>>>> The only way Airflow can know that the code is not manipulated with is when
>>>>>>> it generates DAG code on its own based on a declarative input. The limit is
>>>>>>> to push declarative information only. You CANNOT push code via the REST
>>>>>>> API. This is out of the question. The case is closed.
>>>>>>>
>>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>>> change data/features used by model frequently which in turn leads to
>>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>>>> offline processes with large repo.
>>>>>>>
>>>>>>> Case 2 is actually (if you attempt to read my article I posted
>>>>>>> above, it's written there) the case where shared volume could still be used
>>>>>>> and are bette. This why it's great that Airflow supports multiple DAG
>>>>>>> syncing solutions because your "middle" environment does not have to have
>>>>>>> git sync as it is not "production' (unless you want to mix development with
>>>>>>> testing that is, which is terrible, terrible idea).
>>>>>>>
>>>>>>> Your data science for middle ground does:
>>>>>>>
>>>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if
>>>>>>> you use shared volume of some sort (NFS/EFS etc.)
>>>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your
>>>>>>> dags are on S3  and synced using s3-sync
>>>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS
>>>>>>> and synced using s3-sync
>>>>>>>
>>>>>>> Those are excellent "File push" apis. They do the job. I cannot
>>>>>>> imagine why the middle-loop person might have a problem with using them.
>>>>>>> All of that can also be  fully automated -  they all have nice Python and
>>>>>>> other language APIs so you can even make the IDE run those commands
>>>>>>> automatically on every save if you want.
>>>>>>>
>>>>>>> Could you please elaborate why this would be a problem to use those
>>>>>>> (really good for file pushing) APIs ?
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>>>> has to accept code, both Tomasz and Constance's examples let me think that
>>>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>>>> example, say another service provides an API which can be invoked by its
>>>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>>>> instead of each building its own solution.
>>>>>>>>
>>>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use
>>>>>>>> case, here are three loops a data scientist is doing to develop a machine
>>>>>>>> learning model:
>>>>>>>> - inner loop: iterates on the model locally.
>>>>>>>> - middle loop: iterate the model on a remote cluster with
>>>>>>>> production data, say it's using Airflow DAGs behind the scenes.
>>>>>>>> - outer loop: done with iteration and publish the model on
>>>>>>>> production.
>>>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>>>> change data/features used by model frequently which in turn leads to
>>>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>>>>> offline processes with large repo.
>>>>>>>>
>>>>>>>> Such use case is pretty common for data scientists, and a better
>>>>>>>> **online** service model would help open up more possibilities for Airflow
>>>>>>>> and its users, as additional layers providing more values(like Constance
>>>>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>>>>> level orchestration engine.
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Mocheng
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I really like the Idea of Tomek.
>>>>>>>>>
>>>>>>>>> If we ever go (which is not unlikely) - some "standard"
>>>>>>>>> declarative way of describing DAGs, all my security, packaging concerns are
>>>>>>>>> gone - and submitting such declarative DAG via API is quite viable. Simply
>>>>>>>>> submitting a Python code this way is a no-go for me :). Such Declarative
>>>>>>>>> DAG could be just stored in the DB and scheduled and executed using only
>>>>>>>>> "declaration" from the DB - without ever touching the DAG "folder" and
>>>>>>>>> without allowing the user to submit any executable code this way. All the
>>>>>>>>> code to execute would already have to be in Airflow already in this case.
>>>>>>>>>
>>>>>>>>> And I very much agree also that this case can be solved with Git.
>>>>>>>>> I think we are generally undervaluing the role Git plays for DAG
>>>>>>>>> distribution of Airflow.
>>>>>>>>>
>>>>>>>>> I think when the user feels the need (I very much understand the
>>>>>>>>> need Constance) to submit the DAG via API,  rather than adding the option
>>>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply answer
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>>>>> "API" you wanted to push the code.*
>>>>>>>>>
>>>>>>>>> This has all the flexibility you need, it has integration with
>>>>>>>>> Pull Request, CI workflows, keeps history etc.etc. When we tell people "Use
>>>>>>>>> Git" - we have ALL of that and more for free. Standing on the shoulders of
>>>>>>>>> giants.
>>>>>>>>> If we start thinking about integration of code push via our own
>>>>>>>>> API - we basically start the journey of rewriting Git as eventually we will
>>>>>>>>> have to support those cases. This makes absolutely no sense for me.
>>>>>>>>>
>>>>>>>>> I even start to think that we should make "git sync" a separate
>>>>>>>>> (and much more viable) option that is pretty much the "main recommendation"
>>>>>>>>> for Airflow. rather than "yet another option among shared folders and baked
>>>>>>>>> in DAGs" case.
>>>>>>>>>
>>>>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>>>>> which has much more details on why I think so.
>>>>>>>>>
>>>>>>>>> J.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>>>>> <co...@astronomer.io.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>>>>
>>>>>>>>>> I love the idea of implementing a manual review option in
>>>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster policies)
>>>>>>>>>> would be a good middle ground. An administrator could use that hook to do
>>>>>>>>>> checks against DAGs or run security scanners, and decide whether or not to
>>>>>>>>>> implement a review requirement.
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>>>>> turbaszek@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> In general I second what XD said. CI/CD feels better than
>>>>>>>>>>> sending DAG files over API and the security issues arising from accepting
>>>>>>>>>>> "any python file" are probably quite big.
>>>>>>>>>>>
>>>>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send the
>>>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>>>>> that is not a code. This of course has some limitations like inability to
>>>>>>>>>>> define custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>>>>>>
>>>>>>>>>>> Other thought - if we implement something like "DAG via API"
>>>>>>>>>>> then we should consider adding an option to review DAGs (approval queue
>>>>>>>>>>> etc) to reduce security issues that are mitigated by for example deploying
>>>>>>>>>>> DAGs from git (where we have code review, security scanners etc).
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Tomek
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Mocheng,
>>>>>>>>>>>>
>>>>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>>>>>>> binarized or compressed), right?
>>>>>>>>>>>>
>>>>>>>>>>>> If that's the case, I may not be fully convinced: the
>>>>>>>>>>>> objectives in your proposal is about automation & programmatically
>>>>>>>>>>>> submitting DAGs. These can already be achieved in an efficient way through
>>>>>>>>>>>> CI/CD practice + a centralized place to manage your DAGs (e.g. a Git Repo
>>>>>>>>>>>> to host the DAG files).
>>>>>>>>>>>>
>>>>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>>>>
>>>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood
>>>>>>>>>>>> your proposal. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> XD
>>>>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>>>>> (This is not a contribution)
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an enhancement proposal for the REST API service. This
>>>>>>>>>>>>> is based on the observations that Airflow users want to be able to access
>>>>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>>>>> 1. Users like data scientists want to iterate over data
>>>>>>>>>>>>> quickly with interactive feedback in minutes, e.g. managing data pipelines
>>>>>>>>>>>>> inside Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs
>>>>>>>>>>>>> based on inputs like user command or external triggers, and they want to be
>>>>>>>>>>>>> able to submit DAGs programmatically without manual intervention.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I believe such use cases would help promote Airflow usability
>>>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings
>>>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires offline
>>>>>>>>>>>>> processes and can be slow to rollout.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>>>>> 1. A DAG is packaged individually so that it can be
>>>>>>>>>>>>> distributable over the network. For example, a DAG may be a serialized
>>>>>>>>>>>>> binary or a zip file.
>>>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>>>>>>> needed.
>>>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>>>>> part of the DAG repository.
>>>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>>>>>>> security breach:
>>>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>>>>> supports pluggable authentication modules where strong authentication such
>>>>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>>>>> through the API service will have run_as_user set to be the API identity.
>>>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>>>>>>> initiative.
>>>>>>>>>>>>>
>>>>>>>>>>>>> thanks
>>>>>>>>>>>>> Mocheng
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Constance Martineau
>>>>>>>>>> Product Manager
>>>>>>>>>>
>>>>>>>>>> Email: constance@astronomer.io
>>>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>>>
>>>>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Nishant Sharma <ni...@gmail.com>.
Hi,
I have also felt the need at times for creating DAG's through the REST API.
But I understand the security concerns associated with such implementation.

If not submission through  REST API at least some sort of *development mode*
in airflow interface to create and edit dag's for a user session. Not sure
if this was brought up previously.

Thanks,
Nishant

On Thu, Aug 25, 2022 at 6:32 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Just in case - please watch the devlist for the announcement of the "SIG
> multitenancy" group if it slips my mind.
>
> On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> Cool. I will make sure to include you ! I think this is something that
>> will happen in September, The holiday period is not the best to organize it.
>>
>> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gm...@gmail.com> wrote:
>>
>>> My use case needs automation and security: those are the two key
>>> requirements and does not have to be REST API if there is another way that
>>> DAGs could be submitted to a cloud storage securely. Sure I would
>>> appreciate it if you could include me when organizing AIP-1 related
>>> meetings. Kerberos is a ticket based system in which a ticket has a limited
>>> lifetime. Using kerberos, a workload could be authenticated before
>>> persistence so that Airflow uses its kerberos keytab to execute, which is
>>> similar to the current implementation in worker, another possible scenarios
>>> is a persisted workload needs to include a kerberos renewable TGT to be
>>> used by Airflow worker, but this is more complex and I would be happy to
>>> discuss more in meetings. I will draft a more detailed document for review.
>>>
>>> thanks
>>> Mocheng
>>>
>>>
>>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> None of those requirements are supported by Airflow. And opening REST
>>>> API does not solve the authentication use case you mentioned.
>>>>
>>>> This is a completely new requirement you have - basically what you want
>>>> is workflow identity and it should be rather independent from the way DAG
>>>> is submitted. It would require to attach some kind of identity and
>>>> signaturea and some way of making sure that the DAG has not been tampered
>>>> with, in a way that the worker could use the identity when executing the
>>>> workload and be sure that no-one else modified the DAG - including any of
>>>> the files that the DAG uses. This is an interesting case but it has nothing
>>>> to do with using or not the REST API. REST API alone will not give you the
>>>> user identity guarantees that you need here. The distributed nature of
>>>> Airflow basically requires such workflow identity has to be provided by
>>>> cryptographic signatures and verifying the integrity of the DAG rather than
>>>> basing it on REST API authentication.
>>>>
>>>> BTW. We do support already Kerberos authentication for some of our
>>>> operators but identity is necessarily per instance executing the workload -
>>>> not the user submitting the DAG.
>>>>
>>>> This could be one of the improvement proposals that could in the future
>>>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>>>> interested in leading and proposing such an AIP i will be soon (a month or
>>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>>>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>>>> approved there (and AIP-43 close to completion) and the next steps should
>>>> be introducing fine graines security layer to executing the workloads.
>>>> Adding workload identity might be part of it. If you would like to work on
>>>> that - you are most welcome. It means to prepare and discuss proposals, get
>>>> consensus of involved parties, leading it to a vote and finally
>>>> implementing it.
>>>>
>>>> J
>>>>
>>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com>
>>>> napisał:
>>>>
>>>>> >> Could you please elaborate why this would be a problem to use those
>>>>> (really good for file pushing) APIs ?
>>>>>
>>>>> Submitting DAGs directly to cloud storage API does help for some part
>>>>> of the use case requirement, but cloud storage does not provide the
>>>>> security a data warehouse needs. A typical auth model supported in data
>>>>> warehouse is Kerberos, and a data warehouse provides limited view to a
>>>>> kerberos user with authorization rules. We need users to submit DAGs with
>>>>> identities supported by the data warehouse, so that Apache Spark jobs will
>>>>> be executed as the kerberos user who submits a DAG which in turns decide
>>>>> what data can be processed, there may also be need to handle impersonation,
>>>>> so there needs to be an additional layer to handle data warehouse auth e.g.
>>>>> kerberos.
>>>>>
>>>>> Assuming dags are already inside the cloud storage, and I think
>>>>> AIP-5/20 would work better than the current mono repo model if it could
>>>>> support better flexibility and less latency, and I would be very interested
>>>>> to be part of the design and implementation.
>>>>>
>>>>>
>>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com>
>>>>> wrote:
>>>>>
>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>> has to accept code, both Tomasz and Constance's examples let me think that
>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>> example, say another service provides an API which can be invoked by its
>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>> instead of each building its own solution.
>>>>>>
>>>>>> > You seem to repeat more of the same. This is exactly what we want
>>>>>> to avoid. IF you can push a code over API you can push Any Code. And
>>>>>> precisely the "Access Control" you mentioned or rejecting the call when
>>>>>> "considering code unsafe" those are the decisions we already deliberately
>>>>>> decided we do not want Airflow REST API to make. Whether the code it's
>>>>>> generated or not does not matter because Airflow has no idea whatsoever if
>>>>>> it has been manipulated with, between the time it was generated and pushed.
>>>>>> The only way Airflow can know that the code is not manipulated with is when
>>>>>> it generates DAG code on its own based on a declarative input. The limit is
>>>>>> to push declarative information only. You CANNOT push code via the REST
>>>>>> API. This is out of the question. The case is closed.
>>>>>>
>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>> change data/features used by model frequently which in turn leads to
>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>>> offline processes with large repo.
>>>>>>
>>>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>>>> it's written there) the case where shared volume could still be used and
>>>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>>>> solutions because your "middle" environment does not have to have git sync
>>>>>> as it is not "production' (unless you want to mix development with testing
>>>>>> that is, which is terrible, terrible idea).
>>>>>>
>>>>>> Your data science for middle ground does:
>>>>>>
>>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if
>>>>>> you use shared volume of some sort (NFS/EFS etc.)
>>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>>>> are on S3  and synced using s3-sync
>>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>>>> synced using s3-sync
>>>>>>
>>>>>> Those are excellent "File push" apis. They do the job. I cannot
>>>>>> imagine why the middle-loop person might have a problem with using them.
>>>>>> All of that can also be  fully automated -  they all have nice Python and
>>>>>> other language APIs so you can even make the IDE run those commands
>>>>>> automatically on every save if you want.
>>>>>>
>>>>>> Could you please elaborate why this would be a problem to use those
>>>>>> (really good for file pushing) APIs ?
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>>> has to accept code, both Tomasz and Constance's examples let me think that
>>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>>> example, say another service provides an API which can be invoked by its
>>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>>> instead of each building its own solution.
>>>>>>>
>>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use
>>>>>>> case, here are three loops a data scientist is doing to develop a machine
>>>>>>> learning model:
>>>>>>> - inner loop: iterates on the model locally.
>>>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>>>> - outer loop: done with iteration and publish the model on
>>>>>>> production.
>>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>>> change data/features used by model frequently which in turn leads to
>>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>>>> offline processes with large repo.
>>>>>>>
>>>>>>> Such use case is pretty common for data scientists, and a better
>>>>>>> **online** service model would help open up more possibilities for Airflow
>>>>>>> and its users, as additional layers providing more values(like Constance
>>>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>>>> level orchestration engine.
>>>>>>>
>>>>>>> thanks
>>>>>>> Mocheng
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I really like the Idea of Tomek.
>>>>>>>>
>>>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>>>> way of describing DAGs, all my security, packaging concerns are gone - and
>>>>>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>>>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>>>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>>>>>> the user to submit any executable code this way. All the code to execute
>>>>>>>> would already have to be in Airflow already in this case.
>>>>>>>>
>>>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>>>>>> of Airflow.
>>>>>>>>
>>>>>>>> I think when the user feels the need (I very much understand the
>>>>>>>> need Constance) to submit the DAG via API,  rather than adding the option
>>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply answer
>>>>>>>> this:
>>>>>>>>
>>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>>>> "API" you wanted to push the code.*
>>>>>>>>
>>>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>>>> giants.
>>>>>>>> If we start thinking about integration of code push via our own API
>>>>>>>> - we basically start the journey of rewriting Git as eventually we will
>>>>>>>> have to support those cases. This makes absolutely no sense for me.
>>>>>>>>
>>>>>>>> I even start to think that we should make "git sync" a separate
>>>>>>>> (and much more viable) option that is pretty much the "main recommendation"
>>>>>>>> for Airflow. rather than "yet another option among shared folders and baked
>>>>>>>> in DAGs" case.
>>>>>>>>
>>>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>>>> which has much more details on why I think so.
>>>>>>>>
>>>>>>>> J.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>>>> <co...@astronomer.io.invalid> wrote:
>>>>>>>>
>>>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>>>
>>>>>>>>> I love the idea of implementing a manual review option in
>>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster policies)
>>>>>>>>> would be a good middle ground. An administrator could use that hook to do
>>>>>>>>> checks against DAGs or run security scanners, and decide whether or not to
>>>>>>>>> implement a review requirement.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>>>> turbaszek@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>>>>> python file" are probably quite big.
>>>>>>>>>>
>>>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send the
>>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>>>> that is not a code. This of course has some limitations like inability to
>>>>>>>>>> define custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>>>>>
>>>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>>>> we should consider adding an option to review DAGs (approval queue etc) to
>>>>>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Tomek
>>>>>>>>>>
>>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Mocheng,
>>>>>>>>>>>
>>>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>>>>>> binarized or compressed), right?
>>>>>>>>>>>
>>>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>>>> in your proposal is about automation & programmatically submitting DAGs.
>>>>>>>>>>> These can already be achieved in an efficient way through CI/CD practice +
>>>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>>>>>> files).
>>>>>>>>>>>
>>>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>>>
>>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood
>>>>>>>>>>> your proposal. Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> XD
>>>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>>>> (This is not a contribution)
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I have an enhancement proposal for the REST API service. This
>>>>>>>>>>>> is based on the observations that Airflow users want to be able to access
>>>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>>>
>>>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs
>>>>>>>>>>>> based on inputs like user command or external triggers, and they want to be
>>>>>>>>>>>> able to submit DAGs programmatically without manual intervention.
>>>>>>>>>>>>
>>>>>>>>>>>> I believe such use cases would help promote Airflow usability
>>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings
>>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires offline
>>>>>>>>>>>> processes and can be slow to rollout.
>>>>>>>>>>>>
>>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>>>> 1. A DAG is packaged individually so that it can be
>>>>>>>>>>>> distributable over the network. For example, a DAG may be a serialized
>>>>>>>>>>>> binary or a zip file.
>>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>>>>>> needed.
>>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>>>> part of the DAG repository.
>>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>>>
>>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>>>>>> security breach:
>>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>>>> supports pluggable authentication modules where strong authentication such
>>>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>>>> through the API service will have run_as_user set to be the API identity.
>>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>>>
>>>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>>>>>> initiative.
>>>>>>>>>>>>
>>>>>>>>>>>> thanks
>>>>>>>>>>>> Mocheng
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Constance Martineau
>>>>>>>>> Product Manager
>>>>>>>>>
>>>>>>>>> Email: constance@astronomer.io
>>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>>
>>>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
Just in case - please watch the devlist for the announcement of the "SIG
multitenancy" group if it slips my mind.

On Thu, Aug 25, 2022 at 1:31 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Cool. I will make sure to include you ! I think this is something that
> will happen in September, The holiday period is not the best to organize it.
>
> On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gm...@gmail.com> wrote:
>
>> My use case needs automation and security: those are the two key
>> requirements and does not have to be REST API if there is another way that
>> DAGs could be submitted to a cloud storage securely. Sure I would
>> appreciate it if you could include me when organizing AIP-1 related
>> meetings. Kerberos is a ticket based system in which a ticket has a limited
>> lifetime. Using kerberos, a workload could be authenticated before
>> persistence so that Airflow uses its kerberos keytab to execute, which is
>> similar to the current implementation in worker, another possible scenarios
>> is a persisted workload needs to include a kerberos renewable TGT to be
>> used by Airflow worker, but this is more complex and I would be happy to
>> discuss more in meetings. I will draft a more detailed document for review.
>>
>> thanks
>> Mocheng
>>
>>
>> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> None of those requirements are supported by Airflow. And opening REST
>>> API does not solve the authentication use case you mentioned.
>>>
>>> This is a completely new requirement you have - basically what you want
>>> is workflow identity and it should be rather independent from the way DAG
>>> is submitted. It would require to attach some kind of identity and
>>> signaturea and some way of making sure that the DAG has not been tampered
>>> with, in a way that the worker could use the identity when executing the
>>> workload and be sure that no-one else modified the DAG - including any of
>>> the files that the DAG uses. This is an interesting case but it has nothing
>>> to do with using or not the REST API. REST API alone will not give you the
>>> user identity guarantees that you need here. The distributed nature of
>>> Airflow basically requires such workflow identity has to be provided by
>>> cryptographic signatures and verifying the integrity of the DAG rather than
>>> basing it on REST API authentication.
>>>
>>> BTW. We do support already Kerberos authentication for some of our
>>> operators but identity is necessarily per instance executing the workload -
>>> not the user submitting the DAG.
>>>
>>> This could be one of the improvement proposals that could in the future
>>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>>> interested in leading and proposing such an AIP i will be soon (a month or
>>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>>> approved there (and AIP-43 close to completion) and the next steps should
>>> be introducing fine graines security layer to executing the workloads.
>>> Adding workload identity might be part of it. If you would like to work on
>>> that - you are most welcome. It means to prepare and discuss proposals, get
>>> consensus of involved parties, leading it to a vote and finally
>>> implementing it.
>>>
>>> J
>>>
>>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com>
>>> napisał:
>>>
>>>> >> Could you please elaborate why this would be a problem to use those
>>>> (really good for file pushing) APIs ?
>>>>
>>>> Submitting DAGs directly to cloud storage API does help for some part
>>>> of the use case requirement, but cloud storage does not provide the
>>>> security a data warehouse needs. A typical auth model supported in data
>>>> warehouse is Kerberos, and a data warehouse provides limited view to a
>>>> kerberos user with authorization rules. We need users to submit DAGs with
>>>> identities supported by the data warehouse, so that Apache Spark jobs will
>>>> be executed as the kerberos user who submits a DAG which in turns decide
>>>> what data can be processed, there may also be need to handle impersonation,
>>>> so there needs to be an additional layer to handle data warehouse auth e.g.
>>>> kerberos.
>>>>
>>>> Assuming dags are already inside the cloud storage, and I think
>>>> AIP-5/20 would work better than the current mono repo model if it could
>>>> support better flexibility and less latency, and I would be very interested
>>>> to be part of the design and implementation.
>>>>
>>>>
>>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>>> accepted or a process itself. To expand a little bit more on another
>>>>> example, say another service provides an API which can be invoked by its
>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>> control the identity or even reject all calls when considered unsafe.
>>>>> Please let me know if the example makes sense, and if there is a common
>>>>> interest, having an Airflow native write path would benefit the community
>>>>> instead of each building its own solution.
>>>>>
>>>>> > You seem to repeat more of the same. This is exactly what we want to
>>>>> avoid. IF you can push a code over API you can push Any Code. And precisely
>>>>> the "Access Control" you mentioned or rejecting the call when "considering
>>>>> code unsafe" those are the decisions we already deliberately decided we do
>>>>> not want Airflow REST API to make. Whether the code it's generated or not
>>>>> does not matter because Airflow has no idea whatsoever if it has been
>>>>> manipulated with, between the time it was generated and pushed. The only
>>>>> way Airflow can know that the code is not manipulated with is when it
>>>>> generates DAG code on its own based on a declarative input. The limit is to
>>>>> push declarative information only. You CANNOT push code via the REST API.
>>>>> This is out of the question. The case is closed.
>>>>>
>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>> change data/features used by model frequently which in turn leads to
>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>> possible but the overhead involved is a major reason users rejecting
>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>> offline processes with large repo.
>>>>>
>>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>>> it's written there) the case where shared volume could still be used and
>>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>>> solutions because your "middle" environment does not have to have git sync
>>>>> as it is not "production' (unless you want to mix development with testing
>>>>> that is, which is terrible, terrible idea).
>>>>>
>>>>> Your data science for middle ground does:
>>>>>
>>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if
>>>>> you use shared volume of some sort (NFS/EFS etc.)
>>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>>> are on S3  and synced using s3-sync
>>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>>> synced using s3-sync
>>>>>
>>>>> Those are excellent "File push" apis. They do the job. I cannot
>>>>> imagine why the middle-loop person might have a problem with using them.
>>>>> All of that can also be  fully automated -  they all have nice Python and
>>>>> other language APIs so you can even make the IDE run those commands
>>>>> automatically on every save if you want.
>>>>>
>>>>> Could you please elaborate why this would be a problem to use those
>>>>> (really good for file pushing) APIs ?
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:
>>>>>
>>>>>> First appreciate all for your valuable feedback. Airflow by design
>>>>>> has to accept code, both Tomasz and Constance's examples let me think that
>>>>>> the security judgement should be on the actual DAGs rather than how DAGs
>>>>>> are accepted or a process itself. To expand a little bit more on another
>>>>>> example, say another service provides an API which can be invoked by its
>>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>>> control the identity or even reject all calls when considered unsafe.
>>>>>> Please let me know if the example makes sense, and if there is a common
>>>>>> interest, having an Airflow native write path would benefit the community
>>>>>> instead of each building its own solution.
>>>>>>
>>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use
>>>>>> case, here are three loops a data scientist is doing to develop a machine
>>>>>> learning model:
>>>>>> - inner loop: iterates on the model locally.
>>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>>> - outer loop: done with iteration and publish the model on production.
>>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>>> change data/features used by model frequently which in turn leads to
>>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>>> possible but the overhead involved is a major reason users rejecting
>>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>>> offline processes with large repo.
>>>>>>
>>>>>> Such use case is pretty common for data scientists, and a better
>>>>>> **online** service model would help open up more possibilities for Airflow
>>>>>> and its users, as additional layers providing more values(like Constance
>>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>>> level orchestration engine.
>>>>>>
>>>>>> thanks
>>>>>> Mocheng
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I really like the Idea of Tomek.
>>>>>>>
>>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>>> way of describing DAGs, all my security, packaging concerns are gone - and
>>>>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>>>>> the user to submit any executable code this way. All the code to execute
>>>>>>> would already have to be in Airflow already in this case.
>>>>>>>
>>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>>>>> of Airflow.
>>>>>>>
>>>>>>> I think when the user feels the need (I very much understand the
>>>>>>> need Constance) to submit the DAG via API,  rather than adding the option
>>>>>>> of submitting the DAG code via "Airflow REST API", we should simply answer
>>>>>>> this:
>>>>>>>
>>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>>> "API" you wanted to push the code.*
>>>>>>>
>>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>>> giants.
>>>>>>> If we start thinking about integration of code push via our own API
>>>>>>> - we basically start the journey of rewriting Git as eventually we will
>>>>>>> have to support those cases. This makes absolutely no sense for me.
>>>>>>>
>>>>>>> I even start to think that we should make "git sync" a separate (and
>>>>>>> much more viable) option that is pretty much the "main recommendation" for
>>>>>>> Airflow. rather than "yet another option among shared folders and baked in
>>>>>>> DAGs" case.
>>>>>>>
>>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>>> which has much more details on why I think so.
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>>> <co...@astronomer.io.invalid> wrote:
>>>>>>>
>>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>>
>>>>>>>> I love the idea of implementing a manual review option in
>>>>>>>> conjunction with some sort of hook (similar to Airflow cluster policies)
>>>>>>>> would be a good middle ground. An administrator could use that hook to do
>>>>>>>> checks against DAGs or run security scanners, and decide whether or not to
>>>>>>>> implement a review requirement.
>>>>>>>>
>>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>>> turbaszek@apache.org> wrote:
>>>>>>>>
>>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>>>> python file" are probably quite big.
>>>>>>>>>
>>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send the
>>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>>> that is not a code. This of course has some limitations like inability to
>>>>>>>>> define custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>>>>
>>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>>> we should consider adding an option to review DAGs (approval queue etc) to
>>>>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Tomek
>>>>>>>>>
>>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mocheng,
>>>>>>>>>>
>>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>>>>> binarized or compressed), right?
>>>>>>>>>>
>>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>>> in your proposal is about automation & programmatically submitting DAGs.
>>>>>>>>>> These can already be achieved in an efficient way through CI/CD practice +
>>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>>>>> files).
>>>>>>>>>>
>>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>>
>>>>>>>>>> Kindly let me know if I have missed anything or misunderstood
>>>>>>>>>> your proposal. Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> XD
>>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>>> (This is not a contribution)
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>>
>>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs based
>>>>>>>>>>> on inputs like user command or external triggers, and they want to be able
>>>>>>>>>>> to submit DAGs programmatically without manual intervention.
>>>>>>>>>>>
>>>>>>>>>>> I believe such use cases would help promote Airflow usability
>>>>>>>>>>> and gain more customer popularity. The existing DAG repo brings
>>>>>>>>>>> considerable overhead for such scenarios, a shared repo requires offline
>>>>>>>>>>> processes and can be slow to rollout.
>>>>>>>>>>>
>>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>>> 1. A DAG is packaged individually so that it can be
>>>>>>>>>>> distributable over the network. For example, a DAG may be a serialized
>>>>>>>>>>> binary or a zip file.
>>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>>>>> needed.
>>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>>> part of the DAG repository.
>>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>>
>>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>>>>> security breach:
>>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>>> supports pluggable authentication modules where strong authentication such
>>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>>> through the API service will have run_as_user set to be the API identity.
>>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>>
>>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>>>>> initiative.
>>>>>>>>>>>
>>>>>>>>>>> thanks
>>>>>>>>>>> Mocheng
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Constance Martineau
>>>>>>>> Product Manager
>>>>>>>>
>>>>>>>> Email: constance@astronomer.io
>>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>>
>>>>>>>>
>>>>>>>> <https://www.astronomer.io/>
>>>>>>>>
>>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
Cool. I will make sure to include you ! I think this is something that will
happen in September, The holiday period is not the best to organize it.

On Thu, Aug 25, 2022 at 5:50 AM Mocheng Guo <gm...@gmail.com> wrote:

> My use case needs automation and security: those are the two key
> requirements and does not have to be REST API if there is another way that
> DAGs could be submitted to a cloud storage securely. Sure I would
> appreciate it if you could include me when organizing AIP-1 related
> meetings. Kerberos is a ticket based system in which a ticket has a limited
> lifetime. Using kerberos, a workload could be authenticated before
> persistence so that Airflow uses its kerberos keytab to execute, which is
> similar to the current implementation in worker, another possible scenarios
> is a persisted workload needs to include a kerberos renewable TGT to be
> used by Airflow worker, but this is more complex and I would be happy to
> discuss more in meetings. I will draft a more detailed document for review.
>
> thanks
> Mocheng
>
>
> On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> None of those requirements are supported by Airflow. And opening REST API
>> does not solve the authentication use case you mentioned.
>>
>> This is a completely new requirement you have - basically what you want
>> is workflow identity and it should be rather independent from the way DAG
>> is submitted. It would require to attach some kind of identity and
>> signaturea and some way of making sure that the DAG has not been tampered
>> with, in a way that the worker could use the identity when executing the
>> workload and be sure that no-one else modified the DAG - including any of
>> the files that the DAG uses. This is an interesting case but it has nothing
>> to do with using or not the REST API. REST API alone will not give you the
>> user identity guarantees that you need here. The distributed nature of
>> Airflow basically requires such workflow identity has to be provided by
>> cryptographic signatures and verifying the integrity of the DAG rather than
>> basing it on REST API authentication.
>>
>> BTW. We do support already Kerberos authentication for some of our
>> operators but identity is necessarily per instance executing the workload -
>> not the user submitting the DAG.
>>
>> This could be one of the improvement proposals that could in the future
>> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
>> interested in leading and proposing such an AIP i will be soon (a month or
>> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
>> and minutes of previous meetings). We already have AiP-43 and AIP-44
>> approved there (and AIP-43 close to completion) and the next steps should
>> be introducing fine graines security layer to executing the workloads.
>> Adding workload identity might be part of it. If you would like to work on
>> that - you are most welcome. It means to prepare and discuss proposals, get
>> consensus of involved parties, leading it to a vote and finally
>> implementing it.
>>
>> J
>>
>> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com>
>> napisał:
>>
>>> >> Could you please elaborate why this would be a problem to use those
>>> (really good for file pushing) APIs ?
>>>
>>> Submitting DAGs directly to cloud storage API does help for some part of
>>> the use case requirement, but cloud storage does not provide the security a
>>> data warehouse needs. A typical auth model supported in data warehouse is
>>> Kerberos, and a data warehouse provides limited view to a kerberos user
>>> with authorization rules. We need users to submit DAGs with identities
>>> supported by the data warehouse, so that Apache Spark jobs will be executed
>>> as the kerberos user who submits a DAG which in turns decide what data can
>>> be processed, there may also be need to handle impersonation, so there
>>> needs to be an additional layer to handle data warehouse auth e.g.
>>> kerberos.
>>>
>>> Assuming dags are already inside the cloud storage, and I think AIP-5/20
>>> would work better than the current mono repo model if it could support
>>> better flexibility and less latency, and I would be very interested to be
>>> part of the design and implementation.
>>>
>>>
>>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>> accepted or a process itself. To expand a little bit more on another
>>>> example, say another service provides an API which can be invoked by its
>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>> control the identity or even reject all calls when considered unsafe.
>>>> Please let me know if the example makes sense, and if there is a common
>>>> interest, having an Airflow native write path would benefit the community
>>>> instead of each building its own solution.
>>>>
>>>> > You seem to repeat more of the same. This is exactly what we want to
>>>> avoid. IF you can push a code over API you can push Any Code. And precisely
>>>> the "Access Control" you mentioned or rejecting the call when "considering
>>>> code unsafe" those are the decisions we already deliberately decided we do
>>>> not want Airflow REST API to make. Whether the code it's generated or not
>>>> does not matter because Airflow has no idea whatsoever if it has been
>>>> manipulated with, between the time it was generated and pushed. The only
>>>> way Airflow can know that the code is not manipulated with is when it
>>>> generates DAG code on its own based on a declarative input. The limit is to
>>>> push declarative information only. You CANNOT push code via the REST API.
>>>> This is out of the question. The case is closed.
>>>>
>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>> change data/features used by model frequently which in turn leads to
>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>> possible but the overhead involved is a major reason users rejecting
>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>> offline processes with large repo.
>>>>
>>>> Case 2 is actually (if you attempt to read my article I posted above,
>>>> it's written there) the case where shared volume could still be used and
>>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>>> solutions because your "middle" environment does not have to have git sync
>>>> as it is not "production' (unless you want to mix development with testing
>>>> that is, which is terrible, terrible idea).
>>>>
>>>> Your data science for middle ground does:
>>>>
>>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
>>>> use shared volume of some sort (NFS/EFS etc.)
>>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>>> are on S3  and synced using s3-sync
>>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>>> synced using s3-sync
>>>>
>>>> Those are excellent "File push" apis. They do the job. I cannot imagine
>>>> why the middle-loop person might have a problem with using them. All of
>>>> that can also be  fully automated -  they all have nice Python and
>>>> other language APIs so you can even make the IDE run those commands
>>>> automatically on every save if you want.
>>>>
>>>> Could you please elaborate why this would be a problem to use those
>>>> (really good for file pushing) APIs ?
>>>>
>>>> J.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:
>>>>
>>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>>> accepted or a process itself. To expand a little bit more on another
>>>>> example, say another service provides an API which can be invoked by its
>>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>>> control the identity or even reject all calls when considered unsafe.
>>>>> Please let me know if the example makes sense, and if there is a common
>>>>> interest, having an Airflow native write path would benefit the community
>>>>> instead of each building its own solution.
>>>>>
>>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>>>>> here are three loops a data scientist is doing to develop a machine
>>>>> learning model:
>>>>> - inner loop: iterates on the model locally.
>>>>> - middle loop: iterate the model on a remote cluster with production
>>>>> data, say it's using Airflow DAGs behind the scenes.
>>>>> - outer loop: done with iteration and publish the model on production.
>>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>>> change data/features used by model frequently which in turn leads to
>>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>>> possible but the overhead involved is a major reason users rejecting
>>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>>> offline processes with large repo.
>>>>>
>>>>> Such use case is pretty common for data scientists, and a better
>>>>> **online** service model would help open up more possibilities for Airflow
>>>>> and its users, as additional layers providing more values(like Constance
>>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>>> level orchestration engine.
>>>>>
>>>>> thanks
>>>>> Mocheng
>>>>>
>>>>>
>>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com>
>>>>> wrote:
>>>>>
>>>>>> I really like the Idea of Tomek.
>>>>>>
>>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>>> way of describing DAGs, all my security, packaging concerns are gone - and
>>>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>>>> the user to submit any executable code this way. All the code to execute
>>>>>> would already have to be in Airflow already in this case.
>>>>>>
>>>>>> And I very much agree also that this case can be solved with Git. I
>>>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>>>> of Airflow.
>>>>>>
>>>>>> I think when the user feels the need (I very much understand the need
>>>>>> Constance) to submit the DAG via API,  rather than adding the option of
>>>>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>>>>> this:
>>>>>>
>>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard
>>>>>> "API" you wanted to push the code.*
>>>>>>
>>>>>> This has all the flexibility you need, it has integration with Pull
>>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>>> giants.
>>>>>> If we start thinking about integration of code push via our own API -
>>>>>> we basically start the journey of rewriting Git as eventually we will have
>>>>>> to support those cases. This makes absolutely no sense for me.
>>>>>>
>>>>>> I even start to think that we should make "git sync" a separate (and
>>>>>> much more viable) option that is pretty much the "main recommendation" for
>>>>>> Airflow. rather than "yet another option among shared folders and baked in
>>>>>> DAGs" case.
>>>>>>
>>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>>> which has much more details on why I think so.
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>>> <co...@astronomer.io.invalid> wrote:
>>>>>>
>>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>>
>>>>>>> I love the idea of implementing a manual review option in
>>>>>>> conjunction with some sort of hook (similar to Airflow cluster policies)
>>>>>>> would be a good middle ground. An administrator could use that hook to do
>>>>>>> checks against DAGs or run security scanners, and decide whether or not to
>>>>>>> implement a review requirement.
>>>>>>>
>>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <
>>>>>>> turbaszek@apache.org> wrote:
>>>>>>>
>>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>>> python file" are probably quite big.
>>>>>>>>
>>>>>>>> However, I think this proposal can be tightly related to
>>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send the
>>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>>> that is not a code. This of course has some limitations like inability to
>>>>>>>> define custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>>>
>>>>>>>> Other thought - if we implement something like "DAG via API" then
>>>>>>>> we should consider adding an option to review DAGs (approval queue etc) to
>>>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tomek
>>>>>>>>
>>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Mocheng,
>>>>>>>>>
>>>>>>>>> Please allow me to share a question first: so in your proposal,
>>>>>>>>> the API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>>>> binarized or compressed), right?
>>>>>>>>>
>>>>>>>>> If that's the case, I may not be fully convinced: the objectives
>>>>>>>>> in your proposal is about automation & programmatically submitting DAGs.
>>>>>>>>> These can already be achieved in an efficient way through CI/CD practice +
>>>>>>>>> a centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>>>> files).
>>>>>>>>>
>>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>>
>>>>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>>>>> proposal. Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> XD
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> (This is not a contribution)
>>>>>>>>>
>>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Everyone,
>>>>>>>>>>
>>>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>>
>>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>>> 2. Services targeting specific audiences can generate DAGs based
>>>>>>>>>> on inputs like user command or external triggers, and they want to be able
>>>>>>>>>> to submit DAGs programmatically without manual intervention.
>>>>>>>>>>
>>>>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>>>>>> can be slow to rollout.
>>>>>>>>>>
>>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>>>>>> file.
>>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>>>> needed.
>>>>>>>>>> 3. DAG persistence needs to be implemented since they are not
>>>>>>>>>> part of the DAG repository.
>>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in
>>>>>>>>>> the repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>>
>>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>>>> security breach:
>>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already
>>>>>>>>>> supports pluggable authentication modules where strong authentication such
>>>>>>>>>> as Kerberos can be used.
>>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>>> through the API service will have run_as_user set to be the API identity.
>>>>>>>>>> 3. To enforce data access control on DAGs, the API identity
>>>>>>>>>> should also be used to access the data warehouse.
>>>>>>>>>>
>>>>>>>>>> We shared a demo based on a prototype implementation in the
>>>>>>>>>> summit and some details are described in this ppt
>>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>>>> initiative.
>>>>>>>>>>
>>>>>>>>>> thanks
>>>>>>>>>> Mocheng
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Constance Martineau
>>>>>>> Product Manager
>>>>>>>
>>>>>>> Email: constance@astronomer.io
>>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>>
>>>>>>>
>>>>>>> <https://www.astronomer.io/>
>>>>>>>
>>>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Mocheng Guo <gm...@gmail.com>.
My use case needs automation and security: those are the two key
requirements and does not have to be REST API if there is another way that
DAGs could be submitted to a cloud storage securely. Sure I would
appreciate it if you could include me when organizing AIP-1 related
meetings. Kerberos is a ticket based system in which a ticket has a limited
lifetime. Using kerberos, a workload could be authenticated before
persistence so that Airflow uses its kerberos keytab to execute, which is
similar to the current implementation in worker, another possible scenarios
is a persisted workload needs to include a kerberos renewable TGT to be
used by Airflow worker, but this is more complex and I would be happy to
discuss more in meetings. I will draft a more detailed document for review.

thanks
Mocheng


On Thu, Aug 18, 2022 at 1:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> None of those requirements are supported by Airflow. And opening REST API
> does not solve the authentication use case you mentioned.
>
> This is a completely new requirement you have - basically what you want is
> workflow identity and it should be rather independent from the way DAG is
> submitted. It would require to attach some kind of identity and signaturea
> and some way of making sure that the DAG has not been tampered with, in a
> way that the worker could use the identity when executing the workload and
> be sure that no-one else modified the DAG - including any of the files that
> the DAG uses. This is an interesting case but it has nothing to do with
> using or not the REST API. REST API alone will not give you the user
> identity guarantees that you need here. The distributed nature of Airflow
> basically requires such workflow identity has to be provided by
> cryptographic signatures and verifying the integrity of the DAG rather than
> basing it on REST API authentication.
>
> BTW. We do support already Kerberos authentication for some of our
> operators but identity is necessarily per instance executing the workload -
> not the user submitting the DAG.
>
> This could be one of the improvement proposals that could in the future
> become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
> interested in leading and proposing such an AIP i will be soon (a month or
> so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
> and minutes of previous meetings). We already have AiP-43 and AIP-44
> approved there (and AIP-43 close to completion) and the next steps should
> be introducing fine graines security layer to executing the workloads.
> Adding workload identity might be part of it. If you would like to work on
> that - you are most welcome. It means to prepare and discuss proposals, get
> consensus of involved parties, leading it to a vote and finally
> implementing it.
>
> J
>
> czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com>
> napisał:
>
>> >> Could you please elaborate why this would be a problem to use those
>> (really good for file pushing) APIs ?
>>
>> Submitting DAGs directly to cloud storage API does help for some part of
>> the use case requirement, but cloud storage does not provide the security a
>> data warehouse needs. A typical auth model supported in data warehouse is
>> Kerberos, and a data warehouse provides limited view to a kerberos user
>> with authorization rules. We need users to submit DAGs with identities
>> supported by the data warehouse, so that Apache Spark jobs will be executed
>> as the kerberos user who submits a DAG which in turns decide what data can
>> be processed, there may also be need to handle impersonation, so there
>> needs to be an additional layer to handle data warehouse auth e.g.
>> kerberos.
>>
>> Assuming dags are already inside the cloud storage, and I think AIP-5/20
>> would work better than the current mono repo model if it could support
>> better flexibility and less latency, and I would be very interested to be
>> part of the design and implementation.
>>
>>
>> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> First appreciate all for your valuable feedback. Airflow by design has
>>> to accept code, both Tomasz and Constance's examples let me think that the
>>> security judgement should be on the actual DAGs rather than how DAGs are
>>> accepted or a process itself. To expand a little bit more on another
>>> example, say another service provides an API which can be invoked by its
>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>> pushed through the API. There are certainly cases that DAGs may not be
>>> safe, e.g the API service on public cloud with shared tenants with no
>>> knowledge how DAGs are generated, in such cases the API service can access
>>> control the identity or even reject all calls when considered unsafe.
>>> Please let me know if the example makes sense, and if there is a common
>>> interest, having an Airflow native write path would benefit the community
>>> instead of each building its own solution.
>>>
>>> > You seem to repeat more of the same. This is exactly what we want to
>>> avoid. IF you can push a code over API you can push Any Code. And precisely
>>> the "Access Control" you mentioned or rejecting the call when "considering
>>> code unsafe" those are the decisions we already deliberately decided we do
>>> not want Airflow REST API to make. Whether the code it's generated or not
>>> does not matter because Airflow has no idea whatsoever if it has been
>>> manipulated with, between the time it was generated and pushed. The only
>>> way Airflow can know that the code is not manipulated with is when it
>>> generates DAG code on its own based on a declarative input. The limit is to
>>> push declarative information only. You CANNOT push code via the REST API.
>>> This is out of the question. The case is closed.
>>>
>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>> change data/features used by model frequently which in turn leads to
>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>> while giving user quick feedback? I understand git+ci/cd is technically
>>> possible but the overhead involved is a major reason users rejecting
>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>> offline processes with large repo.
>>>
>>> Case 2 is actually (if you attempt to read my article I posted above,
>>> it's written there) the case where shared volume could still be used and
>>> are bette. This why it's great that Airflow supports multiple DAG syncing
>>> solutions because your "middle" environment does not have to have git sync
>>> as it is not "production' (unless you want to mix development with testing
>>> that is, which is terrible, terrible idea).
>>>
>>> Your data science for middle ground does:
>>>
>>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
>>> use shared volume of some sort (NFS/EFS etc.)
>>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags
>>> are on S3  and synced using s3-sync
>>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>>> synced using s3-sync
>>>
>>> Those are excellent "File push" apis. They do the job. I cannot imagine
>>> why the middle-loop person might have a problem with using them. All of
>>> that can also be  fully automated -  they all have nice Python and
>>> other language APIs so you can even make the IDE run those commands
>>> automatically on every save if you want.
>>>
>>> Could you please elaborate why this would be a problem to use those
>>> (really good for file pushing) APIs ?
>>>
>>> J.
>>>
>>>
>>>
>>>
>>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:
>>>
>>>> First appreciate all for your valuable feedback. Airflow by design has
>>>> to accept code, both Tomasz and Constance's examples let me think that the
>>>> security judgement should be on the actual DAGs rather than how DAGs are
>>>> accepted or a process itself. To expand a little bit more on another
>>>> example, say another service provides an API which can be invoked by its
>>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>>> pushed through the API. There are certainly cases that DAGs may not be
>>>> safe, e.g the API service on public cloud with shared tenants with no
>>>> knowledge how DAGs are generated, in such cases the API service can access
>>>> control the identity or even reject all calls when considered unsafe.
>>>> Please let me know if the example makes sense, and if there is a common
>>>> interest, having an Airflow native write path would benefit the community
>>>> instead of each building its own solution.
>>>>
>>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>>>> here are three loops a data scientist is doing to develop a machine
>>>> learning model:
>>>> - inner loop: iterates on the model locally.
>>>> - middle loop: iterate the model on a remote cluster with production
>>>> data, say it's using Airflow DAGs behind the scenes.
>>>> - outer loop: done with iteration and publish the model on production.
>>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>>> change data/features used by model frequently which in turn leads to
>>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>>> while giving user quick feedback? I understand git+ci/cd is technically
>>>> possible but the overhead involved is a major reason users rejecting
>>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>>> offline processes with large repo.
>>>>
>>>> Such use case is pretty common for data scientists, and a better
>>>> **online** service model would help open up more possibilities for Airflow
>>>> and its users, as additional layers providing more values(like Constance
>>>> mentioned enable users with no engineering or airflow domain knowledge to
>>>> use Airflow) could be built on top of Airflow which remains as a lower
>>>> level orchestration engine.
>>>>
>>>> thanks
>>>> Mocheng
>>>>
>>>>
>>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>> I really like the Idea of Tomek.
>>>>>
>>>>> If we ever go (which is not unlikely) - some "standard" declarative
>>>>> way of describing DAGs, all my security, packaging concerns are gone - and
>>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>>> the user to submit any executable code this way. All the code to execute
>>>>> would already have to be in Airflow already in this case.
>>>>>
>>>>> And I very much agree also that this case can be solved with Git. I
>>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>>> of Airflow.
>>>>>
>>>>> I think when the user feels the need (I very much understand the need
>>>>> Constance) to submit the DAG via API,  rather than adding the option of
>>>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>>>> this:
>>>>>
>>>>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>>>>> you wanted to push the code.*
>>>>>
>>>>> This has all the flexibility you need, it has integration with Pull
>>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>>> giants.
>>>>> If we start thinking about integration of code push via our own API -
>>>>> we basically start the journey of rewriting Git as eventually we will have
>>>>> to support those cases. This makes absolutely no sense for me.
>>>>>
>>>>> I even start to think that we should make "git sync" a separate (and
>>>>> much more viable) option that is pretty much the "main recommendation" for
>>>>> Airflow. rather than "yet another option among shared folders and baked in
>>>>> DAGs" case.
>>>>>
>>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>>> which has much more details on why I think so.
>>>>>
>>>>> J.
>>>>>
>>>>>
>>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>>> <co...@astronomer.io.invalid> wrote:
>>>>>
>>>>>> I understand the security concerns, and generally agree, but as a
>>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>>> the door to have an upload button, which would be nice. It would make
>>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>>
>>>>>> I love the idea of implementing a manual review option in conjunction
>>>>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>>>>> good middle ground. An administrator could use that hook to do checks
>>>>>> against DAGs or run security scanners, and decide whether or not to
>>>>>> implement a review requirement.
>>>>>>
>>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> In general I second what XD said. CI/CD feels better than sending
>>>>>>> DAG files over API and the security issues arising from accepting "any
>>>>>>> python file" are probably quite big.
>>>>>>>
>>>>>>> However, I think this proposal can be tightly related to
>>>>>>> "declarative DAGs". Instead of sending a DAG file, the user would send the
>>>>>>> DAG definition (operators, inputs, relations) in a predefined format
>>>>>>> that is not a code. This of course has some limitations like inability to
>>>>>>> define custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>>
>>>>>>> Other thought - if we implement something like "DAG via API" then we
>>>>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>>> from git (where we have code review, security scanners etc).
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Tomek
>>>>>>>
>>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Mocheng,
>>>>>>>>
>>>>>>>> Please allow me to share a question first: so in your proposal, the
>>>>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>>> binarized or compressed), right?
>>>>>>>>
>>>>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>>>>> your proposal is about automation & programmatically submitting DAGs. These
>>>>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>>> files).
>>>>>>>>
>>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>>
>>>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>>>> proposal. Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> XD
>>>>>>>> ----------------------------------------------------------------
>>>>>>>> (This is not a contribution)
>>>>>>>>
>>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Everyone,
>>>>>>>>>
>>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>>
>>>>>>>>> The motivation comes from the following use cases:
>>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>>> 2. Services targeting specific audiences can generate DAGs based
>>>>>>>>> on inputs like user command or external triggers, and they want to be able
>>>>>>>>> to submit DAGs programmatically without manual intervention.
>>>>>>>>>
>>>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>>>>> can be slow to rollout.
>>>>>>>>>
>>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>>> transmitted online and here are some key points:
>>>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>>>>> file.
>>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>>> needed.
>>>>>>>>> 3. DAG persistence needs to be implemented since they are not part
>>>>>>>>> of the DAG repository.
>>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>>
>>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>>> security breach:
>>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>>>>> pluggable authentication modules where strong authentication such as
>>>>>>>>> Kerberos can be used.
>>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created
>>>>>>>>> through the API service will have run_as_user set to be the API identity.
>>>>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>>>>> also be used to access the data warehouse.
>>>>>>>>>
>>>>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>>>>> and some details are described in this ppt
>>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>>> initiative.
>>>>>>>>>
>>>>>>>>> thanks
>>>>>>>>> Mocheng
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Constance Martineau
>>>>>> Product Manager
>>>>>>
>>>>>> Email: constance@astronomer.io
>>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>>
>>>>>>
>>>>>> <https://www.astronomer.io/>
>>>>>>
>>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
None of those requirements are supported by Airflow. And opening REST API
does not solve the authentication use case you mentioned.

This is a completely new requirement you have - basically what you want is
workflow identity and it should be rather independent from the way DAG is
submitted. It would require to attach some kind of identity and signaturea
and some way of making sure that the DAG has not been tampered with, in a
way that the worker could use the identity when executing the workload and
be sure that no-one else modified the DAG - including any of the files that
the DAG uses. This is an interesting case but it has nothing to do with
using or not the REST API. REST API alone will not give you the user
identity guarantees that you need here. The distributed nature of Airflow
basically requires such workflow identity has to be provided by
cryptographic signatures and verifying the integrity of the DAG rather than
basing it on REST API authentication.

BTW. We do support already Kerberos authentication for some of our
operators but identity is necessarily per instance executing the workload -
not the user submitting the DAG.

This could be one of the improvement proposals that could in the future
become a sub-AIP or  AIP-1 (Improve Airflow Security). if you are
interested in leading and proposing such an AIP i will be soon (a month or
so) re-establishing #sig-multitenancy meetings (see AIP-1 for recordings
and minutes of previous meetings). We already have AiP-43 and AIP-44
approved there (and AIP-43 close to completion) and the next steps should
be introducing fine graines security layer to executing the workloads.
Adding workload identity might be part of it. If you would like to work on
that - you are most welcome. It means to prepare and discuss proposals, get
consensus of involved parties, leading it to a vote and finally
implementing it.

J

czw., 18 sie 2022, 02:44 użytkownik Mocheng Guo <gm...@gmail.com> napisał:

> >> Could you please elaborate why this would be a problem to use those
> (really good for file pushing) APIs ?
>
> Submitting DAGs directly to cloud storage API does help for some part of
> the use case requirement, but cloud storage does not provide the security a
> data warehouse needs. A typical auth model supported in data warehouse is
> Kerberos, and a data warehouse provides limited view to a kerberos user
> with authorization rules. We need users to submit DAGs with identities
> supported by the data warehouse, so that Apache Spark jobs will be executed
> as the kerberos user who submits a DAG which in turns decide what data can
> be processed, there may also be need to handle impersonation, so there
> needs to be an additional layer to handle data warehouse auth e.g.
> kerberos.
>
> Assuming dags are already inside the cloud storage, and I think AIP-5/20
> would work better than the current mono repo model if it could support
> better flexibility and less latency, and I would be very interested to be
> part of the design and implementation.
>
>
> On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> First appreciate all for your valuable feedback. Airflow by design has to
>> accept code, both Tomasz and Constance's examples let me think that the
>> security judgement should be on the actual DAGs rather than how DAGs are
>> accepted or a process itself. To expand a little bit more on another
>> example, say another service provides an API which can be invoked by its
>> clients the service validates user inputs e.g. SQL and generates Airflow
>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>> pushed through the API. There are certainly cases that DAGs may not be
>> safe, e.g the API service on public cloud with shared tenants with no
>> knowledge how DAGs are generated, in such cases the API service can access
>> control the identity or even reject all calls when considered unsafe.
>> Please let me know if the example makes sense, and if there is a common
>> interest, having an Airflow native write path would benefit the community
>> instead of each building its own solution.
>>
>> > You seem to repeat more of the same. This is exactly what we want to
>> avoid. IF you can push a code over API you can push Any Code. And precisely
>> the "Access Control" you mentioned or rejecting the call when "considering
>> code unsafe" those are the decisions we already deliberately decided we do
>> not want Airflow REST API to make. Whether the code it's generated or not
>> does not matter because Airflow has no idea whatsoever if it has been
>> manipulated with, between the time it was generated and pushed. The only
>> way Airflow can know that the code is not manipulated with is when it
>> generates DAG code on its own based on a declarative input. The limit is to
>> push declarative information only. You CANNOT push code via the REST API.
>> This is out of the question. The case is closed.
>>
>> The middle loop usually happens on a Jupyter notebook, it needs to change
>> data/features used by model frequently which in turn leads to Airflow DAG
>> updates, do you mind elaborate how to automate the changes inside a
>> notebook and programmatically submitting DAGs through git+CI/CD while
>> giving user quick feedback? I understand git+ci/cd is technically possible
>> but the overhead involved is a major reason users rejecting Airflow for
>> other alternative solutions, e.g. git repo requires manual approval even if
>> DAGs can be programmatically submitted, and CI/CD are slow offline
>> processes with large repo.
>>
>> Case 2 is actually (if you attempt to read my article I posted above,
>> it's written there) the case where shared volume could still be used and
>> are bette. This why it's great that Airflow supports multiple DAG syncing
>> solutions because your "middle" environment does not have to have git sync
>> as it is not "production' (unless you want to mix development with testing
>> that is, which is terrible, terrible idea).
>>
>> Your data science for middle ground does:
>>
>> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
>> use shared volume of some sort (NFS/EFS etc.)
>> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags are
>> on S3  and synced using s3-sync
>> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
>> synced using s3-sync
>>
>> Those are excellent "File push" apis. They do the job. I cannot imagine
>> why the middle-loop person might have a problem with using them. All of
>> that can also be  fully automated -  they all have nice Python and
>> other language APIs so you can even make the IDE run those commands
>> automatically on every save if you want.
>>
>> Could you please elaborate why this would be a problem to use those
>> (really good for file pushing) APIs ?
>>
>> J.
>>
>>
>>
>>
>> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:
>>
>>> First appreciate all for your valuable feedback. Airflow by design has
>>> to accept code, both Tomasz and Constance's examples let me think that the
>>> security judgement should be on the actual DAGs rather than how DAGs are
>>> accepted or a process itself. To expand a little bit more on another
>>> example, say another service provides an API which can be invoked by its
>>> clients the service validates user inputs e.g. SQL and generates Airflow
>>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>>> pushed through the API. There are certainly cases that DAGs may not be
>>> safe, e.g the API service on public cloud with shared tenants with no
>>> knowledge how DAGs are generated, in such cases the API service can access
>>> control the identity or even reject all calls when considered unsafe.
>>> Please let me know if the example makes sense, and if there is a common
>>> interest, having an Airflow native write path would benefit the community
>>> instead of each building its own solution.
>>>
>>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>>> here are three loops a data scientist is doing to develop a machine
>>> learning model:
>>> - inner loop: iterates on the model locally.
>>> - middle loop: iterate the model on a remote cluster with production
>>> data, say it's using Airflow DAGs behind the scenes.
>>> - outer loop: done with iteration and publish the model on production.
>>> The middle loop usually happens on a Jupyter notebook, it needs to
>>> change data/features used by model frequently which in turn leads to
>>> Airflow DAG updates, do you mind elaborate how to automate the changes
>>> inside a notebook and programmatically submitting DAGs through git+CI/CD
>>> while giving user quick feedback? I understand git+ci/cd is technically
>>> possible but the overhead involved is a major reason users rejecting
>>> Airflow for other alternative solutions, e.g. git repo requires manual
>>> approval even if DAGs can be programmatically submitted, and CI/CD are slow
>>> offline processes with large repo.
>>>
>>> Such use case is pretty common for data scientists, and a better
>>> **online** service model would help open up more possibilities for Airflow
>>> and its users, as additional layers providing more values(like Constance
>>> mentioned enable users with no engineering or airflow domain knowledge to
>>> use Airflow) could be built on top of Airflow which remains as a lower
>>> level orchestration engine.
>>>
>>> thanks
>>> Mocheng
>>>
>>>
>>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>> I really like the Idea of Tomek.
>>>>
>>>> If we ever go (which is not unlikely) - some "standard" declarative way
>>>> of describing DAGs, all my security, packaging concerns are gone - and
>>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>>> just stored in the DB and scheduled and executed using only "declaration"
>>>> from the DB - without ever touching the DAG "folder" and without allowing
>>>> the user to submit any executable code this way. All the code to execute
>>>> would already have to be in Airflow already in this case.
>>>>
>>>> And I very much agree also that this case can be solved with Git. I
>>>> think we are generally undervaluing the role Git plays for DAG distribution
>>>> of Airflow.
>>>>
>>>> I think when the user feels the need (I very much understand the need
>>>> Constance) to submit the DAG via API,  rather than adding the option of
>>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>>> this:
>>>>
>>>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>>>> you wanted to push the code.*
>>>>
>>>> This has all the flexibility you need, it has integration with Pull
>>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>>> - we have ALL of that and more for free. Standing on the shoulders of
>>>> giants.
>>>> If we start thinking about integration of code push via our own API -
>>>> we basically start the journey of rewriting Git as eventually we will have
>>>> to support those cases. This makes absolutely no sense for me.
>>>>
>>>> I even start to think that we should make "git sync" a separate (and
>>>> much more viable) option that is pretty much the "main recommendation" for
>>>> Airflow. rather than "yet another option among shared folders and baked in
>>>> DAGs" case.
>>>>
>>>> I recently even wrote my thoughts about it in this post: "Shared
>>>> Volumes in Airflow - the good, the bad and the ugly":
>>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>>> which has much more details on why I think so.
>>>>
>>>> J.
>>>>
>>>>
>>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>>> <co...@astronomer.io.invalid> wrote:
>>>>
>>>>> I understand the security concerns, and generally agree, but as a
>>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>>> the door to have an upload button, which would be nice. It would make
>>>>> Airflow a lot more accessible to non-engineering types.
>>>>>
>>>>> I love the idea of implementing a manual review option in conjunction
>>>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>>>> good middle ground. An administrator could use that hook to do checks
>>>>> against DAGs or run security scanners, and decide whether or not to
>>>>> implement a review requirement.
>>>>>
>>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> In general I second what XD said. CI/CD feels better than sending DAG
>>>>>> files over API and the security issues arising from accepting "any python
>>>>>> file" are probably quite big.
>>>>>>
>>>>>> However, I think this proposal can be tightly related to "declarative
>>>>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>>>>> definition (operators, inputs, relations) in a predefined format that is
>>>>>> not a code. This of course has some limitations like inability to define
>>>>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>>>>
>>>>>> Other thought - if we implement something like "DAG via API" then we
>>>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>>> from git (where we have code review, security scanners etc).
>>>>>>
>>>>>> Cheers,
>>>>>> Tomek
>>>>>>
>>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Mocheng,
>>>>>>>
>>>>>>> Please allow me to share a question first: so in your proposal, the
>>>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>>> binarized or compressed), right?
>>>>>>>
>>>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>>>> your proposal is about automation & programmatically submitting DAGs. These
>>>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>>> files).
>>>>>>>
>>>>>>> As you are already aware, allowing this via API adds additional
>>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>>
>>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>>> proposal. Thanks.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> XD
>>>>>>> ----------------------------------------------------------------
>>>>>>> (This is not a contribution)
>>>>>>>
>>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>>> Airflow more easily as a platform service.
>>>>>>>>
>>>>>>>> The motivation comes from the following use cases:
>>>>>>>> 1. Users like data scientists want to iterate over data quickly
>>>>>>>> with interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>>>>> inputs like user command or external triggers, and they want to be able to
>>>>>>>> submit DAGs programmatically without manual intervention.
>>>>>>>>
>>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>>>> can be slow to rollout.
>>>>>>>>
>>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>>> transmitted online and here are some key points:
>>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>>>> file.
>>>>>>>> 2. The Airflow REST API is the ideal place to talk with the
>>>>>>>> external world. The API would provide a generic interface to accept DAG
>>>>>>>> artifacts and should be extensible to support different artifact formats if
>>>>>>>> needed.
>>>>>>>> 3. DAG persistence needs to be implemented since they are not part
>>>>>>>> of the DAG repository.
>>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>>
>>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>>> security breach:
>>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>>>> pluggable authentication modules where strong authentication such as
>>>>>>>> Kerberos can be used.
>>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>>>>> the API service will have run_as_user set to be the API identity.
>>>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>>>> also be used to access the data warehouse.
>>>>>>>>
>>>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>>>> and some details are described in this ppt
>>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>>> initiative.
>>>>>>>>
>>>>>>>> thanks
>>>>>>>> Mocheng
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Constance Martineau
>>>>> Product Manager
>>>>>
>>>>> Email: constance@astronomer.io
>>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>>
>>>>>
>>>>> <https://www.astronomer.io/>
>>>>>
>>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Mocheng Guo <gm...@gmail.com>.
>> Could you please elaborate why this would be a problem to use those
(really good for file pushing) APIs ?

Submitting DAGs directly to cloud storage API does help for some part of
the use case requirement, but cloud storage does not provide the security a
data warehouse needs. A typical auth model supported in data warehouse is
Kerberos, and a data warehouse provides limited view to a kerberos user
with authorization rules. We need users to submit DAGs with identities
supported by the data warehouse, so that Apache Spark jobs will be executed
as the kerberos user who submits a DAG which in turns decide what data can
be processed, there may also be need to handle impersonation, so there
needs to be an additional layer to handle data warehouse auth e.g.
kerberos.

Assuming dags are already inside the cloud storage, and I think AIP-5/20
would work better than the current mono repo model if it could support
better flexibility and less latency, and I would be very interested to be
part of the design and implementation.


On Fri, Aug 12, 2022 at 10:56 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> First appreciate all for your valuable feedback. Airflow by design has to
> accept code, both Tomasz and Constance's examples let me think that the
> security judgement should be on the actual DAGs rather than how DAGs are
> accepted or a process itself. To expand a little bit more on another
> example, say another service provides an API which can be invoked by its
> clients the service validates user inputs e.g. SQL and generates Airflow
> DAGs which use the validated operators/macros. Those DAGs are safe to be
> pushed through the API. There are certainly cases that DAGs may not be
> safe, e.g the API service on public cloud with shared tenants with no
> knowledge how DAGs are generated, in such cases the API service can access
> control the identity or even reject all calls when considered unsafe.
> Please let me know if the example makes sense, and if there is a common
> interest, having an Airflow native write path would benefit the community
> instead of each building its own solution.
>
> > You seem to repeat more of the same. This is exactly what we want to
> avoid. IF you can push a code over API you can push Any Code. And precisely
> the "Access Control" you mentioned or rejecting the call when "considering
> code unsafe" those are the decisions we already deliberately decided we do
> not want Airflow REST API to make. Whether the code it's generated or not
> does not matter because Airflow has no idea whatsoever if it has been
> manipulated with, between the time it was generated and pushed. The only
> way Airflow can know that the code is not manipulated with is when it
> generates DAG code on its own based on a declarative input. The limit is to
> push declarative information only. You CANNOT push code via the REST API.
> This is out of the question. The case is closed.
>
> The middle loop usually happens on a Jupyter notebook, it needs to change
> data/features used by model frequently which in turn leads to Airflow DAG
> updates, do you mind elaborate how to automate the changes inside a
> notebook and programmatically submitting DAGs through git+CI/CD while
> giving user quick feedback? I understand git+ci/cd is technically possible
> but the overhead involved is a major reason users rejecting Airflow for
> other alternative solutions, e.g. git repo requires manual approval even if
> DAGs can be programmatically submitted, and CI/CD are slow offline
> processes with large repo.
>
> Case 2 is actually (if you attempt to read my article I posted above, it's
> written there) the case where shared volume could still be used and are
> bette. This why it's great that Airflow supports multiple DAG syncing
> solutions because your "middle" environment does not have to have git sync
> as it is not "production' (unless you want to mix development with testing
> that is, which is terrible, terrible idea).
>
> Your data science for middle ground does:
>
> a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you
> use shared volume of some sort (NFS/EFS etc.)
> b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags are
> on S3  and synced using s3-sync
> c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
> synced using s3-sync
>
> Those are excellent "File push" apis. They do the job. I cannot imagine
> why the middle-loop person might have a problem with using them. All of
> that can also be  fully automated -  they all have nice Python and
> other language APIs so you can even make the IDE run those commands
> automatically on every save if you want.
>
> Could you please elaborate why this would be a problem to use those
> (really good for file pushing) APIs ?
>
> J.
>
>
>
>
> On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:
>
>> First appreciate all for your valuable feedback. Airflow by design has to
>> accept code, both Tomasz and Constance's examples let me think that the
>> security judgement should be on the actual DAGs rather than how DAGs are
>> accepted or a process itself. To expand a little bit more on another
>> example, say another service provides an API which can be invoked by its
>> clients the service validates user inputs e.g. SQL and generates Airflow
>> DAGs which use the validated operators/macros. Those DAGs are safe to be
>> pushed through the API. There are certainly cases that DAGs may not be
>> safe, e.g the API service on public cloud with shared tenants with no
>> knowledge how DAGs are generated, in such cases the API service can access
>> control the identity or even reject all calls when considered unsafe.
>> Please let me know if the example makes sense, and if there is a common
>> interest, having an Airflow native write path would benefit the community
>> instead of each building its own solution.
>>
>> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
>> here are three loops a data scientist is doing to develop a machine
>> learning model:
>> - inner loop: iterates on the model locally.
>> - middle loop: iterate the model on a remote cluster with production
>> data, say it's using Airflow DAGs behind the scenes.
>> - outer loop: done with iteration and publish the model on production.
>> The middle loop usually happens on a Jupyter notebook, it needs to change
>> data/features used by model frequently which in turn leads to Airflow DAG
>> updates, do you mind elaborate how to automate the changes inside a
>> notebook and programmatically submitting DAGs through git+CI/CD while
>> giving user quick feedback? I understand git+ci/cd is technically possible
>> but the overhead involved is a major reason users rejecting Airflow for
>> other alternative solutions, e.g. git repo requires manual approval even if
>> DAGs can be programmatically submitted, and CI/CD are slow offline
>> processes with large repo.
>>
>> Such use case is pretty common for data scientists, and a better
>> **online** service model would help open up more possibilities for Airflow
>> and its users, as additional layers providing more values(like Constance
>> mentioned enable users with no engineering or airflow domain knowledge to
>> use Airflow) could be built on top of Airflow which remains as a lower
>> level orchestration engine.
>>
>> thanks
>> Mocheng
>>
>>
>> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>>> I really like the Idea of Tomek.
>>>
>>> If we ever go (which is not unlikely) - some "standard" declarative way
>>> of describing DAGs, all my security, packaging concerns are gone - and
>>> submitting such declarative DAG via API is quite viable. Simply submitting
>>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>>> just stored in the DB and scheduled and executed using only "declaration"
>>> from the DB - without ever touching the DAG "folder" and without allowing
>>> the user to submit any executable code this way. All the code to execute
>>> would already have to be in Airflow already in this case.
>>>
>>> And I very much agree also that this case can be solved with Git. I
>>> think we are generally undervaluing the role Git plays for DAG distribution
>>> of Airflow.
>>>
>>> I think when the user feels the need (I very much understand the need
>>> Constance) to submit the DAG via API,  rather than adding the option of
>>> submitting the DAG code via "Airflow REST API", we should simply answer
>>> this:
>>>
>>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>>> you wanted to push the code.*
>>>
>>> This has all the flexibility you need, it has integration with Pull
>>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>>> - we have ALL of that and more for free. Standing on the shoulders of
>>> giants.
>>> If we start thinking about integration of code push via our own API - we
>>> basically start the journey of rewriting Git as eventually we will have to
>>> support those cases. This makes absolutely no sense for me.
>>>
>>> I even start to think that we should make "git sync" a separate (and
>>> much more viable) option that is pretty much the "main recommendation" for
>>> Airflow. rather than "yet another option among shared folders and baked in
>>> DAGs" case.
>>>
>>> I recently even wrote my thoughts about it in this post: "Shared Volumes
>>> in Airflow - the good, the bad and the ugly":
>>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>>> which has much more details on why I think so.
>>>
>>> J.
>>>
>>>
>>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>>> <co...@astronomer.io.invalid> wrote:
>>>
>>>> I understand the security concerns, and generally agree, but as a
>>>> regular user I always wished we could upload DAG files via an API. It opens
>>>> the door to have an upload button, which would be nice. It would make
>>>> Airflow a lot more accessible to non-engineering types.
>>>>
>>>> I love the idea of implementing a manual review option in conjunction
>>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>>> good middle ground. An administrator could use that hook to do checks
>>>> against DAGs or run security scanners, and decide whether or not to
>>>> implement a review requirement.
>>>>
>>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
>>>> wrote:
>>>>
>>>>> In general I second what XD said. CI/CD feels better than sending DAG
>>>>> files over API and the security issues arising from accepting "any python
>>>>> file" are probably quite big.
>>>>>
>>>>> However, I think this proposal can be tightly related to "declarative
>>>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>>>> definition (operators, inputs, relations) in a predefined format that is
>>>>> not a code. This of course has some limitations like inability to define
>>>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>>>
>>>>> Other thought - if we implement something like "DAG via API" then we
>>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>>> from git (where we have code review, security scanners etc).
>>>>>
>>>>> Cheers,
>>>>> Tomek
>>>>>
>>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:
>>>>>
>>>>>> Hi Mocheng,
>>>>>>
>>>>>> Please allow me to share a question first: so in your proposal, the
>>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>>> binarized or compressed), right?
>>>>>>
>>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>>> your proposal is about automation & programmatically submitting DAGs. These
>>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>>> files).
>>>>>>
>>>>>> As you are already aware, allowing this via API adds additional
>>>>>> security concern, and I would doubt if that "breaks even".
>>>>>>
>>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>>> proposal. Thanks.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> XD
>>>>>> ----------------------------------------------------------------
>>>>>> (This is not a contribution)
>>>>>>
>>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>>> based on the observations that Airflow users want to be able to access
>>>>>>> Airflow more easily as a platform service.
>>>>>>>
>>>>>>> The motivation comes from the following use cases:
>>>>>>> 1. Users like data scientists want to iterate over data quickly with
>>>>>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>>>> inputs like user command or external triggers, and they want to be able to
>>>>>>> submit DAGs programmatically without manual intervention.
>>>>>>>
>>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>>> can be slow to rollout.
>>>>>>>
>>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>>> transmitted online and here are some key points:
>>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>>> file.
>>>>>>> 2. The Airflow REST API is the ideal place to talk with the external
>>>>>>> world. The API would provide a generic interface to accept DAG artifacts
>>>>>>> and should be extensible to support different artifact formats if needed.
>>>>>>> 3. DAG persistence needs to be implemented since they are not part
>>>>>>> of the DAG repository.
>>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>>> execution, and web server UI should behave the same way.
>>>>>>>
>>>>>>> Since DAGs are written as code, running arbitrary code inside
>>>>>>> Airflow may pose high security risks. Here are a few proposals to stop the
>>>>>>> security breach:
>>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>>> pluggable authentication modules where strong authentication such as
>>>>>>> Kerberos can be used.
>>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>>>> the API service will have run_as_user set to be the API identity.
>>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>>> also be used to access the data warehouse.
>>>>>>>
>>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>>> and some details are described in this ppt
>>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>>> and would love to get feedback and comments from the community about this
>>>>>>> initiative.
>>>>>>>
>>>>>>> thanks
>>>>>>> Mocheng
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>>
>>>> Constance Martineau
>>>> Product Manager
>>>>
>>>> Email: constance@astronomer.io
>>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>>
>>>>
>>>> <https://www.astronomer.io/>
>>>>
>>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
First appreciate all for your valuable feedback. Airflow by design has to
accept code, both Tomasz and Constance's examples let me think that the
security judgement should be on the actual DAGs rather than how DAGs are
accepted or a process itself. To expand a little bit more on another
example, say another service provides an API which can be invoked by its
clients the service validates user inputs e.g. SQL and generates Airflow
DAGs which use the validated operators/macros. Those DAGs are safe to be
pushed through the API. There are certainly cases that DAGs may not be
safe, e.g the API service on public cloud with shared tenants with no
knowledge how DAGs are generated, in such cases the API service can access
control the identity or even reject all calls when considered unsafe.
Please let me know if the example makes sense, and if there is a common
interest, having an Airflow native write path would benefit the community
instead of each building its own solution.

> You seem to repeat more of the same. This is exactly what we want to
avoid. IF you can push a code over API you can push Any Code. And precisely
the "Access Control" you mentioned or rejecting the call when "considering
code unsafe" those are the decisions we already deliberately decided we do
not want Airflow REST API to make. Whether the code it's generated or not
does not matter because Airflow has no idea whatsoever if it has been
manipulated with, between the time it was generated and pushed. The only
way Airflow can know that the code is not manipulated with is when it
generates DAG code on its own based on a declarative input. The limit is to
push declarative information only. You CANNOT push code via the REST API.
This is out of the question. The case is closed.

The middle loop usually happens on a Jupyter notebook, it needs to change
data/features used by model frequently which in turn leads to Airflow DAG
updates, do you mind elaborate how to automate the changes inside a
notebook and programmatically submitting DAGs through git+CI/CD while
giving user quick feedback? I understand git+ci/cd is technically possible
but the overhead involved is a major reason users rejecting Airflow for
other alternative solutions, e.g. git repo requires manual approval even if
DAGs can be programmatically submitted, and CI/CD are slow offline
processes with large repo.

Case 2 is actually (if you attempt to read my article I posted above, it's
written there) the case where shared volume could still be used and are
bette. This why it's great that Airflow supports multiple DAG syncing
solutions because your "middle" environment does not have to have git sync
as it is not "production' (unless you want to mix development with testing
that is, which is terrible, terrible idea).

Your data science for middle ground does:

a) cp my_dag.py "/my_midle_volume_shared_and_mounted_locally". - if you use
shared volume of some sort (NFS/EFS etc.)
b) aws s3 cp my_dag.py "s3://my-midle-testing-bucket/" - if your dags are
on S3  and synced using s3-sync
c) gsutil cp my_dag.py "gs://my-bucket" - if your dags are on GCS and
synced using s3-sync

Those are excellent "File push" apis. They do the job. I cannot imagine why
the middle-loop person might have a problem with using them. All of that
can also be  fully automated -  they all have nice Python and
other language APIs so you can even make the IDE run those commands
automatically on every save if you want.

Could you please elaborate why this would be a problem to use those (really
good for file pushing) APIs ?

J.




On Fri, Aug 12, 2022 at 6:20 PM Mocheng Guo <gm...@gmail.com> wrote:

> First appreciate all for your valuable feedback. Airflow by design has to
> accept code, both Tomasz and Constance's examples let me think that the
> security judgement should be on the actual DAGs rather than how DAGs are
> accepted or a process itself. To expand a little bit more on another
> example, say another service provides an API which can be invoked by its
> clients the service validates user inputs e.g. SQL and generates Airflow
> DAGs which use the validated operators/macros. Those DAGs are safe to be
> pushed through the API. There are certainly cases that DAGs may not be
> safe, e.g the API service on public cloud with shared tenants with no
> knowledge how DAGs are generated, in such cases the API service can access
> control the identity or even reject all calls when considered unsafe.
> Please let me know if the example makes sense, and if there is a common
> interest, having an Airflow native write path would benefit the community
> instead of each building its own solution.
>
> Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case,
> here are three loops a data scientist is doing to develop a machine
> learning model:
> - inner loop: iterates on the model locally.
> - middle loop: iterate the model on a remote cluster with production data,
> say it's using Airflow DAGs behind the scenes.
> - outer loop: done with iteration and publish the model on production.
> The middle loop usually happens on a Jupyter notebook, it needs to change
> data/features used by model frequently which in turn leads to Airflow DAG
> updates, do you mind elaborate how to automate the changes inside a
> notebook and programmatically submitting DAGs through git+CI/CD while
> giving user quick feedback? I understand git+ci/cd is technically possible
> but the overhead involved is a major reason users rejecting Airflow for
> other alternative solutions, e.g. git repo requires manual approval even if
> DAGs can be programmatically submitted, and CI/CD are slow offline
> processes with large repo.
>
> Such use case is pretty common for data scientists, and a better
> **online** service model would help open up more possibilities for Airflow
> and its users, as additional layers providing more values(like Constance
> mentioned enable users with no engineering or airflow domain knowledge to
> use Airflow) could be built on top of Airflow which remains as a lower
> level orchestration engine.
>
> thanks
> Mocheng
>
>
> On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I really like the Idea of Tomek.
>>
>> If we ever go (which is not unlikely) - some "standard" declarative way
>> of describing DAGs, all my security, packaging concerns are gone - and
>> submitting such declarative DAG via API is quite viable. Simply submitting
>> a Python code this way is a no-go for me :). Such Declarative DAG could be
>> just stored in the DB and scheduled and executed using only "declaration"
>> from the DB - without ever touching the DAG "folder" and without allowing
>> the user to submit any executable code this way. All the code to execute
>> would already have to be in Airflow already in this case.
>>
>> And I very much agree also that this case can be solved with Git. I think
>> we are generally undervaluing the role Git plays for DAG distribution of
>> Airflow.
>>
>> I think when the user feels the need (I very much understand the need
>> Constance) to submit the DAG via API,  rather than adding the option of
>> submitting the DAG code via "Airflow REST API", we should simply answer
>> this:
>>
>> *Use Git and git sync. Then "Git Push" then becomes the standard "API"
>> you wanted to push the code.*
>>
>> This has all the flexibility you need, it has integration with Pull
>> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
>> - we have ALL of that and more for free. Standing on the shoulders of
>> giants.
>> If we start thinking about integration of code push via our own API - we
>> basically start the journey of rewriting Git as eventually we will have to
>> support those cases. This makes absolutely no sense for me.
>>
>> I even start to think that we should make "git sync" a separate (and much
>> more viable) option that is pretty much the "main recommendation" for
>> Airflow. rather than "yet another option among shared folders and baked in
>> DAGs" case.
>>
>> I recently even wrote my thoughts about it in this post: "Shared Volumes
>> in Airflow - the good, the bad and the ugly":
>> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
>> which has much more details on why I think so.
>>
>> J.
>>
>>
>> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
>> <co...@astronomer.io.invalid> wrote:
>>
>>> I understand the security concerns, and generally agree, but as a
>>> regular user I always wished we could upload DAG files via an API. It opens
>>> the door to have an upload button, which would be nice. It would make
>>> Airflow a lot more accessible to non-engineering types.
>>>
>>> I love the idea of implementing a manual review option in conjunction
>>> with some sort of hook (similar to Airflow cluster policies) would be a
>>> good middle ground. An administrator could use that hook to do checks
>>> against DAGs or run security scanners, and decide whether or not to
>>> implement a review requirement.
>>>
>>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
>>> wrote:
>>>
>>>> In general I second what XD said. CI/CD feels better than sending DAG
>>>> files over API and the security issues arising from accepting "any python
>>>> file" are probably quite big.
>>>>
>>>> However, I think this proposal can be tightly related to "declarative
>>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>>> definition (operators, inputs, relations) in a predefined format that is
>>>> not a code. This of course has some limitations like inability to define
>>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>>
>>>> Other thought - if we implement something like "DAG via API" then we
>>>> should consider adding an option to review DAGs (approval queue etc) to
>>>> reduce security issues that are mitigated by for example deploying DAGs
>>>> from git (where we have code review, security scanners etc).
>>>>
>>>> Cheers,
>>>> Tomek
>>>>
>>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:
>>>>
>>>>> Hi Mocheng,
>>>>>
>>>>> Please allow me to share a question first: so in your proposal, the
>>>>> API in your plan is still accepting an Airflow DAG as the payload (just
>>>>> binarized or compressed), right?
>>>>>
>>>>> If that's the case, I may not be fully convinced: the objectives in
>>>>> your proposal is about automation & programmatically submitting DAGs. These
>>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>>> files).
>>>>>
>>>>> As you are already aware, allowing this via API adds additional
>>>>> security concern, and I would doubt if that "breaks even".
>>>>>
>>>>> Kindly let me know if I have missed anything or misunderstood your
>>>>> proposal. Thanks.
>>>>>
>>>>>
>>>>> Regards,
>>>>> XD
>>>>> ----------------------------------------------------------------
>>>>> (This is not a contribution)
>>>>>
>>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I have an enhancement proposal for the REST API service. This is
>>>>>> based on the observations that Airflow users want to be able to access
>>>>>> Airflow more easily as a platform service.
>>>>>>
>>>>>> The motivation comes from the following use cases:
>>>>>> 1. Users like data scientists want to iterate over data quickly with
>>>>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>>> inputs like user command or external triggers, and they want to be able to
>>>>>> submit DAGs programmatically without manual intervention.
>>>>>>
>>>>>> I believe such use cases would help promote Airflow usability and
>>>>>> gain more customer popularity. The existing DAG repo brings considerable
>>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>>> can be slow to rollout.
>>>>>>
>>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>>> transmitted online and here are some key points:
>>>>>> 1. A DAG is packaged individually so that it can be distributable
>>>>>> over the network. For example, a DAG may be a serialized binary or a zip
>>>>>> file.
>>>>>> 2. The Airflow REST API is the ideal place to talk with the external
>>>>>> world. The API would provide a generic interface to accept DAG artifacts
>>>>>> and should be extensible to support different artifact formats if needed.
>>>>>> 3. DAG persistence needs to be implemented since they are not part of
>>>>>> the DAG repository.
>>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>>> execution, and web server UI should behave the same way.
>>>>>>
>>>>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>>>>> may pose high security risks. Here are a few proposals to stop the security
>>>>>> breach:
>>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>>> pluggable authentication modules where strong authentication such as
>>>>>> Kerberos can be used.
>>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>>> the API service will have run_as_user set to be the API identity.
>>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>>> also be used to access the data warehouse.
>>>>>>
>>>>>> We shared a demo based on a prototype implementation in the summit
>>>>>> and some details are described in this ppt
>>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>>> and would love to get feedback and comments from the community about this
>>>>>> initiative.
>>>>>>
>>>>>> thanks
>>>>>> Mocheng
>>>>>>
>>>>>
>>>
>>> --
>>>
>>> Constance Martineau
>>> Product Manager
>>>
>>> Email: constance@astronomer.io
>>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>>
>>>
>>> <https://www.astronomer.io/>
>>>
>>>

Re: [Proposal] Creating DAG through the REST api

Posted by Mocheng Guo <gm...@gmail.com>.
First appreciate all for your valuable feedback. Airflow by design has to
accept code, both Tomasz and Constance's examples let me think that the
security judgement should be on the actual DAGs rather than how DAGs are
accepted or a process itself. To expand a little bit more on another
example, say another service provides an API which can be invoked by its
clients the service validates user inputs e.g. SQL and generates Airflow
DAGs which use the validated operators/macros. Those DAGs are safe to be
pushed through the API. There are certainly cases that DAGs may not be
safe, e.g the API service on public cloud with shared tenants with no
knowledge how DAGs are generated, in such cases the API service can access
control the identity or even reject all calls when considered unsafe.
Please let me know if the example makes sense, and if there is a common
interest, having an Airflow native write path would benefit the community
instead of each building its own solution.

Hi Xiaodong/Jarek, for your suggestion let me elaborate on a use case, here
are three loops a data scientist is doing to develop a machine learning
model:
- inner loop: iterates on the model locally.
- middle loop: iterate the model on a remote cluster with production data,
say it's using Airflow DAGs behind the scenes.
- outer loop: done with iteration and publish the model on production.
The middle loop usually happens on a Jupyter notebook, it needs to change
data/features used by model frequently which in turn leads to Airflow DAG
updates, do you mind elaborate how to automate the changes inside a
notebook and programmatically submitting DAGs through git+CI/CD while
giving user quick feedback? I understand git+ci/cd is technically possible
but the overhead involved is a major reason users rejecting Airflow for
other alternative solutions, e.g. git repo requires manual approval even if
DAGs can be programmatically submitted, and CI/CD are slow offline
processes with large repo.

Such use case is pretty common for data scientists, and a better **online**
service model would help open up more possibilities for Airflow and its
users, as additional layers providing more values(like Constance mentioned
enable users with no engineering or airflow domain knowledge to use
Airflow) could be built on top of Airflow which remains as a lower level
orchestration engine.

thanks
Mocheng


On Thu, Aug 11, 2022 at 10:46 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I really like the Idea of Tomek.
>
> If we ever go (which is not unlikely) - some "standard" declarative way of
> describing DAGs, all my security, packaging concerns are gone - and
> submitting such declarative DAG via API is quite viable. Simply submitting
> a Python code this way is a no-go for me :). Such Declarative DAG could be
> just stored in the DB and scheduled and executed using only "declaration"
> from the DB - without ever touching the DAG "folder" and without allowing
> the user to submit any executable code this way. All the code to execute
> would already have to be in Airflow already in this case.
>
> And I very much agree also that this case can be solved with Git. I think
> we are generally undervaluing the role Git plays for DAG distribution of
> Airflow.
>
> I think when the user feels the need (I very much understand the need
> Constance) to submit the DAG via API,  rather than adding the option of
> submitting the DAG code via "Airflow REST API", we should simply answer
> this:
>
> *Use Git and git sync. Then "Git Push" then becomes the standard "API" you
> wanted to push the code.*
>
> This has all the flexibility you need, it has integration with Pull
> Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
> - we have ALL of that and more for free. Standing on the shoulders of
> giants.
> If we start thinking about integration of code push via our own API - we
> basically start the journey of rewriting Git as eventually we will have to
> support those cases. This makes absolutely no sense for me.
>
> I even start to think that we should make "git sync" a separate (and much
> more viable) option that is pretty much the "main recommendation" for
> Airflow. rather than "yet another option among shared folders and baked in
> DAGs" case.
>
> I recently even wrote my thoughts about it in this post: "Shared Volumes
> in Airflow - the good, the bad and the ugly":
> https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
> which has much more details on why I think so.
>
> J.
>
>
> On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
> <co...@astronomer.io.invalid> wrote:
>
>> I understand the security concerns, and generally agree, but as a regular
>> user I always wished we could upload DAG files via an API. It opens the
>> door to have an upload button, which would be nice. It would make Airflow a
>> lot more accessible to non-engineering types.
>>
>> I love the idea of implementing a manual review option in conjunction
>> with some sort of hook (similar to Airflow cluster policies) would be a
>> good middle ground. An administrator could use that hook to do checks
>> against DAGs or run security scanners, and decide whether or not to
>> implement a review requirement.
>>
>> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
>> wrote:
>>
>>> In general I second what XD said. CI/CD feels better than sending DAG
>>> files over API and the security issues arising from accepting "any python
>>> file" are probably quite big.
>>>
>>> However, I think this proposal can be tightly related to "declarative
>>> DAGs". Instead of sending a DAG file, the user would send the DAG
>>> definition (operators, inputs, relations) in a predefined format that is
>>> not a code. This of course has some limitations like inability to define
>>> custom macros, callbacks on the fly but it may be a good compromise.
>>>
>>> Other thought - if we implement something like "DAG via API" then we
>>> should consider adding an option to review DAGs (approval queue etc) to
>>> reduce security issues that are mitigated by for example deploying DAGs
>>> from git (where we have code review, security scanners etc).
>>>
>>> Cheers,
>>> Tomek
>>>
>>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:
>>>
>>>> Hi Mocheng,
>>>>
>>>> Please allow me to share a question first: so in your proposal, the API
>>>> in your plan is still accepting an Airflow DAG as the payload (just
>>>> binarized or compressed), right?
>>>>
>>>> If that's the case, I may not be fully convinced: the objectives in
>>>> your proposal is about automation & programmatically submitting DAGs. These
>>>> can already be achieved in an efficient way through CI/CD practice + a
>>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>>> files).
>>>>
>>>> As you are already aware, allowing this via API adds additional
>>>> security concern, and I would doubt if that "breaks even".
>>>>
>>>> Kindly let me know if I have missed anything or misunderstood your
>>>> proposal. Thanks.
>>>>
>>>>
>>>> Regards,
>>>> XD
>>>> ----------------------------------------------------------------
>>>> (This is not a contribution)
>>>>
>>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I have an enhancement proposal for the REST API service. This is based
>>>>> on the observations that Airflow users want to be able to access Airflow
>>>>> more easily as a platform service.
>>>>>
>>>>> The motivation comes from the following use cases:
>>>>> 1. Users like data scientists want to iterate over data quickly with
>>>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>>> inputs like user command or external triggers, and they want to be able to
>>>>> submit DAGs programmatically without manual intervention.
>>>>>
>>>>> I believe such use cases would help promote Airflow usability and gain
>>>>> more customer popularity. The existing DAG repo brings considerable
>>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>>> can be slow to rollout.
>>>>>
>>>>> The proposal aims to provide an alternative where a DAG can be
>>>>> transmitted online and here are some key points:
>>>>> 1. A DAG is packaged individually so that it can be distributable over
>>>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>>>> 2. The Airflow REST API is the ideal place to talk with the external
>>>>> world. The API would provide a generic interface to accept DAG artifacts
>>>>> and should be extensible to support different artifact formats if needed.
>>>>> 3. DAG persistence needs to be implemented since they are not part of
>>>>> the DAG repository.
>>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>>> execution, and web server UI should behave the same way.
>>>>>
>>>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>>>> may pose high security risks. Here are a few proposals to stop the security
>>>>> breach:
>>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>>> pluggable authentication modules where strong authentication such as
>>>>> Kerberos can be used.
>>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through
>>>>> the API service will have run_as_user set to be the API identity.
>>>>> 3. To enforce data access control on DAGs, the API identity should
>>>>> also be used to access the data warehouse.
>>>>>
>>>>> We shared a demo based on a prototype implementation in the summit and
>>>>> some details are described in this ppt
>>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>>> and would love to get feedback and comments from the community about this
>>>>> initiative.
>>>>>
>>>>> thanks
>>>>> Mocheng
>>>>>
>>>>
>>
>> --
>>
>> Constance Martineau
>> Product Manager
>>
>> Email: constance@astronomer.io
>> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>>
>>
>> <https://www.astronomer.io/>
>>
>>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
I really like the Idea of Tomek.

If we ever go (which is not unlikely) - some "standard" declarative way of
describing DAGs, all my security, packaging concerns are gone - and
submitting such declarative DAG via API is quite viable. Simply submitting
a Python code this way is a no-go for me :). Such Declarative DAG could be
just stored in the DB and scheduled and executed using only "declaration"
from the DB - without ever touching the DAG "folder" and without allowing
the user to submit any executable code this way. All the code to execute
would already have to be in Airflow already in this case.

And I very much agree also that this case can be solved with Git. I think
we are generally undervaluing the role Git plays for DAG distribution of
Airflow.

I think when the user feels the need (I very much understand the need
Constance) to submit the DAG via API,  rather than adding the option of
submitting the DAG code via "Airflow REST API", we should simply answer
this:

*Use Git and git sync. Then "Git Push" then becomes the standard "API" you
wanted to push the code.*

This has all the flexibility you need, it has integration with Pull
Request, CI workflows, keeps history etc.etc. When we tell people "Use Git"
- we have ALL of that and more for free. Standing on the shoulders of
giants.
If we start thinking about integration of code push via our own API - we
basically start the journey of rewriting Git as eventually we will have to
support those cases. This makes absolutely no sense for me.

I even start to think that we should make "git sync" a separate (and much
more viable) option that is pretty much the "main recommendation" for
Airflow. rather than "yet another option among shared folders and baked in
DAGs" case.

I recently even wrote my thoughts about it in this post: "Shared Volumes in
Airflow - the good, the bad and the ugly":
https://medium.com/apache-airflow/shared-volumes-in-airflow-the-good-the-bad-and-the-ugly-22e9f681afca
which has much more details on why I think so.

J.


On Thu, Aug 11, 2022 at 8:43 PM Constance Martineau
<co...@astronomer.io.invalid> wrote:

> I understand the security concerns, and generally agree, but as a regular
> user I always wished we could upload DAG files via an API. It opens the
> door to have an upload button, which would be nice. It would make Airflow a
> lot more accessible to non-engineering types.
>
> I love the idea of implementing a manual review option in conjunction with
> some sort of hook (similar to Airflow cluster policies) would be a good
> middle ground. An administrator could use that hook to do checks against
> DAGs or run security scanners, and decide whether or not to implement a
> review requirement.
>
> On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
> wrote:
>
>> In general I second what XD said. CI/CD feels better than sending DAG
>> files over API and the security issues arising from accepting "any python
>> file" are probably quite big.
>>
>> However, I think this proposal can be tightly related to "declarative
>> DAGs". Instead of sending a DAG file, the user would send the DAG
>> definition (operators, inputs, relations) in a predefined format that is
>> not a code. This of course has some limitations like inability to define
>> custom macros, callbacks on the fly but it may be a good compromise.
>>
>> Other thought - if we implement something like "DAG via API" then we
>> should consider adding an option to review DAGs (approval queue etc) to
>> reduce security issues that are mitigated by for example deploying DAGs
>> from git (where we have code review, security scanners etc).
>>
>> Cheers,
>> Tomek
>>
>> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:
>>
>>> Hi Mocheng,
>>>
>>> Please allow me to share a question first: so in your proposal, the API
>>> in your plan is still accepting an Airflow DAG as the payload (just
>>> binarized or compressed), right?
>>>
>>> If that's the case, I may not be fully convinced: the objectives in your
>>> proposal is about automation & programmatically submitting DAGs. These can
>>> already be achieved in an efficient way through CI/CD practice + a
>>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>>> files).
>>>
>>> As you are already aware, allowing this via API adds additional security
>>> concern, and I would doubt if that "breaks even".
>>>
>>> Kindly let me know if I have missed anything or misunderstood your
>>> proposal. Thanks.
>>>
>>>
>>> Regards,
>>> XD
>>> ----------------------------------------------------------------
>>> (This is not a contribution)
>>>
>>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> I have an enhancement proposal for the REST API service. This is based
>>>> on the observations that Airflow users want to be able to access Airflow
>>>> more easily as a platform service.
>>>>
>>>> The motivation comes from the following use cases:
>>>> 1. Users like data scientists want to iterate over data quickly with
>>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>>> 2. Services targeting specific audiences can generate DAGs based on
>>>> inputs like user command or external triggers, and they want to be able to
>>>> submit DAGs programmatically without manual intervention.
>>>>
>>>> I believe such use cases would help promote Airflow usability and gain
>>>> more customer popularity. The existing DAG repo brings considerable
>>>> overhead for such scenarios, a shared repo requires offline processes and
>>>> can be slow to rollout.
>>>>
>>>> The proposal aims to provide an alternative where a DAG can be
>>>> transmitted online and here are some key points:
>>>> 1. A DAG is packaged individually so that it can be distributable over
>>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>>> 2. The Airflow REST API is the ideal place to talk with the external
>>>> world. The API would provide a generic interface to accept DAG artifacts
>>>> and should be extensible to support different artifact formats if needed.
>>>> 3. DAG persistence needs to be implemented since they are not part of
>>>> the DAG repository.
>>>> 4. Same behavior for DAGs supported in API vs those defined in the
>>>> repo, i.e. users write DAGs in the same syntax, and its scheduling,
>>>> execution, and web server UI should behave the same way.
>>>>
>>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>>> may pose high security risks. Here are a few proposals to stop the security
>>>> breach:
>>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>>> pluggable authentication modules where strong authentication such as
>>>> Kerberos can be used.
>>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>>> API service will have run_as_user set to be the API identity.
>>>> 3. To enforce data access control on DAGs, the API identity should also
>>>> be used to access the data warehouse.
>>>>
>>>> We shared a demo based on a prototype implementation in the summit and
>>>> some details are described in this ppt
>>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>>> and would love to get feedback and comments from the community about this
>>>> initiative.
>>>>
>>>> thanks
>>>> Mocheng
>>>>
>>>
>
> --
>
> Constance Martineau
> Product Manager
>
> Email: constance@astronomer.io
> Time zone: US Eastern (EST UTC-5 / EDT UTC-4)
>
>
> <https://www.astronomer.io/>
>
>

Re: [Proposal] Creating DAG through the REST api

Posted by Constance Martineau <co...@astronomer.io.INVALID>.
I understand the security concerns, and generally agree, but as a regular
user I always wished we could upload DAG files via an API. It opens the
door to have an upload button, which would be nice. It would make Airflow a
lot more accessible to non-engineering types.

I love the idea of implementing a manual review option in conjunction with
some sort of hook (similar to Airflow cluster policies) would be a good
middle ground. An administrator could use that hook to do checks against
DAGs or run security scanners, and decide whether or not to implement a
review requirement.

On Thu, Aug 11, 2022 at 1:54 PM Tomasz Urbaszek <tu...@apache.org>
wrote:

> In general I second what XD said. CI/CD feels better than sending DAG
> files over API and the security issues arising from accepting "any python
> file" are probably quite big.
>
> However, I think this proposal can be tightly related to "declarative
> DAGs". Instead of sending a DAG file, the user would send the DAG
> definition (operators, inputs, relations) in a predefined format that is
> not a code. This of course has some limitations like inability to define
> custom macros, callbacks on the fly but it may be a good compromise.
>
> Other thought - if we implement something like "DAG via API" then we
> should consider adding an option to review DAGs (approval queue etc) to
> reduce security issues that are mitigated by for example deploying DAGs
> from git (where we have code review, security scanners etc).
>
> Cheers,
> Tomek
>
> On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:
>
>> Hi Mocheng,
>>
>> Please allow me to share a question first: so in your proposal, the API
>> in your plan is still accepting an Airflow DAG as the payload (just
>> binarized or compressed), right?
>>
>> If that's the case, I may not be fully convinced: the objectives in your
>> proposal is about automation & programmatically submitting DAGs. These can
>> already be achieved in an efficient way through CI/CD practice + a
>> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
>> files).
>>
>> As you are already aware, allowing this via API adds additional security
>> concern, and I would doubt if that "breaks even".
>>
>> Kindly let me know if I have missed anything or misunderstood your
>> proposal. Thanks.
>>
>>
>> Regards,
>> XD
>> ----------------------------------------------------------------
>> (This is not a contribution)
>>
>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I have an enhancement proposal for the REST API service. This is based
>>> on the observations that Airflow users want to be able to access Airflow
>>> more easily as a platform service.
>>>
>>> The motivation comes from the following use cases:
>>> 1. Users like data scientists want to iterate over data quickly with
>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>> 2. Services targeting specific audiences can generate DAGs based on
>>> inputs like user command or external triggers, and they want to be able to
>>> submit DAGs programmatically without manual intervention.
>>>
>>> I believe such use cases would help promote Airflow usability and gain
>>> more customer popularity. The existing DAG repo brings considerable
>>> overhead for such scenarios, a shared repo requires offline processes and
>>> can be slow to rollout.
>>>
>>> The proposal aims to provide an alternative where a DAG can be
>>> transmitted online and here are some key points:
>>> 1. A DAG is packaged individually so that it can be distributable over
>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>> 2. The Airflow REST API is the ideal place to talk with the external
>>> world. The API would provide a generic interface to accept DAG artifacts
>>> and should be extensible to support different artifact formats if needed.
>>> 3. DAG persistence needs to be implemented since they are not part of
>>> the DAG repository.
>>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>>> and web server UI should behave the same way.
>>>
>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>> may pose high security risks. Here are a few proposals to stop the security
>>> breach:
>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>> pluggable authentication modules where strong authentication such as
>>> Kerberos can be used.
>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>> API service will have run_as_user set to be the API identity.
>>> 3. To enforce data access control on DAGs, the API identity should also
>>> be used to access the data warehouse.
>>>
>>> We shared a demo based on a prototype implementation in the summit and
>>> some details are described in this ppt
>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>> and would love to get feedback and comments from the community about this
>>> initiative.
>>>
>>> thanks
>>> Mocheng
>>>
>>

-- 

Constance Martineau
Product Manager

Email: constance@astronomer.io
Time zone: US Eastern (EST UTC-5 / EDT UTC-4)


<https://www.astronomer.io/>

Re: [Proposal] Creating DAG through the REST api

Posted by Tomasz Urbaszek <tu...@apache.org>.
In general I second what XD said. CI/CD feels better than sending DAG files
over API and the security issues arising from accepting "any python file"
are probably quite big.

However, I think this proposal can be tightly related to "declarative
DAGs". Instead of sending a DAG file, the user would send the DAG
definition (operators, inputs, relations) in a predefined format that is
not a code. This of course has some limitations like inability to define
custom macros, callbacks on the fly but it may be a good compromise.

Other thought - if we implement something like "DAG via API" then we should
consider adding an option to review DAGs (approval queue etc) to reduce
security issues that are mitigated by for example deploying DAGs from git
(where we have code review, security scanners etc).

Cheers,
Tomek

On Thu, 11 Aug 2022 at 17:50, Xiaodong Deng <xd...@apache.org> wrote:

> Hi Mocheng,
>
> Please allow me to share a question first: so in your proposal, the API in
> your plan is still accepting an Airflow DAG as the payload (just binarized
> or compressed), right?
>
> If that's the case, I may not be fully convinced: the objectives in your
> proposal is about automation & programmatically submitting DAGs. These can
> already be achieved in an efficient way through CI/CD practice + a
> centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
> files).
>
> As you are already aware, allowing this via API adds additional security
> concern, and I would doubt if that "breaks even".
>
> Kindly let me know if I have missed anything or misunderstood your
> proposal. Thanks.
>
>
> Regards,
> XD
> ----------------------------------------------------------------
> (This is not a contribution)
>
> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I have an enhancement proposal for the REST API service. This is based on
>> the observations that Airflow users want to be able to access Airflow more
>> easily as a platform service.
>>
>> The motivation comes from the following use cases:
>> 1. Users like data scientists want to iterate over data quickly with
>> interactive feedback in minutes, e.g. managing data pipelines inside
>> Jupyter Notebook while executing them in a remote airflow cluster.
>> 2. Services targeting specific audiences can generate DAGs based on
>> inputs like user command or external triggers, and they want to be able to
>> submit DAGs programmatically without manual intervention.
>>
>> I believe such use cases would help promote Airflow usability and gain
>> more customer popularity. The existing DAG repo brings considerable
>> overhead for such scenarios, a shared repo requires offline processes and
>> can be slow to rollout.
>>
>> The proposal aims to provide an alternative where a DAG can be
>> transmitted online and here are some key points:
>> 1. A DAG is packaged individually so that it can be distributable over
>> the network. For example, a DAG may be a serialized binary or a zip file.
>> 2. The Airflow REST API is the ideal place to talk with the external
>> world. The API would provide a generic interface to accept DAG artifacts
>> and should be extensible to support different artifact formats if needed.
>> 3. DAG persistence needs to be implemented since they are not part of the
>> DAG repository.
>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>> and web server UI should behave the same way.
>>
>> Since DAGs are written as code, running arbitrary code inside Airflow may
>> pose high security risks. Here are a few proposals to stop the security
>> breach:
>> 1. Accept DAGs only from trusted parties. Airflow already supports
>> pluggable authentication modules where strong authentication such as
>> Kerberos can be used.
>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>> API service will have run_as_user set to be the API identity.
>> 3. To enforce data access control on DAGs, the API identity should also
>> be used to access the data warehouse.
>>
>> We shared a demo based on a prototype implementation in the summit and
>> some details are described in this ppt
>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>> and would love to get feedback and comments from the community about this
>> initiative.
>>
>> thanks
>> Mocheng
>>
>

Re: [Proposal] Creating DAG through the REST api

Posted by Xiaodong Deng <xd...@apache.org>.
Hi Mocheng,

Please allow me to share a question first: so in your proposal, the API in
your plan is still accepting an Airflow DAG as the payload (just binarized
or compressed), right?

If that's the case, I may not be fully convinced: the objectives in your
proposal is about automation & programmatically submitting DAGs. These can
already be achieved in an efficient way through CI/CD practice + a
centralized place to manage your DAGs (e.g. a Git Repo to host the DAG
files).

As you are already aware, allowing this via API adds additional security
concern, and I would doubt if that "breaks even".

Kindly let me know if I have missed anything or misunderstood your
proposal. Thanks.


Regards,
XD
----------------------------------------------------------------
(This is not a contribution)

On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:

> Hi Everyone,
>
> I have an enhancement proposal for the REST API service. This is based on
> the observations that Airflow users want to be able to access Airflow more
> easily as a platform service.
>
> The motivation comes from the following use cases:
> 1. Users like data scientists want to iterate over data quickly with
> interactive feedback in minutes, e.g. managing data pipelines inside
> Jupyter Notebook while executing them in a remote airflow cluster.
> 2. Services targeting specific audiences can generate DAGs based on inputs
> like user command or external triggers, and they want to be able to submit
> DAGs programmatically without manual intervention.
>
> I believe such use cases would help promote Airflow usability and gain
> more customer popularity. The existing DAG repo brings considerable
> overhead for such scenarios, a shared repo requires offline processes and
> can be slow to rollout.
>
> The proposal aims to provide an alternative where a DAG can be transmitted
> online and here are some key points:
> 1. A DAG is packaged individually so that it can be distributable over the
> network. For example, a DAG may be a serialized binary or a zip file.
> 2. The Airflow REST API is the ideal place to talk with the external
> world. The API would provide a generic interface to accept DAG artifacts
> and should be extensible to support different artifact formats if needed.
> 3. DAG persistence needs to be implemented since they are not part of the
> DAG repository.
> 4. Same behavior for DAGs supported in API vs those defined in the repo,
> i.e. users write DAGs in the same syntax, and its scheduling, execution,
> and web server UI should behave the same way.
>
> Since DAGs are written as code, running arbitrary code inside Airflow may
> pose high security risks. Here are a few proposals to stop the security
> breach:
> 1. Accept DAGs only from trusted parties. Airflow already supports
> pluggable authentication modules where strong authentication such as
> Kerberos can be used.
> 2. Execute DAG code as the API identity, i.e. A DAG created through the
> API service will have run_as_user set to be the API identity.
> 3. To enforce data access control on DAGs, the API identity should also be
> used to access the data warehouse.
>
> We shared a demo based on a prototype implementation in the summit and
> some details are described in this ppt
> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
> and would love to get feedback and comments from the community about this
> initiative.
>
> thanks
> Mocheng
>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
> For the security concern, is it true that "access DAG files" means
loading DAG code? If that's correct, the proposal will not introduce it
inside the api/web server, the DAG could be serialized in API client and
DAG code/files through the API would be handled as a blob but it needs to
be persisted and meta data inside DB needs to be updated. For task
execution in worker, it can be better isolated with current internal API
initiative, and I have missed discussion in AIP-5 and maybe you could help
educate me here, what are the security differences between DAGs pushed to
API vs DAGs pulled from remote repositories in AIP-5?

Opening up submission via API changes a lot actually. What you are
essentially proposing is the Airflow API to allow a person  who is
authenticated via Airflow API/webserver to submit a code that will be
executed elsewhere (DagProcessor/Worker). This is not possible today - not
because Authentication and webserver implementation prevents it, but
because when Airflow is deployed, there is no physical possibility to
submit such code. Simply Airflow webserver does not have the possibility to
change the code that is executed because it does not have that code
mounted. The code displayed in webserver is a small subset of the code
(Just DAG) from the DB, but it is never executed - it's not even serialized
blob, it is the source code of the DAG file, that the DAG came from. The
best you can do is to modify the source code in the DB, but there is no
code in the DB that ever gets executed (if your deployment is done well and
you do not allow pickling). Effectively this aspect and the responsibility
and the security perimeter is in the hands of people who do deployment
(i.e. user) not application developers (people who submit the code to
Airflow). The proposal of yours completely changes the responsibility.
Instead of the "users" who are deploying it, this security is now in the
hands of application developers (i.e. people who commit code to Airflow).
We deliberately decided to take the responsibility off our shoulders and
pass them to the users. Your proposal is really an attempt to put the
responsibility back on our shoulders. Webserver is the only component that
is potentially available to the users and it is a security gateway that
should be opened at least to the internal "users" on the internal network.
Opening it up to accept code that will be executed by design is simply not
a good idea IMHO. Only worker and DAG file processor should ever have the
possibility of executing user provided code. This is what we have now. With
AIP-43 we implemented in 2.3 even scheduler does not have to have access to
DAG files nor execute the DAG code. You might have separate
DagFileProcessors to do that.

> Besides security, one difference between AIP-5/AIP-20 and API is that
AIP-5/AIP-20 design is only about reading and does not manage DAG creation
inside Airflow, I understand if this is currently by design to keep Airflow
off the storage responsibility and instead rely on external service/process
to manage and supply DAG repo, but it brings extra complexity, for example,
this external service/process needs to understand Airflow and prevent
duplicate dag_id.

Correct. This makes deployment more complex. And it's a deliberate design
decision. And it will not be changed, I am afraid - precisely for the
reasons described above.

> The API proposal could support it natively with access to the DB, and can
synchronously return status to a client. If this can be alternatively
included inside AIP-5 that'd be great.

Once API-5/API-20 is implemented, you will be able to implement your own
API if you want to submit DAGs this way, no problem with that. While
airflow components might pull the data, this should be completely decoupled
from the public API of Airflow. The public API serves a different purpose,
but there is nothing to prevent you from implementing your own API. In fact
if you look closely this is already happening and it is possible even now
via various deployments. Those are different APIs - deployment specific,
and while there is indeed no "synchronous" triggering of the DAG you can
submit the code even now via various mechanisms:

* Git push (if Airflow workers/dag processors uses GitSync)
* Copy files to a shared volume (if they use NFS/EFS etc)
* Push files to S3/GCS (if S3 /GCS filesystems are user)

Those are APIs -  not REST APIS but they are still APIs  - and all of them
are actually vastly superior when sending a bunch of Python files than REST
API (because this is what they have been designed for). The only problem
(but REST API does not solve it on its own) is a lack of synchronous
waiting for the DAG being eligible to run. This happens asynchronously now
- and this is not something that can be changed easily even if you want to
use REST API.

IMHO - what you are really looking for is a better integrated way of
submitting the DAG and waiting for it to be ready to run. But you do not
need a REST API to submit a code at all for that. And actually naive
implementation of submitting the code via REST API in the current
architecture will not make it "magically" return when the dag is "Ready" to
be run. There are a lot more things happening in the scheduler to make
a DAG ready and just a REST api to submit the code does not solve it at
all.

Maybe a better solution (and that could be part of the DAG fetcher AIP-5)
is to expose some Async/Websocket API where you could subscribe to be
notified when DAG is ready to run? I think AIP-5 is very far from being
complete, so this might be definitely part of it. I encourage you to
propose it there if you think that it might be a good idea. But just don't
ask us to take security responsibility for the code submitted via Airflow's
REST API. This is not something we - I think would like to do (or at least
this is something we recently got rid of responsibility for rather recently
and deliberately).

J.


czw., 11 sie 2022, 03:19 użytkownik Mocheng Guo <gm...@gmail.com> napisał:

> Hi Jarek, thanks a lot for the feedback and I understand security is a
> major concern and I would like to discuss more here. AIP-5/AIP-20 share the
> same goal to be able to ship DAGs individually but there are some
> differences and I'd be happy to align them together if that is possible.
>
> For the security concern, is it true that "access DAG files" means loading
> DAG code? If that's correct, the proposal will not introduce it inside the
> api/web server, the DAG could be serialized in API client and DAG
> code/files through the API would be handled as a blob but it needs to be
> persisted and meta data inside DB needs to be updated. For task execution
> in worker, it can be better isolated with current internal API initiative,
> and I have missed discussion in AIP-5 and maybe you could help educate me
> here, what are the security differences between DAGs pushed to API vs DAGs
> pulled from remote repositories in AIP-5?
>
> Besides security, one difference between AIP-5/AIP-20 and API is that
> AIP-5/AIP-20 design is only about reading and does not manage DAG creation
> inside Airflow, I understand if this is currently by design to keep Airflow
> off the storage responsibility and instead rely on external service/process
> to manage and supply DAG repo, but it brings extra complexity, for example,
> this external service/process needs to understand Airflow and prevent
> duplicate dag_id. The API proposal could support it natively with access to
> the DB, and can synchronously return status to client. If this can be
> alternatively included inside AIP-5 that'd be great.
>
> thanks
> Mocheng
>
>
> On Wed, Aug 10, 2022 at 5:27 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> This has been discussed several times, and I think you should rather take
>> a look and focus on those proposals already there:
>>
>> * https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest
>>
>> *
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
>>
>> Both proposals are supposed to address various caveats connected with
>> trying to submit Python DAG via API.
>>
>> * DAG manifest was a proposal on how to add meta-data to "limit" dag to
>> know what kind of other resources are needed for it to run
>> * Remote DAG Fetcher on the other hand would allow a user to use various
>> mechanisms to "Pull the data" (or if you develop your fetcher in the way to
>> allow Push, it would also allow Push model to work if we add some async
>> notifications there).
>>
>> I personally think using rest API to submit DAGs is a bad idea because it
>> is against the current security model of Airflow where Webserver (and also
>> REST API) has not only "READ ONLY" access to DAGs, but also has actually
>> "NO ACCESS" to DAGs whatsoever.
>> Currently, the webserver (i.e. API server) has no physical access to any
>> resources where executable DAG code is accessed and can be not only changed
>> but read directly. It only accesses the DB. Changing that is a huge change
>> in the security model. Actually it goes backwards to the changes we've
>> implemented in Airflow 1.10 initially and leaving that as the only option
>> in Airflow 2 (introducing DAG serialisation) where we specifically put a
>> lot of effort to remove the need for the webserver to access DAG files -
>> and security model we chose was the main driver for that.
>>
>> Making it possible to submit a new executable code via REST API of
>> Airflow would significantly increase dangers of exposing the API and make
>> it an order of magnitude more serious point of attack for an attacker.
>> Basically you are allowing the person who has access to API to submit an
>> executable code that should be executable by DAG file processor and worker.
>> Due to this - I don't think using REST API for that is a good idea and
>> for me this is no-go.
>>
>> However both AIP-5 and AIP-20 (when discussed, approved and implemented)
>> should nicely address the user requirement you have, without compromising
>> the security of the APIs - so I'd heartily recommend you to take a look
>> there and see if maybe you could take a lead in those discussions and
>> finalising them. Currently there is no-one actively working on those two
>> AIPs. but I think there are at least a few people who would like to be
>> involved if there is someone who will lead this effort (myself included).
>>
>> J.
>>
>>
>> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I have an enhancement proposal for the REST API service. This is based
>>> on the observations that Airflow users want to be able to access Airflow
>>> more easily as a platform service.
>>>
>>> The motivation comes from the following use cases:
>>> 1. Users like data scientists want to iterate over data quickly with
>>> interactive feedback in minutes, e.g. managing data pipelines inside
>>> Jupyter Notebook while executing them in a remote airflow cluster.
>>> 2. Services targeting specific audiences can generate DAGs based on
>>> inputs like user command or external triggers, and they want to be able to
>>> submit DAGs programmatically without manual intervention.
>>>
>>> I believe such use cases would help promote Airflow usability and gain
>>> more customer popularity. The existing DAG repo brings considerable
>>> overhead for such scenarios, a shared repo requires offline processes and
>>> can be slow to rollout.
>>>
>>> The proposal aims to provide an alternative where a DAG can be
>>> transmitted online and here are some key points:
>>> 1. A DAG is packaged individually so that it can be distributable over
>>> the network. For example, a DAG may be a serialized binary or a zip file.
>>> 2. The Airflow REST API is the ideal place to talk with the external
>>> world. The API would provide a generic interface to accept DAG artifacts
>>> and should be extensible to support different artifact formats if needed.
>>> 3. DAG persistence needs to be implemented since they are not part of
>>> the DAG repository.
>>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>>> and web server UI should behave the same way.
>>>
>>> Since DAGs are written as code, running arbitrary code inside Airflow
>>> may pose high security risks. Here are a few proposals to stop the security
>>> breach:
>>> 1. Accept DAGs only from trusted parties. Airflow already supports
>>> pluggable authentication modules where strong authentication such as
>>> Kerberos can be used.
>>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>>> API service will have run_as_user set to be the API identity.
>>> 3. To enforce data access control on DAGs, the API identity should also
>>> be used to access the data warehouse.
>>>
>>> We shared a demo based on a prototype implementation in the summit and
>>> some details are described in this ppt
>>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>>> and would love to get feedback and comments from the community about this
>>> initiative.
>>>
>>> thanks
>>> Mocheng
>>>
>>

Re: [Proposal] Creating DAG through the REST api

Posted by Mocheng Guo <gm...@gmail.com>.
Hi Jarek, thanks a lot for the feedback and I understand security is a
major concern and I would like to discuss more here. AIP-5/AIP-20 share the
same goal to be able to ship DAGs individually but there are some
differences and I'd be happy to align them together if that is possible.

For the security concern, is it true that "access DAG files" means loading
DAG code? If that's correct, the proposal will not introduce it inside the
api/web server, the DAG could be serialized in API client and DAG
code/files through the API would be handled as a blob but it needs to be
persisted and meta data inside DB needs to be updated. For task execution
in worker, it can be better isolated with current internal API initiative,
and I have missed discussion in AIP-5 and maybe you could help educate me
here, what are the security differences between DAGs pushed to API vs DAGs
pulled from remote repositories in AIP-5?

Besides security, one difference between AIP-5/AIP-20 and API is that
AIP-5/AIP-20 design is only about reading and does not manage DAG creation
inside Airflow, I understand if this is currently by design to keep Airflow
off the storage responsibility and instead rely on external service/process
to manage and supply DAG repo, but it brings extra complexity, for example,
this external service/process needs to understand Airflow and prevent
duplicate dag_id. The API proposal could support it natively with access to
the DB, and can synchronously return status to client. If this can be
alternatively included inside AIP-5 that'd be great.

thanks
Mocheng


On Wed, Aug 10, 2022 at 5:27 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> This has been discussed several times, and I think you should rather take
> a look and focus on those proposals already there:
>
> * https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest
> *
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher
>
> Both proposals are supposed to address various caveats connected with
> trying to submit Python DAG via API.
>
> * DAG manifest was a proposal on how to add meta-data to "limit" dag to
> know what kind of other resources are needed for it to run
> * Remote DAG Fetcher on the other hand would allow a user to use various
> mechanisms to "Pull the data" (or if you develop your fetcher in the way to
> allow Push, it would also allow Push model to work if we add some async
> notifications there).
>
> I personally think using rest API to submit DAGs is a bad idea because it
> is against the current security model of Airflow where Webserver (and also
> REST API) has not only "READ ONLY" access to DAGs, but also has actually
> "NO ACCESS" to DAGs whatsoever.
> Currently, the webserver (i.e. API server) has no physical access to any
> resources where executable DAG code is accessed and can be not only changed
> but read directly. It only accesses the DB. Changing that is a huge change
> in the security model. Actually it goes backwards to the changes we've
> implemented in Airflow 1.10 initially and leaving that as the only option
> in Airflow 2 (introducing DAG serialisation) where we specifically put a
> lot of effort to remove the need for the webserver to access DAG files -
> and security model we chose was the main driver for that.
>
> Making it possible to submit a new executable code via REST API of Airflow
> would significantly increase dangers of exposing the API and make it an
> order of magnitude more serious point of attack for an attacker. Basically
> you are allowing the person who has access to API to submit an executable
> code that should be executable by DAG file processor and worker.
> Due to this - I don't think using REST API for that is a good idea and for
> me this is no-go.
>
> However both AIP-5 and AIP-20 (when discussed, approved and implemented)
> should nicely address the user requirement you have, without compromising
> the security of the APIs - so I'd heartily recommend you to take a look
> there and see if maybe you could take a lead in those discussions and
> finalising them. Currently there is no-one actively working on those two
> AIPs. but I think there are at least a few people who would like to be
> involved if there is someone who will lead this effort (myself included).
>
> J.
>
>
> On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I have an enhancement proposal for the REST API service. This is based on
>> the observations that Airflow users want to be able to access Airflow more
>> easily as a platform service.
>>
>> The motivation comes from the following use cases:
>> 1. Users like data scientists want to iterate over data quickly with
>> interactive feedback in minutes, e.g. managing data pipelines inside
>> Jupyter Notebook while executing them in a remote airflow cluster.
>> 2. Services targeting specific audiences can generate DAGs based on
>> inputs like user command or external triggers, and they want to be able to
>> submit DAGs programmatically without manual intervention.
>>
>> I believe such use cases would help promote Airflow usability and gain
>> more customer popularity. The existing DAG repo brings considerable
>> overhead for such scenarios, a shared repo requires offline processes and
>> can be slow to rollout.
>>
>> The proposal aims to provide an alternative where a DAG can be
>> transmitted online and here are some key points:
>> 1. A DAG is packaged individually so that it can be distributable over
>> the network. For example, a DAG may be a serialized binary or a zip file.
>> 2. The Airflow REST API is the ideal place to talk with the external
>> world. The API would provide a generic interface to accept DAG artifacts
>> and should be extensible to support different artifact formats if needed.
>> 3. DAG persistence needs to be implemented since they are not part of the
>> DAG repository.
>> 4. Same behavior for DAGs supported in API vs those defined in the repo,
>> i.e. users write DAGs in the same syntax, and its scheduling, execution,
>> and web server UI should behave the same way.
>>
>> Since DAGs are written as code, running arbitrary code inside Airflow may
>> pose high security risks. Here are a few proposals to stop the security
>> breach:
>> 1. Accept DAGs only from trusted parties. Airflow already supports
>> pluggable authentication modules where strong authentication such as
>> Kerberos can be used.
>> 2. Execute DAG code as the API identity, i.e. A DAG created through the
>> API service will have run_as_user set to be the API identity.
>> 3. To enforce data access control on DAGs, the API identity should also
>> be used to access the data warehouse.
>>
>> We shared a demo based on a prototype implementation in the summit and
>> some details are described in this ppt
>> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
>> and would love to get feedback and comments from the community about this
>> initiative.
>>
>> thanks
>> Mocheng
>>
>

Re: [Proposal] Creating DAG through the REST api

Posted by Jarek Potiuk <ja...@potiuk.com>.
This has been discussed several times, and I think you should rather take a
look and focus on those proposals already there:

* https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-20+DAG+manifest
*
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+Remote+DAG+Fetcher

Both proposals are supposed to address various caveats connected with
trying to submit Python DAG via API.

* DAG manifest was a proposal on how to add meta-data to "limit" dag to
know what kind of other resources are needed for it to run
* Remote DAG Fetcher on the other hand would allow a user to use various
mechanisms to "Pull the data" (or if you develop your fetcher in the way to
allow Push, it would also allow Push model to work if we add some async
notifications there).

I personally think using rest API to submit DAGs is a bad idea because it
is against the current security model of Airflow where Webserver (and also
REST API) has not only "READ ONLY" access to DAGs, but also has actually
"NO ACCESS" to DAGs whatsoever.
Currently, the webserver (i.e. API server) has no physical access to any
resources where executable DAG code is accessed and can be not only changed
but read directly. It only accesses the DB. Changing that is a huge change
in the security model. Actually it goes backwards to the changes we've
implemented in Airflow 1.10 initially and leaving that as the only option
in Airflow 2 (introducing DAG serialisation) where we specifically put a
lot of effort to remove the need for the webserver to access DAG files -
and security model we chose was the main driver for that.

Making it possible to submit a new executable code via REST API of Airflow
would significantly increase dangers of exposing the API and make it an
order of magnitude more serious point of attack for an attacker. Basically
you are allowing the person who has access to API to submit an executable
code that should be executable by DAG file processor and worker.
Due to this - I don't think using REST API for that is a good idea and for
me this is no-go.

However both AIP-5 and AIP-20 (when discussed, approved and implemented)
should nicely address the user requirement you have, without compromising
the security of the APIs - so I'd heartily recommend you to take a look
there and see if maybe you could take a lead in those discussions and
finalising them. Currently there is no-one actively working on those two
AIPs. but I think there are at least a few people who would like to be
involved if there is someone who will lead this effort (myself included).

J.


On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gm...@gmail.com> wrote:

> Hi Everyone,
>
> I have an enhancement proposal for the REST API service. This is based on
> the observations that Airflow users want to be able to access Airflow more
> easily as a platform service.
>
> The motivation comes from the following use cases:
> 1. Users like data scientists want to iterate over data quickly with
> interactive feedback in minutes, e.g. managing data pipelines inside
> Jupyter Notebook while executing them in a remote airflow cluster.
> 2. Services targeting specific audiences can generate DAGs based on inputs
> like user command or external triggers, and they want to be able to submit
> DAGs programmatically without manual intervention.
>
> I believe such use cases would help promote Airflow usability and gain
> more customer popularity. The existing DAG repo brings considerable
> overhead for such scenarios, a shared repo requires offline processes and
> can be slow to rollout.
>
> The proposal aims to provide an alternative where a DAG can be transmitted
> online and here are some key points:
> 1. A DAG is packaged individually so that it can be distributable over the
> network. For example, a DAG may be a serialized binary or a zip file.
> 2. The Airflow REST API is the ideal place to talk with the external
> world. The API would provide a generic interface to accept DAG artifacts
> and should be extensible to support different artifact formats if needed.
> 3. DAG persistence needs to be implemented since they are not part of the
> DAG repository.
> 4. Same behavior for DAGs supported in API vs those defined in the repo,
> i.e. users write DAGs in the same syntax, and its scheduling, execution,
> and web server UI should behave the same way.
>
> Since DAGs are written as code, running arbitrary code inside Airflow may
> pose high security risks. Here are a few proposals to stop the security
> breach:
> 1. Accept DAGs only from trusted parties. Airflow already supports
> pluggable authentication modules where strong authentication such as
> Kerberos can be used.
> 2. Execute DAG code as the API identity, i.e. A DAG created through the
> API service will have run_as_user set to be the API identity.
> 3. To enforce data access control on DAGs, the API identity should also be
> used to access the data warehouse.
>
> We shared a demo based on a prototype implementation in the summit and
> some details are described in this ppt
> <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>,
> and would love to get feedback and comments from the community about this
> initiative.
>
> thanks
> Mocheng
>