You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Kevin Lam <ke...@fathomhealth.co> on 2018/07/04 21:37:16 UTC

What information is passed around different components of Airflow?

Hi,

We run Apache Airflow as a set of k8s deployments inside of a GKE cluster,
similar to the way specified in Mumoshu's github repo:
https://github.com/mumoshu/kube-airflow.

We are investigating securing our use of Airflow and are wondering about
some of Airflow's implementation details. Specifically, we run some tasks
where the workers have access to sensitive data. Some of the data can make
its way into the task logs. However, we want to make sure isn't passed
around eg. to the scheduler/database/message queue, and if it is, it should
be encrypted in any network traffic (eg. via mutual tls).

- Does airflow pass around logs to the postgres db, or rabbitmq?
- Is the information in postgres mainly operational in nature?
- Is the information in rabbitmq mainly operational in nature?
- What about the scheduler?
- Anything else we're missing?

Any ideas are appreciated!

Thanks in advance!

Re: What information is passed around different components of Airflow?

Posted by Maxime Beauchemin <ma...@gmail.com>.
The MQ (rabbit / redis / ...) gets the `airflow run {dag_id} {task_id}
{...}` command to execute, and I think the worker runs it blindly as far as
I remember it. It's not ideal as far as security goes since if the MQ is
compromised, there's an open vector to the workers. Eventually it would be
safer to send some sort of payload (say JSON) and build a command on the
worker side based on that payload. Regardless you should limit network
access to the MQ to the cluster itself and ideally you should secure Celery
with SSL signed messages. More information on how to secure Celery here:
http://docs.celeryproject.org/en/latest/userguide/security.html

Max

On Thu, Jul 5, 2018 at 8:44 AM James Meickle
<jm...@quantopian.com.invalid> wrote:

> Airflow logs are stored on the worker filesystem. When a worker starts, it
> runs a subprocess that serves logs via Flask:
>
> https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985
>
> If you use the remote logging feature, the logs are (instead? also?) stored
> in S3.
>
> Postgres stores most everything you see in the UI. Task and DAG state, user
> accounts, privileges (in RBAC), variables, connections, etc.
>
> I believe the rabbitmq information is just task states and names, and that
> the workers fetch most of what they need from the database. But if you
> intercepted it you could manipulate which tasks are being run, so I'd still
> treat it as sensitive.
>
> On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam <ke...@fathomhealth.co> wrote:
>
> > Hi,
> >
> > We run Apache Airflow as a set of k8s deployments inside of a GKE
> cluster,
> > similar to the way specified in Mumoshu's github repo:
> > https://github.com/mumoshu/kube-airflow.
> >
> > We are investigating securing our use of Airflow and are wondering about
> > some of Airflow's implementation details. Specifically, we run some tasks
> > where the workers have access to sensitive data. Some of the data can
> make
> > its way into the task logs. However, we want to make sure isn't passed
> > around eg. to the scheduler/database/message queue, and if it is, it
> should
> > be encrypted in any network traffic (eg. via mutual tls).
> >
> > - Does airflow pass around logs to the postgres db, or rabbitmq?
> > - Is the information in postgres mainly operational in nature?
> > - Is the information in rabbitmq mainly operational in nature?
> > - What about the scheduler?
> > - Anything else we're missing?
> >
> > Any ideas are appreciated!
> >
> > Thanks in advance!
> >
>

Re: What information is passed around different components of Airflow?

Posted by James Meickle <jm...@quantopian.com.INVALID>.
Airflow logs are stored on the worker filesystem. When a worker starts, it
runs a subprocess that serves logs via Flask:
https://github.com/apache/incubator-airflow/blob/master/airflow/bin/cli.py#L985

If you use the remote logging feature, the logs are (instead? also?) stored
in S3.

Postgres stores most everything you see in the UI. Task and DAG state, user
accounts, privileges (in RBAC), variables, connections, etc.

I believe the rabbitmq information is just task states and names, and that
the workers fetch most of what they need from the database. But if you
intercepted it you could manipulate which tasks are being run, so I'd still
treat it as sensitive.

On Wed, Jul 4, 2018 at 5:37 PM, Kevin Lam <ke...@fathomhealth.co> wrote:

> Hi,
>
> We run Apache Airflow as a set of k8s deployments inside of a GKE cluster,
> similar to the way specified in Mumoshu's github repo:
> https://github.com/mumoshu/kube-airflow.
>
> We are investigating securing our use of Airflow and are wondering about
> some of Airflow's implementation details. Specifically, we run some tasks
> where the workers have access to sensitive data. Some of the data can make
> its way into the task logs. However, we want to make sure isn't passed
> around eg. to the scheduler/database/message queue, and if it is, it should
> be encrypted in any network traffic (eg. via mutual tls).
>
> - Does airflow pass around logs to the postgres db, or rabbitmq?
> - Is the information in postgres mainly operational in nature?
> - Is the information in rabbitmq mainly operational in nature?
> - What about the scheduler?
> - Anything else we're missing?
>
> Any ideas are appreciated!
>
> Thanks in advance!
>