You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Shoumitra Srivastava <sh...@gmail.com> on 2017/11/02 17:55:11 UTC

Airflow on ECS

Hi guys,

So far we have had a lot of success testing out Airflow and we are now
going for a full scale deployment. To that end, we are considering
dockerizing airflow and deploying it on one of our ECS clusters. We are
planning on separating out the web server and the scheduler to separate
tasks and using local executor with an RDS postgres and redis backend. Does
anyone else have any suggestions regarding the setup? Any design patterns
or good practises and gotchas would be welcome.

-Shoumitra

Re: Airflow on ECS

Posted by Shoumitra Srivastava <sh...@gmail.com>.
Hi guys,

Thank you so much for your thoughtful and well articulated replies. This
has been invaluable in charting out next steps for our deployment. Michael,
seems like we are headed towards a similar structure as you have outlined
since our loads are not very heavy as of now. The Kubernetes executor looks
promising and we will be monitoring its status. Daniel, I have already
signed up for the Meetup and hope to see you there as well!

-Shoumitra

On Mon, Nov 6, 2017 at 1:04 PM, Daniel Imberman <da...@gmail.com>
wrote:

> Hi Shoumitra,
>
> One thing worth noting is that with the release of the kubernetes executor,
> we will be using resource versions + the Kubernetes API to take care of
> some of the current issues with crash handling (basically recreating state
> from what tasks have been run/are pending within the cluster). The
> kubernetes executor also offloads all tasks to individual pods so you will
> not need to worry about the resources of any tasks affecting the scheduler.
>
> If you're available (and in SF) on Dec. 4th, we will be discussing the PR
> at airbnb for the airflow meetup.
>
> Hope to see you there!
>
> https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/
> 244525050/
>
> On Mon, Nov 6, 2017 at 9:39 AM Michael Erdely <mj...@gmail.com> wrote:
>
> > Hi Shoumitra,
> >
> > As others have mentioned, there are a lot of issues when using the local
> > executor in prod. However, at OfferUp, we have had success in running
> > Airflow dockerized on EC2.
> >
> > Our current setup is the following:
> >
> >    - Airflow 1.8.2 dockerized similar to Matthieu's Celery example at
> >    https://github.com/puckel/docker-airflow
> >    - Running scheduler, webserver, flower, and 5 workers on a c4.8xlarge
> >    EC2 instance
> >    - RDS hosted Postgres
> >    - ElastiCache hosted Redis
> >
> > We are close to the limits of this setup and plan on redoing our
> > configuration with terraform. Not sure if we'll keep the dockerized setup
> > but it's been extremely helpful thus far.
> >
> > -Michael
> >
> >
> >
> > On Thu, Nov 2, 2017 at 11:27 AM Marc Bollinger <ma...@lumoslabs.com>
> wrote:
> >
> > > We're actively following the Airflow/Kubernetes integration
> > > <https://issues.apache.org/jira/browse/AIRFLOW-1314>, and are
> eventually
> > > going to move to both running everything on k8s and using
> > > KubernetesExecutors for many things, but we've deployed Airflow to ECS
> > from
> > > day one. It works mostly fine, and we're using a tool we open-sourced
> > > called Broadside <https://github.com/lumoslabs/broadside> to simplify
> > > configuration and deployment. Our deploy is broken up into one
> scheduler,
> > > one Flower instance, a few web servers, and a number of workers, using
> > > CeleryExecutor backed by redis/Elasticache (and RDS postgres, as you're
> > > suggesting), all in ECS from the same private docker image.
> > >
> > > Tacking on to what Bolke is saying, it is also somewhat tricky in our
> > > experience to get deploys right in ECS with CeleryExecutors. Our first
> > > impulse was to bake the DAG directory/repo into the docker image and
> run
> > an
> > > ECS deploy every time we added or updated DAGs, bouncing all of the
> > > components and killing the workers. Where we wound up is that our CI
> > system
> > > still bakes the DAG directory into the images when we merge to master,
> > but
> > > for a "short" deploy we only bounce the web server and scheduler--the
> > > worker containers all just execute `git pull` and pull down the
> > new/updated
> > > DAGs. Others may have different approaches that work, I'm sure,
> possibly
> > > moving the DAG directory to a shared EFS mount.
> > >
> > > On Thu, Nov 2, 2017 at 11:06 AM, Bolke de Bruin <bd...@gmail.com>
> > wrote:
> > >
> > > > Please remember that with the LocalExecutor your tasks run in
> > > > process(group) with the scheduler. If you want to restart the
> > scheduler,
> > > it
> > > > will need to wait until all tasks have finished that are currently
> > > running.
> > > > In addition if you tasks are resource intensive (cpu, memory) this
> can
> > > also
> > > > affect the scheduler. In 1.9.0 we are a little bit more robust in
> this
> > > > respect, but guarding against OOM errors is very hard.
> > > >
> > > > Furthermore, the new logging framework in 1.9.0, will allow you to
> have
> > > > logs centrally which might be convenient. However, documentation is
> not
> > > up
> > > > to date so you will have to tune it yourself.
> > > >
> > > > My 2 cents,
> > > >
> > > > Bolke.
> > > >
> > > > > On 2 Nov 2017, at 18:55, Shoumitra Srivastava <
> > shoumitra362@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > So far we have had a lot of success testing out Airflow and we are
> > now
> > > > > going for a full scale deployment. To that end, we are considering
> > > > > dockerizing airflow and deploying it on one of our ECS clusters. We
> > are
> > > > > planning on separating out the web server and the scheduler to
> > separate
> > > > > tasks and using local executor with an RDS postgres and redis
> > backend.
> > > > Does
> > > > > anyone else have any suggestions regarding the setup? Any design
> > > patterns
> > > > > or good practises and gotchas would be welcome.
> > > > >
> > > > > -Shoumitra
> > > >
> > > >
> > >
> >
>

Re: Airflow on ECS

Posted by Daniel Imberman <da...@gmail.com>.
Hi Shoumitra,

One thing worth noting is that with the release of the kubernetes executor,
we will be using resource versions + the Kubernetes API to take care of
some of the current issues with crash handling (basically recreating state
from what tasks have been run/are pending within the cluster). The
kubernetes executor also offloads all tasks to individual pods so you will
not need to worry about the resources of any tasks affecting the scheduler.

If you're available (and in SF) on Dec. 4th, we will be discussing the PR
at airbnb for the airflow meetup.

Hope to see you there!

https://www.meetup.com/Bay-Area-Apache-Airflow-Incubating-Meetup/events/244525050/

On Mon, Nov 6, 2017 at 9:39 AM Michael Erdely <mj...@gmail.com> wrote:

> Hi Shoumitra,
>
> As others have mentioned, there are a lot of issues when using the local
> executor in prod. However, at OfferUp, we have had success in running
> Airflow dockerized on EC2.
>
> Our current setup is the following:
>
>    - Airflow 1.8.2 dockerized similar to Matthieu's Celery example at
>    https://github.com/puckel/docker-airflow
>    - Running scheduler, webserver, flower, and 5 workers on a c4.8xlarge
>    EC2 instance
>    - RDS hosted Postgres
>    - ElastiCache hosted Redis
>
> We are close to the limits of this setup and plan on redoing our
> configuration with terraform. Not sure if we'll keep the dockerized setup
> but it's been extremely helpful thus far.
>
> -Michael
>
>
>
> On Thu, Nov 2, 2017 at 11:27 AM Marc Bollinger <ma...@lumoslabs.com> wrote:
>
> > We're actively following the Airflow/Kubernetes integration
> > <https://issues.apache.org/jira/browse/AIRFLOW-1314>, and are eventually
> > going to move to both running everything on k8s and using
> > KubernetesExecutors for many things, but we've deployed Airflow to ECS
> from
> > day one. It works mostly fine, and we're using a tool we open-sourced
> > called Broadside <https://github.com/lumoslabs/broadside> to simplify
> > configuration and deployment. Our deploy is broken up into one scheduler,
> > one Flower instance, a few web servers, and a number of workers, using
> > CeleryExecutor backed by redis/Elasticache (and RDS postgres, as you're
> > suggesting), all in ECS from the same private docker image.
> >
> > Tacking on to what Bolke is saying, it is also somewhat tricky in our
> > experience to get deploys right in ECS with CeleryExecutors. Our first
> > impulse was to bake the DAG directory/repo into the docker image and run
> an
> > ECS deploy every time we added or updated DAGs, bouncing all of the
> > components and killing the workers. Where we wound up is that our CI
> system
> > still bakes the DAG directory into the images when we merge to master,
> but
> > for a "short" deploy we only bounce the web server and scheduler--the
> > worker containers all just execute `git pull` and pull down the
> new/updated
> > DAGs. Others may have different approaches that work, I'm sure, possibly
> > moving the DAG directory to a shared EFS mount.
> >
> > On Thu, Nov 2, 2017 at 11:06 AM, Bolke de Bruin <bd...@gmail.com>
> wrote:
> >
> > > Please remember that with the LocalExecutor your tasks run in
> > > process(group) with the scheduler. If you want to restart the
> scheduler,
> > it
> > > will need to wait until all tasks have finished that are currently
> > running.
> > > In addition if you tasks are resource intensive (cpu, memory) this can
> > also
> > > affect the scheduler. In 1.9.0 we are a little bit more robust in this
> > > respect, but guarding against OOM errors is very hard.
> > >
> > > Furthermore, the new logging framework in 1.9.0, will allow you to have
> > > logs centrally which might be convenient. However, documentation is not
> > up
> > > to date so you will have to tune it yourself.
> > >
> > > My 2 cents,
> > >
> > > Bolke.
> > >
> > > > On 2 Nov 2017, at 18:55, Shoumitra Srivastava <
> shoumitra362@gmail.com>
> > > wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > So far we have had a lot of success testing out Airflow and we are
> now
> > > > going for a full scale deployment. To that end, we are considering
> > > > dockerizing airflow and deploying it on one of our ECS clusters. We
> are
> > > > planning on separating out the web server and the scheduler to
> separate
> > > > tasks and using local executor with an RDS postgres and redis
> backend.
> > > Does
> > > > anyone else have any suggestions regarding the setup? Any design
> > patterns
> > > > or good practises and gotchas would be welcome.
> > > >
> > > > -Shoumitra
> > >
> > >
> >
>

Re: Airflow on ECS

Posted by Michael Erdely <mj...@gmail.com>.
Hi Shoumitra,

As others have mentioned, there are a lot of issues when using the local
executor in prod. However, at OfferUp, we have had success in running
Airflow dockerized on EC2.

Our current setup is the following:

   - Airflow 1.8.2 dockerized similar to Matthieu's Celery example at
   https://github.com/puckel/docker-airflow
   - Running scheduler, webserver, flower, and 5 workers on a c4.8xlarge
   EC2 instance
   - RDS hosted Postgres
   - ElastiCache hosted Redis

We are close to the limits of this setup and plan on redoing our
configuration with terraform. Not sure if we'll keep the dockerized setup
but it's been extremely helpful thus far.

-Michael



On Thu, Nov 2, 2017 at 11:27 AM Marc Bollinger <ma...@lumoslabs.com> wrote:

> We're actively following the Airflow/Kubernetes integration
> <https://issues.apache.org/jira/browse/AIRFLOW-1314>, and are eventually
> going to move to both running everything on k8s and using
> KubernetesExecutors for many things, but we've deployed Airflow to ECS from
> day one. It works mostly fine, and we're using a tool we open-sourced
> called Broadside <https://github.com/lumoslabs/broadside> to simplify
> configuration and deployment. Our deploy is broken up into one scheduler,
> one Flower instance, a few web servers, and a number of workers, using
> CeleryExecutor backed by redis/Elasticache (and RDS postgres, as you're
> suggesting), all in ECS from the same private docker image.
>
> Tacking on to what Bolke is saying, it is also somewhat tricky in our
> experience to get deploys right in ECS with CeleryExecutors. Our first
> impulse was to bake the DAG directory/repo into the docker image and run an
> ECS deploy every time we added or updated DAGs, bouncing all of the
> components and killing the workers. Where we wound up is that our CI system
> still bakes the DAG directory into the images when we merge to master, but
> for a "short" deploy we only bounce the web server and scheduler--the
> worker containers all just execute `git pull` and pull down the new/updated
> DAGs. Others may have different approaches that work, I'm sure, possibly
> moving the DAG directory to a shared EFS mount.
>
> On Thu, Nov 2, 2017 at 11:06 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Please remember that with the LocalExecutor your tasks run in
> > process(group) with the scheduler. If you want to restart the scheduler,
> it
> > will need to wait until all tasks have finished that are currently
> running.
> > In addition if you tasks are resource intensive (cpu, memory) this can
> also
> > affect the scheduler. In 1.9.0 we are a little bit more robust in this
> > respect, but guarding against OOM errors is very hard.
> >
> > Furthermore, the new logging framework in 1.9.0, will allow you to have
> > logs centrally which might be convenient. However, documentation is not
> up
> > to date so you will have to tune it yourself.
> >
> > My 2 cents,
> >
> > Bolke.
> >
> > > On 2 Nov 2017, at 18:55, Shoumitra Srivastava <sh...@gmail.com>
> > wrote:
> > >
> > > Hi guys,
> > >
> > > So far we have had a lot of success testing out Airflow and we are now
> > > going for a full scale deployment. To that end, we are considering
> > > dockerizing airflow and deploying it on one of our ECS clusters. We are
> > > planning on separating out the web server and the scheduler to separate
> > > tasks and using local executor with an RDS postgres and redis backend.
> > Does
> > > anyone else have any suggestions regarding the setup? Any design
> patterns
> > > or good practises and gotchas would be welcome.
> > >
> > > -Shoumitra
> >
> >
>

Re: Airflow on ECS

Posted by Marc Bollinger <ma...@lumoslabs.com>.
We're actively following the Airflow/Kubernetes integration
<https://issues.apache.org/jira/browse/AIRFLOW-1314>, and are eventually
going to move to both running everything on k8s and using
KubernetesExecutors for many things, but we've deployed Airflow to ECS from
day one. It works mostly fine, and we're using a tool we open-sourced
called Broadside <https://github.com/lumoslabs/broadside> to simplify
configuration and deployment. Our deploy is broken up into one scheduler,
one Flower instance, a few web servers, and a number of workers, using
CeleryExecutor backed by redis/Elasticache (and RDS postgres, as you're
suggesting), all in ECS from the same private docker image.

Tacking on to what Bolke is saying, it is also somewhat tricky in our
experience to get deploys right in ECS with CeleryExecutors. Our first
impulse was to bake the DAG directory/repo into the docker image and run an
ECS deploy every time we added or updated DAGs, bouncing all of the
components and killing the workers. Where we wound up is that our CI system
still bakes the DAG directory into the images when we merge to master, but
for a "short" deploy we only bounce the web server and scheduler--the
worker containers all just execute `git pull` and pull down the new/updated
DAGs. Others may have different approaches that work, I'm sure, possibly
moving the DAG directory to a shared EFS mount.

On Thu, Nov 2, 2017 at 11:06 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> Please remember that with the LocalExecutor your tasks run in
> process(group) with the scheduler. If you want to restart the scheduler, it
> will need to wait until all tasks have finished that are currently running.
> In addition if you tasks are resource intensive (cpu, memory) this can also
> affect the scheduler. In 1.9.0 we are a little bit more robust in this
> respect, but guarding against OOM errors is very hard.
>
> Furthermore, the new logging framework in 1.9.0, will allow you to have
> logs centrally which might be convenient. However, documentation is not up
> to date so you will have to tune it yourself.
>
> My 2 cents,
>
> Bolke.
>
> > On 2 Nov 2017, at 18:55, Shoumitra Srivastava <sh...@gmail.com>
> wrote:
> >
> > Hi guys,
> >
> > So far we have had a lot of success testing out Airflow and we are now
> > going for a full scale deployment. To that end, we are considering
> > dockerizing airflow and deploying it on one of our ECS clusters. We are
> > planning on separating out the web server and the scheduler to separate
> > tasks and using local executor with an RDS postgres and redis backend.
> Does
> > anyone else have any suggestions regarding the setup? Any design patterns
> > or good practises and gotchas would be welcome.
> >
> > -Shoumitra
>
>

Re: Airflow on ECS

Posted by Bolke de Bruin <bd...@gmail.com>.
Please remember that with the LocalExecutor your tasks run in process(group) with the scheduler. If you want to restart the scheduler, it will need to wait until all tasks have finished that are currently running. In addition if you tasks are resource intensive (cpu, memory) this can also affect the scheduler. In 1.9.0 we are a little bit more robust in this respect, but guarding against OOM errors is very hard.

Furthermore, the new logging framework in 1.9.0, will allow you to have logs centrally which might be convenient. However, documentation is not up to date so you will have to tune it yourself. 

My 2 cents,

Bolke.

> On 2 Nov 2017, at 18:55, Shoumitra Srivastava <sh...@gmail.com> wrote:
> 
> Hi guys,
> 
> So far we have had a lot of success testing out Airflow and we are now
> going for a full scale deployment. To that end, we are considering
> dockerizing airflow and deploying it on one of our ECS clusters. We are
> planning on separating out the web server and the scheduler to separate
> tasks and using local executor with an RDS postgres and redis backend. Does
> anyone else have any suggestions regarding the setup? Any design patterns
> or good practises and gotchas would be welcome.
> 
> -Shoumitra


Re: Airflow on ECS

Posted by Gerard Toonstra <gt...@gmail.com>.
Hey,

As Bolke said, with LE and tasks consuming variable amounts of memory, you
can run into memory issues on a container. I'd reconsider running on a
containerized
environment at all, because with the LE and the scheduler, you need to set
up a huge one for that to work. You're probably better off on an EC2
instance for that.
With LE, you don't need redis at all, because redis can serve as the
back-end for the CeleryExecutor, not LocalExecutor.

We used CeleryExecutor with redis in a spike on ECS. Indeed, logging is the
biggest issue here. We used static ip's and hostnames for the containers
we started (which doesn't necessarily make them "cattle"). We closed it off
and used "splunk"  to get all logging output in a centralized location. I
didn't spend
enough time to consider all the implications there though, because the web
UI is helpful to see the log output for a specific window for example and
through
splunk you actually lose that.

There were issues with memory usage and OOM, which gets reserved by the
container, so if anything restarts or gets unstable, look at that first.

To synchronize dags across all vm's, we experimented with EFS (works like
NFS) and the idea was to let CI deploy onto that as the single write
instance.

Rgds,

Gerard



On Thu, Nov 2, 2017 at 6:55 PM, Shoumitra Srivastava <shoumitra362@gmail.com
> wrote:

> Hi guys,
>
> So far we have had a lot of success testing out Airflow and we are now
> going for a full scale deployment. To that end, we are considering
> dockerizing airflow and deploying it on one of our ECS clusters. We are
> planning on separating out the web server and the scheduler to separate
> tasks and using local executor with an RDS postgres and redis backend. Does
> anyone else have any suggestions regarding the setup? Any design patterns
> or good practises and gotchas would be welcome.
>
> -Shoumitra
>