You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Kyle Hamlin <ha...@gmail.com> on 2018/08/22 21:12:54 UTC

Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

I'm about to make the switch to Kubernetes with Airflow, but am wondering
what happens when my CI/CD pipeline redeploys the webserver and scheduler
and there are still long-running tasks (pods). My intuition is that since
the database hold all state and the tasks are in charge of updating their
own state, and the UI only renders what it sees in the database that this
is not so much of a problem. To be sure, however, here are my questions:

Will task pods continue to run?
Can task pods continue to poll the external system they are running tasks
on while being "headless"?
Can the tasks pods change/update state in the database while being
"headless"?
Will the UI/Scheduler still be aware of the tasks (pods) once they are live
again?

Is there anything else the might cause issues when deploying while tasks
(pods) are running that I'm not thinking of here?

Kyle Hamlin

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Daniel Imberman <da...@gmail.com>.

Also worth mentioning that when you restart the scheduler it will use ETCD
and postgres to recreate state so you won't end up re-launching
tasks/missing tasks

On Thu, Aug 30, 2018, 12:54 PM Eamon Keane <ea...@gmail.com> wrote:

> Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran
> into that limit. The next possible limit might be etcd, as pod creation is
> expensive so if there were a lot of short lived pods you might run into
> issues (e.g. k8s API refusing connections) or so a google SRE tells me.
>
> On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <gr...@astronomer.io> wrote:
>
> > Yep, that should work fine. Pgbouncer is pretty configurable, so you can
> > play around with different settings for your environment. You can set
> > limits on the amount of connections you want to the actual database and
> > point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In
> my
> > experience, you can get away with a pretty low amount of actual
> connections
> > to postgres. Pgbouncer has some tools to observe the count of clients
> > (airflow processes), the amount of actual connections to the database, as
> > well as the number of waiting clients. You should be able to tune your
> > max_connections to the point where you have little to no clients waiting,
> > but using a dramatically lower number of actual connections to postgres.
> >
> > That chart also deploys a sidecar to pgbouncer that exports the metrics
> for
> > Prometheus to scrape. Here's an example Grafana dashboard that we use to
> > keep an eye on things -
> >
> >
> https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
> > .
> >
> > On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <ea...@gmail.com>
> > wrote:
> >
> > > Interesting, Greg. Do you know if using pg_bouncer would allow you to
> > have
> > > more than 100 running k8s executor tasks at one time if e.g. there is a
> > 100
> > > connection limit on gcp instance?
> > >
> > > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <gr...@astronomer.io>
> > wrote:
> > >
> > > > Good point Eamon, maxing connections out is definitely something to
> > look
> > > > out for. We recently added pgbouncer to our helm charts to pool
> > > connections
> > > > to the database for all the different airflow processes. Here's our
> > chart
> > > > for reference -
> > > >
> > > >
> > >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> > > >
> > > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <ha...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for your responses! Glad to hear that tasks can run
> > > independently
> > > > if
> > > > > something happens.
> > > > >
> > > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <
> eamon.keane1@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Adding to Greg's point, if you're using the k8s executor and for
> > some
> > > > > > reason the k8s executor worker pod fails to launch within 120
> > seconds
> > > > > (e.g.
> > > > > > pending due to scaling up a new node), this counts as a task
> > failure.
> > > > > Also,
> > > > > > if the k8s executor pod has already launched a pod operator but
> is
> > > > killed
> > > > > > (e.g. manually or due to node upgrade), the  pod operator it
> > launched
> > > > is
> > > > > > not killed and runs to completion so if using retries, you need
> to
> > > > ensure
> > > > > > idempotency. The worker pods update the db per my understanding,
> > with
> > > > > each
> > > > > > requiring a separate connection to the db, this can tax your
> > > connection
> > > > > > budget (100-300 for small postgres instances on gcp or aws).
> > > > > >
> > > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <
> greg@astronomer.io
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hey Kyle, the task pods will continue to run even if you reboot
> > the
> > > > > > > scheduler and webserver and the status does get updated in the
> > > > airflow
> > > > > > db,
> > > > > > > which is great.
> > > > > > >
> > > > > > > I know the scheduler subscribes to the Kubernetes watch API to
> > get
> > > an
> > > > > > event
> > > > > > > stream of pods completing and it keeps a checkpoint so it can
> > > > > resubscribe
> > > > > > > when it comes back up.
> > > > > > >
> > > > > > > I forget if the worker pods update the db or if the scheduler
> is
> > > > doing
> > > > > > > that, but it should work out.
> > > > > > >
> > > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin.kn@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > gentle bump
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <
> > hamlin.kn@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm about to make the switch to Kubernetes with Airflow,
> but
> > am
> > > > > > > wondering
> > > > > > > > > what happens when my CI/CD pipeline redeploys the webserver
> > and
> > > > > > > scheduler
> > > > > > > > > and there are still long-running tasks (pods). My intuition
> > is
> > > > that
> > > > > > > since
> > > > > > > > > the database hold all state and the tasks are in charge of
> > > > updating
> > > > > > > their
> > > > > > > > > own state, and the UI only renders what it sees in the
> > database
> > > > > that
> > > > > > > this
> > > > > > > > > is not so much of a problem. To be sure, however, here are
> my
> > > > > > > questions:
> > > > > > > > >
> > > > > > > > > Will task pods continue to run?
> > > > > > > > > Can task pods continue to poll the external system they are
> > > > running
> > > > > > > tasks
> > > > > > > > > on while being "headless"?
> > > > > > > > > Can the tasks pods change/update state in the database
> while
> > > > being
> > > > > > > > > "headless"?
> > > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods)
> once
> > > > they
> > > > > > are
> > > > > > > > > live again?
> > > > > > > > >
> > > > > > > > > Is there anything else the might cause issues when
> deploying
> > > > while
> > > > > > > tasks
> > > > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > > > >
> > > > > > > > > Kyle Hamlin
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Kyle Hamlin
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kyle Hamlin
> > > > >
> > > >
> > > >
> > > > --
> > > > *Greg Neiheisel* / CTO Astronomer.io
> > > >
> > >
> >
> >
> > --
> > *Greg Neiheisel* / CTO Astronomer.io
> >
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Eamon Keane <ea...@gmail.com>.

Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran
into that limit. The next possible limit might be etcd, as pod creation is
expensive so if there were a lot of short lived pods you might run into
issues (e.g. k8s API refusing connections) or so a google SRE tells me.

On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <gr...@astronomer.io> wrote:

> Yep, that should work fine. Pgbouncer is pretty configurable, so you can
> play around with different settings for your environment. You can set
> limits on the amount of connections you want to the actual database and
> point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In my
> experience, you can get away with a pretty low amount of actual connections
> to postgres. Pgbouncer has some tools to observe the count of clients
> (airflow processes), the amount of actual connections to the database, as
> well as the number of waiting clients. You should be able to tune your
> max_connections to the point where you have little to no clients waiting,
> but using a dramatically lower number of actual connections to postgres.
>
> That chart also deploys a sidecar to pgbouncer that exports the metrics for
> Prometheus to scrape. Here's an example Grafana dashboard that we use to
> keep an eye on things -
>
> https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
> .
>
> On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <ea...@gmail.com>
> wrote:
>
> > Interesting, Greg. Do you know if using pg_bouncer would allow you to
> have
> > more than 100 running k8s executor tasks at one time if e.g. there is a
> 100
> > connection limit on gcp instance?
> >
> > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <gr...@astronomer.io>
> wrote:
> >
> > > Good point Eamon, maxing connections out is definitely something to
> look
> > > out for. We recently added pgbouncer to our helm charts to pool
> > connections
> > > to the database for all the different airflow processes. Here's our
> chart
> > > for reference -
> > >
> > >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> > >
> > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <ha...@gmail.com>
> wrote:
> > >
> > > > Thanks for your responses! Glad to hear that tasks can run
> > independently
> > > if
> > > > something happens.
> > > >
> > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <ea...@gmail.com>
> > > > wrote:
> > > >
> > > > > Adding to Greg's point, if you're using the k8s executor and for
> some
> > > > > reason the k8s executor worker pod fails to launch within 120
> seconds
> > > > (e.g.
> > > > > pending due to scaling up a new node), this counts as a task
> failure.
> > > > Also,
> > > > > if the k8s executor pod has already launched a pod operator but is
> > > killed
> > > > > (e.g. manually or due to node upgrade), the  pod operator it
> launched
> > > is
> > > > > not killed and runs to completion so if using retries, you need to
> > > ensure
> > > > > idempotency. The worker pods update the db per my understanding,
> with
> > > > each
> > > > > requiring a separate connection to the db, this can tax your
> > connection
> > > > > budget (100-300 for small postgres instances on gcp or aws).
> > > > >
> > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <greg@astronomer.io
> >
> > > > wrote:
> > > > >
> > > > > > Hey Kyle, the task pods will continue to run even if you reboot
> the
> > > > > > scheduler and webserver and the status does get updated in the
> > > airflow
> > > > > db,
> > > > > > which is great.
> > > > > >
> > > > > > I know the scheduler subscribes to the Kubernetes watch API to
> get
> > an
> > > > > event
> > > > > > stream of pods completing and it keeps a checkpoint so it can
> > > > resubscribe
> > > > > > when it comes back up.
> > > > > >
> > > > > > I forget if the worker pods update the db or if the scheduler is
> > > doing
> > > > > > that, but it should work out.
> > > > > >
> > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > gentle bump
> > > > > > >
> > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <
> hamlin.kn@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > I'm about to make the switch to Kubernetes with Airflow, but
> am
> > > > > > wondering
> > > > > > > > what happens when my CI/CD pipeline redeploys the webserver
> and
> > > > > > scheduler
> > > > > > > > and there are still long-running tasks (pods). My intuition
> is
> > > that
> > > > > > since
> > > > > > > > the database hold all state and the tasks are in charge of
> > > updating
> > > > > > their
> > > > > > > > own state, and the UI only renders what it sees in the
> database
> > > > that
> > > > > > this
> > > > > > > > is not so much of a problem. To be sure, however, here are my
> > > > > > questions:
> > > > > > > >
> > > > > > > > Will task pods continue to run?
> > > > > > > > Can task pods continue to poll the external system they are
> > > running
> > > > > > tasks
> > > > > > > > on while being "headless"?
> > > > > > > > Can the tasks pods change/update state in the database while
> > > being
> > > > > > > > "headless"?
> > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods) once
> > > they
> > > > > are
> > > > > > > > live again?
> > > > > > > >
> > > > > > > > Is there anything else the might cause issues when deploying
> > > while
> > > > > > tasks
> > > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > > >
> > > > > > > > Kyle Hamlin
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Kyle Hamlin
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Kyle Hamlin
> > > >
> > >
> > >
> > > --
> > > *Greg Neiheisel* / CTO Astronomer.io
> > >
> >
>
>
> --
> *Greg Neiheisel* / CTO Astronomer.io
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Greg Neiheisel <gr...@astronomer.io>.

Yep, that should work fine. Pgbouncer is pretty configurable, so you can
play around with different settings for your environment. You can set
limits on the amount of connections you want to the actual database and
point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In my
experience, you can get away with a pretty low amount of actual connections
to postgres. Pgbouncer has some tools to observe the count of clients
(airflow processes), the amount of actual connections to the database, as
well as the number of waiting clients. You should be able to tune your
max_connections to the point where you have little to no clients waiting,
but using a dramatically lower number of actual connections to postgres.

That chart also deploys a sidecar to pgbouncer that exports the metrics for
Prometheus to scrape. Here's an example Grafana dashboard that we use to
keep an eye on things -
https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
.

On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <ea...@gmail.com> wrote:

> Interesting, Greg. Do you know if using pg_bouncer would allow you to have
> more than 100 running k8s executor tasks at one time if e.g. there is a 100
> connection limit on gcp instance?
>
> On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <gr...@astronomer.io> wrote:
>
> > Good point Eamon, maxing connections out is definitely something to look
> > out for. We recently added pgbouncer to our helm charts to pool
> connections
> > to the database for all the different airflow processes. Here's our chart
> > for reference -
> >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> >
> > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <ha...@gmail.com> wrote:
> >
> > > Thanks for your responses! Glad to hear that tasks can run
> independently
> > if
> > > something happens.
> > >
> > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <ea...@gmail.com>
> > > wrote:
> > >
> > > > Adding to Greg's point, if you're using the k8s executor and for some
> > > > reason the k8s executor worker pod fails to launch within 120 seconds
> > > (e.g.
> > > > pending due to scaling up a new node), this counts as a task failure.
> > > Also,
> > > > if the k8s executor pod has already launched a pod operator but is
> > killed
> > > > (e.g. manually or due to node upgrade), the  pod operator it launched
> > is
> > > > not killed and runs to completion so if using retries, you need to
> > ensure
> > > > idempotency. The worker pods update the db per my understanding, with
> > > each
> > > > requiring a separate connection to the db, this can tax your
> connection
> > > > budget (100-300 for small postgres instances on gcp or aws).
> > > >
> > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <gr...@astronomer.io>
> > > wrote:
> > > >
> > > > > Hey Kyle, the task pods will continue to run even if you reboot the
> > > > > scheduler and webserver and the status does get updated in the
> > airflow
> > > > db,
> > > > > which is great.
> > > > >
> > > > > I know the scheduler subscribes to the Kubernetes watch API to get
> an
> > > > event
> > > > > stream of pods completing and it keeps a checkpoint so it can
> > > resubscribe
> > > > > when it comes back up.
> > > > >
> > > > > I forget if the worker pods update the db or if the scheduler is
> > doing
> > > > > that, but it should work out.
> > > > >
> > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com>
> > wrote:
> > > > >
> > > > > > gentle bump
> > > > > >
> > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <hamlin.kn@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > I'm about to make the switch to Kubernetes with Airflow, but am
> > > > > wondering
> > > > > > > what happens when my CI/CD pipeline redeploys the webserver and
> > > > > scheduler
> > > > > > > and there are still long-running tasks (pods). My intuition is
> > that
> > > > > since
> > > > > > > the database hold all state and the tasks are in charge of
> > updating
> > > > > their
> > > > > > > own state, and the UI only renders what it sees in the database
> > > that
> > > > > this
> > > > > > > is not so much of a problem. To be sure, however, here are my
> > > > > questions:
> > > > > > >
> > > > > > > Will task pods continue to run?
> > > > > > > Can task pods continue to poll the external system they are
> > running
> > > > > tasks
> > > > > > > on while being "headless"?
> > > > > > > Can the tasks pods change/update state in the database while
> > being
> > > > > > > "headless"?
> > > > > > > Will the UI/Scheduler still be aware of the tasks (pods) once
> > they
> > > > are
> > > > > > > live again?
> > > > > > >
> > > > > > > Is there anything else the might cause issues when deploying
> > while
> > > > > tasks
> > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > >
> > > > > > > Kyle Hamlin
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kyle Hamlin
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Kyle Hamlin
> > >
> >
> >
> > --
> > *Greg Neiheisel* / CTO Astronomer.io
> >
>


-- 
*Greg Neiheisel* / CTO Astronomer.io

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Eamon Keane <ea...@gmail.com>.

Interesting, Greg. Do you know if using pg_bouncer would allow you to have
more than 100 running k8s executor tasks at one time if e.g. there is a 100
connection limit on gcp instance?

On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <gr...@astronomer.io> wrote:

> Good point Eamon, maxing connections out is definitely something to look
> out for. We recently added pgbouncer to our helm charts to pool connections
> to the database for all the different airflow processes. Here's our chart
> for reference -
>
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
>
> On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <ha...@gmail.com> wrote:
>
> > Thanks for your responses! Glad to hear that tasks can run independently
> if
> > something happens.
> >
> > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <ea...@gmail.com>
> > wrote:
> >
> > > Adding to Greg's point, if you're using the k8s executor and for some
> > > reason the k8s executor worker pod fails to launch within 120 seconds
> > (e.g.
> > > pending due to scaling up a new node), this counts as a task failure.
> > Also,
> > > if the k8s executor pod has already launched a pod operator but is
> killed
> > > (e.g. manually or due to node upgrade), the  pod operator it launched
> is
> > > not killed and runs to completion so if using retries, you need to
> ensure
> > > idempotency. The worker pods update the db per my understanding, with
> > each
> > > requiring a separate connection to the db, this can tax your connection
> > > budget (100-300 for small postgres instances on gcp or aws).
> > >
> > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <gr...@astronomer.io>
> > wrote:
> > >
> > > > Hey Kyle, the task pods will continue to run even if you reboot the
> > > > scheduler and webserver and the status does get updated in the
> airflow
> > > db,
> > > > which is great.
> > > >
> > > > I know the scheduler subscribes to the Kubernetes watch API to get an
> > > event
> > > > stream of pods completing and it keeps a checkpoint so it can
> > resubscribe
> > > > when it comes back up.
> > > >
> > > > I forget if the worker pods update the db or if the scheduler is
> doing
> > > > that, but it should work out.
> > > >
> > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com>
> wrote:
> > > >
> > > > > gentle bump
> > > > >
> > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I'm about to make the switch to Kubernetes with Airflow, but am
> > > > wondering
> > > > > > what happens when my CI/CD pipeline redeploys the webserver and
> > > > scheduler
> > > > > > and there are still long-running tasks (pods). My intuition is
> that
> > > > since
> > > > > > the database hold all state and the tasks are in charge of
> updating
> > > > their
> > > > > > own state, and the UI only renders what it sees in the database
> > that
> > > > this
> > > > > > is not so much of a problem. To be sure, however, here are my
> > > > questions:
> > > > > >
> > > > > > Will task pods continue to run?
> > > > > > Can task pods continue to poll the external system they are
> running
> > > > tasks
> > > > > > on while being "headless"?
> > > > > > Can the tasks pods change/update state in the database while
> being
> > > > > > "headless"?
> > > > > > Will the UI/Scheduler still be aware of the tasks (pods) once
> they
> > > are
> > > > > > live again?
> > > > > >
> > > > > > Is there anything else the might cause issues when deploying
> while
> > > > tasks
> > > > > > (pods) are running that I'm not thinking of here?
> > > > > >
> > > > > > Kyle Hamlin
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kyle Hamlin
> > > > >
> > > >
> > >
> >
> >
> > --
> > Kyle Hamlin
> >
>
>
> --
> *Greg Neiheisel* / CTO Astronomer.io
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Greg Neiheisel <gr...@astronomer.io>.

Good point Eamon, maxing connections out is definitely something to look
out for. We recently added pgbouncer to our helm charts to pool connections
to the database for all the different airflow processes. Here's our chart
for reference -
https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow

On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <ha...@gmail.com> wrote:

> Thanks for your responses! Glad to hear that tasks can run independently if
> something happens.
>
> On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <ea...@gmail.com>
> wrote:
>
> > Adding to Greg's point, if you're using the k8s executor and for some
> > reason the k8s executor worker pod fails to launch within 120 seconds
> (e.g.
> > pending due to scaling up a new node), this counts as a task failure.
> Also,
> > if the k8s executor pod has already launched a pod operator but is killed
> > (e.g. manually or due to node upgrade), the  pod operator it launched is
> > not killed and runs to completion so if using retries, you need to ensure
> > idempotency. The worker pods update the db per my understanding, with
> each
> > requiring a separate connection to the db, this can tax your connection
> > budget (100-300 for small postgres instances on gcp or aws).
> >
> > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <gr...@astronomer.io>
> wrote:
> >
> > > Hey Kyle, the task pods will continue to run even if you reboot the
> > > scheduler and webserver and the status does get updated in the airflow
> > db,
> > > which is great.
> > >
> > > I know the scheduler subscribes to the Kubernetes watch API to get an
> > event
> > > stream of pods completing and it keeps a checkpoint so it can
> resubscribe
> > > when it comes back up.
> > >
> > > I forget if the worker pods update the db or if the scheduler is doing
> > > that, but it should work out.
> > >
> > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com> wrote:
> > >
> > > > gentle bump
> > > >
> > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com>
> > wrote:
> > > >
> > > > > I'm about to make the switch to Kubernetes with Airflow, but am
> > > wondering
> > > > > what happens when my CI/CD pipeline redeploys the webserver and
> > > scheduler
> > > > > and there are still long-running tasks (pods). My intuition is that
> > > since
> > > > > the database hold all state and the tasks are in charge of updating
> > > their
> > > > > own state, and the UI only renders what it sees in the database
> that
> > > this
> > > > > is not so much of a problem. To be sure, however, here are my
> > > questions:
> > > > >
> > > > > Will task pods continue to run?
> > > > > Can task pods continue to poll the external system they are running
> > > tasks
> > > > > on while being "headless"?
> > > > > Can the tasks pods change/update state in the database while being
> > > > > "headless"?
> > > > > Will the UI/Scheduler still be aware of the tasks (pods) once they
> > are
> > > > > live again?
> > > > >
> > > > > Is there anything else the might cause issues when deploying while
> > > tasks
> > > > > (pods) are running that I'm not thinking of here?
> > > > >
> > > > > Kyle Hamlin
> > > > >
> > > >
> > > >
> > > > --
> > > > Kyle Hamlin
> > > >
> > >
> >
>
>
> --
> Kyle Hamlin
>


-- 
*Greg Neiheisel* / CTO Astronomer.io

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Kyle Hamlin <ha...@gmail.com>.

Thanks for your responses! Glad to hear that tasks can run independently if
something happens.

On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <ea...@gmail.com> wrote:

> Adding to Greg's point, if you're using the k8s executor and for some
> reason the k8s executor worker pod fails to launch within 120 seconds (e.g.
> pending due to scaling up a new node), this counts as a task failure. Also,
> if the k8s executor pod has already launched a pod operator but is killed
> (e.g. manually or due to node upgrade), the  pod operator it launched is
> not killed and runs to completion so if using retries, you need to ensure
> idempotency. The worker pods update the db per my understanding, with each
> requiring a separate connection to the db, this can tax your connection
> budget (100-300 for small postgres instances on gcp or aws).
>
> On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <gr...@astronomer.io> wrote:
>
> > Hey Kyle, the task pods will continue to run even if you reboot the
> > scheduler and webserver and the status does get updated in the airflow
> db,
> > which is great.
> >
> > I know the scheduler subscribes to the Kubernetes watch API to get an
> event
> > stream of pods completing and it keeps a checkpoint so it can resubscribe
> > when it comes back up.
> >
> > I forget if the worker pods update the db or if the scheduler is doing
> > that, but it should work out.
> >
> > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com> wrote:
> >
> > > gentle bump
> > >
> > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com>
> wrote:
> > >
> > > > I'm about to make the switch to Kubernetes with Airflow, but am
> > wondering
> > > > what happens when my CI/CD pipeline redeploys the webserver and
> > scheduler
> > > > and there are still long-running tasks (pods). My intuition is that
> > since
> > > > the database hold all state and the tasks are in charge of updating
> > their
> > > > own state, and the UI only renders what it sees in the database that
> > this
> > > > is not so much of a problem. To be sure, however, here are my
> > questions:
> > > >
> > > > Will task pods continue to run?
> > > > Can task pods continue to poll the external system they are running
> > tasks
> > > > on while being "headless"?
> > > > Can the tasks pods change/update state in the database while being
> > > > "headless"?
> > > > Will the UI/Scheduler still be aware of the tasks (pods) once they
> are
> > > > live again?
> > > >
> > > > Is there anything else the might cause issues when deploying while
> > tasks
> > > > (pods) are running that I'm not thinking of here?
> > > >
> > > > Kyle Hamlin
> > > >
> > >
> > >
> > > --
> > > Kyle Hamlin
> > >
> >
>


-- 
Kyle Hamlin

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Eamon Keane <ea...@gmail.com>.

Adding to Greg's point, if you're using the k8s executor and for some
reason the k8s executor worker pod fails to launch within 120 seconds (e.g.
pending due to scaling up a new node), this counts as a task failure. Also,
if the k8s executor pod has already launched a pod operator but is killed
(e.g. manually or due to node upgrade), the  pod operator it launched is
not killed and runs to completion so if using retries, you need to ensure
idempotency. The worker pods update the db per my understanding, with each
requiring a separate connection to the db, this can tax your connection
budget (100-300 for small postgres instances on gcp or aws).

On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <gr...@astronomer.io> wrote:

> Hey Kyle, the task pods will continue to run even if you reboot the
> scheduler and webserver and the status does get updated in the airflow db,
> which is great.
>
> I know the scheduler subscribes to the Kubernetes watch API to get an event
> stream of pods completing and it keeps a checkpoint so it can resubscribe
> when it comes back up.
>
> I forget if the worker pods update the db or if the scheduler is doing
> that, but it should work out.
>
> On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com> wrote:
>
> > gentle bump
> >
> > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com> wrote:
> >
> > > I'm about to make the switch to Kubernetes with Airflow, but am
> wondering
> > > what happens when my CI/CD pipeline redeploys the webserver and
> scheduler
> > > and there are still long-running tasks (pods). My intuition is that
> since
> > > the database hold all state and the tasks are in charge of updating
> their
> > > own state, and the UI only renders what it sees in the database that
> this
> > > is not so much of a problem. To be sure, however, here are my
> questions:
> > >
> > > Will task pods continue to run?
> > > Can task pods continue to poll the external system they are running
> tasks
> > > on while being "headless"?
> > > Can the tasks pods change/update state in the database while being
> > > "headless"?
> > > Will the UI/Scheduler still be aware of the tasks (pods) once they are
> > > live again?
> > >
> > > Is there anything else the might cause issues when deploying while
> tasks
> > > (pods) are running that I'm not thinking of here?
> > >
> > > Kyle Hamlin
> > >
> >
> >
> > --
> > Kyle Hamlin
> >
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Greg Neiheisel <gr...@astronomer.io>.

Hey Kyle, the task pods will continue to run even if you reboot the
scheduler and webserver and the status does get updated in the airflow db,
which is great.

I know the scheduler subscribes to the Kubernetes watch API to get an event
stream of pods completing and it keeps a checkpoint so it can resubscribe
when it comes back up.

I forget if the worker pods update the db or if the scheduler is doing
that, but it should work out.

On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <ha...@gmail.com> wrote:

> gentle bump
>
> On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com> wrote:
>
> > I'm about to make the switch to Kubernetes with Airflow, but am wondering
> > what happens when my CI/CD pipeline redeploys the webserver and scheduler
> > and there are still long-running tasks (pods). My intuition is that since
> > the database hold all state and the tasks are in charge of updating their
> > own state, and the UI only renders what it sees in the database that this
> > is not so much of a problem. To be sure, however, here are my questions:
> >
> > Will task pods continue to run?
> > Can task pods continue to poll the external system they are running tasks
> > on while being "headless"?
> > Can the tasks pods change/update state in the database while being
> > "headless"?
> > Will the UI/Scheduler still be aware of the tasks (pods) once they are
> > live again?
> >
> > Is there anything else the might cause issues when deploying while tasks
> > (pods) are running that I'm not thinking of here?
> >
> > Kyle Hamlin
> >
>
>
> --
> Kyle Hamlin
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Posted by Kyle Hamlin <ha...@gmail.com>.

gentle bump

On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <ha...@gmail.com> wrote:

> I'm about to make the switch to Kubernetes with Airflow, but am wondering
> what happens when my CI/CD pipeline redeploys the webserver and scheduler
> and there are still long-running tasks (pods). My intuition is that since
> the database hold all state and the tasks are in charge of updating their
> own state, and the UI only renders what it sees in the database that this
> is not so much of a problem. To be sure, however, here are my questions:
>
> Will task pods continue to run?
> Can task pods continue to poll the external system they are running tasks
> on while being "headless"?
> Can the tasks pods change/update state in the database while being
> "headless"?
> Will the UI/Scheduler still be aware of the tasks (pods) once they are
> live again?
>
> Is there anything else the might cause issues when deploying while tasks
> (pods) are running that I'm not thinking of here?
>
> Kyle Hamlin
>


-- 
Kyle Hamlin