You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Amit Jain <aj...@gmail.com> on 2017/04/18 21:47:58 UTC

Best practices on Long running process over LB

Hi All,

We have a use case where we are building Airflow DAG consisting of few
tasks and each task (HttpOperator) is calling the service running behind
AWS Elastic Load Balancer (ELB).

Since these tasks are the long running process so I'm getting 504 GATEWAY
TIMEOUT HTTP status code and resulting into incorrect task status at
Airflow side.

IMO to solve this problem, we can choose among following approaches

   - Make a call to the service and service will send back response and
   process actual request in another thread/process. One monitoring thread
   would heartbeat about task status to DB. At Airflow side, immediate task
   after each HttpOperator, we should have a sensor which should check for the
   status change in given poke interval.
   - Since we have around 1500 task running per hour so using service
   discovery system like Apache Zookeeper to get the node in round-robin
   fashion would make a direct connection with the node running service.
   - AWS ELB has limitation over HTTP idle-timeout to 1hr and my tasks are
   taking ~ 3 hr to get it done so no change at AWS ELB possible


Both approaches have cons first one, makes us change our current flow at
each service side i.e. handle a request in async mode, start heartbeat on
executing process/thread status in some interval hence the DB writes.

I'm interested to know how you guys are handling this problem and any
suggestion or improvement in mentioned approaches I can use.


Thanks,
Amit

Re: Best practices on Long running process over LB

Posted by siddharth anand <sa...@apache.org>.

Another approach :
1. Airflow calls webservice in a fire-and-forget fashion
2. Webservice updates a message bus/stream (e.g. SQS) with result
3. An airfllow sensor pulls updates off SQS and processes them

This saves airflow from polling your webservice which would in turn poll
your DB. Additionally, it avoids coupling your airflow instance to the
availability of your webservice and DB. Also, you'd need to implement an
efficient http endpoint to return status on a potentially long list of
status_ids and then you'd need to manage that list of ids.

SQS is great.  It's cheap to poll (and SQS supports long-polling as well)
and doesn't couple Airflow to the uptime of your webservice and DB. SQS
also supports batch reads and is transactional.

-s

On Tue, Apr 18, 2017 at 3:44 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> The proper way to do this is for your service to return a token (unique
> identifier for the long running process) asynchronously (immediately), and
> to then call another endpoint to check on the status while passing this
> token.
>
> Since this is Airflow and you have the luxury of having a lot of predefined
> sensors, you may just have to call a trigger endpoint async, and in the
> next task have a sensor look for the actual byproduct of that service's
> process (say if the process generates an S3 file, you'd have an S3Sensor
> right after the trigger task). The good thing with this approach is that
> this is more "stateless" than the approach where you are using a token (it
> allows for tasks to die without worrying about the token).
>
> Max
>
> On Tue, Apr 18, 2017 at 2:47 PM, Amit Jain <aj...@gmail.com> wrote:
>
> > Hi All,
> >
> > We have a use case where we are building Airflow DAG consisting of few
> > tasks and each task (HttpOperator) is calling the service running behind
> > AWS Elastic Load Balancer (ELB).
> >
> > Since these tasks are the long running process so I'm getting 504 GATEWAY
> > TIMEOUT HTTP status code and resulting into incorrect task status at
> > Airflow side.
> >
> > IMO to solve this problem, we can choose among following approaches
> >
> >    - Make a call to the service and service will send back response and
> >    process actual request in another thread/process. One monitoring
> thread
> >    would heartbeat about task status to DB. At Airflow side, immediate
> task
> >    after each HttpOperator, we should have a sensor which should check
> for
> > the
> >    status change in given poke interval.
> >    - Since we have around 1500 task running per hour so using service
> >    discovery system like Apache Zookeeper to get the node in round-robin
> >    fashion would make a direct connection with the node running service.
> >    - AWS ELB has limitation over HTTP idle-timeout to 1hr and my tasks
> are
> >    taking ~ 3 hr to get it done so no change at AWS ELB possible
> >
> >
> > Both approaches have cons first one, makes us change our current flow at
> > each service side i.e. handle a request in async mode, start heartbeat on
> > executing process/thread status in some interval hence the DB writes.
> >
> > I'm interested to know how you guys are handling this problem and any
> > suggestion or improvement in mentioned approaches I can use.
> >
> >
> > Thanks,
> > Amit
> >
>

Re: Best practices on Long running process over LB

Posted by Maxime Beauchemin <ma...@gmail.com>.

The proper way to do this is for your service to return a token (unique
identifier for the long running process) asynchronously (immediately), and
to then call another endpoint to check on the status while passing this
token.

Since this is Airflow and you have the luxury of having a lot of predefined
sensors, you may just have to call a trigger endpoint async, and in the
next task have a sensor look for the actual byproduct of that service's
process (say if the process generates an S3 file, you'd have an S3Sensor
right after the trigger task). The good thing with this approach is that
this is more "stateless" than the approach where you are using a token (it
allows for tasks to die without worrying about the token).

Max

On Tue, Apr 18, 2017 at 2:47 PM, Amit Jain <aj...@gmail.com> wrote:

> Hi All,
>
> We have a use case where we are building Airflow DAG consisting of few
> tasks and each task (HttpOperator) is calling the service running behind
> AWS Elastic Load Balancer (ELB).
>
> Since these tasks are the long running process so I'm getting 504 GATEWAY
> TIMEOUT HTTP status code and resulting into incorrect task status at
> Airflow side.
>
> IMO to solve this problem, we can choose among following approaches
>
>    - Make a call to the service and service will send back response and
>    process actual request in another thread/process. One monitoring thread
>    would heartbeat about task status to DB. At Airflow side, immediate task
>    after each HttpOperator, we should have a sensor which should check for
> the
>    status change in given poke interval.
>    - Since we have around 1500 task running per hour so using service
>    discovery system like Apache Zookeeper to get the node in round-robin
>    fashion would make a direct connection with the node running service.
>    - AWS ELB has limitation over HTTP idle-timeout to 1hr and my tasks are
>    taking ~ 3 hr to get it done so no change at AWS ELB possible
>
>
> Both approaches have cons first one, makes us change our current flow at
> each service side i.e. handle a request in async mode, start heartbeat on
> executing process/thread status in some interval hence the DB writes.
>
> I'm interested to know how you guys are handling this problem and any
> suggestion or improvement in mentioned approaches I can use.
>
>
> Thanks,
> Amit
>