You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "Dubey, Somesh" <so...@nordstrom.com> on 2018/08/24 13:40:04 UTC

airflow 1.9 tasks randomly failing | k8 - hive

We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we call via JDBC (via rest end point) and randomly tasks fail. One day one would fail and next day some other.
The pods are all healthy and there is no node eviction/any issues and retries typically works. Also even when the tasks fails it completes on the hive side successfully.
We do not get good logs of the failure and typically get partial sql in sysout in the task log with no error message. Loglevel increased to 5 in jdbc connection.
This only happens in production.

Not been able to reproduce the issue in dev.
The closest we have gotten in dev to replicate similar behavior is to kill the pid (manually killing airflow run --raw) in worker node.
That case also things run fine on hive side but task fails in Airflow.
Similar task log with no error is received.

Have you seen this kind of behavior. Any help is much appreciated.

Thanks so much,
Somesh

Re: airflow 1.9 tasks randomly failing | k8 - hive

Posted by Emmanuel Brard <em...@getyourguide.com>.

Hello,

You can see the zombie tasks killing in the scheduler logs (text file), I
think you need the 'INFO' log level though. The detection is based on the
time difference with the last heartbeat in the job table.

Best,
E

On Fri, Aug 24, 2018 at 4:36 PM Dubey, Somesh <so...@nordstrom.com>
wrote:

> Thanks so much Emmanuel for the reply.
> How did Airflow classified those processes as Zombie.
> Did any log say that.
> As we are not able to get any good logs of tasks failing.
>
> Thanks so much- You are potentially saving hours of work/sleepless nights
> as things are breaking production for us.
> Somesh
>
> -----Original Message-----
> From: Emmanuel Brard [mailto:emmanuel.brard@getyourguide.com]
> Sent: Friday, August 24, 2018 6:48 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: airflow 1.9 tasks randomly failing | k8 - hive
>
> Hey,
>
> We have a similar setup with Airflow 1.9 on Kubernetes with the Celery
> executor.
>
> We saw some airflow tasks being killed inside the container because of
> cgroup limits (set by kubernetes and pushed to the docker daemon) but not
> airflow itself (airflow celery command) which ended up in zombie being
> identified by the scheduler (setting the task to fail). We decreased the
> number of celery slots on each worker reducing memory pressure and this
> stoped. It happened mostly with one DAG which had simple operators but was
> quite big with subdags involved.
>
> Maybe you are facing the same issue ?
>
> Bye,
> E
>
> On Fri, Aug 24, 2018 at 3:40 PM Dubey, Somesh <so...@nordstrom.com>
> wrote:
>
> > We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we
> > call via JDBC (via rest end point) and randomly tasks fail. One day
> > one would fail and next day some other.
> > The pods are all healthy and there is no node eviction/any issues and
> > retries typically works. Also even when the tasks fails it completes
> > on the hive side successfully.
> > We do not get good logs of the failure and typically get partial sql
> > in sysout in the task log with no error message. Loglevel increased to
> > 5 in jdbc connection.
> > This only happens in production.
> >
> > Not been able to reproduce the issue in dev.
> > The closest we have gotten in dev to replicate similar behavior is to
> > kill the pid (manually killing airflow run --raw) in worker node.
> > That case also things run fine on hive side but task fails in Airflow.
> > Similar task log with no error is received.
> >
> > Have you seen this kind of behavior. Any help is much appreciated.
> >
> > Thanks so much,
> > Somesh
> >
> >
> >
> >
> >
>
> --
>
>
>
>
>
>
>
>
> GetYourGuide AG
>
> Stampfenbachstrasse 48
>
> 8006 Zürich
>
> Switzerland
>
>
>
>  <https://www.facebook.com/GetYourGuide>
> <https://twitter.com/GetYourGuide>
> <https://www.instagram.com/getyourguide/>
> <https://www.linkedin.com/company/getyourguide-ag>
> <http://www.getyourguide.com>
>
>
>
>
>
>
>
>

-- 








GetYourGuide AG

Stampfenbachstrasse 48  

8006 Zürich

Switzerland



 <https://www.facebook.com/GetYourGuide>  
<https://twitter.com/GetYourGuide>  
<https://www.instagram.com/getyourguide/>  
<https://www.linkedin.com/company/getyourguide-ag>  
<http://www.getyourguide.com>

RE: airflow 1.9 tasks randomly failing | k8 - hive

Posted by "Dubey, Somesh" <so...@nordstrom.com>.

Thanks so much Emmanuel for the reply. 
How did Airflow classified those processes as Zombie. 
Did any log say that. 
As we are not able to get any good logs of tasks failing. 

Thanks so much- You are potentially saving hours of work/sleepless nights as things are breaking production for us. 
Somesh

-----Original Message-----
From: Emmanuel Brard [mailto:emmanuel.brard@getyourguide.com] 
Sent: Friday, August 24, 2018 6:48 AM
To: dev@airflow.incubator.apache.org
Subject: Re: airflow 1.9 tasks randomly failing | k8 - hive

Hey,

We have a similar setup with Airflow 1.9 on Kubernetes with the Celery executor.

We saw some airflow tasks being killed inside the container because of cgroup limits (set by kubernetes and pushed to the docker daemon) but not airflow itself (airflow celery command) which ended up in zombie being identified by the scheduler (setting the task to fail). We decreased the number of celery slots on each worker reducing memory pressure and this stoped. It happened mostly with one DAG which had simple operators but was quite big with subdags involved.

Maybe you are facing the same issue ?

Bye,
E

On Fri, Aug 24, 2018 at 3:40 PM Dubey, Somesh <so...@nordstrom.com>
wrote:

> We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we 
> call via JDBC (via rest end point) and randomly tasks fail. One day 
> one would fail and next day some other.
> The pods are all healthy and there is no node eviction/any issues and 
> retries typically works. Also even when the tasks fails it completes 
> on the hive side successfully.
> We do not get good logs of the failure and typically get partial sql 
> in sysout in the task log with no error message. Loglevel increased to 
> 5 in jdbc connection.
> This only happens in production.
>
> Not been able to reproduce the issue in dev.
> The closest we have gotten in dev to replicate similar behavior is to 
> kill the pid (manually killing airflow run --raw) in worker node.
> That case also things run fine on hive side but task fails in Airflow.
> Similar task log with no error is received.
>
> Have you seen this kind of behavior. Any help is much appreciated.
>
> Thanks so much,
> Somesh
>
>
>
>
>

-- 

GetYourGuide AG

Stampfenbachstrasse 48  

8006 Zürich

Switzerland

 <https://www.facebook.com/GetYourGuide>
<https://twitter.com/GetYourGuide>
<https://www.instagram.com/getyourguide/>
<https://www.linkedin.com/company/getyourguide-ag>
<http://www.getyourguide.com>

Re: airflow 1.9 tasks randomly failing | k8 - hive

Posted by Emmanuel Brard <em...@getyourguide.com>.

Hey,

We have a similar setup with Airflow 1.9 on Kubernetes with the Celery
executor.

We saw some airflow tasks being killed inside the container because of
cgroup limits (set by kubernetes and pushed to the docker daemon) but not
airflow itself (airflow celery command) which ended up in zombie being
identified by the scheduler (setting the task to fail). We decreased the
number of celery slots on each worker reducing memory pressure and this
stoped. It happened mostly with one DAG which had simple operators but was
quite big with subdags involved.

Maybe you are facing the same issue ?

Bye,
E

On Fri, Aug 24, 2018 at 3:40 PM Dubey, Somesh <so...@nordstrom.com>
wrote:

> We are on AirFlow 1.9 on Kubernetes. We have many hive DAGs which we call
> via JDBC (via rest end point) and randomly tasks fail. One day one would
> fail and next day some other.
> The pods are all healthy and there is no node eviction/any issues and
> retries typically works. Also even when the tasks fails it completes on the
> hive side successfully.
> We do not get good logs of the failure and typically get partial sql in
> sysout in the task log with no error message. Loglevel increased to 5 in
> jdbc connection.
> This only happens in production.
>
> Not been able to reproduce the issue in dev.
> The closest we have gotten in dev to replicate similar behavior is to kill
> the pid (manually killing airflow run --raw) in worker node.
> That case also things run fine on hive side but task fails in Airflow.
> Similar task log with no error is received.
>
> Have you seen this kind of behavior. Any help is much appreciated.
>
> Thanks so much,
> Somesh
>
>
>
>
>

-- 

GetYourGuide AG

Stampfenbachstrasse 48  

8006 Zürich

Switzerland

 <https://www.facebook.com/GetYourGuide>  
<https://twitter.com/GetYourGuide>  
<https://www.instagram.com/getyourguide/>  
<https://www.linkedin.com/company/getyourguide-ag>  
<http://www.getyourguide.com>