You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tez.apache.org by Rohini Palaniswamy <ro...@gmail.com> on 2015/01/22 00:01:11 UTC

Handling ATS downtime

Folks,
     In the middle of big discussion on how to get delegation tokens from
ATS for Oozie jobs, another question came up. What is the behaviour of
running tez jobs if ATS goes down. Haven't tried it out, but my guess is
the job is going to fail. Or do we do something now to handle the failure
and still have the job complete successfully?

Regards,
Rohini

Re: Handling ATS downtime

Posted by Hitesh Shah <hi...@apache.org>.

Agreed. @Jonathan, can you file jiras for the 2 cases with related stack traces? 

The domain handling might be the trickier issue as we will need to disable ATS publishing in case the domain could not be created.

— Hitesh 


On Jan 21, 2015, at 4:03 PM, Jonathan Eagles <je...@gmail.com> wrote:

> I just checked this behavior in a secure cluster and if it fails to get a
> timeline server delegation token or fails to post the domain,  the job will
> fail. We should consider making these operations "best effort" as well.
> On Jan 21, 2015 5:33 PM, "Hitesh Shah" <hi...@apache.org> wrote:
> 
>> Actually at this time, the current impl just logs a WARN when there is a
>> failure pushing data to ATS. ATS is not treated as a critical entity as it
>> is not needed for job recovery.
>> 
>> — Hitesh
>> 
>> On Jan 21, 2015, at 3:01 PM, Rohini Palaniswamy <ro...@gmail.com>
>> wrote:
>> 
>>> Folks,
>>>    In the middle of big discussion on how to get delegation tokens from
>>> ATS for Oozie jobs, another question came up. What is the behaviour of
>>> running tez jobs if ATS goes down. Haven't tried it out, but my guess is
>>> the job is going to fail. Or do we do something now to handle the failure
>>> and still have the job complete successfully?
>>> 
>>> Regards,
>>> Rohini
>> 
>>

Re: Handling ATS downtime

Posted by Jonathan Eagles <je...@gmail.com>.

I just checked this behavior in a secure cluster and if it fails to get a
timeline server delegation token or fails to post the domain,  the job will
fail. We should consider making these operations "best effort" as well.
On Jan 21, 2015 5:33 PM, "Hitesh Shah" <hi...@apache.org> wrote:

> Actually at this time, the current impl just logs a WARN when there is a
> failure pushing data to ATS. ATS is not treated as a critical entity as it
> is not needed for job recovery.
>
> — Hitesh
>
> On Jan 21, 2015, at 3:01 PM, Rohini Palaniswamy <ro...@gmail.com>
> wrote:
>
> > Folks,
> >     In the middle of big discussion on how to get delegation tokens from
> > ATS for Oozie jobs, another question came up. What is the behaviour of
> > running tez jobs if ATS goes down. Haven't tried it out, but my guess is
> > the job is going to fail. Or do we do something now to handle the failure
> > and still have the job complete successfully?
> >
> > Regards,
> > Rohini
>
>

Re: Handling ATS downtime

Posted by Hitesh Shah <hi...@apache.org>.

Actually at this time, the current impl just logs a WARN when there is a failure pushing data to ATS. ATS is not treated as a critical entity as it is not needed for job recovery.

— Hitesh

On Jan 21, 2015, at 3:01 PM, Rohini Palaniswamy <ro...@gmail.com> wrote:

> Folks,
>     In the middle of big discussion on how to get delegation tokens from
> ATS for Oozie jobs, another question came up. What is the behaviour of
> running tez jobs if ATS goes down. Haven't tried it out, but my guess is
> the job is going to fail. Or do we do something now to handle the failure
> and still have the job complete successfully?
> 
> Regards,
> Rohini