You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "David.Serafini" <Da...@target.com> on 2017/09/27 23:48:40 UTC

slider job fails when resourcemanager restarts

I'm seeing my slider jobs sometimes fail for no obvious reason.
One hypothesis is that this happens when the resource manager is restarted (actually, when one of the 2 redundant RMs restarts).

Is this expected behavior?   

The jobs don't always fail completely; sometimes, yarn will fail an attempt and start another one, and the job's containers will all restart and everything will be fine.  Sometimes some of the jobs that are running will have trouble and some won't.  I haven't figured out a pattern yet.

Any insight would be appreciated.

-david



Re: slider job fails when resourcemanager restarts

Posted by Billie Rinaldi <bi...@gmail.com>.
You should be able to figure out the cause from the AM log. It sounds like
it could be SLIDER-1183. The fix for this issue also requires YARN-5999.
With the SLIDER-1183 fix by itself, it should stop the app from being
killed, but the AM will remain in a broken state.

On Wed, Sep 27, 2017 at 4:48 PM, David.Serafini <Da...@target.com>
wrote:

> I'm seeing my slider jobs sometimes fail for no obvious reason.
> One hypothesis is that this happens when the resource manager is restarted
> (actually, when one of the 2 redundant RMs restarts).
>
> Is this expected behavior?
>
> The jobs don't always fail completely; sometimes, yarn will fail an
> attempt and start another one, and the job's containers will all restart
> and everything will be fine.  Sometimes some of the jobs that are running
> will have trouble and some won't.  I haven't figured out a pattern yet.
>
> Any insight would be appreciated.
>
> -david
>
>
>