You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Matt Cheah <mc...@palantir.com> on 2016/01/23 00:29:33 UTC

YARN queues become unusable and jobs are stuck in ACCEPTED state

Hi,

I¹ve sporadically been seeing an issue when using Hadoop YARN. I¹m using
Hadoop 2.5.0, CDH5.3.3.

When I¹ve configured the stack to use the fair scheduler protocol, after
some period of time of the cluster being alive and running jobs, I¹m
noticing that when I submit a job, the job will be stuck in the ACCEPTED
state even though the cluster has sufficient resources to spawn an
application master container as well as the queue I¹m submitting to having
sufficient resources available. Furthermore, all jobs submitted to that
queue will be stuck in the ACCEPTED state. I can unblock job submission by
going into the allocation XML file, renaming the queue, and submitting jobs
to that renamed queue instead. However the queue has only changed name, and
all of its other settings have been preserved.

It is clearly untenable for me to have to change the queues that I¹m using
sometimes. This appears to happen irrespective of the settings of the queue,
e.g. Its weight or its minimum resource share. The events leading up to this
occurrence are strictly unpredictable and I have no concrete way to
reproduce the issue. The logs don¹t show anything interesting either; the
resource manager just states that it schedules an attempt for the
application submitted to the bad queue, but the attempt¹s application master
is never allocated to a container anywhere.

I have looked around the YARN bug base and couldn¹t find any similar issues.
I¹ve also used jstack to inspect the Resource Manager process, but nothing
is obviously wrong there. I was wondering if anyone has encountered a
similar issue before. I apologize that the description is vague, but it¹s
the best way I can describe it.

Thanks,

-Matt Cheah