You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Gagan Brahmi <ga...@gmail.com> on 2016/05/01 19:52:14 UTC

Re: YARN queues become unusable and jobs are stuck in ACCEPTED state

Matt,

You may want to check the resource manager logs to see if you find any
errors or exceptions related to concurrent modifications for the
application.


Regards,
Gagan Brahmi

On Fri, Apr 29, 2016 at 4:33 PM, Ray Chiang <rc...@cloudera.com> wrote:
> Just because you have sufficient resources doesn't mean another job should
> launch an AM.  You might want to check maxAMShare and
> queueMaxAMShareDefault.
>
> Given that you have sufficient resources, you could be running into
> YARN-3491.
>
> I don't know whether you have the option, but CDH 5.3.3 is pretty old at
> this point.  CDH 5.3.10/5.4.10/5.5.2 have the latest bug fixes.
>
> -Ray
>
> On Thu, Apr 28, 2016 at 12:03 PM, Matt Cheah <mc...@palantir.com> wrote:
>>
>> Hi,
>>
>> I¹ve been sporadically seeing an issue when using Hadoop YARN. I¹m using
>> Hadoop 2.5.0, CDH5.3.3.
>>
>> When I¹ve configured the stack to use the fair scheduler protocol, after
>> some period of time of the cluster being alive and running jobs, I¹m
>> noticing that when I submit a job, the job will be stuck in the ACCEPTED
>> state even though the cluster has sufficient resources to spawn an
>> application master container as well as the queue I¹m submitting to having
>> sufficient resources available. Furthermore, all jobs submitted to that
>> queue will be stuck in the ACCEPTED state. I can unblock job submission by
>> going into the allocation XML file, renaming the queue, and submitting jobs
>> to that renamed queue instead. However the queue has only changed name, and
>> all of its other settings have been preserved.
>>
>> It is clearly untenable for me to have to change the queues that I¹m using
>> sometimes. This appears to happen irrespective of the settings of the queue,
>> e.g. Its weight or its minimum resource share. The events leading up to this
>> occurrence are strictly unpredictable and I have no concrete way to
>> reproduce the issue. The logs don¹t show anything interesting either; the
>> resource manager just states that it schedules an attempt for the
>> application submitted to the bad queue, but the attempt¹s application master
>> is never allocated to a container anywhere.
>>
>> I have looked around the YARN bug base and couldn¹t find any similar
>> issues. I¹ve also used jstack to inspect the Resource Manager process, but
>> nothing is obviously wrong there. I was wondering if anyone has encountered
>> a similar issue before. I apologize that the description is vague, but it¹s
>> the best way I can describe it.
>>
>> Thanks,
>>
>> -Matt Cheah
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org