You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Naganarasimha G R (Naga)" <ga...@huawei.com> on 2014/09/01 16:46:45 UTC

Regarding a scenario where applications are being hung

Hi ,

    I have one sceanrio which makes the applications to get hung , so wanted to validate whether its a bug (if so will raise a jira).



Consider a cluster setup which has 2 NMS of each 8GB resource,

And 2 applications are launched in the default queue where in each AM is taking 2 GB each.

Each AM is placed in each of the NM. Now each AM is requesting for container of 7Gb  mem resource .

As in each NM only 6GB resource is available both the applications are hung forever.



Is this a bug ?



Regards,

Naga



Huawei Technologies Co., Ltd.
Phone:
Fax:
Mobile:  +91 9980040283
Email: naganarasimhagr@huawei.com<ma...@huawei.com>
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com

¡This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

Re: Regarding a scenario where applications are being hung

Posted by Steve Loughran <st...@hortonworks.com>.
All AMs are already free to implement their own policy of tracking
outstanding requests and reacting to it —today. I don't know any that do.

https://issues.apache.org/jira/browse/YARN-624 covers "gang scheduling", in
which an AM can say "don't assign any until the requirements are met". This
addresses the more complex dining philosophers problem in which

AM1 requests 4 2 GB containers, gets back 3 and hangs on to them. not
starting work until it has all
AM2 requests 2x 4GB containers, gets one and hangs around waiting for the
other.

A timeout will tell them both they've failed and get them to react, which,
unless they are clever, will probably just be to fail themselves. However,
there are enough resources to satisfy both AMs, sequentially.


There's another possibility in cloud environments: something gets told that
more compute capacity is required, possibly triggering requests for more
VMs.


On 2 September 2014 02:36, Wangda Tan <wh...@gmail.com> wrote:

> Hi Naga,
> AFAIK, there's no such timeout,
> Since it's a new feature request, I'd suggest to create a JIRA and let's
> move discussions on it.
>
> Thanks,
> Wangda
>
>
> On Tue, Sep 2, 2014 at 8:52 AM, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
>
> > Hi,
> >
> >     "AM can take action if it doesn't receive any container for some
> time."
> >
> > Can we have a generic timeout feature for all AM's @ the yarn side such
> > that if no containers are assigned for an application for a defined
> period
> > than yarn can timeout the application attempt.
> >
> > Default can be set to 0 where in RM will not timeout the app attempt and
> > user can set his own timeout when he submits the application?
> >
> >
> >
> > Basically we faced this issues in MR2 itself and i was not able to find
> > any such timeouts in map reduce config
> >
> > Regards,
> > Naga
> > ________________________________________
> > From: Wangda Tan [wheeleast@gmail.com]
> > Sent: Tuesday, September 02, 2014 07:59
> > To: yarn-dev@hadoop.apache.org
> > Subject: Re: Regarding a scenario where applications are being hung
> >
> > Hi Naga,
> > When trying to allocate a container, the behavior is,
> > First it will check capacity of queue, in your case, capacity of queue is
> > 8G * 2 = 16G. So a 7G container will be pass the check
> > The it will check if there's enough space on a node, in your case, it
> > cannot pass the check, so the ResourceRequest will be skipped.
> >
> > There's no "timeout" for a ResourceRequest now, AM can take action if it
> > doesn't receive any container for some time.
> > Preemption is another story, preemption is used to reclaim resource for a
> > queue under-satisfied from a queue over-satisfied. It's not your case
> too.
> >
> > Hope this helps,
> > Wangda
> >
> >
> >
> > On Mon, Sep 1, 2014 at 11:11 PM, Naganarasimha G R (Naga) <
> > garlanaganarasimha@huawei.com> wrote:
> >
> > > Hi Wangda,
> > >
> > >               Yes its the case where in its requesting 7GB per
> container.
> > > But can you decribe why its expected behavior ?
> > >
> > > From user perspective either it should not have taken in the
> application
> > > request or after some time may be both apps should have  been killed
> with
> > > proper exception or log information, or pre empt any one AM container
> etc
> > > ...
> > >
> > > Here none are happening!
> > >
> > >
> > > Regards,
> > > Naga
> > >
> > >
> > >
> > > ________________________________________
> > > From: Wangda Tan [wheeleast@gmail.com]
> > > Sent: Monday, September 01, 2014 22:51
> > > To: yarn-dev@hadoop.apache.org
> > > Subject: Re: Regarding a scenario where applications are being hung
> > >
> > > Hi Naga,
> > > According to scenario you described, if "Now each AM is requesting for
> > > container of 7Gb  mem resource ." is a request for a single container
> > (not
> > > request 7 * containers, 1G for each one), it should be expected
> behavior.
> > > Please let me know if you have more questions.
> > >
> > > Thanks,
> > > Wangda
> > >
> > >
> > > On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
> > > garlanaganarasimha@huawei.com> wrote:
> > >
> > > > Hi ,
> > > >
> > > >     I have one sceanrio which makes the applications to get hung , so
> > > > wanted to validate whether its a bug (if so will raise a jira).
> > > >
> > > >
> > > >
> > > > Consider a cluster setup which has 2 NMS of each 8GB resource,
> > > >
> > > > And 2 applications are launched in the default queue where in each AM
> > is
> > > > taking 2 GB each.
> > > >
> > > > Each AM is placed in each of the NM. Now each AM is requesting for
> > > > container of 7Gb  mem resource .
> > > >
> > > > As in each NM only 6GB resource is available both the applications
> are
> > > > hung forever.
> > > >
> > > >
> > > >
> > > > Is this a bug ?
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Naga
> > > >
> > > >
> > > >
> > > > Huawei Technologies Co., Ltd.
> > > > Phone:
> > > > Fax:
> > > > Mobile:  +91 9980040283
> > > > Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> > > > Huawei Technologies Co., Ltd.
> > > > Bantian, Longgang District,Shenzhen 518129, P.R.China
> > > > http://www.huawei.com
> > > >
> > > > ¡This e-mail and its attachments contain confidential information
> from
> > > > HUAWEI, which is intended only for the person or entity whose address
> > is
> > > > listed above. Any use of the information contained herein in any way
> > > > (including, but not limited to, total or partial disclosure,
> > > reproduction,
> > > > or dissemination) by persons other than the intended recipient(s) is
> > > > prohibited. If you receive this e-mail in error, please notify the
> > sender
> > > > by phone or email immediately and delete it!
> > > >
> > >
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Regarding a scenario where applications are being hung

Posted by Wangda Tan <wh...@gmail.com>.
Hi Naga,
AFAIK, there's no such timeout,
Since it's a new feature request, I'd suggest to create a JIRA and let's
move discussions on it.

Thanks,
Wangda


On Tue, Sep 2, 2014 at 8:52 AM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

> Hi,
>
>     "AM can take action if it doesn't receive any container for some time."
>
> Can we have a generic timeout feature for all AM's @ the yarn side such
> that if no containers are assigned for an application for a defined period
> than yarn can timeout the application attempt.
>
> Default can be set to 0 where in RM will not timeout the app attempt and
> user can set his own timeout when he submits the application?
>
>
>
> Basically we faced this issues in MR2 itself and i was not able to find
> any such timeouts in map reduce config
>
> Regards,
> Naga
> ________________________________________
> From: Wangda Tan [wheeleast@gmail.com]
> Sent: Tuesday, September 02, 2014 07:59
> To: yarn-dev@hadoop.apache.org
> Subject: Re: Regarding a scenario where applications are being hung
>
> Hi Naga,
> When trying to allocate a container, the behavior is,
> First it will check capacity of queue, in your case, capacity of queue is
> 8G * 2 = 16G. So a 7G container will be pass the check
> The it will check if there's enough space on a node, in your case, it
> cannot pass the check, so the ResourceRequest will be skipped.
>
> There's no "timeout" for a ResourceRequest now, AM can take action if it
> doesn't receive any container for some time.
> Preemption is another story, preemption is used to reclaim resource for a
> queue under-satisfied from a queue over-satisfied. It's not your case too.
>
> Hope this helps,
> Wangda
>
>
>
> On Mon, Sep 1, 2014 at 11:11 PM, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
>
> > Hi Wangda,
> >
> >               Yes its the case where in its requesting 7GB per container.
> > But can you decribe why its expected behavior ?
> >
> > From user perspective either it should not have taken in the application
> > request or after some time may be both apps should have  been killed with
> > proper exception or log information, or pre empt any one AM container etc
> > ...
> >
> > Here none are happening!
> >
> >
> > Regards,
> > Naga
> >
> >
> >
> > ________________________________________
> > From: Wangda Tan [wheeleast@gmail.com]
> > Sent: Monday, September 01, 2014 22:51
> > To: yarn-dev@hadoop.apache.org
> > Subject: Re: Regarding a scenario where applications are being hung
> >
> > Hi Naga,
> > According to scenario you described, if "Now each AM is requesting for
> > container of 7Gb  mem resource ." is a request for a single container
> (not
> > request 7 * containers, 1G for each one), it should be expected behavior.
> > Please let me know if you have more questions.
> >
> > Thanks,
> > Wangda
> >
> >
> > On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
> > garlanaganarasimha@huawei.com> wrote:
> >
> > > Hi ,
> > >
> > >     I have one sceanrio which makes the applications to get hung , so
> > > wanted to validate whether its a bug (if so will raise a jira).
> > >
> > >
> > >
> > > Consider a cluster setup which has 2 NMS of each 8GB resource,
> > >
> > > And 2 applications are launched in the default queue where in each AM
> is
> > > taking 2 GB each.
> > >
> > > Each AM is placed in each of the NM. Now each AM is requesting for
> > > container of 7Gb  mem resource .
> > >
> > > As in each NM only 6GB resource is available both the applications are
> > > hung forever.
> > >
> > >
> > >
> > > Is this a bug ?
> > >
> > >
> > >
> > > Regards,
> > >
> > > Naga
> > >
> > >
> > >
> > > Huawei Technologies Co., Ltd.
> > > Phone:
> > > Fax:
> > > Mobile:  +91 9980040283
> > > Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> > > Huawei Technologies Co., Ltd.
> > > Bantian, Longgang District,Shenzhen 518129, P.R.China
> > > http://www.huawei.com
> > >
> > > ¡This e-mail and its attachments contain confidential information from
> > > HUAWEI, which is intended only for the person or entity whose address
> is
> > > listed above. Any use of the information contained herein in any way
> > > (including, but not limited to, total or partial disclosure,
> > reproduction,
> > > or dissemination) by persons other than the intended recipient(s) is
> > > prohibited. If you receive this e-mail in error, please notify the
> sender
> > > by phone or email immediately and delete it!
> > >
> >
>

RE: Regarding a scenario where applications are being hung

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi,

    "AM can take action if it doesn't receive any container for some time."

Can we have a generic timeout feature for all AM's @ the yarn side such that if no containers are assigned for an application for a defined period than yarn can timeout the application attempt.

Default can be set to 0 where in RM will not timeout the app attempt and user can set his own timeout when he submits the application?



Basically we faced this issues in MR2 itself and i was not able to find any such timeouts in map reduce config

Regards,
Naga
________________________________________
From: Wangda Tan [wheeleast@gmail.com]
Sent: Tuesday, September 02, 2014 07:59
To: yarn-dev@hadoop.apache.org
Subject: Re: Regarding a scenario where applications are being hung

Hi Naga,
When trying to allocate a container, the behavior is,
First it will check capacity of queue, in your case, capacity of queue is
8G * 2 = 16G. So a 7G container will be pass the check
The it will check if there's enough space on a node, in your case, it
cannot pass the check, so the ResourceRequest will be skipped.

There's no "timeout" for a ResourceRequest now, AM can take action if it
doesn't receive any container for some time.
Preemption is another story, preemption is used to reclaim resource for a
queue under-satisfied from a queue over-satisfied. It's not your case too.

Hope this helps,
Wangda



On Mon, Sep 1, 2014 at 11:11 PM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

> Hi Wangda,
>
>               Yes its the case where in its requesting 7GB per container.
> But can you decribe why its expected behavior ?
>
> From user perspective either it should not have taken in the application
> request or after some time may be both apps should have  been killed with
> proper exception or log information, or pre empt any one AM container etc
> ...
>
> Here none are happening!
>
>
> Regards,
> Naga
>
>
>
> ________________________________________
> From: Wangda Tan [wheeleast@gmail.com]
> Sent: Monday, September 01, 2014 22:51
> To: yarn-dev@hadoop.apache.org
> Subject: Re: Regarding a scenario where applications are being hung
>
> Hi Naga,
> According to scenario you described, if "Now each AM is requesting for
> container of 7Gb  mem resource ." is a request for a single container (not
> request 7 * containers, 1G for each one), it should be expected behavior.
> Please let me know if you have more questions.
>
> Thanks,
> Wangda
>
>
> On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
>
> > Hi ,
> >
> >     I have one sceanrio which makes the applications to get hung , so
> > wanted to validate whether its a bug (if so will raise a jira).
> >
> >
> >
> > Consider a cluster setup which has 2 NMS of each 8GB resource,
> >
> > And 2 applications are launched in the default queue where in each AM is
> > taking 2 GB each.
> >
> > Each AM is placed in each of the NM. Now each AM is requesting for
> > container of 7Gb  mem resource .
> >
> > As in each NM only 6GB resource is available both the applications are
> > hung forever.
> >
> >
> >
> > Is this a bug ?
> >
> >
> >
> > Regards,
> >
> > Naga
> >
> >
> >
> > Huawei Technologies Co., Ltd.
> > Phone:
> > Fax:
> > Mobile:  +91 9980040283
> > Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> > Huawei Technologies Co., Ltd.
> > Bantian, Longgang District,Shenzhen 518129, P.R.China
> > http://www.huawei.com
> >
> > ¡This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address is
> > listed above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> reproduction,
> > or dissemination) by persons other than the intended recipient(s) is
> > prohibited. If you receive this e-mail in error, please notify the sender
> > by phone or email immediately and delete it!
> >
>

Re: Regarding a scenario where applications are being hung

Posted by Wangda Tan <wh...@gmail.com>.
Hi Naga,
When trying to allocate a container, the behavior is,
First it will check capacity of queue, in your case, capacity of queue is
8G * 2 = 16G. So a 7G container will be pass the check
The it will check if there's enough space on a node, in your case, it
cannot pass the check, so the ResourceRequest will be skipped.

There's no "timeout" for a ResourceRequest now, AM can take action if it
doesn't receive any container for some time.
Preemption is another story, preemption is used to reclaim resource for a
queue under-satisfied from a queue over-satisfied. It's not your case too.

Hope this helps,
Wangda



On Mon, Sep 1, 2014 at 11:11 PM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

> Hi Wangda,
>
>               Yes its the case where in its requesting 7GB per container.
> But can you decribe why its expected behavior ?
>
> From user perspective either it should not have taken in the application
> request or after some time may be both apps should have  been killed with
> proper exception or log information, or pre empt any one AM container etc
> ...
>
> Here none are happening!
>
>
> Regards,
> Naga
>
>
>
> ________________________________________
> From: Wangda Tan [wheeleast@gmail.com]
> Sent: Monday, September 01, 2014 22:51
> To: yarn-dev@hadoop.apache.org
> Subject: Re: Regarding a scenario where applications are being hung
>
> Hi Naga,
> According to scenario you described, if "Now each AM is requesting for
> container of 7Gb  mem resource ." is a request for a single container (not
> request 7 * containers, 1G for each one), it should be expected behavior.
> Please let me know if you have more questions.
>
> Thanks,
> Wangda
>
>
> On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
> garlanaganarasimha@huawei.com> wrote:
>
> > Hi ,
> >
> >     I have one sceanrio which makes the applications to get hung , so
> > wanted to validate whether its a bug (if so will raise a jira).
> >
> >
> >
> > Consider a cluster setup which has 2 NMS of each 8GB resource,
> >
> > And 2 applications are launched in the default queue where in each AM is
> > taking 2 GB each.
> >
> > Each AM is placed in each of the NM. Now each AM is requesting for
> > container of 7Gb  mem resource .
> >
> > As in each NM only 6GB resource is available both the applications are
> > hung forever.
> >
> >
> >
> > Is this a bug ?
> >
> >
> >
> > Regards,
> >
> > Naga
> >
> >
> >
> > Huawei Technologies Co., Ltd.
> > Phone:
> > Fax:
> > Mobile:  +91 9980040283
> > Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> > Huawei Technologies Co., Ltd.
> > Bantian, Longgang District,Shenzhen 518129, P.R.China
> > http://www.huawei.com
> >
> > ¡This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address is
> > listed above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> reproduction,
> > or dissemination) by persons other than the intended recipient(s) is
> > prohibited. If you receive this e-mail in error, please notify the sender
> > by phone or email immediately and delete it!
> >
>

RE: Regarding a scenario where applications are being hung

Posted by "Naganarasimha G R (Naga)" <ga...@huawei.com>.
Hi Wangda,

              Yes its the case where in its requesting 7GB per container. But can you decribe why its expected behavior ?

>From user perspective either it should not have taken in the application request or after some time may be both apps should have  been killed with proper exception or log information, or pre empt any one AM container etc ...

Here none are happening!


Regards,
Naga



________________________________________
From: Wangda Tan [wheeleast@gmail.com]
Sent: Monday, September 01, 2014 22:51
To: yarn-dev@hadoop.apache.org
Subject: Re: Regarding a scenario where applications are being hung

Hi Naga,
According to scenario you described, if "Now each AM is requesting for
container of 7Gb  mem resource ." is a request for a single container (not
request 7 * containers, 1G for each one), it should be expected behavior.
Please let me know if you have more questions.

Thanks,
Wangda


On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

> Hi ,
>
>     I have one sceanrio which makes the applications to get hung , so
> wanted to validate whether its a bug (if so will raise a jira).
>
>
>
> Consider a cluster setup which has 2 NMS of each 8GB resource,
>
> And 2 applications are launched in the default queue where in each AM is
> taking 2 GB each.
>
> Each AM is placed in each of the NM. Now each AM is requesting for
> container of 7Gb  mem resource .
>
> As in each NM only 6GB resource is available both the applications are
> hung forever.
>
>
>
> Is this a bug ?
>
>
>
> Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
> ¡This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>

Re: Regarding a scenario where applications are being hung

Posted by Wangda Tan <wh...@gmail.com>.
Hi Naga,
According to scenario you described, if "Now each AM is requesting for
container of 7Gb  mem resource ." is a request for a single container (not
request 7 * containers, 1G for each one), it should be expected behavior.
Please let me know if you have more questions.

Thanks,
Wangda


On Mon, Sep 1, 2014 at 10:46 PM, Naganarasimha G R (Naga) <
garlanaganarasimha@huawei.com> wrote:

> Hi ,
>
>     I have one sceanrio which makes the applications to get hung , so
> wanted to validate whether its a bug (if so will raise a jira).
>
>
>
> Consider a cluster setup which has 2 NMS of each 8GB resource,
>
> And 2 applications are launched in the default queue where in each AM is
> taking 2 GB each.
>
> Each AM is placed in each of the NM. Now each AM is requesting for
> container of 7Gb  mem resource .
>
> As in each NM only 6GB resource is available both the applications are
> hung forever.
>
>
>
> Is this a bug ?
>
>
>
> Regards,
>
> Naga
>
>
>
> Huawei Technologies Co., Ltd.
> Phone:
> Fax:
> Mobile:  +91 9980040283
> Email: naganarasimhagr@huawei.com<ma...@huawei.com>
> Huawei Technologies Co., Ltd.
> Bantian, Longgang District,Shenzhen 518129, P.R.China
> http://www.huawei.com
>
> ¡This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>