You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sadystio Ilmatunt <ur...@gmail.com> on 2016/03/03 02:26:02 UTC

Question about YARN NodeManager and ApplicationMaster failures

Hello,

I have some questions regarding failure of NodeManager and Application Master.
What happens if NodeManager which is running on the same node as
Application Master fails?
Does Application Master fail as well?

Also How is Application Master failure handled with respect to its
(child) container?
Do these containers fail too?
If Yes, is there a way these containers can be assigned to new
instance of application master that might come up on some other node?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: Question about YARN NodeManager and ApplicationMaster failures

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
@Steve: Are there any existing applications whose AM has the code to handle
the rebuilding states on restart? I am curious because we are currently
trying to improve Apache Samza's behavior on NM restarts. We seem to
occasionally run into some orphaned containers and I am wondering if we are
not handling shutdown/failure properly.

Navina

On Thu, Mar 3, 2016 at 5:34 AM, Junping Du <jd...@hortonworks.com> wrote:

> With proper configuration, container (include AM) can still running when
> NM get failed. Please check YARN-1336 for NM restart work preserving.
> For AM failed (restart), after YARN-1489 (Work-preserving
> ApplicationMaster restart), the container will not get killed when AM
> failed (within maximum attempts). However, like mentioned by Steve, each
> application should figure out ways to wire new AM attempt with existing
> containers and sync states (and most of them haven't done it yet.). The
> ongoing work for MAPREDUCE-6608 is an example.
>
>
> Thanks,
>
> Junping
> ________________________________________
> From: Steve Loughran <st...@hortonworks.com>
> Sent: Thursday, March 03, 2016 1:16 PM
> To: yarn-dev@hadoop.apache.org
> Subject: Re: Question about YARN NodeManager and ApplicationMaster failures
>
> > On 3 Mar 2016, at 12:58, Dustin Cote <dc...@cloudera.com> wrote:
> >
> > -dev since this is more of a user question
> >
> > The NodeManager is the parent for the application master, so any
> containers
> > (including application master containers) that are running where the
> failed
> > NodeManager is located will die.  If an application master fails, then a
> > new one is created up to your limit (set by
> > yarn.resourcemanager.am.max-attempts).  The other containers associated
> > with the application master are supposed to continue on and pick up the
> > newly started application master.
>
>
> Only if you tell yarn to keep containers over restart and the AM has the
> code to rebuild its state. Most of AM's don't do this (MR, Tez, Spark,
> etc), as the state is hard to preserve and rebuild.
>
> See YARN-896 for all the details of things related to long-lived services
>
> You can also put a reset window on AM failures, YARN-611.
>
> Oh, and there's work-preserving NM restart, but that's another topic  ....
>
> > The resource manager takes care of the
> > bookkeeping needed to make this happen.  I'd suggest you have a look at
> the
> > series of blogs here
> > <
> http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/
> >
> > for
> > a more in depth look at the mechanics.
> >
> > -Dustin
> >
> > On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <
> urkpostenardr@gmail.com>
> > wrote:
> >
> >> Hello,
> >>
> >> I have some questions regarding failure of NodeManager and Application
> >> Master.
> >> What happens if NodeManager which is running on the same node as
> >> Application Master fails?
> >> Does Application Master fail as well?
> >>
> >> Also How is Application Master failure handled with respect to its
> >> (child) container?
> >> Do these containers fail too?
> >> If Yes, is there a way these containers can be assigned to new
> >> instance of application master that might come up on some other node?
> >>
> >
> >
> >
> > --
> > Dustin Cote
> > Customer Operations Engineer
> > <http://www.cloudera.com>
>
>


-- 
Navina R.

Re: Question about YARN NodeManager and ApplicationMaster failures

Posted by Junping Du <jd...@hortonworks.com>.
With proper configuration, container (include AM) can still running when NM get failed. Please check YARN-1336 for NM restart work preserving.
For AM failed (restart), after YARN-1489 (Work-preserving ApplicationMaster restart), the container will not get killed when AM failed (within maximum attempts). However, like mentioned by Steve, each application should figure out ways to wire new AM attempt with existing containers and sync states (and most of them haven't done it yet.). The ongoing work for MAPREDUCE-6608 is an example.


Thanks,

Junping
________________________________________
From: Steve Loughran <st...@hortonworks.com>
Sent: Thursday, March 03, 2016 1:16 PM
To: yarn-dev@hadoop.apache.org
Subject: Re: Question about YARN NodeManager and ApplicationMaster failures

> On 3 Mar 2016, at 12:58, Dustin Cote <dc...@cloudera.com> wrote:
>
> -dev since this is more of a user question
>
> The NodeManager is the parent for the application master, so any containers
> (including application master containers) that are running where the failed
> NodeManager is located will die.  If an application master fails, then a
> new one is created up to your limit (set by
> yarn.resourcemanager.am.max-attempts).  The other containers associated
> with the application master are supposed to continue on and pick up the
> newly started application master.


Only if you tell yarn to keep containers over restart and the AM has the code to rebuild its state. Most of AM's don't do this (MR, Tez, Spark, etc), as the state is hard to preserve and rebuild.

See YARN-896 for all the details of things related to long-lived services

You can also put a reset window on AM failures, YARN-611.

Oh, and there's work-preserving NM restart, but that's another topic  ....

> The resource manager takes care of the
> bookkeeping needed to make this happen.  I'd suggest you have a look at the
> series of blogs here
> <http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/>
> for
> a more in depth look at the mechanics.
>
> -Dustin
>
> On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <ur...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I have some questions regarding failure of NodeManager and Application
>> Master.
>> What happens if NodeManager which is running on the same node as
>> Application Master fails?
>> Does Application Master fail as well?
>>
>> Also How is Application Master failure handled with respect to its
>> (child) container?
>> Do these containers fail too?
>> If Yes, is there a way these containers can be assigned to new
>> instance of application master that might come up on some other node?
>>
>
>
>
> --
> Dustin Cote
> Customer Operations Engineer
> <http://www.cloudera.com>


Re: Question about YARN NodeManager and ApplicationMaster failures

Posted by Steve Loughran <st...@hortonworks.com>.
> On 3 Mar 2016, at 12:58, Dustin Cote <dc...@cloudera.com> wrote:
> 
> -dev since this is more of a user question
> 
> The NodeManager is the parent for the application master, so any containers
> (including application master containers) that are running where the failed
> NodeManager is located will die.  If an application master fails, then a
> new one is created up to your limit (set by
> yarn.resourcemanager.am.max-attempts).  The other containers associated
> with the application master are supposed to continue on and pick up the
> newly started application master.  


Only if you tell yarn to keep containers over restart and the AM has the code to rebuild its state. Most of AM's don't do this (MR, Tez, Spark, etc), as the state is hard to preserve and rebuild.

See YARN-896 for all the details of things related to long-lived services

You can also put a reset window on AM failures, YARN-611.

Oh, and there's work-preserving NM restart, but that's another topic  .... 

> The resource manager takes care of the
> bookkeeping needed to make this happen.  I'd suggest you have a look at the
> series of blogs here
> <http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/>
> for
> a more in depth look at the mechanics.
> 
> -Dustin
> 
> On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <ur...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> I have some questions regarding failure of NodeManager and Application
>> Master.
>> What happens if NodeManager which is running on the same node as
>> Application Master fails?
>> Does Application Master fail as well?
>> 
>> Also How is Application Master failure handled with respect to its
>> (child) container?
>> Do these containers fail too?
>> If Yes, is there a way these containers can be assigned to new
>> instance of application master that might come up on some other node?
>> 
> 
> 
> 
> -- 
> Dustin Cote
> Customer Operations Engineer
> <http://www.cloudera.com>


Re: Question about YARN NodeManager and ApplicationMaster failures

Posted by Dustin Cote <dc...@cloudera.com>.
-dev since this is more of a user question

The NodeManager is the parent for the application master, so any containers
(including application master containers) that are running where the failed
NodeManager is located will die.  If an application master fails, then a
new one is created up to your limit (set by
yarn.resourcemanager.am.max-attempts).  The other containers associated
with the application master are supposed to continue on and pick up the
newly started application master.  The resource manager takes care of the
bookkeeping needed to make this happen.  I'd suggest you have a look at the
series of blogs here
<http://blog.cloudera.com/blog/2015/09/untangling-apache-hadoop-yarn-part-1/>
for
a more in depth look at the mechanics.

-Dustin

On Wed, Mar 2, 2016 at 8:26 PM, Sadystio Ilmatunt <ur...@gmail.com>
wrote:

> Hello,
>
> I have some questions regarding failure of NodeManager and Application
> Master.
> What happens if NodeManager which is running on the same node as
> Application Master fails?
> Does Application Master fail as well?
>
> Also How is Application Master failure handled with respect to its
> (child) container?
> Do these containers fail too?
> If Yes, is there a way these containers can be assigned to new
> instance of application master that might come up on some other node?
>



-- 
Dustin Cote
Customer Operations Engineer
<http://www.cloudera.com>