You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Matt Narrell <ma...@gmail.com> on 2015/04/23 18:41:41 UTC

YARN HA Active ResourceManager failover when machine is stopped

I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn

Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <ro...@huawei.com> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:matt.narrell@gmail.com] 
> Sent: 24 April 2015 23:28
> To: user@hadoop.apache.org
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <drake.min@nexr.com <ma...@nexr.com>> wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <ro...@huawei.com> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:matt.narrell@gmail.com] 
> Sent: 24 April 2015 23:28
> To: user@hadoop.apache.org
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <drake.min@nexr.com <ma...@nexr.com>> wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <ro...@huawei.com> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:matt.narrell@gmail.com] 
> Sent: 24 April 2015 23:28
> To: user@hadoop.apache.org
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <drake.min@nexr.com <ma...@nexr.com>> wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Yes, it looks like we’re running up against YARN-2578.  That’s very unfortunate.

Thanks for everyone’s investigation and input.

mn

> On Apr 26, 2015, at 10:38 PM, Rohith Sharma K S <ro...@huawei.com> wrote:
> 
> Hi
>  
>      I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.
>  
> Open JIRA’s in community regarding this problem are
> https://issues.apache.org/jira/i#browse/YARN-1061 <https://issues.apache.org/jira/i#browse/YARN-1061> (Without HA)
> https://issues.apache.org/jira/i#browse/YARN-2578 <https://issues.apache.org/jira/i#browse/YARN-2578> (With HA)
>  
>  
> Thanks & Regards
> Rohith Sharma K S
>  
> From: Matt Narrell [mailto:matt.narrell@gmail.com] 
> Sent: 24 April 2015 23:28
> To: user@hadoop.apache.org
> Subject: Re: YARN HA Active ResourceManager failover when machine is stopped
>  
> Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?
>  
> mn
>  
> On Apr 24, 2015, at 1:50 AM, Drake민영근 <drake.min@nexr.com <ma...@nexr.com>> wrote:
>  
> Hi, Matt
>  
> The second log file looks like node manager's log, not the standby resource manager.
>  
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
>  
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
>  
> Oppressively chatty and not much valuable info contained therein.
>  
>  
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>  
> I have run into this offline with someone else too but couldn't root-cause it.
>  
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>  
> +Vinod
>  
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
> 
> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>  
> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>  
> Thanks,
> mn


RE: YARN HA Active ResourceManager failover when machine is stopped

Posted by Rohith Sharma K S <ro...@huawei.com>.
Hi

     I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.

Open JIRA’s in community regarding this problem are
https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA)
https://issues.apache.org/jira/i#browse/YARN-2578 (With HA)


Thanks & Regards
Rohith Sharma K S

From: Matt Narrell [mailto:matt.narrell@gmail.com]
Sent: 24 April 2015 23:28
To: user@hadoop.apache.org
Subject: Re: YARN HA Active ResourceManager failover when machine is stopped

Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com>> wrote:

Hi, Matt

The second log file looks like node manager's log, not the standby resource manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>> wrote:
Active ResourceManager:  http://pastebin.com/hE0ppmnb
Standby ResourceManager: http://pastebin.com/DB8VjHqA

Oppressively chatty and not much valuable info contained therein.


On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com>> wrote:

I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:


I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn





RE: YARN HA Active ResourceManager failover when machine is stopped

Posted by Rohith Sharma K S <ro...@huawei.com>.
Hi

     I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.

Open JIRA’s in community regarding this problem are
https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA)
https://issues.apache.org/jira/i#browse/YARN-2578 (With HA)


Thanks & Regards
Rohith Sharma K S

From: Matt Narrell [mailto:matt.narrell@gmail.com]
Sent: 24 April 2015 23:28
To: user@hadoop.apache.org
Subject: Re: YARN HA Active ResourceManager failover when machine is stopped

Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com>> wrote:

Hi, Matt

The second log file looks like node manager's log, not the standby resource manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>> wrote:
Active ResourceManager:  http://pastebin.com/hE0ppmnb
Standby ResourceManager: http://pastebin.com/DB8VjHqA

Oppressively chatty and not much valuable info contained therein.


On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com>> wrote:

I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:


I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn





RE: YARN HA Active ResourceManager failover when machine is stopped

Posted by Rohith Sharma K S <ro...@huawei.com>.
Hi

     I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.

Open JIRA’s in community regarding this problem are
https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA)
https://issues.apache.org/jira/i#browse/YARN-2578 (With HA)


Thanks & Regards
Rohith Sharma K S

From: Matt Narrell [mailto:matt.narrell@gmail.com]
Sent: 24 April 2015 23:28
To: user@hadoop.apache.org
Subject: Re: YARN HA Active ResourceManager failover when machine is stopped

Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com>> wrote:

Hi, Matt

The second log file looks like node manager's log, not the standby resource manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>> wrote:
Active ResourceManager:  http://pastebin.com/hE0ppmnb
Standby ResourceManager: http://pastebin.com/DB8VjHqA

Oppressively chatty and not much valuable info contained therein.


On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com>> wrote:

I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:


I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn





RE: YARN HA Active ResourceManager failover when machine is stopped

Posted by Rohith Sharma K S <ro...@huawei.com>.
Hi

     I had seen this issue in my cluster without HA configured when the process is Halted.  I assume that your scenario also having similar issue when Active RM machine is Shutdown abruptly.  May be you can verify and compare taking thread dump of NM and with below JIRA’s.

Open JIRA’s in community regarding this problem are
https://issues.apache.org/jira/i#browse/YARN-1061 (Without HA)
https://issues.apache.org/jira/i#browse/YARN-2578 (With HA)


Thanks & Regards
Rohith Sharma K S

From: Matt Narrell [mailto:matt.narrell@gmail.com]
Sent: 24 April 2015 23:28
To: user@hadoop.apache.org
Subject: Re: YARN HA Active ResourceManager failover when machine is stopped

Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com>> wrote:

Hi, Matt

The second log file looks like node manager's log, not the standby resource manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>> wrote:
Active ResourceManager:  http://pastebin.com/hE0ppmnb
Standby ResourceManager: http://pastebin.com/DB8VjHqA

Oppressively chatty and not much valuable info contained therein.


On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com>> wrote:

I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:


I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn





Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Ah, yes.  Ok please see below:

Scenario one: Stop the Active ResourceManager process (leaving the VM running)
Active ResourceManager:  https://gist.github.com/mnarrell/157c8e1b82d40541cd88 <https://gist.github.com/mnarrell/157c8e1b82d40541cd88>
Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d <https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>

Scenario two: Shutdown the VM ($ shutdown -h now)
Active ResourceManager:  https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b <https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b>
Standby ResourceManager:  https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf <https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf>

Here is the yarn-site.xml
https://gist.github.com/mnarrell/115a3eff03bbef947a57 <https://gist.github.com/mnarrell/115a3eff03bbef947a57>

We have some suspicion that this could be related to fencing?  We speculate that when the machine is shutdown, the NodeManagers do not see the NoRouteToHost exception as a failover situation?  We have a pretty vanilla configuration of YARN, mostly Ambari defaults, and have compared our configuration to the YARN ResourceManager HA documentation from Apache and Hortonworks.

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Ah, yes.  Ok please see below:

Scenario one: Stop the Active ResourceManager process (leaving the VM running)
Active ResourceManager:  https://gist.github.com/mnarrell/157c8e1b82d40541cd88 <https://gist.github.com/mnarrell/157c8e1b82d40541cd88>
Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d <https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>

Scenario two: Shutdown the VM ($ shutdown -h now)
Active ResourceManager:  https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b <https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b>
Standby ResourceManager:  https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf <https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf>

Here is the yarn-site.xml
https://gist.github.com/mnarrell/115a3eff03bbef947a57 <https://gist.github.com/mnarrell/115a3eff03bbef947a57>

We have some suspicion that this could be related to fencing?  We speculate that when the machine is shutdown, the NodeManagers do not see the NoRouteToHost exception as a failover situation?  We have a pretty vanilla configuration of YARN, mostly Ambari defaults, and have compared our configuration to the YARN ResourceManager HA documentation from Apache and Hortonworks.

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Ah, yes.  Ok please see below:

Scenario one: Stop the Active ResourceManager process (leaving the VM running)
Active ResourceManager:  https://gist.github.com/mnarrell/157c8e1b82d40541cd88 <https://gist.github.com/mnarrell/157c8e1b82d40541cd88>
Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d <https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>

Scenario two: Shutdown the VM ($ shutdown -h now)
Active ResourceManager:  https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b <https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b>
Standby ResourceManager:  https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf <https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf>

Here is the yarn-site.xml
https://gist.github.com/mnarrell/115a3eff03bbef947a57 <https://gist.github.com/mnarrell/115a3eff03bbef947a57>

We have some suspicion that this could be related to fencing?  We speculate that when the machine is shutdown, the NodeManagers do not see the NoRouteToHost exception as a failover situation?  We have a pretty vanilla configuration of YARN, mostly Ambari defaults, and have compared our configuration to the YARN ResourceManager HA documentation from Apache and Hortonworks.

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Also, another observation is that when the VMs are halted, its seems like the NodeManagers do not consider this a scenario to round-robin among the configured ResourceManagers?  Is there some timeout that I’ve missed to instruct the NodeManagers to do this round-robining in the case of the machine not responding (to distinguish it from a network blip)?

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Ah, yes.  Ok please see below:

Scenario one: Stop the Active ResourceManager process (leaving the VM running)
Active ResourceManager:  https://gist.github.com/mnarrell/157c8e1b82d40541cd88 <https://gist.github.com/mnarrell/157c8e1b82d40541cd88>
Standby ResourceManager: https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d <https://gist.github.com/mnarrell/b6ad01d2f4b900b42e6d>

Scenario two: Shutdown the VM ($ shutdown -h now)
Active ResourceManager:  https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b <https://gist.github.com/mnarrell/95b35cc8be0ed817cf1b>
Standby ResourceManager:  https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf <https://gist.github.com/mnarrell/68a778e0d0d213e1b2cf>

Here is the yarn-site.xml
https://gist.github.com/mnarrell/115a3eff03bbef947a57 <https://gist.github.com/mnarrell/115a3eff03bbef947a57>

We have some suspicion that this could be related to fencing?  We speculate that when the machine is shutdown, the NodeManagers do not see the NoRouteToHost exception as a failover situation?  We have a pretty vanilla configuration of YARN, mostly Ambari defaults, and have compared our configuration to the YARN ResourceManager HA documentation from Apache and Hortonworks.

mn

> On Apr 24, 2015, at 1:50 AM, Drake민영근 <dr...@nexr.com> wrote:
> 
> Hi, Matt
> 
> The second log file looks like node manager's log, not the standby resource manager.
> 
> Thanks.
> 
> Drake 민영근 Ph.D
> kt NexR
> 
> On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
> Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>
> 
> Oppressively chatty and not much valuable info contained therein.
> 
> 
>> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vinodkv@hortonworks.com <ma...@hortonworks.com>> wrote:
>> 
>> I have run into this offline with someone else too but couldn't root-cause it.
>> 
>> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
>> 
>> +Vinod
>> 
>> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
>> 
>>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>>> 
>>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>>> 
>>> Thanks,
>>> mn
>> 
> 
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Drake민영근 <dr...@nexr.com>.
Hi, Matt

The second log file looks like node manager's log, not the standby resource
manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>
wrote:

> Active ResourceManager:  http://pastebin.com/hE0ppmnb
> Standby ResourceManager: http://pastebin.com/DB8VjHqA
>
> Oppressively chatty and not much valuable info contained therein.
>
>
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>  I have run into this offline with someone else too but couldn't
> root-cause it.
>
>  Will you be able to share your active/standby ResourceManager logs via
> pastebin or something?
>
>  +Vinod
>
>  On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com> wrote:
>
>  I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>
>  I’m testing the YARN HA ResourceManager failover. If I STOP the active
> ResourceManager (shut the machine off), the standby ResourceManager is
> elected to active, but the NodeManagers do not register themselves with the
> newly elected active ResourceManager. If I restart the machine (but DO NOT
> resume the YARN services) the NodeManagers register with the newly elected
> ResourceManager and my jobs resume. I assume I have some bad configuration,
> as this produces a SPOF, and is not HA in the sense I’m expecting.
>
>  Thanks,
> mn
>
>
>
>

Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Drake민영근 <dr...@nexr.com>.
Hi, Matt

The second log file looks like node manager's log, not the standby resource
manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>
wrote:

> Active ResourceManager:  http://pastebin.com/hE0ppmnb
> Standby ResourceManager: http://pastebin.com/DB8VjHqA
>
> Oppressively chatty and not much valuable info contained therein.
>
>
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>  I have run into this offline with someone else too but couldn't
> root-cause it.
>
>  Will you be able to share your active/standby ResourceManager logs via
> pastebin or something?
>
>  +Vinod
>
>  On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com> wrote:
>
>  I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>
>  I’m testing the YARN HA ResourceManager failover. If I STOP the active
> ResourceManager (shut the machine off), the standby ResourceManager is
> elected to active, but the NodeManagers do not register themselves with the
> newly elected active ResourceManager. If I restart the machine (but DO NOT
> resume the YARN services) the NodeManagers register with the newly elected
> ResourceManager and my jobs resume. I assume I have some bad configuration,
> as this produces a SPOF, and is not HA in the sense I’m expecting.
>
>  Thanks,
> mn
>
>
>
>

Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Drake민영근 <dr...@nexr.com>.
Hi, Matt

The second log file looks like node manager's log, not the standby resource
manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>
wrote:

> Active ResourceManager:  http://pastebin.com/hE0ppmnb
> Standby ResourceManager: http://pastebin.com/DB8VjHqA
>
> Oppressively chatty and not much valuable info contained therein.
>
>
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>  I have run into this offline with someone else too but couldn't
> root-cause it.
>
>  Will you be able to share your active/standby ResourceManager logs via
> pastebin or something?
>
>  +Vinod
>
>  On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com> wrote:
>
>  I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>
>  I’m testing the YARN HA ResourceManager failover. If I STOP the active
> ResourceManager (shut the machine off), the standby ResourceManager is
> elected to active, but the NodeManagers do not register themselves with the
> newly elected active ResourceManager. If I restart the machine (but DO NOT
> resume the YARN services) the NodeManagers register with the newly elected
> ResourceManager and my jobs resume. I assume I have some bad configuration,
> as this produces a SPOF, and is not HA in the sense I’m expecting.
>
>  Thanks,
> mn
>
>
>
>

Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Drake민영근 <dr...@nexr.com>.
Hi, Matt

The second log file looks like node manager's log, not the standby resource
manager.

Thanks.

Drake 민영근 Ph.D
kt NexR

On Fri, Apr 24, 2015 at 11:39 AM, Matt Narrell <ma...@gmail.com>
wrote:

> Active ResourceManager:  http://pastebin.com/hE0ppmnb
> Standby ResourceManager: http://pastebin.com/DB8VjHqA
>
> Oppressively chatty and not much valuable info contained therein.
>
>
> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <
> vinodkv@hortonworks.com> wrote:
>
>  I have run into this offline with someone else too but couldn't
> root-cause it.
>
>  Will you be able to share your active/standby ResourceManager logs via
> pastebin or something?
>
>  +Vinod
>
>  On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com> wrote:
>
>  I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>
>  I’m testing the YARN HA ResourceManager failover. If I STOP the active
> ResourceManager (shut the machine off), the standby ResourceManager is
> elected to active, but the NodeManagers do not register themselves with the
> newly elected active ResourceManager. If I restart the machine (but DO NOT
> resume the YARN services) the NodeManagers register with the newly elected
> ResourceManager and my jobs resume. I assume I have some bad configuration,
> as this produces a SPOF, and is not HA in the sense I’m expecting.
>
>  Thanks,
> mn
>
>
>
>

Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>

Oppressively chatty and not much valuable info contained therein.


> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com> wrote:
> 
> I have run into this offline with someone else too but couldn't root-cause it.
> 
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
> 
> +Vinod
> 
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>> 
>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>> 
>> Thanks,
>> mn
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>

Oppressively chatty and not much valuable info contained therein.


> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com> wrote:
> 
> I have run into this offline with someone else too but couldn't root-cause it.
> 
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
> 
> +Vinod
> 
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>> 
>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>> 
>> Thanks,
>> mn
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>

Oppressively chatty and not much valuable info contained therein.


> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com> wrote:
> 
> I have run into this offline with someone else too but couldn't root-cause it.
> 
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
> 
> +Vinod
> 
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>> 
>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>> 
>> Thanks,
>> mn
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Matt Narrell <ma...@gmail.com>.
Active ResourceManager:  http://pastebin.com/hE0ppmnb <http://pastebin.com/hE0ppmnb>
Standby ResourceManager: http://pastebin.com/DB8VjHqA <http://pastebin.com/DB8VjHqA>

Oppressively chatty and not much valuable info contained therein.


> On Apr 23, 2015, at 4:25 PM, Vinod Kumar Vavilapalli <vi...@hortonworks.com> wrote:
> 
> I have run into this offline with someone else too but couldn't root-cause it.
> 
> Will you be able to share your active/standby ResourceManager logs via pastebin or something?
> 
> +Vinod
> 
> On Apr 23, 2015, at 9:41 AM, Matt Narrell <matt.narrell@gmail.com <ma...@gmail.com>> wrote:
> 
>> I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0
>> 
>> I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.
>> 
>> Thanks,
>> mn
> 


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:

I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:

I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:

I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn


Re: YARN HA Active ResourceManager failover when machine is stopped

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
I have run into this offline with someone else too but couldn't root-cause it.

Will you be able to share your active/standby ResourceManager logs via pastebin or something?

+Vinod

On Apr 23, 2015, at 9:41 AM, Matt Narrell <ma...@gmail.com>> wrote:

I’m using Hadoop 2.6.0 from HDP 2.2.4 installed via Ambari 2.0

I’m testing the YARN HA ResourceManager failover. If I STOP the active ResourceManager (shut the machine off), the standby ResourceManager is elected to active, but the NodeManagers do not register themselves with the newly elected active ResourceManager. If I restart the machine (but DO NOT resume the YARN services) the NodeManagers register with the newly elected ResourceManager and my jobs resume. I assume I have some bad configuration, as this produces a SPOF, and is not HA in the sense I’m expecting.

Thanks,
mn