You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by dlmarion <dl...@hotmail.com> on 2014/03/15 01:21:39 UTC

HA NN Failover question

I was doing some testing with HA NN today. I set up two NN with active
failover (ZKFC) using sshfence. I tested that its working on both NN by
doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
the standby would become the active and everything seemed to work. Next, I
logged onto the active NN and did a 'service network stop' to simulate a
NIC/network failure. The standby did not become the active in this scenario.
In fact, it remained in standby mode and complained in the log that it could
not communicate with (what was) the active NN. I was unable to find anything
relevant via searches in Google in Jira. Does anyone have experience
successfully testing this? I'm hoping that it is just a configuration
problem.

 

FWIW, when the network was restarted on the active NN, it failed over almost
immediately.

 

Thanks,

 

Dave

Re: HA NN Failover question

Posted by Azuryy <az...@gmail.com>.

Which Hadoop version you used?


Sent from my iPhone5s

> On 2014年3月15日, at 9:29, dlmarion <dl...@hotmail.com> wrote:
> 
> Server 1: NN1 and ZKFC1
> Server 2: NN2 and ZKFC2
> Server 3: Journal1 and ZK1
> Server 4: Journal2 and ZK2
> Server 5: Journal3 and ZK3
> Server 6+: Datanode
>  
> All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).
>  
> - Dave
>  
> From: Juan Carlos [mailto:jucaf1@gmail.com] 
> Sent: Friday, March 14, 2014 9:12 PM
> To: user@hadoop.apache.org
> Subject: Re: HA NN Failover question
>  
> Hi Dave,
> How many zookeeper servers do you have and where are them? 
> 
> Juan Carlos Fernández Rodríguez
> 
> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

Re: HA NN Failover question

Posted by Azuryy <az...@gmail.com>.

Which Hadoop version you used?


Sent from my iPhone5s

> On 2014年3月15日, at 9:29, dlmarion <dl...@hotmail.com> wrote:
> 
> Server 1: NN1 and ZKFC1
> Server 2: NN2 and ZKFC2
> Server 3: Journal1 and ZK1
> Server 4: Journal2 and ZK2
> Server 5: Journal3 and ZK3
> Server 6+: Datanode
>  
> All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).
>  
> - Dave
>  
> From: Juan Carlos [mailto:jucaf1@gmail.com] 
> Sent: Friday, March 14, 2014 9:12 PM
> To: user@hadoop.apache.org
> Subject: Re: HA NN Failover question
>  
> Hi Dave,
> How many zookeeper servers do you have and where are them? 
> 
> Juan Carlos Fernández Rodríguez
> 
> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

Re: HA NN Failover question

Posted by Azuryy <az...@gmail.com>.

Which Hadoop version you used?


Sent from my iPhone5s

> On 2014��3��15��, at 9:29, dlmarion <dl...@hotmail.com> wrote:
> 
> Server 1: NN1 and ZKFC1
> Server 2: NN2 and ZKFC2
> Server 3: Journal1 and ZK1
> Server 4: Journal2 and ZK2
> Server 5: Journal3 and ZK3
> Server 6+: Datanode
>  
> All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I��m assuming that��s how it works).
>  
> - Dave
>  
> From: Juan Carlos [mailto:jucaf1@gmail.com] 
> Sent: Friday, March 14, 2014 9:12 PM
> To: user@hadoop.apache.org
> Subject: Re: HA NN Failover question
>  
> Hi Dave,
> How many zookeeper servers do you have and where are them? 
> 
> Juan Carlos Fern��ndez Rodr��guez
> 
> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribi��:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ��kill -9 <pid>�� on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ��service network stop�� to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I��m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

Re: HA NN Failover question

Posted by Azuryy <az...@gmail.com>.

Which Hadoop version you used?


Sent from my iPhone5s

> On 2014��3��15��, at 9:29, dlmarion <dl...@hotmail.com> wrote:
> 
> Server 1: NN1 and ZKFC1
> Server 2: NN2 and ZKFC2
> Server 3: Journal1 and ZK1
> Server 4: Journal2 and ZK2
> Server 5: Journal3 and ZK3
> Server 6+: Datanode
>  
> All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I��m assuming that��s how it works).
>  
> - Dave
>  
> From: Juan Carlos [mailto:jucaf1@gmail.com] 
> Sent: Friday, March 14, 2014 9:12 PM
> To: user@hadoop.apache.org
> Subject: Re: HA NN Failover question
>  
> Hi Dave,
> How many zookeeper servers do you have and where are them? 
> 
> Juan Carlos Fern��ndez Rodr��guez
> 
> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribi��:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ��kill -9 <pid>�� on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ��service network stop�� to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I��m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

Server 1: NN1 and ZKFC1

Server 2: NN2 and ZKFC2

Server 3: Journal1 and ZK1

Server 4: Journal2 and ZK2

Server 5: Journal3 and ZK3

Server 6+: Datanode

 

All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).

 

- Dave

 

From: Juan Carlos [mailto:jucaf1@gmail.com] 
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

 

Hi Dave,

How many zookeeper servers do you have and where are them? 


Juan Carlos Fernández Rodríguez


El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:

I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.

 

FWIW, when the network was restarted on the active NN, it failed over almost immediately.

 

Thanks,

 

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

Server 1: NN1 and ZKFC1

Server 2: NN2 and ZKFC2

Server 3: Journal1 and ZK1

Server 4: Journal2 and ZK2

Server 5: Journal3 and ZK3

Server 6+: Datanode

 

All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).

 

- Dave

 

From: Juan Carlos [mailto:jucaf1@gmail.com] 
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

 

Hi Dave,

How many zookeeper servers do you have and where are them? 


Juan Carlos Fernández Rodríguez


El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:

I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.

 

FWIW, when the network was restarted on the active NN, it failed over almost immediately.

 

Thanks,

 

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

Server 1: NN1 and ZKFC1

Server 2: NN2 and ZKFC2

Server 3: Journal1 and ZK1

Server 4: Journal2 and ZK2

Server 5: Journal3 and ZK3

Server 6+: Datanode

 

All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).

 

- Dave

 

From: Juan Carlos [mailto:jucaf1@gmail.com] 
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

 

Hi Dave,

How many zookeeper servers do you have and where are them? 


Juan Carlos Fernández Rodríguez


El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:

I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.

 

FWIW, when the network was restarted on the active NN, it failed over almost immediately.

 

Thanks,

 

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

Server 1: NN1 and ZKFC1

Server 2: NN2 and ZKFC2

Server 3: Journal1 and ZK1

Server 4: Journal2 and ZK2

Server 5: Journal3 and ZK3

Server 6+: Datanode

 

All in the same rack. I would expect the ZKFC from the active name node server to lose its lock and the other ZKFC to tell the standby namenode that it should become active (I’m assuming that’s how it works).

 

- Dave

 

From: Juan Carlos [mailto:jucaf1@gmail.com] 
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

 

Hi Dave,

How many zookeeper servers do you have and where are them? 


Juan Carlos Fernández Rodríguez


El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:

I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.

 

FWIW, when the network was restarted on the active NN, it failed over almost immediately.

 

Thanks,

 

Dave

Re: HA NN Failover question

Posted by Juan Carlos <ju...@gmail.com>.

Hi Dave,
How many zookeeper servers do you have and where are them? 

Juan Carlos Fernández Rodríguez

> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

Re: HA NN Failover question

Posted by Juan Carlos <ju...@gmail.com>.

Hi Dave,
How many zookeeper servers do you have and where are them? 

Juan Carlos Fernández Rodríguez

> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

I don't think so. NN1 and ZKFC1 are one physically separate machines than
NN2 and ZKFC2.

From: Chris Mawata [mailto:chris.mawata@gmail.com] 
Sent: Friday, March 14, 2014 9:05 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

Could you have also prevented the standby from communicating with Zookeeper?

Chris

On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

I was doing some testing with HA NN today. I set up two NN with active
failover (ZKFC) using sshfence. I tested that its working on both NN by
doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
the standby would become the active and everything seemed to work. Next, I
logged onto the active NN and did a 'service network stop' to simulate a
NIC/network failure. The standby did not become the active in this scenario.
In fact, it remained in standby mode and complained in the log that it could
not communicate with (what was) the active NN. I was unable to find anything
relevant via searches in Google in Jira. Does anyone have experience
successfully testing this? I'm hoping that it is just a configuration
problem.

FWIW, when the network was restarted on the active NN, it failed over almost
immediately.

Thanks,

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

I don't think so. NN1 and ZKFC1 are one physically separate machines than
NN2 and ZKFC2.

From: Chris Mawata [mailto:chris.mawata@gmail.com] 
Sent: Friday, March 14, 2014 9:05 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

Could you have also prevented the standby from communicating with Zookeeper?

Chris

On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

I was doing some testing with HA NN today. I set up two NN with active
failover (ZKFC) using sshfence. I tested that its working on both NN by
doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
the standby would become the active and everything seemed to work. Next, I
logged onto the active NN and did a 'service network stop' to simulate a
NIC/network failure. The standby did not become the active in this scenario.
In fact, it remained in standby mode and complained in the log that it could
not communicate with (what was) the active NN. I was unable to find anything
relevant via searches in Google in Jira. Does anyone have experience
successfully testing this? I'm hoping that it is just a configuration
problem.

FWIW, when the network was restarted on the active NN, it failed over almost
immediately.

Thanks,

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

I don't think so. NN1 and ZKFC1 are one physically separate machines than
NN2 and ZKFC2.

From: Chris Mawata [mailto:chris.mawata@gmail.com] 
Sent: Friday, March 14, 2014 9:05 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

Could you have also prevented the standby from communicating with Zookeeper?

Chris

On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

I was doing some testing with HA NN today. I set up two NN with active
failover (ZKFC) using sshfence. I tested that its working on both NN by
doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
the standby would become the active and everything seemed to work. Next, I
logged onto the active NN and did a 'service network stop' to simulate a
NIC/network failure. The standby did not become the active in this scenario.
In fact, it remained in standby mode and complained in the log that it could
not communicate with (what was) the active NN. I was unable to find anything
relevant via searches in Google in Jira. Does anyone have experience
successfully testing this? I'm hoping that it is just a configuration
problem.

FWIW, when the network was restarted on the active NN, it failed over almost
immediately.

Thanks,

Dave

RE: HA NN Failover question

Posted by dlmarion <dl...@hotmail.com>.

I don't think so. NN1 and ZKFC1 are one physically separate machines than
NN2 and ZKFC2.

From: Chris Mawata [mailto:chris.mawata@gmail.com] 
Sent: Friday, March 14, 2014 9:05 PM
To: user@hadoop.apache.org
Subject: Re: HA NN Failover question

Could you have also prevented the standby from communicating with Zookeeper?

Chris

On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

I was doing some testing with HA NN today. I set up two NN with active
failover (ZKFC) using sshfence. I tested that its working on both NN by
doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
the standby would become the active and everything seemed to work. Next, I
logged onto the active NN and did a 'service network stop' to simulate a
NIC/network failure. The standby did not become the active in this scenario.
In fact, it remained in standby mode and complained in the log that it could
not communicate with (what was) the active NN. I was unable to find anything
relevant via searches in Google in Jira. Does anyone have experience
successfully testing this? I'm hoping that it is just a configuration
problem.

FWIW, when the network was restarted on the active NN, it failed over almost
immediately.

Thanks,

Dave

Re: HA NN Failover question

Posted by Chris Mawata <ch...@gmail.com>.

Could you have also prevented the standby from communicating with
Zookeeper?
Chris
On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

> I was doing some testing with HA NN today. I set up two NN with active
> failover (ZKFC) using sshfence. I tested that its working on both NN by
> doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
> the standby would become the active and everything seemed to work. Next, I
> logged onto the active NN and did a 'service network stop' to simulate a
> NIC/network failure. The standby did not become the active in this
> scenario. In fact, it remained in standby mode and complained in the log
> that it could not communicate with (what was) the active NN. I was unable
> to find anything relevant via searches in Google in Jira. Does anyone have
> experience successfully testing this? I'm hoping that it is just a
> configuration problem.
>
>
>
> FWIW, when the network was restarted on the active NN, it failed over
> almost immediately.
>
>
>
> Thanks,
>
>
>
> Dave
>

Re: HA NN Failover question

Posted by Juan Carlos <ju...@gmail.com>.

Hi Dave,
How many zookeeper servers do you have and where are them? 

Juan Carlos Fernández Rodríguez

> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave

Re: HA NN Failover question

Posted by Chris Mawata <ch...@gmail.com>.

Could you have also prevented the standby from communicating with
Zookeeper?
Chris
On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

> I was doing some testing with HA NN today. I set up two NN with active
> failover (ZKFC) using sshfence. I tested that its working on both NN by
> doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
> the standby would become the active and everything seemed to work. Next, I
> logged onto the active NN and did a 'service network stop' to simulate a
> NIC/network failure. The standby did not become the active in this
> scenario. In fact, it remained in standby mode and complained in the log
> that it could not communicate with (what was) the active NN. I was unable
> to find anything relevant via searches in Google in Jira. Does anyone have
> experience successfully testing this? I'm hoping that it is just a
> configuration problem.
>
>
>
> FWIW, when the network was restarted on the active NN, it failed over
> almost immediately.
>
>
>
> Thanks,
>
>
>
> Dave
>

Re: HA NN Failover question

Posted by Chris Mawata <ch...@gmail.com>.

Could you have also prevented the standby from communicating with
Zookeeper?
Chris
On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

> I was doing some testing with HA NN today. I set up two NN with active
> failover (ZKFC) using sshfence. I tested that its working on both NN by
> doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
> the standby would become the active and everything seemed to work. Next, I
> logged onto the active NN and did a 'service network stop' to simulate a
> NIC/network failure. The standby did not become the active in this
> scenario. In fact, it remained in standby mode and complained in the log
> that it could not communicate with (what was) the active NN. I was unable
> to find anything relevant via searches in Google in Jira. Does anyone have
> experience successfully testing this? I'm hoping that it is just a
> configuration problem.
>
>
>
> FWIW, when the network was restarted on the active NN, it failed over
> almost immediately.
>
>
>
> Thanks,
>
>
>
> Dave
>

Re: HA NN Failover question

Posted by Chris Mawata <ch...@gmail.com>.

Could you have also prevented the standby from communicating with
Zookeeper?
Chris
On Mar 14, 2014 8:22 PM, "dlmarion" <dl...@hotmail.com> wrote:

> I was doing some testing with HA NN today. I set up two NN with active
> failover (ZKFC) using sshfence. I tested that its working on both NN by
> doing 'kill -9 <pid>' on the active NN. When I did this on the active node,
> the standby would become the active and everything seemed to work. Next, I
> logged onto the active NN and did a 'service network stop' to simulate a
> NIC/network failure. The standby did not become the active in this
> scenario. In fact, it remained in standby mode and complained in the log
> that it could not communicate with (what was) the active NN. I was unable
> to find anything relevant via searches in Google in Jira. Does anyone have
> experience successfully testing this? I'm hoping that it is just a
> configuration problem.
>
>
>
> FWIW, when the network was restarted on the active NN, it failed over
> almost immediately.
>
>
>
> Thanks,
>
>
>
> Dave
>

Re: HA NN Failover question

Posted by Juan Carlos <ju...@gmail.com>.

Hi Dave,
How many zookeeper servers do you have and where are them? 

Juan Carlos Fernández Rodríguez

> El 15/03/2014, a las 01:21, dlmarion <dl...@hotmail.com> escribió:
> 
> I was doing some testing with HA NN today. I set up two NN with active failover (ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 <pid>’ on the active NN. When I did this on the active node, the standby would become the active and everything seemed to work. Next, I logged onto the active NN and did a ‘service network stop’ to simulate a NIC/network failure. The standby did not become the active in this scenario. In fact, it remained in standby mode and complained in the log that it could not communicate with (what was) the active NN. I was unable to find anything relevant via searches in Google in Jira. Does anyone have experience successfully testing this? I’m hoping that it is just a configuration problem.
>  
> FWIW, when the network was restarted on the active NN, it failed over almost immediately.
>  
> Thanks,
>  
> Dave