You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by 谢良 <xi...@xiaomi.com> on 2012/10/15 05:01:58 UTC

Question about namenode HA

Hi Todd and other HA experts,

I've two question:

1) why the zkfc is a seperate process, i mean, what's the primary design consideration that we didn't integrate zkfc features into namenode self ?

2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing writer,  so can i just config like this safely ?
        <name>dfs.ha.fencing.methods</name>
        <value>shell(/bin/true)</value>

Thanks,
Liang

Re: Question about namenode HA

Posted by Steve Loughran <st...@hortonworks.com>.
On 15 October 2012 04:01, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>

I don't know the answer to that, except I do know that you are usually best
a monitoring the state of a process externally, as liveness is defined as
"doing useful work to external callers".

>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
I would never encourage anyone to skip fencing, as its the only way to
reliably ensure the far end that appears to be offline really is offline.
Fencing PSUs aren't that expensive; talking to an ILO management port on
the server via a separate 100 MBit/s management network a less satisfactory
option.

The risk here is that both NN's can talk to the DNs -if they emit
conflicting operations, you are in trouble. That situation could
theoretically arise if both NNs were visible to the DNs, but not to each
other. ZK quorum logic should recognise and handle this, and I have faith
in the mathematicians behind the proof of ZK's operations -but I'd
encourage you to spend the extra amount for proper fencing -the extra
hardware is minimal compared to the value of data in a cluster.

-steve

Re: Question about namenode HA

Posted by Chao Shi <st...@live.com>.
Get we get rid of ZK completely? Since JNs are like simplified version of
ZK, it should be possible to use it for election.

I think it pretty easy:
- JN exposes latest heartbeat information via RPC (the active NN
heart-beats JNs every 1 second)
- zkfc decided whether the current active NN is alive by querying (a quorum
of) JNs

Once the active NN is down, the standby NN may compete to be active. Random
delay for a few seconds may help. Thanks to QJM, It will not be correctness
issue anyway.

On Tue, Oct 16, 2012 at 9:24 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Liang,
>
> Answers inline below.
>
> On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:
>
>> Hi Todd and other HA experts,
>>
>> I've two question:
>>
>> 1) why the zkfc is a seperate process, i mean, what's the primary design
>> consideration that we didn't integrate zkfc features into namenode self ?
>>
>>
> There are a few reasons for this design choice:
>
> 1)  Like Steve said, it's easier to monitor a process from another
> process, rather than self-monitor. Consider, for example, what happens if
> the NN somehow gets into a deadlock. The process may still be alive, and a
> ZooKeeper thread would keep running, even though it is not successfully
> handling any operations. The ZKFC running in a separate process
> periodically pings the local NN via RPC to ensure that the RPC server is
> still working properly, not deadlocked, etc.
>
> 2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
> will notice it quite quickly, and delete its own zookeeper node. If the NN
> were holding its own ZK session, you would have to wait for the full ZK
> session timeout to expire. So the external ZKFC results in a faster
> failover time for certain classes of failure.
>
>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
>> writer,  so can i just config like this safely ?
>>         <name>dfs.ha.fencing.methods</name>
>>         <value>shell(/bin/true)</value>
>>
>>
> Yes, this is safe. The design of the QuorumJournalManager ensures that
> multiple conflicting writers cannot corrupt your namespace in any way. You
> might still consider using sshfence ahead of that, with a short configured
> timeout -- this provides "read fencing". Otherwise the old NN could
> theoretically serve stale reads for a few seconds before it noticed that it
> lost its ZK lease. But it's definitely not critical -- the old NN will
> eventually do some kind of write and abort itself. So, I'd recommend
> /bin/true as the last configured method in your fencing list with QJM.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about namenode HA

Posted by Chao Shi <st...@live.com>.
Get we get rid of ZK completely? Since JNs are like simplified version of
ZK, it should be possible to use it for election.

I think it pretty easy:
- JN exposes latest heartbeat information via RPC (the active NN
heart-beats JNs every 1 second)
- zkfc decided whether the current active NN is alive by querying (a quorum
of) JNs

Once the active NN is down, the standby NN may compete to be active. Random
delay for a few seconds may help. Thanks to QJM, It will not be correctness
issue anyway.

On Tue, Oct 16, 2012 at 9:24 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Liang,
>
> Answers inline below.
>
> On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:
>
>> Hi Todd and other HA experts,
>>
>> I've two question:
>>
>> 1) why the zkfc is a seperate process, i mean, what's the primary design
>> consideration that we didn't integrate zkfc features into namenode self ?
>>
>>
> There are a few reasons for this design choice:
>
> 1)  Like Steve said, it's easier to monitor a process from another
> process, rather than self-monitor. Consider, for example, what happens if
> the NN somehow gets into a deadlock. The process may still be alive, and a
> ZooKeeper thread would keep running, even though it is not successfully
> handling any operations. The ZKFC running in a separate process
> periodically pings the local NN via RPC to ensure that the RPC server is
> still working properly, not deadlocked, etc.
>
> 2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
> will notice it quite quickly, and delete its own zookeeper node. If the NN
> were holding its own ZK session, you would have to wait for the full ZK
> session timeout to expire. So the external ZKFC results in a faster
> failover time for certain classes of failure.
>
>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
>> writer,  so can i just config like this safely ?
>>         <name>dfs.ha.fencing.methods</name>
>>         <value>shell(/bin/true)</value>
>>
>>
> Yes, this is safe. The design of the QuorumJournalManager ensures that
> multiple conflicting writers cannot corrupt your namespace in any way. You
> might still consider using sshfence ahead of that, with a short configured
> timeout -- this provides "read fencing". Otherwise the old NN could
> theoretically serve stale reads for a few seconds before it noticed that it
> lost its ZK lease. But it's definitely not critical -- the old NN will
> eventually do some kind of write and abort itself. So, I'd recommend
> /bin/true as the last configured method in your fencing list with QJM.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about namenode HA

Posted by Chao Shi <st...@live.com>.
Get we get rid of ZK completely? Since JNs are like simplified version of
ZK, it should be possible to use it for election.

I think it pretty easy:
- JN exposes latest heartbeat information via RPC (the active NN
heart-beats JNs every 1 second)
- zkfc decided whether the current active NN is alive by querying (a quorum
of) JNs

Once the active NN is down, the standby NN may compete to be active. Random
delay for a few seconds may help. Thanks to QJM, It will not be correctness
issue anyway.

On Tue, Oct 16, 2012 at 9:24 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Liang,
>
> Answers inline below.
>
> On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:
>
>> Hi Todd and other HA experts,
>>
>> I've two question:
>>
>> 1) why the zkfc is a seperate process, i mean, what's the primary design
>> consideration that we didn't integrate zkfc features into namenode self ?
>>
>>
> There are a few reasons for this design choice:
>
> 1)  Like Steve said, it's easier to monitor a process from another
> process, rather than self-monitor. Consider, for example, what happens if
> the NN somehow gets into a deadlock. The process may still be alive, and a
> ZooKeeper thread would keep running, even though it is not successfully
> handling any operations. The ZKFC running in a separate process
> periodically pings the local NN via RPC to ensure that the RPC server is
> still working properly, not deadlocked, etc.
>
> 2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
> will notice it quite quickly, and delete its own zookeeper node. If the NN
> were holding its own ZK session, you would have to wait for the full ZK
> session timeout to expire. So the external ZKFC results in a faster
> failover time for certain classes of failure.
>
>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
>> writer,  so can i just config like this safely ?
>>         <name>dfs.ha.fencing.methods</name>
>>         <value>shell(/bin/true)</value>
>>
>>
> Yes, this is safe. The design of the QuorumJournalManager ensures that
> multiple conflicting writers cannot corrupt your namespace in any way. You
> might still consider using sshfence ahead of that, with a short configured
> timeout -- this provides "read fencing". Otherwise the old NN could
> theoretically serve stale reads for a few seconds before it noticed that it
> lost its ZK lease. But it's definitely not critical -- the old NN will
> eventually do some kind of write and abort itself. So, I'd recommend
> /bin/true as the last configured method in your fencing list with QJM.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about namenode HA

Posted by Chao Shi <st...@live.com>.
Get we get rid of ZK completely? Since JNs are like simplified version of
ZK, it should be possible to use it for election.

I think it pretty easy:
- JN exposes latest heartbeat information via RPC (the active NN
heart-beats JNs every 1 second)
- zkfc decided whether the current active NN is alive by querying (a quorum
of) JNs

Once the active NN is down, the standby NN may compete to be active. Random
delay for a few seconds may help. Thanks to QJM, It will not be correctness
issue anyway.

On Tue, Oct 16, 2012 at 9:24 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Liang,
>
> Answers inline below.
>
> On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:
>
>> Hi Todd and other HA experts,
>>
>> I've two question:
>>
>> 1) why the zkfc is a seperate process, i mean, what's the primary design
>> consideration that we didn't integrate zkfc features into namenode self ?
>>
>>
> There are a few reasons for this design choice:
>
> 1)  Like Steve said, it's easier to monitor a process from another
> process, rather than self-monitor. Consider, for example, what happens if
> the NN somehow gets into a deadlock. The process may still be alive, and a
> ZooKeeper thread would keep running, even though it is not successfully
> handling any operations. The ZKFC running in a separate process
> periodically pings the local NN via RPC to ensure that the RPC server is
> still working properly, not deadlocked, etc.
>
> 2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
> will notice it quite quickly, and delete its own zookeeper node. If the NN
> were holding its own ZK session, you would have to wait for the full ZK
> session timeout to expire. So the external ZKFC results in a faster
> failover time for certain classes of failure.
>
>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
>> writer,  so can i just config like this safely ?
>>         <name>dfs.ha.fencing.methods</name>
>>         <value>shell(/bin/true)</value>
>>
>>
> Yes, this is safe. The design of the QuorumJournalManager ensures that
> multiple conflicting writers cannot corrupt your namespace in any way. You
> might still consider using sshfence ahead of that, with a short configured
> timeout -- this provides "read fencing". Otherwise the old NN could
> theoretically serve stale reads for a few seconds before it noticed that it
> lost its ZK lease. But it's definitely not critical -- the old NN will
> eventually do some kind of write and abort itself. So, I'd recommend
> /bin/true as the last configured method in your fencing list with QJM.
>
> Thanks
> -Todd
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about namenode HA

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>
>
There are a few reasons for this design choice:

1)  Like Steve said, it's easier to monitor a process from another process,
rather than self-monitor. Consider, for example, what happens if the NN
somehow gets into a deadlock. The process may still be alive, and a
ZooKeeper thread would keep running, even though it is not successfully
handling any operations. The ZKFC running in a separate process
periodically pings the local NN via RPC to ensure that the RPC server is
still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
will notice it quite quickly, and delete its own zookeeper node. If the NN
were holding its own ZK session, you would have to wait for the full ZK
session timeout to expire. So the external ZKFC results in a faster
failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
Yes, this is safe. The design of the QuorumJournalManager ensures that
multiple conflicting writers cannot corrupt your namespace in any way. You
might still consider using sshfence ahead of that, with a short configured
timeout -- this provides "read fencing". Otherwise the old NN could
theoretically serve stale reads for a few seconds before it noticed that it
lost its ZK lease. But it's definitely not critical -- the old NN will
eventually do some kind of write and abort itself. So, I'd recommend
/bin/true as the last configured method in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera

Re: Question about namenode HA

Posted by Steve Loughran <st...@hortonworks.com>.
On 15 October 2012 04:01, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>

I don't know the answer to that, except I do know that you are usually best
a monitoring the state of a process externally, as liveness is defined as
"doing useful work to external callers".

>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
I would never encourage anyone to skip fencing, as its the only way to
reliably ensure the far end that appears to be offline really is offline.
Fencing PSUs aren't that expensive; talking to an ILO management port on
the server via a separate 100 MBit/s management network a less satisfactory
option.

The risk here is that both NN's can talk to the DNs -if they emit
conflicting operations, you are in trouble. That situation could
theoretically arise if both NNs were visible to the DNs, but not to each
other. ZK quorum logic should recognise and handle this, and I have faith
in the mathematicians behind the proof of ZK's operations -but I'd
encourage you to spend the extra amount for proper fencing -the extra
hardware is minimal compared to the value of data in a cluster.

-steve

Re: Question about namenode HA

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>
>
There are a few reasons for this design choice:

1)  Like Steve said, it's easier to monitor a process from another process,
rather than self-monitor. Consider, for example, what happens if the NN
somehow gets into a deadlock. The process may still be alive, and a
ZooKeeper thread would keep running, even though it is not successfully
handling any operations. The ZKFC running in a separate process
periodically pings the local NN via RPC to ensure that the RPC server is
still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
will notice it quite quickly, and delete its own zookeeper node. If the NN
were holding its own ZK session, you would have to wait for the full ZK
session timeout to expire. So the external ZKFC results in a faster
failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
Yes, this is safe. The design of the QuorumJournalManager ensures that
multiple conflicting writers cannot corrupt your namespace in any way. You
might still consider using sshfence ahead of that, with a short configured
timeout -- this provides "read fencing". Otherwise the old NN could
theoretically serve stale reads for a few seconds before it noticed that it
lost its ZK lease. But it's definitely not critical -- the old NN will
eventually do some kind of write and abort itself. So, I'd recommend
/bin/true as the last configured method in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera

Re: Question about namenode HA

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>
>
There are a few reasons for this design choice:

1)  Like Steve said, it's easier to monitor a process from another process,
rather than self-monitor. Consider, for example, what happens if the NN
somehow gets into a deadlock. The process may still be alive, and a
ZooKeeper thread would keep running, even though it is not successfully
handling any operations. The ZKFC running in a separate process
periodically pings the local NN via RPC to ensure that the RPC server is
still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
will notice it quite quickly, and delete its own zookeeper node. If the NN
were holding its own ZK session, you would have to wait for the full ZK
session timeout to expire. So the external ZKFC results in a faster
failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
Yes, this is safe. The design of the QuorumJournalManager ensures that
multiple conflicting writers cannot corrupt your namespace in any way. You
might still consider using sshfence ahead of that, with a short configured
timeout -- this provides "read fencing". Otherwise the old NN could
theoretically serve stale reads for a few seconds before it noticed that it
lost its ZK lease. But it's definitely not critical -- the old NN will
eventually do some kind of write and abort itself. So, I'd recommend
/bin/true as the last configured method in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera

Re: Question about namenode HA

Posted by Steve Loughran <st...@hortonworks.com>.
On 15 October 2012 04:01, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>

I don't know the answer to that, except I do know that you are usually best
a monitoring the state of a process externally, as liveness is defined as
"doing useful work to external callers".

>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
I would never encourage anyone to skip fencing, as its the only way to
reliably ensure the far end that appears to be offline really is offline.
Fencing PSUs aren't that expensive; talking to an ILO management port on
the server via a separate 100 MBit/s management network a less satisfactory
option.

The risk here is that both NN's can talk to the DNs -if they emit
conflicting operations, you are in trouble. That situation could
theoretically arise if both NNs were visible to the DNs, but not to each
other. ZK quorum logic should recognise and handle this, and I have faith
in the mathematicians behind the proof of ZK's operations -but I'd
encourage you to spend the extra amount for proper fencing -the extra
hardware is minimal compared to the value of data in a cluster.

-steve

Re: Question about namenode HA

Posted by Steve Loughran <st...@hortonworks.com>.
On 15 October 2012 04:01, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>

I don't know the answer to that, except I do know that you are usually best
a monitoring the state of a process externally, as liveness is defined as
"doing useful work to external callers".

>
> 2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
I would never encourage anyone to skip fencing, as its the only way to
reliably ensure the far end that appears to be offline really is offline.
Fencing PSUs aren't that expensive; talking to an ILO management port on
the server via a separate 100 MBit/s management network a less satisfactory
option.

The risk here is that both NN's can talk to the DNs -if they emit
conflicting operations, you are in trouble. That situation could
theoretically arise if both NNs were visible to the DNs, but not to each
other. ZK quorum logic should recognise and handle this, and I have faith
in the mathematicians behind the proof of ZK's operations -but I'd
encourage you to spend the extra amount for proper fencing -the extra
hardware is minimal compared to the value of data in a cluster.

-steve

Re: Question about namenode HA

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Liang,

Answers inline below.

On Sun, Oct 14, 2012 at 8:01 PM, 谢良 <xi...@xiaomi.com> wrote:

> Hi Todd and other HA experts,
>
> I've two question:
>
> 1) why the zkfc is a seperate process, i mean, what's the primary design
> consideration that we didn't integrate zkfc features into namenode self ?
>
>
There are a few reasons for this design choice:

1)  Like Steve said, it's easier to monitor a process from another process,
rather than self-monitor. Consider, for example, what happens if the NN
somehow gets into a deadlock. The process may still be alive, and a
ZooKeeper thread would keep running, even though it is not successfully
handling any operations. The ZKFC running in a separate process
periodically pings the local NN via RPC to ensure that the RPC server is
still working properly, not deadlocked, etc.

2) If the NN process itself crashes (eg segfault due to bad RAM), the ZKFC
will notice it quite quickly, and delete its own zookeeper node. If the NN
were holding its own ZK session, you would have to wait for the full ZK
session timeout to expire. So the external ZKFC results in a faster
failover time for certain classes of failure.


2) If i deploy CDH4.1(included QJM feature),  since QJM can do fencing
> writer,  so can i just config like this safely ?
>         <name>dfs.ha.fencing.methods</name>
>         <value>shell(/bin/true)</value>
>
>
Yes, this is safe. The design of the QuorumJournalManager ensures that
multiple conflicting writers cannot corrupt your namespace in any way. You
might still consider using sshfence ahead of that, with a short configured
timeout -- this provides "read fencing". Otherwise the old NN could
theoretically serve stale reads for a few seconds before it noticed that it
lost its ZK lease. But it's definitely not critical -- the old NN will
eventually do some kind of write and abort itself. So, I'd recommend
/bin/true as the last configured method in your fencing list with QJM.

Thanks
-Todd

--
Todd Lipcon
Software Engineer, Cloudera