You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2012/09/19 21:53:08 UTC

Heads up: merge for QJM branch soon

Hi all,

Work has been progressing steadily for the last several months on the
HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
complete", and I believe ready for a merge soon now. Here is a brief
overview of some salient points:

* The QJM can be configured to act as a shared journal directory much
the same way that the BKJM can. It supports all of the same operations
as the other JMs in trunk, and fulfills all of the stated design goals
from the design document.
* Documentation has been added to the HA guide to explain how to set
up the QJM as a shared edits mechanism.
* Metrics have been added to both the NameNode (client side) and
JournalNode (server side). These metrics should be sufficient to
detect potential issues both in terms of availability and in operation
latency.
* Security is fully implemented with the normal mechanisms
* The branch has been swept for findbugs, javac warnings, license
headers, etc, and should be "good to go" by those standards. It is
also fully up-to-date with trunk.
* The design doc on HDFS-3077 has been updated to reflect the final
state of the implementation.

As this is a critical part of NN metadata storage, I imagine people
will be interested to hear about the test status as well:

* Unit/functional test coverage is pretty high. As a rough measure,
there are ~2300 lines of test code (via sloccount, ie not counting
comments etc) vs 3300 lines of non-test code.  The test coverage of
the new package is as follows: 92% coverage of client code, 86%
coverage of server code. The uncovered areas are mostly
assertion/sanity checks that never fail, and security code which can't
be tested by unit tests.
* In terms of state-space coverage, there is one randomized stress
test which injects faults randomly. This test has caught almost every
bug in the design or implementation, and alone covers ~80% of the
code. Since it is randomized, running it repeatedly covers more areas
of the state-space  -- I've written a small MR job which runs it in
parallel on a Hadoop cluster, and have run it for many slot-years.
(Actually, this test case uncovered an unrelated Jetty bug that has
been hitting us for a couple years now!)
* In terms of actual cluster testing:
** We have done cluster testing of a small secure HA cluster using QJM
for shared edits. This covers the security code which cannot be
covered by the automated tests.
** We have been running a QJM-based HA setup on a 100-node test
cluster for several weeks with no new issues in quite some time. This
cluster runs a mixed QA workload - eg hive benchmarks,
teragen/terasort, gridmix, etc.
** We have tested failover/failback in both small and large clusters.

In terms of performance:

I collected the logs from the above-mentioned 100-node test cluster,
and looked at the "SyncTimes" reported by the periodic FSEditLog
statistics printout. The active NN in this cluster is configured to
write to two local disks, and write shared edits to a
QuorumJournalManager. The QJM is configured with three JournalNodes,
all in the same rack as the active NN, and one on the same machine as
the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
quorum behavior achieves the goal of smoothing out latencies, since it
can proceed even if one of the underlying disks is temporarily slow.

I also ran some manual tests by running a loop like: while true ; do
sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
; done. This allows the JN to only run 100ms out of every 600ms. I
applied a client load to the NN and verified that the NN's operation
latency was unaffected even though one of the JNs slowly was falling
behind.

In terms of risk assessment for the merge:

- This feature is entirely optional. The changes to existing code in
the branch are fairly minimal improvements to the support for
pluggable JournalManagers. We have been merging these changes into our
CDH nightly tree and performed all of our usual daily QA and
integration against these builds, so I feel confident that they do not
introduce any new risk.
- Even when the feature is enabled for shared edits, users will likely
continue to log edits to locally mounted FileJournalManagers in
addition to the shared QJM. So, if there is any bug in the QJM, the
system metadata is not at risk.
- Overall, we have taken a conservative approach and favored
durability/correctness over availability. For example, we have many
extra sanity checks and assertions to check our assumptions, and if
any fail, we will abort rather than continue on with a risk of
data-loss.

So, in summary, I think the branch is very nearly ready to be merged
into trunk. I will continue to perform stress tests, but at this point
I am not aware of any deficiencies which would be considered a
blocker. If anyone has any questions about the code, design, or tests,
please feel free to chime in on HDFS-3077 or the relevant subtasks.

Thanks
Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Heads up: merge for QJM branch soon

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Sep 20, 2012 at 1:38 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> I need a week or so to go over the design and review the code changes. I
> will post my comments to the jira directly. Meanwhile any updates made to
> the design document would help.

The latest rev on the design doc is fully up to date as of yesterday.

Thanks
-Todd

> On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hi all,
>>
>> Work has been progressing steadily for the last several months on the
>> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
>> complete", and I believe ready for a merge soon now. Here is a brief
>> overview of some salient points:
>>
>> * The QJM can be configured to act as a shared journal directory much
>> the same way that the BKJM can. It supports all of the same operations
>> as the other JMs in trunk, and fulfills all of the stated design goals
>> from the design document.
>> * Documentation has been added to the HA guide to explain how to set
>> up the QJM as a shared edits mechanism.
>> * Metrics have been added to both the NameNode (client side) and
>> JournalNode (server side). These metrics should be sufficient to
>> detect potential issues both in terms of availability and in operation
>> latency.
>> * Security is fully implemented with the normal mechanisms
>> * The branch has been swept for findbugs, javac warnings, license
>> headers, etc, and should be "good to go" by those standards. It is
>> also fully up-to-date with trunk.
>> * The design doc on HDFS-3077 has been updated to reflect the final
>> state of the implementation.
>>
>> As this is a critical part of NN metadata storage, I imagine people
>> will be interested to hear about the test status as well:
>>
>> * Unit/functional test coverage is pretty high. As a rough measure,
>> there are ~2300 lines of test code (via sloccount, ie not counting
>> comments etc) vs 3300 lines of non-test code.  The test coverage of
>> the new package is as follows: 92% coverage of client code, 86%
>> coverage of server code. The uncovered areas are mostly
>> assertion/sanity checks that never fail, and security code which can't
>> be tested by unit tests.
>> * In terms of state-space coverage, there is one randomized stress
>> test which injects faults randomly. This test has caught almost every
>> bug in the design or implementation, and alone covers ~80% of the
>> code. Since it is randomized, running it repeatedly covers more areas
>> of the state-space  -- I've written a small MR job which runs it in
>> parallel on a Hadoop cluster, and have run it for many slot-years.
>> (Actually, this test case uncovered an unrelated Jetty bug that has
>> been hitting us for a couple years now!)
>> * In terms of actual cluster testing:
>> ** We have done cluster testing of a small secure HA cluster using QJM
>> for shared edits. This covers the security code which cannot be
>> covered by the automated tests.
>> ** We have been running a QJM-based HA setup on a 100-node test
>> cluster for several weeks with no new issues in quite some time. This
>> cluster runs a mixed QA workload - eg hive benchmarks,
>> teragen/terasort, gridmix, etc.
>> ** We have tested failover/failback in both small and large clusters.
>>
>> In terms of performance:
>>
>> I collected the logs from the above-mentioned 100-node test cluster,
>> and looked at the "SyncTimes" reported by the periodic FSEditLog
>> statistics printout. The active NN in this cluster is configured to
>> write to two local disks, and write shared edits to a
>> QuorumJournalManager. The QJM is configured with three JournalNodes,
>> all in the same rack as the active NN, and one on the same machine as
>> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
>> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
>> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
>> quorum behavior achieves the goal of smoothing out latencies, since it
>> can proceed even if one of the underlying disks is temporarily slow.
>>
>> I also ran some manual tests by running a loop like: while true ; do
>> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
>> ; done. This allows the JN to only run 100ms out of every 600ms. I
>> applied a client load to the NN and verified that the NN's operation
>> latency was unaffected even though one of the JNs slowly was falling
>> behind.
>>
>> In terms of risk assessment for the merge:
>>
>> - This feature is entirely optional. The changes to existing code in
>> the branch are fairly minimal improvements to the support for
>> pluggable JournalManagers. We have been merging these changes into our
>> CDH nightly tree and performed all of our usual daily QA and
>> integration against these builds, so I feel confident that they do not
>> introduce any new risk.
>> - Even when the feature is enabled for shared edits, users will likely
>> continue to log edits to locally mounted FileJournalManagers in
>> addition to the shared QJM. So, if there is any bug in the QJM, the
>> system metadata is not at risk.
>> - Overall, we have taken a conservative approach and favored
>> durability/correctness over availability. For example, we have many
>> extra sanity checks and assertions to check our assumptions, and if
>> any fail, we will abort rather than continue on with a risk of
>> data-loss.
>>
>> So, in summary, I think the branch is very nearly ready to be merged
>> into trunk. I will continue to perform stress tests, but at this point
>> I am not aware of any deficiencies which would be considered a
>> blocker. If anyone has any questions about the code, design, or tests,
>> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>>
>> Thanks
>> Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
>
> --
> http://hortonworks.com/download/



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Heads up: merge for QJM branch soon

Posted by Suresh Srinivas <su...@hortonworks.com>.

I need a week or so to go over the design and review the code changes. I
will post my comments to the jira directly. Meanwhile any updates made to
the design document would help.

Regards,
Suresh

On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi all,
>
> Work has been progressing steadily for the last several months on the
> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
> complete", and I believe ready for a merge soon now. Here is a brief
> overview of some salient points:
>
> * The QJM can be configured to act as a shared journal directory much
> the same way that the BKJM can. It supports all of the same operations
> as the other JMs in trunk, and fulfills all of the stated design goals
> from the design document.
> * Documentation has been added to the HA guide to explain how to set
> up the QJM as a shared edits mechanism.
> * Metrics have been added to both the NameNode (client side) and
> JournalNode (server side). These metrics should be sufficient to
> detect potential issues both in terms of availability and in operation
> latency.
> * Security is fully implemented with the normal mechanisms
> * The branch has been swept for findbugs, javac warnings, license
> headers, etc, and should be "good to go" by those standards. It is
> also fully up-to-date with trunk.
> * The design doc on HDFS-3077 has been updated to reflect the final
> state of the implementation.
>
> As this is a critical part of NN metadata storage, I imagine people
> will be interested to hear about the test status as well:
>
> * Unit/functional test coverage is pretty high. As a rough measure,
> there are ~2300 lines of test code (via sloccount, ie not counting
> comments etc) vs 3300 lines of non-test code.  The test coverage of
> the new package is as follows: 92% coverage of client code, 86%
> coverage of server code. The uncovered areas are mostly
> assertion/sanity checks that never fail, and security code which can't
> be tested by unit tests.
> * In terms of state-space coverage, there is one randomized stress
> test which injects faults randomly. This test has caught almost every
> bug in the design or implementation, and alone covers ~80% of the
> code. Since it is randomized, running it repeatedly covers more areas
> of the state-space  -- I've written a small MR job which runs it in
> parallel on a Hadoop cluster, and have run it for many slot-years.
> (Actually, this test case uncovered an unrelated Jetty bug that has
> been hitting us for a couple years now!)
> * In terms of actual cluster testing:
> ** We have done cluster testing of a small secure HA cluster using QJM
> for shared edits. This covers the security code which cannot be
> covered by the automated tests.
> ** We have been running a QJM-based HA setup on a 100-node test
> cluster for several weeks with no new issues in quite some time. This
> cluster runs a mixed QA workload - eg hive benchmarks,
> teragen/terasort, gridmix, etc.
> ** We have tested failover/failback in both small and large clusters.
>
> In terms of performance:
>
> I collected the logs from the above-mentioned 100-node test cluster,
> and looked at the "SyncTimes" reported by the periodic FSEditLog
> statistics printout. The active NN in this cluster is configured to
> write to two local disks, and write shared edits to a
> QuorumJournalManager. The QJM is configured with three JournalNodes,
> all in the same rack as the active NN, and one on the same machine as
> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
> quorum behavior achieves the goal of smoothing out latencies, since it
> can proceed even if one of the underlying disks is temporarily slow.
>
> I also ran some manual tests by running a loop like: while true ; do
> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
> ; done. This allows the JN to only run 100ms out of every 600ms. I
> applied a client load to the NN and verified that the NN's operation
> latency was unaffected even though one of the JNs slowly was falling
> behind.
>
> In terms of risk assessment for the merge:
>
> - This feature is entirely optional. The changes to existing code in
> the branch are fairly minimal improvements to the support for
> pluggable JournalManagers. We have been merging these changes into our
> CDH nightly tree and performed all of our usual daily QA and
> integration against these builds, so I feel confident that they do not
> introduce any new risk.
> - Even when the feature is enabled for shared edits, users will likely
> continue to log edits to locally mounted FileJournalManagers in
> addition to the shared QJM. So, if there is any bug in the QJM, the
> system metadata is not at risk.
> - Overall, we have taken a conservative approach and favored
> durability/correctness over availability. For example, we have many
> extra sanity checks and assertions to check our assumptions, and if
> any fail, we will abort rather than continue on with a risk of
> data-loss.
>
> So, in summary, I think the branch is very nearly ready to be merged
> into trunk. I will continue to perform stress tests, but at this point
> I am not aware of any deficiencies which would be considered a
> blocker. If anyone has any questions about the code, design, or tests,
> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>
> Thanks
> Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
http://hortonworks.com/download/

Re: Heads up: merge for QJM branch soon

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Konstantin,

I'd be open to that. I know that we plan to ship it as a recommended
option for CDH, but if the community feels it's better suited to
compile as a separate module, either inside or outside of Hadoop, I'm
open to discussing that. You're absolutely correct that it is meant to
"stand alone" in its own HDFS package.

-Todd

On Wed, Sep 19, 2012 at 5:20 PM, Konstantin Shvachko
<sh...@gmail.com> wrote:
> Hi Todd,
>
> I was wondering if you considered to make QuorumJournal a separate
> project or subproject.
> Given that
> - it is 6600 lines of code
> - the code is all new
> - well separated in a separate package
> - implements reliable journaling, which can have alternative
> approaches (say Bookeeper)
>
> Taking all that into account it seems this work by itself can and
> should form a project.
> As an alternative of merging it under the HDFS umbrella.
>
> Thanks,
> --Konstantin
>
> On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> Hi all,
>>
>> Work has been progressing steadily for the last several months on the
>> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
>> complete", and I believe ready for a merge soon now. Here is a brief
>> overview of some salient points:
>>
>> * The QJM can be configured to act as a shared journal directory much
>> the same way that the BKJM can. It supports all of the same operations
>> as the other JMs in trunk, and fulfills all of the stated design goals
>> from the design document.
>> * Documentation has been added to the HA guide to explain how to set
>> up the QJM as a shared edits mechanism.
>> * Metrics have been added to both the NameNode (client side) and
>> JournalNode (server side). These metrics should be sufficient to
>> detect potential issues both in terms of availability and in operation
>> latency.
>> * Security is fully implemented with the normal mechanisms
>> * The branch has been swept for findbugs, javac warnings, license
>> headers, etc, and should be "good to go" by those standards. It is
>> also fully up-to-date with trunk.
>> * The design doc on HDFS-3077 has been updated to reflect the final
>> state of the implementation.
>>
>> As this is a critical part of NN metadata storage, I imagine people
>> will be interested to hear about the test status as well:
>>
>> * Unit/functional test coverage is pretty high. As a rough measure,
>> there are ~2300 lines of test code (via sloccount, ie not counting
>> comments etc) vs 3300 lines of non-test code.  The test coverage of
>> the new package is as follows: 92% coverage of client code, 86%
>> coverage of server code. The uncovered areas are mostly
>> assertion/sanity checks that never fail, and security code which can't
>> be tested by unit tests.
>> * In terms of state-space coverage, there is one randomized stress
>> test which injects faults randomly. This test has caught almost every
>> bug in the design or implementation, and alone covers ~80% of the
>> code. Since it is randomized, running it repeatedly covers more areas
>> of the state-space  -- I've written a small MR job which runs it in
>> parallel on a Hadoop cluster, and have run it for many slot-years.
>> (Actually, this test case uncovered an unrelated Jetty bug that has
>> been hitting us for a couple years now!)
>> * In terms of actual cluster testing:
>> ** We have done cluster testing of a small secure HA cluster using QJM
>> for shared edits. This covers the security code which cannot be
>> covered by the automated tests.
>> ** We have been running a QJM-based HA setup on a 100-node test
>> cluster for several weeks with no new issues in quite some time. This
>> cluster runs a mixed QA workload - eg hive benchmarks,
>> teragen/terasort, gridmix, etc.
>> ** We have tested failover/failback in both small and large clusters.
>>
>> In terms of performance:
>>
>> I collected the logs from the above-mentioned 100-node test cluster,
>> and looked at the "SyncTimes" reported by the periodic FSEditLog
>> statistics printout. The active NN in this cluster is configured to
>> write to two local disks, and write shared edits to a
>> QuorumJournalManager. The QJM is configured with three JournalNodes,
>> all in the same rack as the active NN, and one on the same machine as
>> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
>> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
>> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
>> quorum behavior achieves the goal of smoothing out latencies, since it
>> can proceed even if one of the underlying disks is temporarily slow.
>>
>> I also ran some manual tests by running a loop like: while true ; do
>> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
>> ; done. This allows the JN to only run 100ms out of every 600ms. I
>> applied a client load to the NN and verified that the NN's operation
>> latency was unaffected even though one of the JNs slowly was falling
>> behind.
>>
>> In terms of risk assessment for the merge:
>>
>> - This feature is entirely optional. The changes to existing code in
>> the branch are fairly minimal improvements to the support for
>> pluggable JournalManagers. We have been merging these changes into our
>> CDH nightly tree and performed all of our usual daily QA and
>> integration against these builds, so I feel confident that they do not
>> introduce any new risk.
>> - Even when the feature is enabled for shared edits, users will likely
>> continue to log edits to locally mounted FileJournalManagers in
>> addition to the shared QJM. So, if there is any bug in the QJM, the
>> system metadata is not at risk.
>> - Overall, we have taken a conservative approach and favored
>> durability/correctness over availability. For example, we have many
>> extra sanity checks and assertions to check our assumptions, and if
>> any fail, we will abort rather than continue on with a risk of
>> data-loss.
>>
>> So, in summary, I think the branch is very nearly ready to be merged
>> into trunk. I will continue to perform stress tests, but at this point
>> I am not aware of any deficiencies which would be considered a
>> blocker. If anyone has any questions about the code, design, or tests,
>> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>>
>> Thanks
>> Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Heads up: merge for QJM branch soon

Posted by Konstantin Shvachko <sh...@gmail.com>.

Hi Todd,

I was wondering if you considered to make QuorumJournal a separate
project or subproject.
Given that
- it is 6600 lines of code
- the code is all new
- well separated in a separate package
- implements reliable journaling, which can have alternative
approaches (say Bookeeper)

Taking all that into account it seems this work by itself can and
should form a project.
As an alternative of merging it under the HDFS umbrella.

Thanks,
--Konstantin

On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi all,
>
> Work has been progressing steadily for the last several months on the
> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
> complete", and I believe ready for a merge soon now. Here is a brief
> overview of some salient points:
>
> * The QJM can be configured to act as a shared journal directory much
> the same way that the BKJM can. It supports all of the same operations
> as the other JMs in trunk, and fulfills all of the stated design goals
> from the design document.
> * Documentation has been added to the HA guide to explain how to set
> up the QJM as a shared edits mechanism.
> * Metrics have been added to both the NameNode (client side) and
> JournalNode (server side). These metrics should be sufficient to
> detect potential issues both in terms of availability and in operation
> latency.
> * Security is fully implemented with the normal mechanisms
> * The branch has been swept for findbugs, javac warnings, license
> headers, etc, and should be "good to go" by those standards. It is
> also fully up-to-date with trunk.
> * The design doc on HDFS-3077 has been updated to reflect the final
> state of the implementation.
>
> As this is a critical part of NN metadata storage, I imagine people
> will be interested to hear about the test status as well:
>
> * Unit/functional test coverage is pretty high. As a rough measure,
> there are ~2300 lines of test code (via sloccount, ie not counting
> comments etc) vs 3300 lines of non-test code.  The test coverage of
> the new package is as follows: 92% coverage of client code, 86%
> coverage of server code. The uncovered areas are mostly
> assertion/sanity checks that never fail, and security code which can't
> be tested by unit tests.
> * In terms of state-space coverage, there is one randomized stress
> test which injects faults randomly. This test has caught almost every
> bug in the design or implementation, and alone covers ~80% of the
> code. Since it is randomized, running it repeatedly covers more areas
> of the state-space  -- I've written a small MR job which runs it in
> parallel on a Hadoop cluster, and have run it for many slot-years.
> (Actually, this test case uncovered an unrelated Jetty bug that has
> been hitting us for a couple years now!)
> * In terms of actual cluster testing:
> ** We have done cluster testing of a small secure HA cluster using QJM
> for shared edits. This covers the security code which cannot be
> covered by the automated tests.
> ** We have been running a QJM-based HA setup on a 100-node test
> cluster for several weeks with no new issues in quite some time. This
> cluster runs a mixed QA workload - eg hive benchmarks,
> teragen/terasort, gridmix, etc.
> ** We have tested failover/failback in both small and large clusters.
>
> In terms of performance:
>
> I collected the logs from the above-mentioned 100-node test cluster,
> and looked at the "SyncTimes" reported by the periodic FSEditLog
> statistics printout. The active NN in this cluster is configured to
> write to two local disks, and write shared edits to a
> QuorumJournalManager. The QJM is configured with three JournalNodes,
> all in the same rack as the active NN, and one on the same machine as
> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
> quorum behavior achieves the goal of smoothing out latencies, since it
> can proceed even if one of the underlying disks is temporarily slow.
>
> I also ran some manual tests by running a loop like: while true ; do
> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
> ; done. This allows the JN to only run 100ms out of every 600ms. I
> applied a client load to the NN and verified that the NN's operation
> latency was unaffected even though one of the JNs slowly was falling
> behind.
>
> In terms of risk assessment for the merge:
>
> - This feature is entirely optional. The changes to existing code in
> the branch are fairly minimal improvements to the support for
> pluggable JournalManagers. We have been merging these changes into our
> CDH nightly tree and performed all of our usual daily QA and
> integration against these builds, so I feel confident that they do not
> introduce any new risk.
> - Even when the feature is enabled for shared edits, users will likely
> continue to log edits to locally mounted FileJournalManagers in
> addition to the shared QJM. So, if there is any bug in the QJM, the
> system metadata is not at risk.
> - Overall, we have taken a conservative approach and favored
> durability/correctness over availability. For example, we have many
> extra sanity checks and assertions to check our assumptions, and if
> any fail, we will abort rather than continue on with a risk of
> data-loss.
>
> So, in summary, I think the branch is very nearly ready to be merged
> into trunk. I will continue to perform stress tests, but at this point
> I am not aware of any deficiencies which would be considered a
> blocker. If anyone has any questions about the code, design, or tests,
> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>
> Thanks
> Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Heads up: merge for QJM branch soon

Posted by Andrew Purtell <ap...@apache.org>.

We have been backporting Todd's HDFS-3077 branch changes on top of
branch-2 for a while now, and testing the result in small clusters
(5-10 nodes). Although we certainly have not had the test coverage
Todd describes below that is their internal testing, we can add the
datapoint that the QJM, in black box testing, functions as advertised
and is resiliant to both single-JN and single-NN fault scenarios. I'd
add the caveat that support for running with security enabled was
added only recently and so our experience with that is limited, though
successful.

On Wed, Sep 19, 2012 at 12:53 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi all,
>
> Work has been progressing steadily for the last several months on the
> HDFS-3077 (QuorumJournalManager) branch. The branch is now "feature
> complete", and I believe ready for a merge soon now. Here is a brief
> overview of some salient points:
>
> * The QJM can be configured to act as a shared journal directory much
> the same way that the BKJM can. It supports all of the same operations
> as the other JMs in trunk, and fulfills all of the stated design goals
> from the design document.
> * Documentation has been added to the HA guide to explain how to set
> up the QJM as a shared edits mechanism.
> * Metrics have been added to both the NameNode (client side) and
> JournalNode (server side). These metrics should be sufficient to
> detect potential issues both in terms of availability and in operation
> latency.
> * Security is fully implemented with the normal mechanisms
> * The branch has been swept for findbugs, javac warnings, license
> headers, etc, and should be "good to go" by those standards. It is
> also fully up-to-date with trunk.
> * The design doc on HDFS-3077 has been updated to reflect the final
> state of the implementation.
>
> As this is a critical part of NN metadata storage, I imagine people
> will be interested to hear about the test status as well:
>
> * Unit/functional test coverage is pretty high. As a rough measure,
> there are ~2300 lines of test code (via sloccount, ie not counting
> comments etc) vs 3300 lines of non-test code.  The test coverage of
> the new package is as follows: 92% coverage of client code, 86%
> coverage of server code. The uncovered areas are mostly
> assertion/sanity checks that never fail, and security code which can't
> be tested by unit tests.
> * In terms of state-space coverage, there is one randomized stress
> test which injects faults randomly. This test has caught almost every
> bug in the design or implementation, and alone covers ~80% of the
> code. Since it is randomized, running it repeatedly covers more areas
> of the state-space  -- I've written a small MR job which runs it in
> parallel on a Hadoop cluster, and have run it for many slot-years.
> (Actually, this test case uncovered an unrelated Jetty bug that has
> been hitting us for a couple years now!)
> * In terms of actual cluster testing:
> ** We have done cluster testing of a small secure HA cluster using QJM
> for shared edits. This covers the security code which cannot be
> covered by the automated tests.
> ** We have been running a QJM-based HA setup on a 100-node test
> cluster for several weeks with no new issues in quite some time. This
> cluster runs a mixed QA workload - eg hive benchmarks,
> teragen/terasort, gridmix, etc.
> ** We have tested failover/failback in both small and large clusters.
>
> In terms of performance:
>
> I collected the logs from the above-mentioned 100-node test cluster,
> and looked at the "SyncTimes" reported by the periodic FSEditLog
> statistics printout. The active NN in this cluster is configured to
> write to two local disks, and write shared edits to a
> QuorumJournalManager. The QJM is configured with three JournalNodes,
> all in the same rack as the active NN, and one on the same machine as
> the active NN. The average sync times are: QJM: 6.6ms, Local disk 1:
> 7.1ms, Local disk 2: 5.7ms. The maximum times seen in the log are:
> QJM: 19.8ms, Local 1: 30.8ms, Local 2: 22.8ms. This shows that the
> quorum behavior achieves the goal of smoothing out latencies, since it
> can proceed even if one of the underlying disks is temporarily slow.
>
> I also ran some manual tests by running a loop like: while true ; do
> sleep 0.1 ; kill -STOP $PID_OF_JN ; sleep 0.5 ; kill -CONT $PID_OF_JN
> ; done. This allows the JN to only run 100ms out of every 600ms. I
> applied a client load to the NN and verified that the NN's operation
> latency was unaffected even though one of the JNs slowly was falling
> behind.
>
> In terms of risk assessment for the merge:
>
> - This feature is entirely optional. The changes to existing code in
> the branch are fairly minimal improvements to the support for
> pluggable JournalManagers. We have been merging these changes into our
> CDH nightly tree and performed all of our usual daily QA and
> integration against these builds, so I feel confident that they do not
> introduce any new risk.
> - Even when the feature is enabled for shared edits, users will likely
> continue to log edits to locally mounted FileJournalManagers in
> addition to the shared QJM. So, if there is any bug in the QJM, the
> system metadata is not at risk.
> - Overall, we have taken a conservative approach and favored
> durability/correctness over availability. For example, we have many
> extra sanity checks and assertions to check our assumptions, and if
> any fail, we will abort rather than continue on with a risk of
> data-loss.
>
> So, in summary, I think the branch is very nearly ready to be merged
> into trunk. I will continue to perform stress tests, but at this point
> I am not aware of any deficiencies which would be considered a
> blocker. If anyone has any questions about the code, design, or tests,
> please feel free to chime in on HDFS-3077 or the relevant subtasks.
>
> Thanks
> Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)