You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Koert Kuipers <ko...@tresata.com> on 2013/10/12 01:51:50 UTC

high availability

i have been playing with high availability using journalnodes and 2 masters
both running namenode and hbase master.

when i kill the namenode and hbase-master processes on the active master,
the failover is perfect. hbase never stops and a running map-reduce jobs
keeps going. this is impressive!

however when instead of killing the proceses i kill the entire active
master machine, the transactions is less smooth and can take a long time,
at least it seems this way in the logs. this is because ssh fencing fails
but keeps trying. my fencing is configured as:

 <property>
    <name>dfs.ha.fencing.methods</name>
    <value>
      sshfence
      shell(/bin/true)
    </value>
    <final>true</final>
  </property>

it is unclear to me if the transition in this case is also rapid but the
fencing takes long while the new namenode is already active, or if in this
period i am stuck without an active namenode. it is hard to accurately test
this in my setup.
is this supposed to take this long? is HDFS writable in this period? and is
hbase supposed to survive this long transition?

thanks! koert

Re: high availability

Posted by Bertrand Dechoux <de...@gmail.com>.

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

Old version (4.1) but the principle is still the same.

*No requirement for custom fencing configuration *- fencing methods such as
STONITH <http://en.wikipedia.org/wiki/STONITH> require custom hardware;
instead, we should rely only on software methods.

Bertrand

PS: But then the only true validation is by testing it.

On Tue, Oct 15, 2013 at 10:59 PM, Jing Zhao <ji...@hortonworks.com> wrote:

> I think a real fencing is not required in case that you're using
> QJM-based HA. If you are using ZKFC, a graceful fencing will first be
> triggered in which ZKFC will send a RPC request to the original ANN to
> make it standby. If the graceful fencing failed the configured fencing
> will be used. In the worst case that your original ANN cannot
> transition to standby state, QJM still has built-in single-writer
> semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
> https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
> fence method to shell(/bin/true) (since in the current code the fence
> configuration is still required).
>
> On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > Jing,
> > thanks for your answer.
> >
> > if hbase with high availability is the desired goal, is it recommended to
> > remove sshfence? we do not plan to use hdfs for anything else.
> >
> > i understood that the only downside of no fencing is that the old
> namenode
> > could still be serving read requests. could this negatively impact hbase
> > functionality, or worse, could it corrupt hbase somehow (not sure how
> that
> > would be...)?
> >
> > thanks! koert
> >
> >
> >
> > On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com>
> wrote:
> >>
> >> "it is unclear to me if the transition in this case is also rapid but
> >> the fencing takes long while the new namenode is already active, or if
> >> in this period i am stuck without an active namenode."
> >>
> >> The standby->active transition will get stuck in this period, i.e.,
> >> the NN can only become active after fencing the old active NN. During
> >> this period since the only NN is in standby state which cannot handle
> >> usual R/W operations and just throws StandbyException, hbase region
> >> server may kill itself in some cases I guess.
> >>
> >> I think you can remove sshfence from the configuration if you are
> >> using QJM-based HA.
> >>
> >> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >> > i have been playing with high availability using journalnodes and 2
> >> > masters
> >> > both running namenode and hbase master.
> >> >
> >> > when i kill the namenode and hbase-master processes on the active
> >> > master,
> >> > the failover is perfect. hbase never stops and a running map-reduce
> jobs
> >> > keeps going. this is impressive!
> >> >
> >> > however when instead of killing the proceses i kill the entire active
> >> > master
> >> > machine, the transactions is less smooth and can take a long time, at
> >> > least
> >> > it seems this way in the logs. this is because ssh fencing fails but
> >> > keeps
> >> > trying. my fencing is configured as:
> >> >
> >> >  <property>
> >> >     <name>dfs.ha.fencing.methods</name>
> >> >     <value>
> >> >       sshfence
> >> >       shell(/bin/true)
> >> >     </value>
> >> >     <final>true</final>
> >> >   </property>
> >> >
> >> > it is unclear to me if the transition in this case is also rapid but
> the
> >> > fencing takes long while the new namenode is already active, or if in
> >> > this
> >> > period i am stuck without an active namenode. it is hard to accurately
> >> > test
> >> > this in my setup.
> >> > is this supposed to take this long? is HDFS writable in this period?
> and
> >> > is
> >> > hbase supposed to survive this long transition?
> >> >
> >> > thanks! koert
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately
> >> and delete it from your system. Thank You.
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Bertrand Dechoux

Re: high availability

Posted by Bertrand Dechoux <de...@gmail.com>.

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

Old version (4.1) but the principle is still the same.

*No requirement for custom fencing configuration *- fencing methods such as
STONITH <http://en.wikipedia.org/wiki/STONITH> require custom hardware;
instead, we should rely only on software methods.

Bertrand

PS: But then the only true validation is by testing it.

On Tue, Oct 15, 2013 at 10:59 PM, Jing Zhao <ji...@hortonworks.com> wrote:

> I think a real fencing is not required in case that you're using
> QJM-based HA. If you are using ZKFC, a graceful fencing will first be
> triggered in which ZKFC will send a RPC request to the original ANN to
> make it standby. If the graceful fencing failed the configured fencing
> will be used. In the worst case that your original ANN cannot
> transition to standby state, QJM still has built-in single-writer
> semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
> https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
> fence method to shell(/bin/true) (since in the current code the fence
> configuration is still required).
>
> On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > Jing,
> > thanks for your answer.
> >
> > if hbase with high availability is the desired goal, is it recommended to
> > remove sshfence? we do not plan to use hdfs for anything else.
> >
> > i understood that the only downside of no fencing is that the old
> namenode
> > could still be serving read requests. could this negatively impact hbase
> > functionality, or worse, could it corrupt hbase somehow (not sure how
> that
> > would be...)?
> >
> > thanks! koert
> >
> >
> >
> > On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com>
> wrote:
> >>
> >> "it is unclear to me if the transition in this case is also rapid but
> >> the fencing takes long while the new namenode is already active, or if
> >> in this period i am stuck without an active namenode."
> >>
> >> The standby->active transition will get stuck in this period, i.e.,
> >> the NN can only become active after fencing the old active NN. During
> >> this period since the only NN is in standby state which cannot handle
> >> usual R/W operations and just throws StandbyException, hbase region
> >> server may kill itself in some cases I guess.
> >>
> >> I think you can remove sshfence from the configuration if you are
> >> using QJM-based HA.
> >>
> >> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >> > i have been playing with high availability using journalnodes and 2
> >> > masters
> >> > both running namenode and hbase master.
> >> >
> >> > when i kill the namenode and hbase-master processes on the active
> >> > master,
> >> > the failover is perfect. hbase never stops and a running map-reduce
> jobs
> >> > keeps going. this is impressive!
> >> >
> >> > however when instead of killing the proceses i kill the entire active
> >> > master
> >> > machine, the transactions is less smooth and can take a long time, at
> >> > least
> >> > it seems this way in the logs. this is because ssh fencing fails but
> >> > keeps
> >> > trying. my fencing is configured as:
> >> >
> >> >  <property>
> >> >     <name>dfs.ha.fencing.methods</name>
> >> >     <value>
> >> >       sshfence
> >> >       shell(/bin/true)
> >> >     </value>
> >> >     <final>true</final>
> >> >   </property>
> >> >
> >> > it is unclear to me if the transition in this case is also rapid but
> the
> >> > fencing takes long while the new namenode is already active, or if in
> >> > this
> >> > period i am stuck without an active namenode. it is hard to accurately
> >> > test
> >> > this in my setup.
> >> > is this supposed to take this long? is HDFS writable in this period?
> and
> >> > is
> >> > hbase supposed to survive this long transition?
> >> >
> >> > thanks! koert
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately
> >> and delete it from your system. Thank You.
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Bertrand Dechoux

Re: high availability

Posted by Bertrand Dechoux <de...@gmail.com>.

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

Old version (4.1) but the principle is still the same.

*No requirement for custom fencing configuration *- fencing methods such as
STONITH <http://en.wikipedia.org/wiki/STONITH> require custom hardware;
instead, we should rely only on software methods.

Bertrand

PS: But then the only true validation is by testing it.

On Tue, Oct 15, 2013 at 10:59 PM, Jing Zhao <ji...@hortonworks.com> wrote:

> I think a real fencing is not required in case that you're using
> QJM-based HA. If you are using ZKFC, a graceful fencing will first be
> triggered in which ZKFC will send a RPC request to the original ANN to
> make it standby. If the graceful fencing failed the configured fencing
> will be used. In the worst case that your original ANN cannot
> transition to standby state, QJM still has built-in single-writer
> semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
> https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
> fence method to shell(/bin/true) (since in the current code the fence
> configuration is still required).
>
> On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > Jing,
> > thanks for your answer.
> >
> > if hbase with high availability is the desired goal, is it recommended to
> > remove sshfence? we do not plan to use hdfs for anything else.
> >
> > i understood that the only downside of no fencing is that the old
> namenode
> > could still be serving read requests. could this negatively impact hbase
> > functionality, or worse, could it corrupt hbase somehow (not sure how
> that
> > would be...)?
> >
> > thanks! koert
> >
> >
> >
> > On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com>
> wrote:
> >>
> >> "it is unclear to me if the transition in this case is also rapid but
> >> the fencing takes long while the new namenode is already active, or if
> >> in this period i am stuck without an active namenode."
> >>
> >> The standby->active transition will get stuck in this period, i.e.,
> >> the NN can only become active after fencing the old active NN. During
> >> this period since the only NN is in standby state which cannot handle
> >> usual R/W operations and just throws StandbyException, hbase region
> >> server may kill itself in some cases I guess.
> >>
> >> I think you can remove sshfence from the configuration if you are
> >> using QJM-based HA.
> >>
> >> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >> > i have been playing with high availability using journalnodes and 2
> >> > masters
> >> > both running namenode and hbase master.
> >> >
> >> > when i kill the namenode and hbase-master processes on the active
> >> > master,
> >> > the failover is perfect. hbase never stops and a running map-reduce
> jobs
> >> > keeps going. this is impressive!
> >> >
> >> > however when instead of killing the proceses i kill the entire active
> >> > master
> >> > machine, the transactions is less smooth and can take a long time, at
> >> > least
> >> > it seems this way in the logs. this is because ssh fencing fails but
> >> > keeps
> >> > trying. my fencing is configured as:
> >> >
> >> >  <property>
> >> >     <name>dfs.ha.fencing.methods</name>
> >> >     <value>
> >> >       sshfence
> >> >       shell(/bin/true)
> >> >     </value>
> >> >     <final>true</final>
> >> >   </property>
> >> >
> >> > it is unclear to me if the transition in this case is also rapid but
> the
> >> > fencing takes long while the new namenode is already active, or if in
> >> > this
> >> > period i am stuck without an active namenode. it is hard to accurately
> >> > test
> >> > this in my setup.
> >> > is this supposed to take this long? is HDFS writable in this period?
> and
> >> > is
> >> > hbase supposed to survive this long transition?
> >> >
> >> > thanks! koert
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately
> >> and delete it from your system. Thank You.
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Bertrand Dechoux

Re: high availability

Posted by Bertrand Dechoux <de...@gmail.com>.

http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1/

Old version (4.1) but the principle is still the same.

*No requirement for custom fencing configuration *- fencing methods such as
STONITH <http://en.wikipedia.org/wiki/STONITH> require custom hardware;
instead, we should rely only on software methods.

Bertrand

PS: But then the only true validation is by testing it.

On Tue, Oct 15, 2013 at 10:59 PM, Jing Zhao <ji...@hortonworks.com> wrote:

> I think a real fencing is not required in case that you're using
> QJM-based HA. If you are using ZKFC, a graceful fencing will first be
> triggered in which ZKFC will send a RPC request to the original ANN to
> make it standby. If the graceful fencing failed the configured fencing
> will be used. In the worst case that your original ANN cannot
> transition to standby state, QJM still has built-in single-writer
> semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
> https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
> fence method to shell(/bin/true) (since in the current code the fence
> configuration is still required).
>
> On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > Jing,
> > thanks for your answer.
> >
> > if hbase with high availability is the desired goal, is it recommended to
> > remove sshfence? we do not plan to use hdfs for anything else.
> >
> > i understood that the only downside of no fencing is that the old
> namenode
> > could still be serving read requests. could this negatively impact hbase
> > functionality, or worse, could it corrupt hbase somehow (not sure how
> that
> > would be...)?
> >
> > thanks! koert
> >
> >
> >
> > On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com>
> wrote:
> >>
> >> "it is unclear to me if the transition in this case is also rapid but
> >> the fencing takes long while the new namenode is already active, or if
> >> in this period i am stuck without an active namenode."
> >>
> >> The standby->active transition will get stuck in this period, i.e.,
> >> the NN can only become active after fencing the old active NN. During
> >> this period since the only NN is in standby state which cannot handle
> >> usual R/W operations and just throws StandbyException, hbase region
> >> server may kill itself in some cases I guess.
> >>
> >> I think you can remove sshfence from the configuration if you are
> >> using QJM-based HA.
> >>
> >> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com>
> wrote:
> >> > i have been playing with high availability using journalnodes and 2
> >> > masters
> >> > both running namenode and hbase master.
> >> >
> >> > when i kill the namenode and hbase-master processes on the active
> >> > master,
> >> > the failover is perfect. hbase never stops and a running map-reduce
> jobs
> >> > keeps going. this is impressive!
> >> >
> >> > however when instead of killing the proceses i kill the entire active
> >> > master
> >> > machine, the transactions is less smooth and can take a long time, at
> >> > least
> >> > it seems this way in the logs. this is because ssh fencing fails but
> >> > keeps
> >> > trying. my fencing is configured as:
> >> >
> >> >  <property>
> >> >     <name>dfs.ha.fencing.methods</name>
> >> >     <value>
> >> >       sshfence
> >> >       shell(/bin/true)
> >> >     </value>
> >> >     <final>true</final>
> >> >   </property>
> >> >
> >> > it is unclear to me if the transition in this case is also rapid but
> the
> >> > fencing takes long while the new namenode is already active, or if in
> >> > this
> >> > period i am stuck without an active namenode. it is hard to accurately
> >> > test
> >> > this in my setup.
> >> > is this supposed to take this long? is HDFS writable in this period?
> and
> >> > is
> >> > hbase supposed to survive this long transition?
> >> >
> >> > thanks! koert
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately
> >> and delete it from your system. Thank You.
> >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>



-- 
Bertrand Dechoux

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

I think a real fencing is not required in case that you're using
QJM-based HA. If you are using ZKFC, a graceful fencing will first be
triggered in which ZKFC will send a RPC request to the original ANN to
make it standby. If the graceful fencing failed the configured fencing
will be used. In the worst case that your original ANN cannot
transition to standby state, QJM still has built-in single-writer
semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
fence method to shell(/bin/true) (since in the current code the fence
configuration is still required).

On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> Jing,
> thanks for your answer.
>
> if hbase with high availability is the desired goal, is it recommended to
> remove sshfence? we do not plan to use hdfs for anything else.
>
> i understood that the only downside of no fencing is that the old namenode
> could still be serving read requests. could this negatively impact hbase
> functionality, or worse, could it corrupt hbase somehow (not sure how that
> would be...)?
>
> thanks! koert
>
>
>
> On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:
>>
>> "it is unclear to me if the transition in this case is also rapid but
>> the fencing takes long while the new namenode is already active, or if
>> in this period i am stuck without an active namenode."
>>
>> The standby->active transition will get stuck in this period, i.e.,
>> the NN can only become active after fencing the old active NN. During
>> this period since the only NN is in standby state which cannot handle
>> usual R/W operations and just throws StandbyException, hbase region
>> server may kill itself in some cases I guess.
>>
>> I think you can remove sshfence from the configuration if you are
>> using QJM-based HA.
>>
>> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > i have been playing with high availability using journalnodes and 2
>> > masters
>> > both running namenode and hbase master.
>> >
>> > when i kill the namenode and hbase-master processes on the active
>> > master,
>> > the failover is perfect. hbase never stops and a running map-reduce jobs
>> > keeps going. this is impressive!
>> >
>> > however when instead of killing the proceses i kill the entire active
>> > master
>> > machine, the transactions is less smooth and can take a long time, at
>> > least
>> > it seems this way in the logs. this is because ssh fencing fails but
>> > keeps
>> > trying. my fencing is configured as:
>> >
>> >  <property>
>> >     <name>dfs.ha.fencing.methods</name>
>> >     <value>
>> >       sshfence
>> >       shell(/bin/true)
>> >     </value>
>> >     <final>true</final>
>> >   </property>
>> >
>> > it is unclear to me if the transition in this case is also rapid but the
>> > fencing takes long while the new namenode is already active, or if in
>> > this
>> > period i am stuck without an active namenode. it is hard to accurately
>> > test
>> > this in my setup.
>> > is this supposed to take this long? is HDFS writable in this period? and
>> > is
>> > hbase supposed to survive this long transition?
>> >
>> > thanks! koert
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

I think a real fencing is not required in case that you're using
QJM-based HA. If you are using ZKFC, a graceful fencing will first be
triggered in which ZKFC will send a RPC request to the original ANN to
make it standby. If the graceful fencing failed the configured fencing
will be used. In the worst case that your original ANN cannot
transition to standby state, QJM still has built-in single-writer
semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
fence method to shell(/bin/true) (since in the current code the fence
configuration is still required).

On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> Jing,
> thanks for your answer.
>
> if hbase with high availability is the desired goal, is it recommended to
> remove sshfence? we do not plan to use hdfs for anything else.
>
> i understood that the only downside of no fencing is that the old namenode
> could still be serving read requests. could this negatively impact hbase
> functionality, or worse, could it corrupt hbase somehow (not sure how that
> would be...)?
>
> thanks! koert
>
>
>
> On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:
>>
>> "it is unclear to me if the transition in this case is also rapid but
>> the fencing takes long while the new namenode is already active, or if
>> in this period i am stuck without an active namenode."
>>
>> The standby->active transition will get stuck in this period, i.e.,
>> the NN can only become active after fencing the old active NN. During
>> this period since the only NN is in standby state which cannot handle
>> usual R/W operations and just throws StandbyException, hbase region
>> server may kill itself in some cases I guess.
>>
>> I think you can remove sshfence from the configuration if you are
>> using QJM-based HA.
>>
>> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > i have been playing with high availability using journalnodes and 2
>> > masters
>> > both running namenode and hbase master.
>> >
>> > when i kill the namenode and hbase-master processes on the active
>> > master,
>> > the failover is perfect. hbase never stops and a running map-reduce jobs
>> > keeps going. this is impressive!
>> >
>> > however when instead of killing the proceses i kill the entire active
>> > master
>> > machine, the transactions is less smooth and can take a long time, at
>> > least
>> > it seems this way in the logs. this is because ssh fencing fails but
>> > keeps
>> > trying. my fencing is configured as:
>> >
>> >  <property>
>> >     <name>dfs.ha.fencing.methods</name>
>> >     <value>
>> >       sshfence
>> >       shell(/bin/true)
>> >     </value>
>> >     <final>true</final>
>> >   </property>
>> >
>> > it is unclear to me if the transition in this case is also rapid but the
>> > fencing takes long while the new namenode is already active, or if in
>> > this
>> > period i am stuck without an active namenode. it is hard to accurately
>> > test
>> > this in my setup.
>> > is this supposed to take this long? is HDFS writable in this period? and
>> > is
>> > hbase supposed to survive this long transition?
>> >
>> > thanks! koert
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

I think a real fencing is not required in case that you're using
QJM-based HA. If you are using ZKFC, a graceful fencing will first be
triggered in which ZKFC will send a RPC request to the original ANN to
make it standby. If the graceful fencing failed the configured fencing
will be used. In the worst case that your original ANN cannot
transition to standby state, QJM still has built-in single-writer
semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
fence method to shell(/bin/true) (since in the current code the fence
configuration is still required).

On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> Jing,
> thanks for your answer.
>
> if hbase with high availability is the desired goal, is it recommended to
> remove sshfence? we do not plan to use hdfs for anything else.
>
> i understood that the only downside of no fencing is that the old namenode
> could still be serving read requests. could this negatively impact hbase
> functionality, or worse, could it corrupt hbase somehow (not sure how that
> would be...)?
>
> thanks! koert
>
>
>
> On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:
>>
>> "it is unclear to me if the transition in this case is also rapid but
>> the fencing takes long while the new namenode is already active, or if
>> in this period i am stuck without an active namenode."
>>
>> The standby->active transition will get stuck in this period, i.e.,
>> the NN can only become active after fencing the old active NN. During
>> this period since the only NN is in standby state which cannot handle
>> usual R/W operations and just throws StandbyException, hbase region
>> server may kill itself in some cases I guess.
>>
>> I think you can remove sshfence from the configuration if you are
>> using QJM-based HA.
>>
>> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > i have been playing with high availability using journalnodes and 2
>> > masters
>> > both running namenode and hbase master.
>> >
>> > when i kill the namenode and hbase-master processes on the active
>> > master,
>> > the failover is perfect. hbase never stops and a running map-reduce jobs
>> > keeps going. this is impressive!
>> >
>> > however when instead of killing the proceses i kill the entire active
>> > master
>> > machine, the transactions is less smooth and can take a long time, at
>> > least
>> > it seems this way in the logs. this is because ssh fencing fails but
>> > keeps
>> > trying. my fencing is configured as:
>> >
>> >  <property>
>> >     <name>dfs.ha.fencing.methods</name>
>> >     <value>
>> >       sshfence
>> >       shell(/bin/true)
>> >     </value>
>> >     <final>true</final>
>> >   </property>
>> >
>> > it is unclear to me if the transition in this case is also rapid but the
>> > fencing takes long while the new namenode is already active, or if in
>> > this
>> > period i am stuck without an active namenode. it is hard to accurately
>> > test
>> > this in my setup.
>> > is this supposed to take this long? is HDFS writable in this period? and
>> > is
>> > hbase supposed to survive this long transition?
>> >
>> > thanks! koert
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

I think a real fencing is not required in case that you're using
QJM-based HA. If you are using ZKFC, a graceful fencing will first be
triggered in which ZKFC will send a RPC request to the original ANN to
make it standby. If the graceful fencing failed the configured fencing
will be used. In the worst case that your original ANN cannot
transition to standby state, QJM still has built-in single-writer
semantics (see https://issues.apache.org/jira/browse/HDFS-3862,
https://issues.apache.org/jira/browse/HDFS-4915). Thus you can set the
fence method to shell(/bin/true) (since in the current code the fence
configuration is still required).

On Tue, Oct 15, 2013 at 12:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> Jing,
> thanks for your answer.
>
> if hbase with high availability is the desired goal, is it recommended to
> remove sshfence? we do not plan to use hdfs for anything else.
>
> i understood that the only downside of no fencing is that the old namenode
> could still be serving read requests. could this negatively impact hbase
> functionality, or worse, could it corrupt hbase somehow (not sure how that
> would be...)?
>
> thanks! koert
>
>
>
> On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:
>>
>> "it is unclear to me if the transition in this case is also rapid but
>> the fencing takes long while the new namenode is already active, or if
>> in this period i am stuck without an active namenode."
>>
>> The standby->active transition will get stuck in this period, i.e.,
>> the NN can only become active after fencing the old active NN. During
>> this period since the only NN is in standby state which cannot handle
>> usual R/W operations and just throws StandbyException, hbase region
>> server may kill itself in some cases I guess.
>>
>> I think you can remove sshfence from the configuration if you are
>> using QJM-based HA.
>>
>> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > i have been playing with high availability using journalnodes and 2
>> > masters
>> > both running namenode and hbase master.
>> >
>> > when i kill the namenode and hbase-master processes on the active
>> > master,
>> > the failover is perfect. hbase never stops and a running map-reduce jobs
>> > keeps going. this is impressive!
>> >
>> > however when instead of killing the proceses i kill the entire active
>> > master
>> > machine, the transactions is less smooth and can take a long time, at
>> > least
>> > it seems this way in the logs. this is because ssh fencing fails but
>> > keeps
>> > trying. my fencing is configured as:
>> >
>> >  <property>
>> >     <name>dfs.ha.fencing.methods</name>
>> >     <value>
>> >       sshfence
>> >       shell(/bin/true)
>> >     </value>
>> >     <final>true</final>
>> >   </property>
>> >
>> > it is unclear to me if the transition in this case is also rapid but the
>> > fencing takes long while the new namenode is already active, or if in
>> > this
>> > period i am stuck without an active namenode. it is hard to accurately
>> > test
>> > this in my setup.
>> > is this supposed to take this long? is HDFS writable in this period? and
>> > is
>> > hbase supposed to survive this long transition?
>> >
>> > thanks! koert
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Koert Kuipers <ko...@tresata.com>.

Jing,
thanks for your answer.

if hbase with high availability is the desired goal, is it recommended to
remove sshfence? we do not plan to use hdfs for anything else.

i understood that the only downside of no fencing is that the old namenode
could still be serving read requests. could this negatively impact hbase
functionality, or worse, could it corrupt hbase somehow (not sure how that
would be...)?

thanks! koert



On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:

> "it is unclear to me if the transition in this case is also rapid but
> the fencing takes long while the new namenode is already active, or if
> in this period i am stuck without an active namenode."
>
> The standby->active transition will get stuck in this period, i.e.,
> the NN can only become active after fencing the old active NN. During
> this period since the only NN is in standby state which cannot handle
> usual R/W operations and just throws StandbyException, hbase region
> server may kill itself in some cases I guess.
>
> I think you can remove sshfence from the configuration if you are
> using QJM-based HA.
>
> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > i have been playing with high availability using journalnodes and 2
> masters
> > both running namenode and hbase master.
> >
> > when i kill the namenode and hbase-master processes on the active master,
> > the failover is perfect. hbase never stops and a running map-reduce jobs
> > keeps going. this is impressive!
> >
> > however when instead of killing the proceses i kill the entire active
> master
> > machine, the transactions is less smooth and can take a long time, at
> least
> > it seems this way in the logs. this is because ssh fencing fails but
> keeps
> > trying. my fencing is configured as:
> >
> >  <property>
> >     <name>dfs.ha.fencing.methods</name>
> >     <value>
> >       sshfence
> >       shell(/bin/true)
> >     </value>
> >     <final>true</final>
> >   </property>
> >
> > it is unclear to me if the transition in this case is also rapid but the
> > fencing takes long while the new namenode is already active, or if in
> this
> > period i am stuck without an active namenode. it is hard to accurately
> test
> > this in my setup.
> > is this supposed to take this long? is HDFS writable in this period? and
> is
> > hbase supposed to survive this long transition?
> >
> > thanks! koert
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: high availability

Posted by Koert Kuipers <ko...@tresata.com>.

Jing,
thanks for your answer.

if hbase with high availability is the desired goal, is it recommended to
remove sshfence? we do not plan to use hdfs for anything else.

i understood that the only downside of no fencing is that the old namenode
could still be serving read requests. could this negatively impact hbase
functionality, or worse, could it corrupt hbase somehow (not sure how that
would be...)?

thanks! koert



On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:

> "it is unclear to me if the transition in this case is also rapid but
> the fencing takes long while the new namenode is already active, or if
> in this period i am stuck without an active namenode."
>
> The standby->active transition will get stuck in this period, i.e.,
> the NN can only become active after fencing the old active NN. During
> this period since the only NN is in standby state which cannot handle
> usual R/W operations and just throws StandbyException, hbase region
> server may kill itself in some cases I guess.
>
> I think you can remove sshfence from the configuration if you are
> using QJM-based HA.
>
> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > i have been playing with high availability using journalnodes and 2
> masters
> > both running namenode and hbase master.
> >
> > when i kill the namenode and hbase-master processes on the active master,
> > the failover is perfect. hbase never stops and a running map-reduce jobs
> > keeps going. this is impressive!
> >
> > however when instead of killing the proceses i kill the entire active
> master
> > machine, the transactions is less smooth and can take a long time, at
> least
> > it seems this way in the logs. this is because ssh fencing fails but
> keeps
> > trying. my fencing is configured as:
> >
> >  <property>
> >     <name>dfs.ha.fencing.methods</name>
> >     <value>
> >       sshfence
> >       shell(/bin/true)
> >     </value>
> >     <final>true</final>
> >   </property>
> >
> > it is unclear to me if the transition in this case is also rapid but the
> > fencing takes long while the new namenode is already active, or if in
> this
> > period i am stuck without an active namenode. it is hard to accurately
> test
> > this in my setup.
> > is this supposed to take this long? is HDFS writable in this period? and
> is
> > hbase supposed to survive this long transition?
> >
> > thanks! koert
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: high availability

Posted by Koert Kuipers <ko...@tresata.com>.

Jing,
thanks for your answer.

if hbase with high availability is the desired goal, is it recommended to
remove sshfence? we do not plan to use hdfs for anything else.

i understood that the only downside of no fencing is that the old namenode
could still be serving read requests. could this negatively impact hbase
functionality, or worse, could it corrupt hbase somehow (not sure how that
would be...)?

thanks! koert



On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:

> "it is unclear to me if the transition in this case is also rapid but
> the fencing takes long while the new namenode is already active, or if
> in this period i am stuck without an active namenode."
>
> The standby->active transition will get stuck in this period, i.e.,
> the NN can only become active after fencing the old active NN. During
> this period since the only NN is in standby state which cannot handle
> usual R/W operations and just throws StandbyException, hbase region
> server may kill itself in some cases I guess.
>
> I think you can remove sshfence from the configuration if you are
> using QJM-based HA.
>
> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > i have been playing with high availability using journalnodes and 2
> masters
> > both running namenode and hbase master.
> >
> > when i kill the namenode and hbase-master processes on the active master,
> > the failover is perfect. hbase never stops and a running map-reduce jobs
> > keeps going. this is impressive!
> >
> > however when instead of killing the proceses i kill the entire active
> master
> > machine, the transactions is less smooth and can take a long time, at
> least
> > it seems this way in the logs. this is because ssh fencing fails but
> keeps
> > trying. my fencing is configured as:
> >
> >  <property>
> >     <name>dfs.ha.fencing.methods</name>
> >     <value>
> >       sshfence
> >       shell(/bin/true)
> >     </value>
> >     <final>true</final>
> >   </property>
> >
> > it is unclear to me if the transition in this case is also rapid but the
> > fencing takes long while the new namenode is already active, or if in
> this
> > period i am stuck without an active namenode. it is hard to accurately
> test
> > this in my setup.
> > is this supposed to take this long? is HDFS writable in this period? and
> is
> > hbase supposed to survive this long transition?
> >
> > thanks! koert
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: high availability

Posted by Koert Kuipers <ko...@tresata.com>.

Jing,
thanks for your answer.

if hbase with high availability is the desired goal, is it recommended to
remove sshfence? we do not plan to use hdfs for anything else.

i understood that the only downside of no fencing is that the old namenode
could still be serving read requests. could this negatively impact hbase
functionality, or worse, could it corrupt hbase somehow (not sure how that
would be...)?

thanks! koert



On Tue, Oct 15, 2013 at 12:38 AM, Jing Zhao <ji...@hortonworks.com> wrote:

> "it is unclear to me if the transition in this case is also rapid but
> the fencing takes long while the new namenode is already active, or if
> in this period i am stuck without an active namenode."
>
> The standby->active transition will get stuck in this period, i.e.,
> the NN can only become active after fencing the old active NN. During
> this period since the only NN is in standby state which cannot handle
> usual R/W operations and just throws StandbyException, hbase region
> server may kill itself in some cases I guess.
>
> I think you can remove sshfence from the configuration if you are
> using QJM-based HA.
>
> On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > i have been playing with high availability using journalnodes and 2
> masters
> > both running namenode and hbase master.
> >
> > when i kill the namenode and hbase-master processes on the active master,
> > the failover is perfect. hbase never stops and a running map-reduce jobs
> > keeps going. this is impressive!
> >
> > however when instead of killing the proceses i kill the entire active
> master
> > machine, the transactions is less smooth and can take a long time, at
> least
> > it seems this way in the logs. this is because ssh fencing fails but
> keeps
> > trying. my fencing is configured as:
> >
> >  <property>
> >     <name>dfs.ha.fencing.methods</name>
> >     <value>
> >       sshfence
> >       shell(/bin/true)
> >     </value>
> >     <final>true</final>
> >   </property>
> >
> > it is unclear to me if the transition in this case is also rapid but the
> > fencing takes long while the new namenode is already active, or if in
> this
> > period i am stuck without an active namenode. it is hard to accurately
> test
> > this in my setup.
> > is this supposed to take this long? is HDFS writable in this period? and
> is
> > hbase supposed to survive this long transition?
> >
> > thanks! koert
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

"it is unclear to me if the transition in this case is also rapid but
the fencing takes long while the new namenode is already active, or if
in this period i am stuck without an active namenode."

The standby->active transition will get stuck in this period, i.e.,
the NN can only become active after fencing the old active NN. During
this period since the only NN is in standby state which cannot handle
usual R/W operations and just throws StandbyException, hbase region
server may kill itself in some cases I guess.

I think you can remove sshfence from the configuration if you are
using QJM-based HA.

On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i have been playing with high availability using journalnodes and 2 masters
> both running namenode and hbase master.
>
> when i kill the namenode and hbase-master processes on the active master,
> the failover is perfect. hbase never stops and a running map-reduce jobs
> keeps going. this is impressive!
>
> however when instead of killing the proceses i kill the entire active master
> machine, the transactions is less smooth and can take a long time, at least
> it seems this way in the logs. this is because ssh fencing fails but keeps
> trying. my fencing is configured as:
>
>  <property>
>     <name>dfs.ha.fencing.methods</name>
>     <value>
>       sshfence
>       shell(/bin/true)
>     </value>
>     <final>true</final>
>   </property>
>
> it is unclear to me if the transition in this case is also rapid but the
> fencing takes long while the new namenode is already active, or if in this
> period i am stuck without an active namenode. it is hard to accurately test
> this in my setup.
> is this supposed to take this long? is HDFS writable in this period? and is
> hbase supposed to survive this long transition?
>
> thanks! koert

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

"it is unclear to me if the transition in this case is also rapid but
the fencing takes long while the new namenode is already active, or if
in this period i am stuck without an active namenode."

The standby->active transition will get stuck in this period, i.e.,
the NN can only become active after fencing the old active NN. During
this period since the only NN is in standby state which cannot handle
usual R/W operations and just throws StandbyException, hbase region
server may kill itself in some cases I guess.

I think you can remove sshfence from the configuration if you are
using QJM-based HA.

On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i have been playing with high availability using journalnodes and 2 masters
> both running namenode and hbase master.
>
> when i kill the namenode and hbase-master processes on the active master,
> the failover is perfect. hbase never stops and a running map-reduce jobs
> keeps going. this is impressive!
>
> however when instead of killing the proceses i kill the entire active master
> machine, the transactions is less smooth and can take a long time, at least
> it seems this way in the logs. this is because ssh fencing fails but keeps
> trying. my fencing is configured as:
>
>  <property>
>     <name>dfs.ha.fencing.methods</name>
>     <value>
>       sshfence
>       shell(/bin/true)
>     </value>
>     <final>true</final>
>   </property>
>
> it is unclear to me if the transition in this case is also rapid but the
> fencing takes long while the new namenode is already active, or if in this
> period i am stuck without an active namenode. it is hard to accurately test
> this in my setup.
> is this supposed to take this long? is HDFS writable in this period? and is
> hbase supposed to survive this long transition?
>
> thanks! koert

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

"it is unclear to me if the transition in this case is also rapid but
the fencing takes long while the new namenode is already active, or if
in this period i am stuck without an active namenode."

The standby->active transition will get stuck in this period, i.e.,
the NN can only become active after fencing the old active NN. During
this period since the only NN is in standby state which cannot handle
usual R/W operations and just throws StandbyException, hbase region
server may kill itself in some cases I guess.

I think you can remove sshfence from the configuration if you are
using QJM-based HA.

On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i have been playing with high availability using journalnodes and 2 masters
> both running namenode and hbase master.
>
> when i kill the namenode and hbase-master processes on the active master,
> the failover is perfect. hbase never stops and a running map-reduce jobs
> keeps going. this is impressive!
>
> however when instead of killing the proceses i kill the entire active master
> machine, the transactions is less smooth and can take a long time, at least
> it seems this way in the logs. this is because ssh fencing fails but keeps
> trying. my fencing is configured as:
>
>  <property>
>     <name>dfs.ha.fencing.methods</name>
>     <value>
>       sshfence
>       shell(/bin/true)
>     </value>
>     <final>true</final>
>   </property>
>
> it is unclear to me if the transition in this case is also rapid but the
> fencing takes long while the new namenode is already active, or if in this
> period i am stuck without an active namenode. it is hard to accurately test
> this in my setup.
> is this supposed to take this long? is HDFS writable in this period? and is
> hbase supposed to survive this long transition?
>
> thanks! koert

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: high availability

Posted by Jing Zhao <ji...@hortonworks.com>.

"it is unclear to me if the transition in this case is also rapid but
the fencing takes long while the new namenode is already active, or if
in this period i am stuck without an active namenode."

The standby->active transition will get stuck in this period, i.e.,
the NN can only become active after fencing the old active NN. During
this period since the only NN is in standby state which cannot handle
usual R/W operations and just throws StandbyException, hbase region
server may kill itself in some cases I guess.

I think you can remove sshfence from the configuration if you are
using QJM-based HA.

On Fri, Oct 11, 2013 at 4:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i have been playing with high availability using journalnodes and 2 masters
> both running namenode and hbase master.
>
> when i kill the namenode and hbase-master processes on the active master,
> the failover is perfect. hbase never stops and a running map-reduce jobs
> keeps going. this is impressive!
>
> however when instead of killing the proceses i kill the entire active master
> machine, the transactions is less smooth and can take a long time, at least
> it seems this way in the logs. this is because ssh fencing fails but keeps
> trying. my fencing is configured as:
>
>  <property>
>     <name>dfs.ha.fencing.methods</name>
>     <value>
>       sshfence
>       shell(/bin/true)
>     </value>
>     <final>true</final>
>   </property>
>
> it is unclear to me if the transition in this case is also rapid but the
> fencing takes long while the new namenode is already active, or if in this
> period i am stuck without an active namenode. it is hard to accurately test
> this in my setup.
> is this supposed to take this long? is HDFS writable in this period? and is
> hbase supposed to survive this long transition?
>
> thanks! koert

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.