You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Sean Busbey <se...@manvsbeard.com> on 2013/11/18 18:13:47 UTC

Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/
-----------------------------------------------------------

Review request for accumulo and Alex Moundalexis.


Bugs: ACCUMULO-1794
    https://issues.apache.org/jira/browse/ACCUMULO-1794


Repository: accumulo


Description
-------

ACCUMULO-1794 adds hdfs failover to continuous integration test.


Diffs
-----

  test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
  test/system/continuous/hdfs-agitator.pl PRE-CREATION 
  test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
  test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 

Diff: https://reviews.apache.org/r/15650/diff/


Testing
-------

Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.


Thanks,

Sean Busbey


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Alex Moundalexis <al...@clouderagovt.com>.

> On Dec. 6, 2013, 4:54 p.m., Alex Moundalexis wrote:
> > Ship It!

Shell and perl updates check out.


- Alex


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29881
-----------------------------------------------------------


On Dec. 5, 2013, 9:49 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Dec. 5, 2013, 9:49 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e 
>   test/system/continuous/stop-agitator.sh b853a55 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Alex Moundalexis <al...@clouderagovt.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29881
-----------------------------------------------------------

Ship it!


Ship It!

- Alex Moundalexis


On Dec. 5, 2013, 9:49 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Dec. 5, 2013, 9:49 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e 
>   test/system/continuous/stop-agitator.sh b853a55 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/
-----------------------------------------------------------

(Updated Dec. 5, 2013, 9:49 p.m.)


Review request for accumulo and Alex Moundalexis.


Changes
-------

updated review to include feedback.


Bugs: ACCUMULO-1794
    https://issues.apache.org/jira/browse/ACCUMULO-1794


Repository: accumulo


Description
-------

ACCUMULO-1794 adds hdfs failover to continuous integration test.


Diffs (updated)
-----

  test/system/continuous/continuous-env.sh.example 830ae86 
  test/system/continuous/hdfs-agitator.pl PRE-CREATION 
  test/system/continuous/start-agitator.sh 52e5a4e 
  test/system/continuous/stop-agitator.sh b853a55 

Diff: https://reviews.apache.org/r/15650/diff/


Testing
-------

Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.


Thanks,

Sean Busbey


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <bu...@clouderagovt.com>.
On Wed, Nov 20, 2013 at 12:00 PM, Keith Turner <ke...@deenlo.com> wrote:

> On Mon, Nov 18, 2013 at 12:31 PM, Sean Busbey <busbey+ml@clouderagovt.com
> >wrote:
>
> > Hey Folks,
> >
> > I have a proposed agitator script for flexing Hadoop 2's HDFS HA
> failover.
> > I'm looking for other features of Hadoop 2 that would be relevant for
> > ACCUMULO-1794.
> >
> > One related question I'd like to get a read on: the JobTracker HA in
> CDH4.
> > As a quick recap,CDH4.2.0+ includes HA support for MRv1, including
> > automatic failover between jobtrackers. When a failover occurs, jobs have
> > to restart.
> >
> > I have a similar agitator script to the one for HDFS HA that works to
> fail
> > over the job tracker. JobTracker HA doesn't exist in any Apache Hadoop
> > release, so I'm not sure if I should include it in the general test suite
> > as it would only be applicable to clusters running CDH4+ (unless some
> other
> > distro adds an equivalent administrative command).
> >
> > Thoughts?
> >
>
> I do not see a problem w/ adding this to out test scripts as long as their
> is adequate documentation making it clear its for CDH4.  The continuous
> ingest test do use map reduce.   Would it be better to put this script in
> CDH4?
>
>

Whether it gets included in Apache Accumulo or not, the script will be used
for integration tests on CDH.

The down-side to relying on that is primarily that it will only really get
maintained with a CDH-specific build of Accumulo in mind.

-- 
Sean

Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Nov 18, 2013 at 12:31 PM, Sean Busbey <bu...@clouderagovt.com>wrote:

> Hey Folks,
>
> I have a proposed agitator script for flexing Hadoop 2's HDFS HA failover.
> I'm looking for other features of Hadoop 2 that would be relevant for
> ACCUMULO-1794.
>
> One related question I'd like to get a read on: the JobTracker HA in CDH4.
> As a quick recap,CDH4.2.0+ includes HA support for MRv1, including
> automatic failover between jobtrackers. When a failover occurs, jobs have
> to restart.
>
> I have a similar agitator script to the one for HDFS HA that works to fail
> over the job tracker. JobTracker HA doesn't exist in any Apache Hadoop
> release, so I'm not sure if I should include it in the general test suite
> as it would only be applicable to clusters running CDH4+ (unless some other
> distro adds an equivalent administrative command).
>
> Thoughts?
>

I do not see a problem w/ adding this to out test scripts as long as their
is adequate documentation making it clear its for CDH4.  The continuous
ingest test do use map reduce.   Would it be better to put this script in
CDH4?


>
>
> On Mon, Nov 18, 2013 at 11:13 AM, Sean Busbey <se...@manvsbeard.com> wrote:
>
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/15650/
> > -----------------------------------------------------------
> >
> > Review request for accumulo and Alex Moundalexis.
> >
> >
> > Bugs: ACCUMULO-1794
> >     https://issues.apache.org/jira/browse/ACCUMULO-1794
> >
> >
> > Repository: accumulo
> >
> >
> > Description
> > -------
> >
> > ACCUMULO-1794 adds hdfs failover to continuous integration test.
> >
> >
> > Diffs
> > -----
> >
> >   test/system/continuous/continuous-env.sh.example
> > 830ae86b5bf2398a840b853423755f6dd65f2dc0
> >   test/system/continuous/hdfs-agitator.pl PRE-CREATION
> >   test/system/continuous/start-agitator.sh
> > 52e5a4e82a4564fa624a71f73ad29fa20ba23246
> >   test/system/continuous/stop-agitator.sh
> > b853a55b12f8402606af52e0748ca50daf95ed7f
> >
> > Diff: https://reviews.apache.org/r/15650/diff/
> >
> >
> > Testing
> > -------
> >
> > Ran the hdfs agitator on a CDH4 cluster configured for HA. it
> successfully
> > caused the active namenode to failover as it went.
> >
> >
> > Thanks,
> >
> > Sean Busbey
> >
> >
>
>
> --
> Sean
>

Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <bu...@clouderagovt.com>.
Hey Folks,

I have a proposed agitator script for flexing Hadoop 2's HDFS HA failover.
I'm looking for other features of Hadoop 2 that would be relevant for
ACCUMULO-1794.

One related question I'd like to get a read on: the JobTracker HA in CDH4.
As a quick recap,CDH4.2.0+ includes HA support for MRv1, including
automatic failover between jobtrackers. When a failover occurs, jobs have
to restart.

I have a similar agitator script to the one for HDFS HA that works to fail
over the job tracker. JobTracker HA doesn't exist in any Apache Hadoop
release, so I'm not sure if I should include it in the general test suite
as it would only be applicable to clusters running CDH4+ (unless some other
distro adds an equivalent administrative command).

Thoughts?


On Mon, Nov 18, 2013 at 11:13 AM, Sean Busbey <se...@manvsbeard.com> wrote:

>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
>
> Review request for accumulo and Alex Moundalexis.
>
>
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
>
>
> Repository: accumulo
>
>
> Description
> -------
>
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
>
>
> Diffs
> -----
>
>   test/system/continuous/continuous-env.sh.example
> 830ae86b5bf2398a840b853423755f6dd65f2dc0
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION
>   test/system/continuous/start-agitator.sh
> 52e5a4e82a4564fa624a71f73ad29fa20ba23246
>   test/system/continuous/stop-agitator.sh
> b853a55b12f8402606af52e0748ca50daf95ed7f
>
> Diff: https://reviews.apache.org/r/15650/diff/
>
>
> Testing
> -------
>
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully
> caused the active namenode to failover as it went.
>
>
> Thanks,
>
> Sean Busbey
>
>


-- 
Sean

Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.

> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 90
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line90>
> >
> >     It would be nice to default to running as the invoking user instead of forcing sudo. If I have discrete users set up for each role, I may not always want to have sudoers set up.
> >     
> >     Having the ability is definitely nice, though.

haadmin is only runnable as an HDFS super user, and AFAICT the continuous integration test runs as either the accumulo user or root (for its kill stuff to work on the other components).

If people run the agitator script as root, then the sudo is needed to allow the command to run. If they run the agitator as something other than root, then we need either a sudo to the accumulo user for the other agitator stuff or one here. Unless the accumulo user is in the hdfs superuser group. But I don't want to encourage people to add the accumulo user to the HDFS superuser group.


Maybe a docs update is need too?

I think it'd be simpler to document "run the agitator as root because it's going to need access to multiple users". Or is it worth the overhead of properly breaking the testing out into users-per-role?


> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 120
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line120>
> >
> >     Maybe rename this from hdfs-agitator to ha-hdfs-agitator (or similar). agitator.pl also agitates datanodes so the name is a bit of a misnomer.

I'd rather pull the agitator.pl parts that mess with datanodes here. that way we'd get better failure testing for when data nodes and tablet servers are not the same set of machines and we could flex more advanced HDFS failure conditions (like handling rack loss).

Sound good?


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29059
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.

> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 120
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line120>
> >
> >     Maybe rename this from hdfs-agitator to ha-hdfs-agitator (or similar). agitator.pl also agitates datanodes so the name is a bit of a misnomer.
> 
> Sean Busbey wrote:
>     I'd rather pull the agitator.pl parts that mess with datanodes here. that way we'd get better failure testing for when data nodes and tablet servers are not the same set of machines and we could flex more advanced HDFS failure conditions (like handling rack loss).
>     
>     Sound good?
> 
> Josh Elser wrote:
>     Yup. Makes sense.

Filed as ACCUMULO-1971


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29059
-----------------------------------------------------------


On Dec. 5, 2013, 9:49 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Dec. 5, 2013, 9:49 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e 
>   test/system/continuous/stop-agitator.sh b853a55 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.

> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 90
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line90>
> >
> >     It would be nice to default to running as the invoking user instead of forcing sudo. If I have discrete users set up for each role, I may not always want to have sudoers set up.
> >     
> >     Having the ability is definitely nice, though.
> 
> Sean Busbey wrote:
>     haadmin is only runnable as an HDFS super user, and AFAICT the continuous integration test runs as either the accumulo user or root (for its kill stuff to work on the other components).
>     
>     If people run the agitator script as root, then the sudo is needed to allow the command to run. If they run the agitator as something other than root, then we need either a sudo to the accumulo user for the other agitator stuff or one here. Unless the accumulo user is in the hdfs superuser group. But I don't want to encourage people to add the accumulo user to the HDFS superuser group.
>     
>     
>     Maybe a docs update is need too?
>     
>     I think it'd be simpler to document "run the agitator as root because it's going to need access to multiple users". Or is it worth the overhead of properly breaking the testing out into users-per-role?

Changed to default to run as current user unless --sudo is given. If that fails, makes an attempt to default to sudo on the path.


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29059
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Josh Elser <jo...@gmail.com>.

> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 90
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line90>
> >
> >     It would be nice to default to running as the invoking user instead of forcing sudo. If I have discrete users set up for each role, I may not always want to have sudoers set up.
> >     
> >     Having the ability is definitely nice, though.
> 
> Sean Busbey wrote:
>     haadmin is only runnable as an HDFS super user, and AFAICT the continuous integration test runs as either the accumulo user or root (for its kill stuff to work on the other components).
>     
>     If people run the agitator script as root, then the sudo is needed to allow the command to run. If they run the agitator as something other than root, then we need either a sudo to the accumulo user for the other agitator stuff or one here. Unless the accumulo user is in the hdfs superuser group. But I don't want to encourage people to add the accumulo user to the HDFS superuser group.
>     
>     
>     Maybe a docs update is need too?
>     
>     I think it'd be simpler to document "run the agitator as root because it's going to need access to multiple users". Or is it worth the overhead of properly breaking the testing out into users-per-role?
> 
> Sean Busbey wrote:
>     Changed to default to run as current user unless --sudo is given. If that fails, makes an attempt to default to sudo on the path.

Awesome. Sounds good. Agreed that the docs likely need to be updated (or formally written)


> On Nov. 18, 2013, 6:33 p.m., Josh Elser wrote:
> > test/system/continuous/hdfs-agitator.pl, line 120
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line120>
> >
> >     Maybe rename this from hdfs-agitator to ha-hdfs-agitator (or similar). agitator.pl also agitates datanodes so the name is a bit of a misnomer.
> 
> Sean Busbey wrote:
>     I'd rather pull the agitator.pl parts that mess with datanodes here. that way we'd get better failure testing for when data nodes and tablet servers are not the same set of machines and we could flex more advanced HDFS failure conditions (like handling rack loss).
>     
>     Sound good?

Yup. Makes sense.


- Josh


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29059
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Josh Elser <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29059
-----------------------------------------------------------



test/system/continuous/hdfs-agitator.pl
<https://reviews.apache.org/r/15650/#comment56136>

    It would be nice to default to running as the invoking user instead of forcing sudo. If I have discrete users set up for each role, I may not always want to have sudoers set up.
    
    Having the ability is definitely nice, though.



test/system/continuous/hdfs-agitator.pl
<https://reviews.apache.org/r/15650/#comment56137>

    Maybe rename this from hdfs-agitator to ha-hdfs-agitator (or similar). agitator.pl also agitates datanodes so the name is a bit of a misnomer.


- Josh Elser


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29044
-----------------------------------------------------------



test/system/continuous/hdfs-agitator.pl
<https://reviews.apache.org/r/15650/#comment56128>

    I could add a check here that the `hdfs haadmin` command works.
    
    Worth it? thinks should fail cleanly below.
    
    It would speed up troubleshooting should someone attempt to use this on a non-HA capable cluster.


- Sean Busbey


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by ke...@deenlo.com.

> On Nov. 20, 2013, 4:16 p.m., kturner wrote:
> > test/system/continuous/hdfs-agitator.pl, line 104
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line104>
> >
> >     What are the pros and cons of using this haadmin command vs killing namenode processes?
> 
> Sean Busbey wrote:
>     Pro haadmin:
>     
>     * The underlying HDFS instance may not be configured for automatic failover.
>     * The haadmin command doesn't require knowing where the NameNode processes are running within the cluster.
>     * The haadmin tool is a publicly exposed way of saying "do a failover", whereas finding the NameNode to kill will be a heuristic.
>     
>     Pro killing namenode:
>     
>     * If you specifically need to test what happens when it's the automatic failover process kicking in
>     
>     Note that I don't think the pro-killing pro is that strong of a pro. The haadmin command still needs to transition the active to standby and then the standby to active, so systems above HDFS are going to already encounter e.g. gaps in there being an active namenode.

I made the following comment on the dev list earlier because review board was not working.  I suspect killing the processes would yield slightly more realistic test results, but it certainly makes our scripts more unwieldy.  Maybe a better way to do this it to work towards moving hdfs agitation into hdfs itself.  

Taking things a bit further, killing processes is not as effective in test as really killing machines (because of it does not expose issues like unflushed data in OS caches).

On to another issue.  Does the script ever kill all ha namnodes?   Is this possible w/ haadmin?


- kturner


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29167
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.

> On Nov. 20, 2013, 4:16 p.m., kturner wrote:
> > test/system/continuous/hdfs-agitator.pl, line 104
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line104>
> >
> >     What are the pros and cons of using this haadmin command vs killing namenode processes?
> 
> Sean Busbey wrote:
>     Pro haadmin:
>     
>     * The underlying HDFS instance may not be configured for automatic failover.
>     * The haadmin command doesn't require knowing where the NameNode processes are running within the cluster.
>     * The haadmin tool is a publicly exposed way of saying "do a failover", whereas finding the NameNode to kill will be a heuristic.
>     
>     Pro killing namenode:
>     
>     * If you specifically need to test what happens when it's the automatic failover process kicking in
>     
>     Note that I don't think the pro-killing pro is that strong of a pro. The haadmin command still needs to transition the active to standby and then the standby to active, so systems above HDFS are going to already encounter e.g. gaps in there being an active namenode.
> 
> kturner wrote:
>     I made the following comment on the dev list earlier because review board was not working.  I suspect killing the processes would yield slightly more realistic test results, but it certainly makes our scripts more unwieldy.  Maybe a better way to do this it to work towards moving hdfs agitation into hdfs itself.  
>     
>     Taking things a bit further, killing processes is not as effective in test as really killing machines (because of it does not expose issues like unflushed data in OS caches).
>     
>     On to another issue.  Does the script ever kill all ha namnodes?   Is this possible w/ haadmin?

The kind of testing you're talking about generally happens in BigTop, rather than in individual components (e.g. HBase).

Haadmin doesn't have a command to take NameNodes offline, just to mark them as standby rather than active. I believe you could use haadmin to force all namenodes to standby mode, but I'm would suspect in a set up with automatic failover that the failover controllers would cause one to become active again. I'll check to confirm this.

Actually getting to the point of killing machines requires something external, e.g. the ability to talk to power managers or VMs. If we're looking for that level of fault testing, then I think we're better off deferring to BigTop and trying to improve both Accumulo's presence there and the use of e.g. Chaos Monkey or Gremlins.


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29167
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Nov 20, 2013 at 11:22 AM, Sean Busbey <se...@manvsbeard.com> wrote:

>
>
> > On Nov. 20, 2013, 4:16 p.m., kturner wrote:
> > > test/system/continuous/hdfs-agitator.pl, line 104
> > > <
> https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line104>
> > >
> > >     What are the pros and cons of using this haadmin command vs
> killing namenode processes?
>
> Pro haadmin:
>
> * The underlying HDFS instance may not be configured for automatic
> failover.
> * The haadmin command doesn't require knowing where the NameNode processes
> are running within the cluster.
> * The haadmin tool is a publicly exposed way of saying "do a failover",
> whereas finding the NameNode to kill will be a heuristic.
>
> Pro killing namenode:
>
> * If you specifically need to test what happens when it's the automatic
> failover process kicking in
>
> Note that I don't think the pro-killing pro is that strong of a pro. The
> haadmin command still needs to transition the active to standby and then
> the standby to active, so systems above HDFS are going to already encounter
> e.g. gaps in there being an active namenode.
>
>
>
Review board is not working so responding here on the dev list.  I suspect
killing the processes would yield slightly more realistic test results, but
it certainly makes our scripts more unwieldy.  Maybe a better way to do
this it to work towards moving hdfs agitation into hdfs itself.

Taking things a bit further, killing processes is not as effective in test
as really killing machines (because of it does not expose issues like
unflushed data in OS caches).

On to another issue.  Does the script ever kill all ha namnodes?   Is this
possible w/ haadmin?




> - Sean
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/#review29167
> -----------------------------------------------------------
>
>
> On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> >
> > -----------------------------------------------------------
> > This is an automatically generated e-mail. To reply, visit:
> > https://reviews.apache.org/r/15650/
> > -----------------------------------------------------------
> >
> > (Updated Nov. 18, 2013, 5:13 p.m.)
> >
> >
> > Review request for accumulo and Alex Moundalexis.
> >
> >
> > Bugs: ACCUMULO-1794
> >     https://issues.apache.org/jira/browse/ACCUMULO-1794
> >
> >
> > Repository: accumulo
> >
> >
> > Description
> > -------
> >
> > ACCUMULO-1794 adds hdfs failover to continuous integration test.
> >
> >
> > Diffs
> > -----
> >
> >   test/system/continuous/continuous-env.sh.example
> 830ae86b5bf2398a840b853423755f6dd65f2dc0
> >   test/system/continuous/hdfs-agitator.pl PRE-CREATION
> >   test/system/continuous/start-agitator.sh
> 52e5a4e82a4564fa624a71f73ad29fa20ba23246
> >   test/system/continuous/stop-agitator.sh
> b853a55b12f8402606af52e0748ca50daf95ed7f
> >
> > Diff: https://reviews.apache.org/r/15650/diff/
> >
> >
> > Testing
> > -------
> >
> > Ran the hdfs agitator on a CDH4 cluster configured for HA. it
> successfully caused the active namenode to failover as it went.
> >
> >
> > Thanks,
> >
> > Sean Busbey
> >
> >
>
>

Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by Sean Busbey <se...@manvsbeard.com>.

> On Nov. 20, 2013, 4:16 p.m., kturner wrote:
> > test/system/continuous/hdfs-agitator.pl, line 104
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line104>
> >
> >     What are the pros and cons of using this haadmin command vs killing namenode processes?

Pro haadmin:

* The underlying HDFS instance may not be configured for automatic failover.
* The haadmin command doesn't require knowing where the NameNode processes are running within the cluster.
* The haadmin tool is a publicly exposed way of saying "do a failover", whereas finding the NameNode to kill will be a heuristic.

Pro killing namenode:

* If you specifically need to test what happens when it's the automatic failover process kicking in

Note that I don't think the pro-killing pro is that strong of a pro. The haadmin command still needs to transition the active to standby and then the standby to active, so systems above HDFS are going to already encounter e.g. gaps in there being an active namenode.


- Sean


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29167
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by ke...@deenlo.com.

> On Nov. 20, 2013, 4:16 p.m., kturner wrote:
> > test/system/continuous/hdfs-agitator.pl, line 104
> > <https://reviews.apache.org/r/15650/diff/1/?file=388001#file388001line104>
> >
> >     What are the pros and cons of using this haadmin command vs killing namenode processes?
> 
> Sean Busbey wrote:
>     Pro haadmin:
>     
>     * The underlying HDFS instance may not be configured for automatic failover.
>     * The haadmin command doesn't require knowing where the NameNode processes are running within the cluster.
>     * The haadmin tool is a publicly exposed way of saying "do a failover", whereas finding the NameNode to kill will be a heuristic.
>     
>     Pro killing namenode:
>     
>     * If you specifically need to test what happens when it's the automatic failover process kicking in
>     
>     Note that I don't think the pro-killing pro is that strong of a pro. The haadmin command still needs to transition the active to standby and then the standby to active, so systems above HDFS are going to already encounter e.g. gaps in there being an active namenode.
> 
> kturner wrote:
>     I made the following comment on the dev list earlier because review board was not working.  I suspect killing the processes would yield slightly more realistic test results, but it certainly makes our scripts more unwieldy.  Maybe a better way to do this it to work towards moving hdfs agitation into hdfs itself.  
>     
>     Taking things a bit further, killing processes is not as effective in test as really killing machines (because of it does not expose issues like unflushed data in OS caches).
>     
>     On to another issue.  Does the script ever kill all ha namnodes?   Is this possible w/ haadmin?
> 
> Sean Busbey wrote:
>     The kind of testing you're talking about generally happens in BigTop, rather than in individual components (e.g. HBase).
>     
>     Haadmin doesn't have a command to take NameNodes offline, just to mark them as standby rather than active. I believe you could use haadmin to force all namenodes to standby mode, but I'm would suspect in a set up with automatic failover that the failover controllers would cause one to become active again. I'll check to confirm this.
>     
>     Actually getting to the point of killing machines requires something external, e.g. the ability to talk to power managers or VMs. If we're looking for that level of fault testing, then I think we're better off deferring to BigTop and trying to improve both Accumulo's presence there and the use of e.g. Chaos Monkey or Gremlins.

I agree killing machines is out of scope.  I brought it up to argue against killing processes, but did not complete the thought.

Regardless of how its done, it would be nice to test Accumulo when there are temporarily no datanodes and/or no namenode.  That may be a separate issue if its more features than the scripts currently have.


- kturner


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29167
-----------------------------------------------------------


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>


Re: Review Request 15650: ACCUMULO-1794 adds hdfs failover to continuous integration test.

Posted by ke...@deenlo.com.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15650/#review29167
-----------------------------------------------------------



test/system/continuous/hdfs-agitator.pl
<https://reviews.apache.org/r/15650/#comment56303>

    What are the pros and cons of using this haadmin command vs killing namenode processes?


- kturner


On Nov. 18, 2013, 5:13 p.m., Sean Busbey wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15650/
> -----------------------------------------------------------
> 
> (Updated Nov. 18, 2013, 5:13 p.m.)
> 
> 
> Review request for accumulo and Alex Moundalexis.
> 
> 
> Bugs: ACCUMULO-1794
>     https://issues.apache.org/jira/browse/ACCUMULO-1794
> 
> 
> Repository: accumulo
> 
> 
> Description
> -------
> 
> ACCUMULO-1794 adds hdfs failover to continuous integration test.
> 
> 
> Diffs
> -----
> 
>   test/system/continuous/continuous-env.sh.example 830ae86b5bf2398a840b853423755f6dd65f2dc0 
>   test/system/continuous/hdfs-agitator.pl PRE-CREATION 
>   test/system/continuous/start-agitator.sh 52e5a4e82a4564fa624a71f73ad29fa20ba23246 
>   test/system/continuous/stop-agitator.sh b853a55b12f8402606af52e0748ca50daf95ed7f 
> 
> Diff: https://reviews.apache.org/r/15650/diff/
> 
> 
> Testing
> -------
> 
> Ran the hdfs agitator on a CDH4 cluster configured for HA. it successfully caused the active namenode to failover as it went.
> 
> 
> Thanks,
> 
> Sean Busbey
> 
>