You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Andrei Savu <sa...@gmail.com> on 2012/07/02 12:15:23 UTC

Tool for interactive fault injection testing

Hi guys,

As part of my MSc. project I have spent some time working on a tool for
fault injection testing for Apache ZooKeeper based on jboss Byteman and
Apache Whirr.

You can find the code on Github:

https://github.com/andreisavu/zookeeper-tester

Do you think this can be an useful addition to contrib? (a version that's a
bit more generic)

Thanks,

-- Andrei Savu / axemblr.com / Tools for Clouds

Re: Tool for interactive fault injection testing

Posted by Patrick Hunt <ph...@apache.org>.
Andrei, FYI I've passed this on to my ZK QA lead and asked him to
followup with you. I'm hoping we can add this to our system testing
infrastructure.

Regards,

Patrick

On Tue, Jul 3, 2012 at 2:14 PM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> Hi Andrei, Here are some thoughts about fault injection. Broadly speaking, I think we can classify the faults into server and channel. Server faults can mess with the internals of a server, and can cause it to crash, delay a response, or corrupt data. We have only limited support to tolerate data corruption, and I think if you inject bit flips at various parts of the pipeline, only in a few spots we would be able to deal with them.
>
> For channel faults, we could consider introducing delays, dropping messages at random, and disconnections at various points. Dropping messages at random may cause ZooKeeper to break because it makes the assumption in a few places that things are delivered in order and there is no gap. Introducing delays seems to be particularly interesting because we could test different ways of interleaving messages for leader election and Zab.
>
> -Flavio
>
> On Jul 2, 2012, at 10:27 PM, Andrei Savu wrote:
>
>> Thanks Flavio! This is the rule I'm using for the demo:
>>
>> RULE NIO Server readPayload fails
>> CLASS org.apache.zookeeper.server.NIOServerCnxn
>> METHOD readPayload
>> HELPER RandomHelper
>> AT ENTRY
>> IF nextInt(100) < 10
>> DO throw new IOException("Injected by byteman");
>> ENDRULE
>>
>>
>> See:
>>
>> https://github.com/andreisavu/zookeeper-tester/blob/master/src/main/resources/functions/install_byteman.sh
>>
>>
>> ~10% of all the payload reads result in an exception being thrown. There is
>> an
>> increase in latency but the cluster as a whole works as expected.
>>
>> I am planning to do more of this if you think it's useful.
>>
>> -- Andrei Savu
>>
>> On Mon, Jul 2, 2012 at 11:21 PM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>
>>> Sounds like great stuff, Andrei. Do you have a description of the faults
>>> you have injected I can access?
>>>
>>> -Flavio
>>>
>>> On Jul 2, 2012, at 10:14 PM, Andrei Savu wrote:
>>>
>>>> I was unable to find any issues so far. It seems like ZooKeeper does a
>>>> great job at
>>>> handling network failures.
>>>>
>>>> This tool is deploying a ZooKeeper cluster on a cloud provider using
>>> Whirr
>>>> together
>>>> with Byteman [1]  (attached to the JVM).
>>>>
>>>> Faults are injected by using Byteman rules. See this tutorial:
>>>>
>>> https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing
>>>>
>>>> I am planning to improve the tool to have the ability o inject arbitrary
>>>> rules through the web UI.
>>>>
>>>> As an workload generator I am using a distributed queue implementation
>>>> that's handling
>>>> ConnectionLoss by retrying to post the message (duplicates are acceptable
>>>> when measuring the latency).
>>>>
>>>> [1] http://www.jboss.org/byteman/
>>>>
>>>> -- Andrei Savu
>>>>
>>>> On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <ph...@apache.org> wrote:
>>>>
>>>>> Sounds interesting but it's not clear to me from the provided docs
>>>>> what it does and what am I expected to do? (canned tests or a
>>>>> framework for me to use). Have you been able to find any issues using
>>>>> this?
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com>
>>> wrote:
>>>>>> Hi guys,
>>>>>>
>>>>>> As part of my MSc. project I have spent some time working on a tool for
>>>>>> fault injection testing for Apache ZooKeeper based on jboss Byteman and
>>>>>> Apache Whirr.
>>>>>>
>>>>>> You can find the code on Github:
>>>>>>
>>>>>> https://github.com/andreisavu/zookeeper-tester
>>>>>>
>>>>>> Do you think this can be an useful addition to contrib? (a version
>>>>> that's a
>>>>>> bit more generic)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -- Andrei Savu / axemblr.com / Tools for Clouds
>>>>>
>>>
>>>
>

Re: Tool for interactive fault injection testing

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Hi Andrei, Here are some thoughts about fault injection. Broadly speaking, I think we can classify the faults into server and channel. Server faults can mess with the internals of a server, and can cause it to crash, delay a response, or corrupt data. We have only limited support to tolerate data corruption, and I think if you inject bit flips at various parts of the pipeline, only in a few spots we would be able to deal with them. 

For channel faults, we could consider introducing delays, dropping messages at random, and disconnections at various points. Dropping messages at random may cause ZooKeeper to break because it makes the assumption in a few places that things are delivered in order and there is no gap. Introducing delays seems to be particularly interesting because we could test different ways of interleaving messages for leader election and Zab.

-Flavio    

On Jul 2, 2012, at 10:27 PM, Andrei Savu wrote:

> Thanks Flavio! This is the rule I'm using for the demo:
> 
> RULE NIO Server readPayload fails
> CLASS org.apache.zookeeper.server.NIOServerCnxn
> METHOD readPayload
> HELPER RandomHelper
> AT ENTRY
> IF nextInt(100) < 10
> DO throw new IOException("Injected by byteman");
> ENDRULE
> 
> 
> See:
> 
> https://github.com/andreisavu/zookeeper-tester/blob/master/src/main/resources/functions/install_byteman.sh
> 
> 
> ~10% of all the payload reads result in an exception being thrown. There is
> an
> increase in latency but the cluster as a whole works as expected.
> 
> I am planning to do more of this if you think it's useful.
> 
> -- Andrei Savu
> 
> On Mon, Jul 2, 2012 at 11:21 PM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> 
>> Sounds like great stuff, Andrei. Do you have a description of the faults
>> you have injected I can access?
>> 
>> -Flavio
>> 
>> On Jul 2, 2012, at 10:14 PM, Andrei Savu wrote:
>> 
>>> I was unable to find any issues so far. It seems like ZooKeeper does a
>>> great job at
>>> handling network failures.
>>> 
>>> This tool is deploying a ZooKeeper cluster on a cloud provider using
>> Whirr
>>> together
>>> with Byteman [1]  (attached to the JVM).
>>> 
>>> Faults are injected by using Byteman rules. See this tutorial:
>>> 
>> https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing
>>> 
>>> I am planning to improve the tool to have the ability o inject arbitrary
>>> rules through the web UI.
>>> 
>>> As an workload generator I am using a distributed queue implementation
>>> that's handling
>>> ConnectionLoss by retrying to post the message (duplicates are acceptable
>>> when measuring the latency).
>>> 
>>> [1] http://www.jboss.org/byteman/
>>> 
>>> -- Andrei Savu
>>> 
>>> On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <ph...@apache.org> wrote:
>>> 
>>>> Sounds interesting but it's not clear to me from the provided docs
>>>> what it does and what am I expected to do? (canned tests or a
>>>> framework for me to use). Have you been able to find any issues using
>>>> this?
>>>> 
>>>> Patrick
>>>> 
>>>> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com>
>> wrote:
>>>>> Hi guys,
>>>>> 
>>>>> As part of my MSc. project I have spent some time working on a tool for
>>>>> fault injection testing for Apache ZooKeeper based on jboss Byteman and
>>>>> Apache Whirr.
>>>>> 
>>>>> You can find the code on Github:
>>>>> 
>>>>> https://github.com/andreisavu/zookeeper-tester
>>>>> 
>>>>> Do you think this can be an useful addition to contrib? (a version
>>>> that's a
>>>>> bit more generic)
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> -- Andrei Savu / axemblr.com / Tools for Clouds
>>>> 
>> 
>> 


Re: Tool for interactive fault injection testing

Posted by Andrei Savu <sa...@gmail.com>.
Thanks Flavio! This is the rule I'm using for the demo:

RULE NIO Server readPayload fails
CLASS org.apache.zookeeper.server.NIOServerCnxn
METHOD readPayload
HELPER RandomHelper
AT ENTRY
IF nextInt(100) < 10
DO throw new IOException("Injected by byteman");
ENDRULE


See:

https://github.com/andreisavu/zookeeper-tester/blob/master/src/main/resources/functions/install_byteman.sh


~10% of all the payload reads result in an exception being thrown. There is
an
increase in latency but the cluster as a whole works as expected.

I am planning to do more of this if you think it's useful.

-- Andrei Savu

On Mon, Jul 2, 2012 at 11:21 PM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:

> Sounds like great stuff, Andrei. Do you have a description of the faults
> you have injected I can access?
>
> -Flavio
>
> On Jul 2, 2012, at 10:14 PM, Andrei Savu wrote:
>
> > I was unable to find any issues so far. It seems like ZooKeeper does a
> > great job at
> > handling network failures.
> >
> > This tool is deploying a ZooKeeper cluster on a cloud provider using
> Whirr
> > together
> > with Byteman [1]  (attached to the JVM).
> >
> > Faults are injected by using Byteman rules. See this tutorial:
> >
> https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing
> >
> > I am planning to improve the tool to have the ability o inject arbitrary
> > rules through the web UI.
> >
> > As an workload generator I am using a distributed queue implementation
> > that's handling
> > ConnectionLoss by retrying to post the message (duplicates are acceptable
> > when measuring the latency).
> >
> > [1] http://www.jboss.org/byteman/
> >
> > -- Andrei Savu
> >
> > On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <ph...@apache.org> wrote:
> >
> >> Sounds interesting but it's not clear to me from the provided docs
> >> what it does and what am I expected to do? (canned tests or a
> >> framework for me to use). Have you been able to find any issues using
> >> this?
> >>
> >> Patrick
> >>
> >> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com>
> wrote:
> >>> Hi guys,
> >>>
> >>> As part of my MSc. project I have spent some time working on a tool for
> >>> fault injection testing for Apache ZooKeeper based on jboss Byteman and
> >>> Apache Whirr.
> >>>
> >>> You can find the code on Github:
> >>>
> >>> https://github.com/andreisavu/zookeeper-tester
> >>>
> >>> Do you think this can be an useful addition to contrib? (a version
> >> that's a
> >>> bit more generic)
> >>>
> >>> Thanks,
> >>>
> >>> -- Andrei Savu / axemblr.com / Tools for Clouds
> >>
>
>

Re: Tool for interactive fault injection testing

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.
Sounds like great stuff, Andrei. Do you have a description of the faults you have injected I can access?

-Flavio

On Jul 2, 2012, at 10:14 PM, Andrei Savu wrote:

> I was unable to find any issues so far. It seems like ZooKeeper does a
> great job at
> handling network failures.
> 
> This tool is deploying a ZooKeeper cluster on a cloud provider using Whirr
> together
> with Byteman [1]  (attached to the JVM).
> 
> Faults are injected by using Byteman rules. See this tutorial:
> https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing
> 
> I am planning to improve the tool to have the ability o inject arbitrary
> rules through the web UI.
> 
> As an workload generator I am using a distributed queue implementation
> that's handling
> ConnectionLoss by retrying to post the message (duplicates are acceptable
> when measuring the latency).
> 
> [1] http://www.jboss.org/byteman/
> 
> -- Andrei Savu
> 
> On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <ph...@apache.org> wrote:
> 
>> Sounds interesting but it's not clear to me from the provided docs
>> what it does and what am I expected to do? (canned tests or a
>> framework for me to use). Have you been able to find any issues using
>> this?
>> 
>> Patrick
>> 
>> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com> wrote:
>>> Hi guys,
>>> 
>>> As part of my MSc. project I have spent some time working on a tool for
>>> fault injection testing for Apache ZooKeeper based on jboss Byteman and
>>> Apache Whirr.
>>> 
>>> You can find the code on Github:
>>> 
>>> https://github.com/andreisavu/zookeeper-tester
>>> 
>>> Do you think this can be an useful addition to contrib? (a version
>> that's a
>>> bit more generic)
>>> 
>>> Thanks,
>>> 
>>> -- Andrei Savu / axemblr.com / Tools for Clouds
>> 


Re: Tool for interactive fault injection testing

Posted by Andrei Savu <sa...@gmail.com>.
I was unable to find any issues so far. It seems like ZooKeeper does a
great job at
handling network failures.

This tool is deploying a ZooKeeper cluster on a cloud provider using Whirr
together
with Byteman [1]  (attached to the JVM).

Faults are injected by using Byteman rules. See this tutorial:
https://community.jboss.org/wiki/FaultInjectionTestingWithByteman#what_is_fault_injection_testing

I am planning to improve the tool to have the ability o inject arbitrary
rules through the web UI.

As an workload generator I am using a distributed queue implementation
that's handling
ConnectionLoss by retrying to post the message (duplicates are acceptable
when measuring the latency).

[1] http://www.jboss.org/byteman/

-- Andrei Savu

On Mon, Jul 2, 2012 at 7:39 PM, Patrick Hunt <ph...@apache.org> wrote:

> Sounds interesting but it's not clear to me from the provided docs
> what it does and what am I expected to do? (canned tests or a
> framework for me to use). Have you been able to find any issues using
> this?
>
> Patrick
>
> On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com> wrote:
> > Hi guys,
> >
> > As part of my MSc. project I have spent some time working on a tool for
> > fault injection testing for Apache ZooKeeper based on jboss Byteman and
> > Apache Whirr.
> >
> > You can find the code on Github:
> >
> > https://github.com/andreisavu/zookeeper-tester
> >
> > Do you think this can be an useful addition to contrib? (a version
> that's a
> > bit more generic)
> >
> > Thanks,
> >
> > -- Andrei Savu / axemblr.com / Tools for Clouds
>

Re: Tool for interactive fault injection testing

Posted by Patrick Hunt <ph...@apache.org>.
Sounds interesting but it's not clear to me from the provided docs
what it does and what am I expected to do? (canned tests or a
framework for me to use). Have you been able to find any issues using
this?

Patrick

On Mon, Jul 2, 2012 at 3:15 AM, Andrei Savu <sa...@gmail.com> wrote:
> Hi guys,
>
> As part of my MSc. project I have spent some time working on a tool for
> fault injection testing for Apache ZooKeeper based on jboss Byteman and
> Apache Whirr.
>
> You can find the code on Github:
>
> https://github.com/andreisavu/zookeeper-tester
>
> Do you think this can be an useful addition to contrib? (a version that's a
> bit more generic)
>
> Thanks,
>
> -- Andrei Savu / axemblr.com / Tools for Clouds