You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Viktor Somogyi <vi...@gmail.com> on 2017/09/01 12:07:00 UTC

Re: Fault Injection

Hi Colin,

I'd be interested in this and also think it's a valuable thing to have this
for the community and would greatly increase the test coverage.
Saw you already have a PR, I'll give a review as I have time :).

Viktor

On Tue, Aug 22, 2017 at 9:36 PM, Timothy Chen <tn...@gmail.com> wrote:

> Hi Colin,
>
> The Kibosh code is just a README for now, is it going to be published soon?
>
> Tim
>
> On Tue, Aug 22, 2017 at 11:44 AM, Colin McCabe <cm...@apache.org> wrote:
> > Hi all,
> >
> > I've been working on a fault injector for Apache Kafka.  The general
> > idea is to create faults such as network partitions or disk failures,
> > and see what happens in the cluster.  The fault injector can run as part
> > of a ducktape system test, or standalone.
> >
> > The fault injector has two processes: a coordinator, and an agent.  The
> > agent process is responsible for actually implementing the faults.  For
> > example, it might run iptables, send signals to processes, generate a
> > lot of load, or do something else to disrupt the computer it is running
> > on.  We run an agent process on each node where we would like to
> > potentially inject faults.  So it will run alongside the brokers,
> > zookeeper nodes, etc.
> >
> > The coordinator process is responsible for communicating with the agent
> > processes and for scheduling faults.  For example, the coordinator can
> > be instructed to create a fault immediately on several nodes.  Or it can
> > be instructed to create faults over time, based on a pseudorandom seed.
> > Both the coordinator and the agent expose a REST interface that accepts
> > objects serialized via JSON.
> >
> > I think two kinds of faults will be especially interesting: network
> > faults, and disk errors.  Simulating network faults in a Linux
> > environment is relatively straightforward using iptables.  Disk errors
> > are tougher to simulate, but I have written a FUSE filesystem to do
> > this.  The  filesystem essentially simulates a bind mount in most cases,
> > but it can take a JSON specification telling it to inject certain
> > faults.  (Disk errors seem especially relevant to the ongoing work on
> > JBOD.)
> >
> > Although it's not a user-visible component, I think having a fault
> > injector will be really great for Kafka users.  It will really help us
> > stress test Kafka in more situations.  I'm going to post some patches in
> > a day or two-- it would be great to get some feedback.  Check out
> > https://cwiki.apache.org/confluence/display/KAFKA/Fault+Injection
> >
> > best,
> > Colin
>

Re: Fault Injection

Posted by Colin McCabe <cm...@apache.org>.
Thanks, Victor.  Also check out the fault injection umbrella JIRA here:
https://issues.apache.org/jira/browse/KAFKA-5775 with more subtasks.

cheers,
Colin


On Fri, Sep 1, 2017, at 05:07, Viktor Somogyi wrote:
> Hi Colin,
> 
> I'd be interested in this and also think it's a valuable thing to have
> this
> for the community and would greatly increase the test coverage.
> Saw you already have a PR, I'll give a review as I have time :).
> 
> Viktor
> 
> On Tue, Aug 22, 2017 at 9:36 PM, Timothy Chen <tn...@gmail.com> wrote:
> 
> > Hi Colin,
> >
> > The Kibosh code is just a README for now, is it going to be published soon?
> >
> > Tim
> >
> > On Tue, Aug 22, 2017 at 11:44 AM, Colin McCabe <cm...@apache.org> wrote:
> > > Hi all,
> > >
> > > I've been working on a fault injector for Apache Kafka.  The general
> > > idea is to create faults such as network partitions or disk failures,
> > > and see what happens in the cluster.  The fault injector can run as part
> > > of a ducktape system test, or standalone.
> > >
> > > The fault injector has two processes: a coordinator, and an agent.  The
> > > agent process is responsible for actually implementing the faults.  For
> > > example, it might run iptables, send signals to processes, generate a
> > > lot of load, or do something else to disrupt the computer it is running
> > > on.  We run an agent process on each node where we would like to
> > > potentially inject faults.  So it will run alongside the brokers,
> > > zookeeper nodes, etc.
> > >
> > > The coordinator process is responsible for communicating with the agent
> > > processes and for scheduling faults.  For example, the coordinator can
> > > be instructed to create a fault immediately on several nodes.  Or it can
> > > be instructed to create faults over time, based on a pseudorandom seed.
> > > Both the coordinator and the agent expose a REST interface that accepts
> > > objects serialized via JSON.
> > >
> > > I think two kinds of faults will be especially interesting: network
> > > faults, and disk errors.  Simulating network faults in a Linux
> > > environment is relatively straightforward using iptables.  Disk errors
> > > are tougher to simulate, but I have written a FUSE filesystem to do
> > > this.  The  filesystem essentially simulates a bind mount in most cases,
> > > but it can take a JSON specification telling it to inject certain
> > > faults.  (Disk errors seem especially relevant to the ongoing work on
> > > JBOD.)
> > >
> > > Although it's not a user-visible component, I think having a fault
> > > injector will be really great for Kafka users.  It will really help us
> > > stress test Kafka in more situations.  I'm going to post some patches in
> > > a day or two-- it would be great to get some feedback.  Check out
> > > https://cwiki.apache.org/confluence/display/KAFKA/Fault+Injection
> > >
> > > best,
> > > Colin
> >