You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Gwen Shapira <gw...@confluent.io> on 2016/10/04 00:45:36 UTC

[DISCUSS] Fault injection tests for Kafka

Hi Team Kafka,

I was thinking of enhancing our system tests with some fault
injections. You know, drop random packets, partition some nodes,
delete disks, maybe play with system clocks. Fun stuff :)

I was thinking of adding the fault injection to our system tests, so
if someone reports a failure scenario, we can create a similar
scenario in our system tests to make sure we fixed it.

I've also seen suggestions of using Jepsen for fault injection, but
I'm not familiar with this framework.

What do you guys think? Write our own failure injection? or write
Kafka tests in Jepsen?



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] Fault injection tests for Kafka

Posted by radai <ra...@gmail.com>.
for "small" failures (local failures on a single node, like socket
disconnection, disk read errors, out of memory etc) I've used byteman
before - http://byteman.jboss.org/

On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jj...@gmail.com> wrote:

> Hi Gwen,
>
> I've also seen suggestions of using Jepsen for fault injection, but
> > I'm not familiar with this framework.
> >
> > What do you guys think? Write our own failure injection? or write
> > Kafka tests in Jepsen?
> >
>
> This would definitely add a lot of value and save a lot on release
> validation overheads. I have heard of Jepsen (via the blog), but haven't
> used it. At LinkedIn a couple of infra teams have been using Simoorg
> <https://github.com/linkedin/simoorg> which being python-based would
> perhaps be easier to use for system test writers than Clojure (under
> Jepsen). The Ambry <https://github.com/linkedin/ambry> project at LinkedIn
> uses it extensively (and I think has added several more failure scenarios
> which don't seem to be reflected in the github repo). Anyway, I think we
> should at least enumerate what we want to test and evaluate the
> alternatives before reinventing.
>
> Thanks,
>
> Joel
>

Re: [DISCUSS] Fault injection tests for Kafka

Posted by Gwen Shapira <gw...@confluent.io>.
YES!

One of my goal for the fault-injection in our system tests is that
whoever fixes the issue will also add tests to make sure it stays
fixed.

On Wed, Oct 5, 2016 at 11:33 AM, Tom Crayford <tc...@heroku.com> wrote:
> I did some stuff like this recently with simple calls to `tc` (samples that
> I used were in the README for https://github.com/tylertreat/comcast). The
> only notable bug I found so far is that if you cut all the kafka nodes
> entirely off from zookeeper for say, 60 seconds, then reconnect them, the
> nodes don't crash, they report as healthy in JMX, but calls to fetch
> metadata from them timeout entirely. That can be fixed with a rolling
> restart, but it doesn't sound ideal (especially in the face of cloud
> networks, where short-lived total network outages can and do happen).
> Should I file a Jira detailing that bug?
>
> On Wed, Oct 5, 2016 at 7:26 PM, Gwen Shapira <gw...@confluent.io> wrote:
>
>> Yeah, totally agree on discussing what we want to test first and
>> implement anything later :)
>>
>> Its just that whenever I have this discussion Jepsen came up, so I was
>> curious what was driving the interest and whether the specific
>> framework is important to the community.
>>
>> On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jj...@gmail.com> wrote:
>> > Hi Gwen,
>> >
>> > I've also seen suggestions of using Jepsen for fault injection, but
>> >> I'm not familiar with this framework.
>> >>
>> >> What do you guys think? Write our own failure injection? or write
>> >> Kafka tests in Jepsen?
>> >>
>> >
>> > This would definitely add a lot of value and save a lot on release
>> > validation overheads. I have heard of Jepsen (via the blog), but haven't
>> > used it. At LinkedIn a couple of infra teams have been using Simoorg
>> > <https://github.com/linkedin/simoorg> which being python-based would
>> > perhaps be easier to use for system test writers than Clojure (under
>> > Jepsen). The Ambry <https://github.com/linkedin/ambry> project at
>> LinkedIn
>> > uses it extensively (and I think has added several more failure scenarios
>> > which don't seem to be reflected in the github repo). Anyway, I think we
>> > should at least enumerate what we want to test and evaluate the
>> > alternatives before reinventing.
>> >
>> > Thanks,
>> >
>> > Joel
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] Fault injection tests for Kafka

Posted by Tom Crayford <tc...@heroku.com>.
I did some stuff like this recently with simple calls to `tc` (samples that
I used were in the README for https://github.com/tylertreat/comcast). The
only notable bug I found so far is that if you cut all the kafka nodes
entirely off from zookeeper for say, 60 seconds, then reconnect them, the
nodes don't crash, they report as healthy in JMX, but calls to fetch
metadata from them timeout entirely. That can be fixed with a rolling
restart, but it doesn't sound ideal (especially in the face of cloud
networks, where short-lived total network outages can and do happen).
Should I file a Jira detailing that bug?

On Wed, Oct 5, 2016 at 7:26 PM, Gwen Shapira <gw...@confluent.io> wrote:

> Yeah, totally agree on discussing what we want to test first and
> implement anything later :)
>
> Its just that whenever I have this discussion Jepsen came up, so I was
> curious what was driving the interest and whether the specific
> framework is important to the community.
>
> On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jj...@gmail.com> wrote:
> > Hi Gwen,
> >
> > I've also seen suggestions of using Jepsen for fault injection, but
> >> I'm not familiar with this framework.
> >>
> >> What do you guys think? Write our own failure injection? or write
> >> Kafka tests in Jepsen?
> >>
> >
> > This would definitely add a lot of value and save a lot on release
> > validation overheads. I have heard of Jepsen (via the blog), but haven't
> > used it. At LinkedIn a couple of infra teams have been using Simoorg
> > <https://github.com/linkedin/simoorg> which being python-based would
> > perhaps be easier to use for system test writers than Clojure (under
> > Jepsen). The Ambry <https://github.com/linkedin/ambry> project at
> LinkedIn
> > uses it extensively (and I think has added several more failure scenarios
> > which don't seem to be reflected in the github repo). Anyway, I think we
> > should at least enumerate what we want to test and evaluate the
> > alternatives before reinventing.
> >
> > Thanks,
> >
> > Joel
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>

Re: [DISCUSS] Fault injection tests for Kafka

Posted by Gwen Shapira <gw...@confluent.io>.
Yeah, totally agree on discussing what we want to test first and
implement anything later :)

Its just that whenever I have this discussion Jepsen came up, so I was
curious what was driving the interest and whether the specific
framework is important to the community.

On Tue, Oct 4, 2016 at 5:46 PM, Joel Koshy <jj...@gmail.com> wrote:
> Hi Gwen,
>
> I've also seen suggestions of using Jepsen for fault injection, but
>> I'm not familiar with this framework.
>>
>> What do you guys think? Write our own failure injection? or write
>> Kafka tests in Jepsen?
>>
>
> This would definitely add a lot of value and save a lot on release
> validation overheads. I have heard of Jepsen (via the blog), but haven't
> used it. At LinkedIn a couple of infra teams have been using Simoorg
> <https://github.com/linkedin/simoorg> which being python-based would
> perhaps be easier to use for system test writers than Clojure (under
> Jepsen). The Ambry <https://github.com/linkedin/ambry> project at LinkedIn
> uses it extensively (and I think has added several more failure scenarios
> which don't seem to be reflected in the github repo). Anyway, I think we
> should at least enumerate what we want to test and evaluate the
> alternatives before reinventing.
>
> Thanks,
>
> Joel



-- 
Gwen Shapira
Product Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] Fault injection tests for Kafka

Posted by Joel Koshy <jj...@gmail.com>.
Hi Gwen,

I've also seen suggestions of using Jepsen for fault injection, but
> I'm not familiar with this framework.
>
> What do you guys think? Write our own failure injection? or write
> Kafka tests in Jepsen?
>

This would definitely add a lot of value and save a lot on release
validation overheads. I have heard of Jepsen (via the blog), but haven't
used it. At LinkedIn a couple of infra teams have been using Simoorg
<https://github.com/linkedin/simoorg> which being python-based would
perhaps be easier to use for system test writers than Clojure (under
Jepsen). The Ambry <https://github.com/linkedin/ambry> project at LinkedIn
uses it extensively (and I think has added several more failure scenarios
which don't seem to be reflected in the github repo). Anyway, I think we
should at least enumerate what we want to test and evaluate the
alternatives before reinventing.

Thanks,

Joel