You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Henry Robinson <he...@apache.org> on 2016/09/02 23:51:42 UTC

IMPALA-2549: Debug rules -> injecting failures and exercising edge cases

Hi all -

I want to start to get some feedback on a skunkworks project I did recently.

Impala has long had 'debug options' which are small scripts that can inject
a limited class of failures into query execution, mostly at the exec node
scope (e.g. delaying Prepare(), returning an error during Open() etc).
These are useful, but have a number of limitations:

* Extending them is hard, because of the lack of regular language
* The set of available actions is limited to waiting forever, or
immediately cancelling.
* Actions can't be composed or sequenced, or executed with some
probability.
* Actions are only applicable to exec nodes, and generalising them is hard.
* Only one debug action may be applied per-query.

I have wanted to make improvements here for a long time, not least because
testing any RPC changes is hard, and am about done with the first cut that
addresses all of the above issues and more. I'm calling them 'debug rules',
and they have the following features:

   1. Regular rule structure so that the language may be extended without
   touching a parser.
   2. JSON-based serialisation so that rules may be human readable and
   (somewhat) writeable.
   3. Python builder classes to make complex rule construction easy.
   4. Extensible set of actions, which now includes setting a CLI flag at
   runtime, logging a message, failing with a probability, sleeping, and
   setting RPC timeouts.
   5. Actions may have arguments which may themselves be functions.
   Functions may access their environment in a limited way (so, for example,
   they may log their fragment ID).
   6. Actions may be applied at any point in Impalad, not just in exec
   nodes.
   7. Disabled completely in release builds.


The full commit is the HEAD here: https://github.com/
henryr/impala/tree/debug-rules. But I'd like some general feedback on the
'user experience', and with that in mind I'd like to point you to four
locations:

   1. Introductory documentation here: https://github.com/
   henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25
   <https://github.com/henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25>
   2. A set of tests for failure modes after IMPALA-1599, which
   parallelised fragment start-up: https://github.com/
   henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py
   <https://github.com/henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py>
   3. A randomized test that introduces random delays to a subset of phases
   in PHJ, PHA, scans and xchg operators: https://github.com/henryr/
   Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py
   <https://github.com/henryr/Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py>
   4. A port of the existing test_rpc_timeouts.py test (that confirms the
   RPC timeout mechanism that Juan wrote is working correctly) that doesn't
   need to be a custom cluster test - debug rules do all the parameter and
   flag setting: https://github.com/henryr/Impala/blob/debug-
   rules/tests/rpc/test_timeouts.py
   <https://github.com/henryr/Impala/blob/debug-rules/tests/rpc/test_timeouts.py>

I like the last one particularly as it shaves more than 60s off the
execution time of that test.

The changes themselves are very large (~2000 lines and growing), and aren't
remotely ready for review yet (you'll see half commented code all over the
place) - but I'd like feedback on whether this is useful, would help with
any testing scenarios you haven't been able to address yet, and if the
abstractions and test APIs seem about right.

Let me know what you think -

Henry

Re: IMPALA-2549: Debug rules -> injecting failures and exercising edge cases

Posted by Henry Robinson <he...@cloudera.com>.
Couple of answers inline (sorry for the delay in replying!)

On 6 September 2016 at 16:29, Juan <an...@gmail.com> wrote:

> Hey Henry,
>
> This is great. The rules are easy to read, easy to set, make failure
> injection much easier so we could add tests to verify Impala's behavior. I
> love that you can change the impala config dynamically without restart
> cluster. Not only it saves testing time, also adding flexibility to tests.
> I have a few questions:
> 1. could I set multiple debug rules for the same scope? for example, random
> rpc failures on multiple rpcs, like first transimit data RPC fails, then
> cancel fragment RPC fails as well.
>

You can set any number of rules, but a maximum of one on each tracepoint.
What you're asking for I think involves coordination between rules - that
would be possible to do quite easily, but wasn't in scope (i.e. someway to
say "this rule only takes effect if some other rule has already fired" -
easy to express that as a new action).



> 2. how easy to repeat a test failure? some actions allow inject failure
> randomly on fragments, If we could repeat the test with exactly the same
> failure injection sequence, we could probably figure out what's wrong
> faster.
>

That's something I want to make easy: random seeds should be logged, for
example, as should whether or not a rule was triggered. One goal is to make
it easy to write deterministic failure tests, so outside of rule sets
produced by randomized rule generation it should be very easy to see what's
going on.

Henry


>
> Thanks,
> Juan
>
> On Fri, Sep 2, 2016 at 4:51 PM, Henry Robinson <he...@apache.org> wrote:
>
> > Hi all -
> >
> > I want to start to get some feedback on a skunkworks project I did
> > recently.
> >
> > Impala has long had 'debug options' which are small scripts that can
> inject
> > a limited class of failures into query execution, mostly at the exec node
> > scope (e.g. delaying Prepare(), returning an error during Open() etc).
> > These are useful, but have a number of limitations:
> >
> > * Extending them is hard, because of the lack of regular language
> > * The set of available actions is limited to waiting forever, or
> > immediately cancelling.
> > * Actions can't be composed or sequenced, or executed with some
> > probability.
> > * Actions are only applicable to exec nodes, and generalising them is
> hard.
> > * Only one debug action may be applied per-query.
> >
> > I have wanted to make improvements here for a long time, not least
> because
> > testing any RPC changes is hard, and am about done with the first cut
> that
> > addresses all of the above issues and more. I'm calling them 'debug
> rules',
> > and they have the following features:
> >
> >    1. Regular rule structure so that the language may be extended without
> >    touching a parser.
> >    2. JSON-based serialisation so that rules may be human readable and
> >    (somewhat) writeable.
> >    3. Python builder classes to make complex rule construction easy.
> >    4. Extensible set of actions, which now includes setting a CLI flag at
> >    runtime, logging a message, failing with a probability, sleeping, and
> >    setting RPC timeouts.
> >    5. Actions may have arguments which may themselves be functions.
> >    Functions may access their environment in a limited way (so, for
> > example,
> >    they may log their fragment ID).
> >    6. Actions may be applied at any point in Impalad, not just in exec
> >    nodes.
> >    7. Disabled completely in release builds.
> >
> >
> > The full commit is the HEAD here: https://github.com/
> > henryr/impala/tree/debug-rules. But I'd like some general feedback on
> the
> > 'user experience', and with that in mind I'd like to point you to four
> > locations:
> >
> >    1. Introductory documentation here: https://github.com/
> >    henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25
> >    <https://github.com/henryr/Impala/blob/debug-rules/be/
> > src/runtime/debug-rules.h#L25>
> >    2. A set of tests for failure modes after IMPALA-1599, which
> >    parallelised fragment start-up: https://github.com/
> >    henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_1599.py>
> >    3. A randomized test that introduces random delays to a subset of
> phases
> >    in PHJ, PHA, scans and xchg operators: https://github.com/henryr/
> >    Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_exec_node_injection.py>
> >    4. A port of the existing test_rpc_timeouts.py test (that confirms the
> >    RPC timeout mechanism that Juan wrote is working correctly) that
> doesn't
> >    need to be a custom cluster test - debug rules do all the parameter
> and
> >    flag setting: https://github.com/henryr/Impala/blob/debug-
> >    rules/tests/rpc/test_timeouts.py
> >    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> > rpc/test_timeouts.py>
> >
> > I like the last one particularly as it shaves more than 60s off the
> > execution time of that test.
> >
> > The changes themselves are very large (~2000 lines and growing), and
> aren't
> > remotely ready for review yet (you'll see half commented code all over
> the
> > place) - but I'd like feedback on whether this is useful, would help with
> > any testing scenarios you haven't been able to address yet, and if the
> > abstractions and test APIs seem about right.
> >
> > Let me know what you think -
> >
> > Henry
> >
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Re: IMPALA-2549: Debug rules -> injecting failures and exercising edge cases

Posted by Juan <an...@gmail.com>.
Hey Henry,

This is great. The rules are easy to read, easy to set, make failure
injection much easier so we could add tests to verify Impala's behavior. I
love that you can change the impala config dynamically without restart
cluster. Not only it saves testing time, also adding flexibility to tests.
I have a few questions:
1. could I set multiple debug rules for the same scope? for example, random
rpc failures on multiple rpcs, like first transimit data RPC fails, then
cancel fragment RPC fails as well.
2. how easy to repeat a test failure? some actions allow inject failure
randomly on fragments, If we could repeat the test with exactly the same
failure injection sequence, we could probably figure out what's wrong
faster.

Thanks,
Juan

On Fri, Sep 2, 2016 at 4:51 PM, Henry Robinson <he...@apache.org> wrote:

> Hi all -
>
> I want to start to get some feedback on a skunkworks project I did
> recently.
>
> Impala has long had 'debug options' which are small scripts that can inject
> a limited class of failures into query execution, mostly at the exec node
> scope (e.g. delaying Prepare(), returning an error during Open() etc).
> These are useful, but have a number of limitations:
>
> * Extending them is hard, because of the lack of regular language
> * The set of available actions is limited to waiting forever, or
> immediately cancelling.
> * Actions can't be composed or sequenced, or executed with some
> probability.
> * Actions are only applicable to exec nodes, and generalising them is hard.
> * Only one debug action may be applied per-query.
>
> I have wanted to make improvements here for a long time, not least because
> testing any RPC changes is hard, and am about done with the first cut that
> addresses all of the above issues and more. I'm calling them 'debug rules',
> and they have the following features:
>
>    1. Regular rule structure so that the language may be extended without
>    touching a parser.
>    2. JSON-based serialisation so that rules may be human readable and
>    (somewhat) writeable.
>    3. Python builder classes to make complex rule construction easy.
>    4. Extensible set of actions, which now includes setting a CLI flag at
>    runtime, logging a message, failing with a probability, sleeping, and
>    setting RPC timeouts.
>    5. Actions may have arguments which may themselves be functions.
>    Functions may access their environment in a limited way (so, for
> example,
>    they may log their fragment ID).
>    6. Actions may be applied at any point in Impalad, not just in exec
>    nodes.
>    7. Disabled completely in release builds.
>
>
> The full commit is the HEAD here: https://github.com/
> henryr/impala/tree/debug-rules. But I'd like some general feedback on the
> 'user experience', and with that in mind I'd like to point you to four
> locations:
>
>    1. Introductory documentation here: https://github.com/
>    henryr/Impala/blob/debug-rules/be/src/runtime/debug-rules.h#L25
>    <https://github.com/henryr/Impala/blob/debug-rules/be/
> src/runtime/debug-rules.h#L25>
>    2. A set of tests for failure modes after IMPALA-1599, which
>    parallelised fragment start-up: https://github.com/
>    henryr/Impala/blob/debug-rules/tests/rpc/test_1599.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_1599.py>
>    3. A randomized test that introduces random delays to a subset of phases
>    in PHJ, PHA, scans and xchg operators: https://github.com/henryr/
>    Impala/blob/debug-rules/tests/rpc/test_exec_node_injection.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_exec_node_injection.py>
>    4. A port of the existing test_rpc_timeouts.py test (that confirms the
>    RPC timeout mechanism that Juan wrote is working correctly) that doesn't
>    need to be a custom cluster test - debug rules do all the parameter and
>    flag setting: https://github.com/henryr/Impala/blob/debug-
>    rules/tests/rpc/test_timeouts.py
>    <https://github.com/henryr/Impala/blob/debug-rules/tests/
> rpc/test_timeouts.py>
>
> I like the last one particularly as it shaves more than 60s off the
> execution time of that test.
>
> The changes themselves are very large (~2000 lines and growing), and aren't
> remotely ready for review yet (you'll see half commented code all over the
> place) - but I'd like feedback on whether this is useful, would help with
> any testing scenarios you haven't been able to address yet, and if the
> abstractions and test APIs seem about right.
>
> Let me know what you think -
>
> Henry
>