You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Stephen Sisk <si...@google.com.INVALID> on 2017/01/19 20:06:54 UTC

IO testing: failure scenarios

This is a discussion that I don't think affects any immediate decisions,
but that does inform how folks are writing unit tests, so I wanted to give
it it's own thread.

Ismael mentioned:
"I am not sure that unit tests are enough to test distribution issues
because they are harder to simulate in particular if we add the fact that
we can have too many moving pieces. For example, imagine that we run a Beam
pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
that reads from Kafka (with some slow partition) and writes to Cassandra
(with a partition that goes down). You see, this is a quite complex
combination of pieces (and possible issues), but it is not a totally
artificial scenario, in fact this is a common architecture, and this can
(at least in theory) be simulated with a cluster manager, but I don’t see
how can I easily reproduce this with a unit test."

I'd like to separate out two scenarios:
1. Testing for failures we know can occur
2. Testing for failures we don't realize can occur

For known failure scenarios (#1), we can definitely recreate it with a unit
test - as long as we focus on the code being tested and how those failures
interact with the code being tested. In the case you describe, we can think
through how the failures would surface in the IO code and runner code and
write unit tests for that scenario. That way we don't need to worry about
the combinatorial explosion of kafka failures * spark failures * yarn
cluster failures * cassandra failures - we can just focus on the boundaries
between those. That is, which of these pieces directly interact, and how
can they surface failures to the other pieces? We then test each of those
individual failures on each particular component (and if useful, the
combination of failures within a particular piece.)

For example: A cassandraIO test that ensures that if a particular worker
running a BoundedReader/ParDo goes away, the IO performs correctly. We
don't care whether that happens because of a spark failure or a YARN
failure - we just know the reader worker went away before committing work.

However, I think you are getting at the value that chaos-monkey style
testing provides: showing failure scenarios that we don't realize can occur
(#2) - in that case, I do agree that having a full stack chaos-monkey test
can help. As you mentioned, that's a good thing to focus on down the line.
I would especially call out that those tests can make a lot of noise and
the failures are hard to investigate. I see them as valuable, but I would
want to consider implementing them after we have proven that we have good
tests for the failure scenarios we do know about. It has also proven useful
to turn the failures found by chaos-monkey testing into concrete unit tests
on the components affected.

S

Re: IO testing: failure scenarios

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Agree. It's predictable or unpredictable errors. It also includes the checkpoint case we discussed yesterday.

Thanks for bringing this point.

Regards
JB\u2063\u200b

On Jan 19, 2017, 12:07, at 12:07, Stephen Sisk <si...@google.com.INVALID> wrote:
>This is a discussion that I don't think affects any immediate
>decisions,
>but that does inform how folks are writing unit tests, so I wanted to
>give
>it it's own thread.
>
>Ismael mentioned:
>"I am not sure that unit tests are enough to test distribution issues
>because they are harder to simulate in particular if we add the fact
>that
>we can have too many moving pieces. For example, imagine that we run a
>Beam
>pipeline deployed via Spark on a YARN cluster (where some nodes can
>fail)
>that reads from Kafka (with some slow partition) and writes to
>Cassandra
>(with a partition that goes down). You see, this is a quite complex
>combination of pieces (and possible issues), but it is not a totally
>artificial scenario, in fact this is a common architecture, and this
>can
>(at least in theory) be simulated with a cluster manager, but I don\u2019t
>see
>how can I easily reproduce this with a unit test."
>
>I'd like to separate out two scenarios:
>1. Testing for failures we know can occur
>2. Testing for failures we don't realize can occur
>
>For known failure scenarios (#1), we can definitely recreate it with a
>unit
>test - as long as we focus on the code being tested and how those
>failures
>interact with the code being tested. In the case you describe, we can
>think
>through how the failures would surface in the IO code and runner code
>and
>write unit tests for that scenario. That way we don't need to worry
>about
>the combinatorial explosion of kafka failures * spark failures * yarn
>cluster failures * cassandra failures - we can just focus on the
>boundaries
>between those. That is, which of these pieces directly interact, and
>how
>can they surface failures to the other pieces? We then test each of
>those
>individual failures on each particular component (and if useful, the
>combination of failures within a particular piece.)
>
>For example: A cassandraIO test that ensures that if a particular
>worker
>running a BoundedReader/ParDo goes away, the IO performs correctly. We
>don't care whether that happens because of a spark failure or a YARN
>failure - we just know the reader worker went away before committing
>work.
>
>However, I think you are getting at the value that chaos-monkey style
>testing provides: showing failure scenarios that we don't realize can
>occur
>(#2) - in that case, I do agree that having a full stack chaos-monkey
>test
>can help. As you mentioned, that's a good thing to focus on down the
>line.
>I would especially call out that those tests can make a lot of noise
>and
>the failures are hard to investigate. I see them as valuable, but I
>would
>want to consider implementing them after we have proven that we have
>good
>tests for the failure scenarios we do know about. It has also proven
>useful
>to turn the failures found by chaos-monkey testing into concrete unit
>tests
>on the components affected.
>
>S