You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Thomas Groh <tg...@google.com.INVALID> on 2017/04/03 16:14:53 UTC

Re: IO IT Patterns: Simplifying data loading

+1!

I really like this approach; it lets us test for consistency without having
to reimplement parts of the IO to actually load our data.

I'd also like to note a few things which I think will be required when we
want to expand this framework and style to also handle Unbounded IOs.
Obviously, unbounded Pipelines are more complicated, but I think we can
reuse most of what you've written.

The additional needs, from what I can tell, are promptness and termination.
We're of course going to need to have timeouts associated with all of these
tests, but we're going to need to also include some way to terminate the
Pipeline(s) prior to the actual test timeout if it's successful (or
failed), and some way to ensure that assertions are run promptly.

Most of the surety around running assertions can be done with Pablo's move
to get PAssert away from aggregators by having the test pipelines read back
the appropriate signal. With this we can also successfully complete early,
similarly to how we expect to fail-early when assertions fail. This can
also handle termination, where (in the absence of failures) a successful
pipeline execution tears down its resources immediately.

For most Runners, promptness can be obtained by ensuring that the elements
produced are close to "now", and windows all expire a meaningful amount of
time before the test would time out. Without significant system lag, most
runners will produce their outputs relatively promptly.

Shutting down the producing pipeline can either be done when the test
completes, or when we know it's written all of the downstream inputs to the
data source.

On Wed, Mar 29, 2017 at 7:49 AM, Stephen Sisk <si...@google.com.invalid>
wrote:

> Hey Cham,
>
> Debugging is harder
> ================
> I definitely agree. As I said (and I think you still generally agree), I
> think the tradeoff is worth it. Looking at the data store in question can
> quickly narrow it down to one vs the other for a particular failure.
>
> Eventually consistent data stores
> ==========================
> I agree that this is a problem, however I don't think this is a problem
> created by doing a writeThenRead test, because we have exactly the same
> problems for regular write tests (which are themselves writeThenRead, just
> with a native reader). I agree it exacerbates the "debugging is harder"
> question.
>
> I think we're in general agreement - folks testing eventually consistent
> data stores need to be careful, and consider what's best for them. This may
> not be the correct solution for them. I added a note to the testing doc to
> make sure to address this.
>
> S
>
> On Tue, Mar 28, 2017 at 10:27 PM Chamikara Jayalath <ch...@apache.org>
> wrote:
>
> > On Tue, Mar 28, 2017 at 3:00 AM Etienne Chauchot <ec...@gmail.com>
> > wrote:
> >
> > > Hi Stephen,
> > >
> > > I have some comments bellow:
> > >
> > >
> > > Le 24/03/2017 à 00:26, Stephen Sisk a écrit :
> > > > hi!
> > > >
> > > > I just opened a jira ticket that I wanted to make sure the mailing
> list
> > > got
> > > > a chance to see.
> > > >
> > > > The problem is that the current design pattern for doing data loading
> > in
> > > IO
> > > > ITs (either writing a small program or using an external tool) is
> > > complex,
> > > > inefficient and requires extra steps like installing external
> > > > tools/probably using a VM. It also really doesn't scale well to the
> > > larger
> > > > data sizes we'd like to use for performance benchmarking.
> > > >
> > > > My proposal is that instead of trying to test read and write
> > separately,
> > > > the test should be a "write, then read back what you just wrote", all
> > > using
> > > > the IO being tested.
> > > Sure, joining read and write tests will allow to write less often and
> > > thus be more efficient. Indeed, instead of writing once for all the
> read
> > > test runs and write at each write test run, we will only write at each
> > > read+write test run. We will also avoid using another writing place.
> > >
> >
> > I agree that this is beneficial from a test efficiency perspective but
> > there is a downside.
> >
> > I think a failure of this kind of a write+read test could be quite hard
> to
> > debug and it might even be hard to develop such a test to be non-flaky
> > depending on the I/O. For example, for a eventually consistent
> file-system
> > such as GCS, a failure of a write+read test could mean any one of
> > following.
> >
> > * write failed
> > * read failed
> > * read was executed prior to write finishing and file system reaching a
> > consistent state.
> >
> > At first glance one might think that adding barrier in the middle that
> > waits for read to be consistent would solve that problem but that will
> not
> > be the case if the data source serves requests using multiple replicas
> > which may be in inconsistent states (which is the case for GCS).
> >
> > Separate read and write tests with fixed input are much easier to
> > manage/debug.
> >
> > So I think we should be careful when converting I/O ITs to do read+write
> > and probably should only make this a recommendation for I/O ITs that
> would
> > not run into issues due to this.
> >
> > Just my 2 cents.
> >
> > Thanks,
> > Cham
> >
> >
> > > > To support scenarios like "I want to run my read test
> > > > repeatedly without re-writing the data", tests would add flags for
> > > > "skipCleanUp" and "useExistingData".
> > > But this does the assumption of the order of test runs: write test
> needs
> > > to have been run before read test can happen. Maybe a little dangerous
> > > to do this assumption no?
> > > >
> > > > I think we've all likely seen this type of solution when testing
> > storage
> > > > layers in the past, and I've previously shied away from it in this
> > > context,
> > > > but I think now that I've seen some real ITs and thought about
> scaling
> > > > them, in this case it's the right solution.
> > > >
> > > > Please take a look at the jira if you have questions - there's a lot
> > more
> > > > detail there.
> > > >
> > > > S
> > > >
> > > Etienne
> > >
> >
>