You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Eugene Kirpichov <ki...@google.com.INVALID> on 2017/10/26 21:24:33 UTC

The trouble with DisplayData tests

Hello,

DisplayData is currently in an odd position:
- Dataflow is the only runner that supports it at all [1]
- Dataflow supports it only for primitive transforms, but discards it for
composite transforms [2](which is quite problematic for very complex but
commonly used composite transforms, such as TextIO.write() and
BigQueryIO.write())
- There are a lot of tests testing display data of composite transforms -
these tests are currently testing something that is a no-op in production;
and these tests often are a maintenance burden when refactoring a transform.

What should we do about this? I see two options:
- Publish guidance for library authors that Beam is not currently ready for
supporting HasDisplayData on composite transforms, and you shouldn't bother
implementing or testing it (though you can implement it on primitive
transforms of course - which is basically just Read.from(Source) and ParDo)
- Say that people should be implementing and testing this feature
nevertheless, and promise that Beam runners are going to make it "worth it"
(not be a no-op) in the foreseeable future.

[1]
https://lists.apache.org/thread.html/53b33f17d981054ed198af0351196907bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-366

Re: The trouble with DisplayData tests

Posted by Lukasz Cwik <lc...@google.com.INVALID>.
I think we should keep the tests around and encourage people to use it.

With the work that Ken put in to get the Pipeline proto to passed to the
Dataflow workers means that these composites will be flowing through
Dataflow to the execution side. I could see that this will eventually show
up in the Dataflow UI as well.

On Mon, Oct 30, 2017 at 1:45 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> On Fri, Oct 27, 2017 at 12:40 AM Kenneth Knowles <kl...@google.com.invalid>
> wrote:
>
> > I see where you are coming from. It is truly a marginal feature of Beam
> at
> > the moment, but really *really* useful in debugging, when a runner takes
> > advantage of it. More inline - FWIW it may seem like I'm defending the
> > tests, but I have been in the same situation and felt the same way. I'm
> > just trying to ground things in specifics.
> >
> > On Thu, Oct 26, 2017 at 2:24 PM, Eugene Kirpichov <
> > kirpichov@google.com.invalid> wrote:
> >
> > > Hello,
> > >
> > > DisplayData is currently in an odd position:
> > > - Dataflow is the only runner that supports it at all [1]
> > > - Dataflow supports it only for primitive transforms, but discards it
> for
> > > composite transforms [2](which is quite problematic for very complex
> but
> > > commonly used composite transforms, such as TextIO.write() and
> > > BigQueryIO.write())
> > >
> >
> > As of today, the Java and Python Dataflow runners actually transmit the
> > entire original pipeline, which has slots for display data on composites.
> > So the future in which this is actually used - at least for Dataflow -
> > might be very near.
> >
> >
> > - There are a lot of tests testing display data of composite transforms -
> > > these tests are currently testing something that is a no-op in
> > production;
> > > and these tests often are a maintenance burden when refactoring a
> > > transform.
> >
> >
> > Which composites' tests in particular? Are they legitimate tests?
> >
> Pretty much all of them.
> For example:
> https://github.com/apache/beam/blob/master/sdks/java/
> core/src/test/java/org/apache/beam/sdk/io/TextIOWriteTest.java#L585
>
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-
> <https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/test/java/org/apache/beam/sdk/
> io/gcp/pubsub/PubsubIOTest.java#L98>
> platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/
> PubsubIOTest.java#L98
> <https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/test/java/org/apache/beam/sdk/
> io/gcp/pubsub/PubsubIOTest.java#L98>
>  and
> https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/test/java/org/apache/beam/sdk/
> io/gcp/pubsub/PubsubIOTest.java#L222
>
>
> https://github.com/apache/beam/blob/master/sdks/java/
> core/src/test/java/org/apache/beam/sdk/transforms/
> FlatMapElementsTest.java#L169
>
>
> and so on. I think nearly all transforms in the SDK are composite, and have
> unit tests that test DisplayData of the transform itself, which is
> currently moot. Some have tests of "display data of primitive transforms"
> which is currently more useful in practice.
>
>
> >
> > What should we do about this? I see two options:
> > > - Publish guidance for library authors that Beam is not currently ready
> > for
> > > supporting HasDisplayData on composite transforms, and you shouldn't
> > bother
> > > implementing or testing it (though you can implement it on primitive
> > > transforms of course - which is basically just Read.from(Source) and
> > ParDo)
> > > - Say that people should be implementing and testing this feature
> > > nevertheless, and promise that Beam runners are going to make it "worth
> > it"
> > > (not be a no-op) in the foreseeable future.
> > >
> >
> > Option 3: Don't say anyone _should_ do any particular thing, but promise
> > that [some] Beam runners are [probably] going to make it "worth it"
> pretty
> > soon. I expect third-party transform authors are already not using this,
> > are they?
> >
> Most aren't, but some are [looking at sdks/java/io]:
> https://github.com/apache/beam/blob/master/sdks/java/io/
> hbase/src/test/java/org/apache/beam/sdk/io/hbase/HBaseIOTest.java#L328
>
> https://github.com/apache/beam/blob/master/sdks/java/io/
> tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L128
>
> https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/datastore/
> DatastoreV1Test.java#L223
>
> I guess if Dataflow's (or another runner's) proper support of DisplayData
> on composite transforms is sufficiently near, or at least planned, then we
> should keep existing tests, and add more support and more tests once the
> support lands; but meanwhile not require or encourage new transform authors
> to implement it (but it's fine if they do). How does this sound?
>
>
> >
> > Kenn
> >
> >
> >
> > > [1]
> > > https://lists.apache.org/thread.html/53b33f17d981054ed198af03511969
> > > 07bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
> > > [2] https://issues.apache.org/jira/browse/BEAM-366
> > >
> >
>

Re: The trouble with DisplayData tests

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
On Fri, Oct 27, 2017 at 12:40 AM Kenneth Knowles <kl...@google.com.invalid>
wrote:

> I see where you are coming from. It is truly a marginal feature of Beam at
> the moment, but really *really* useful in debugging, when a runner takes
> advantage of it. More inline - FWIW it may seem like I'm defending the
> tests, but I have been in the same situation and felt the same way. I'm
> just trying to ground things in specifics.
>
> On Thu, Oct 26, 2017 at 2:24 PM, Eugene Kirpichov <
> kirpichov@google.com.invalid> wrote:
>
> > Hello,
> >
> > DisplayData is currently in an odd position:
> > - Dataflow is the only runner that supports it at all [1]
> > - Dataflow supports it only for primitive transforms, but discards it for
> > composite transforms [2](which is quite problematic for very complex but
> > commonly used composite transforms, such as TextIO.write() and
> > BigQueryIO.write())
> >
>
> As of today, the Java and Python Dataflow runners actually transmit the
> entire original pipeline, which has slots for display data on composites.
> So the future in which this is actually used - at least for Dataflow -
> might be very near.
>
>
> - There are a lot of tests testing display data of composite transforms -
> > these tests are currently testing something that is a no-op in
> production;
> > and these tests often are a maintenance burden when refactoring a
> > transform.
>
>
> Which composites' tests in particular? Are they legitimate tests?
>
Pretty much all of them.
For example:
https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/io/TextIOWriteTest.java#L585

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOTest.java#L98>
platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOTest.java#L98
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOTest.java#L98>
 and
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIOTest.java#L222


https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/FlatMapElementsTest.java#L169


and so on. I think nearly all transforms in the SDK are composite, and have
unit tests that test DisplayData of the transform itself, which is
currently moot. Some have tests of "display data of primitive transforms"
which is currently more useful in practice.


>
> What should we do about this? I see two options:
> > - Publish guidance for library authors that Beam is not currently ready
> for
> > supporting HasDisplayData on composite transforms, and you shouldn't
> bother
> > implementing or testing it (though you can implement it on primitive
> > transforms of course - which is basically just Read.from(Source) and
> ParDo)
> > - Say that people should be implementing and testing this feature
> > nevertheless, and promise that Beam runners are going to make it "worth
> it"
> > (not be a no-op) in the foreseeable future.
> >
>
> Option 3: Don't say anyone _should_ do any particular thing, but promise
> that [some] Beam runners are [probably] going to make it "worth it" pretty
> soon. I expect third-party transform authors are already not using this,
> are they?
>
Most aren't, but some are [looking at sdks/java/io]:
https://github.com/apache/beam/blob/master/sdks/java/io/hbase/src/test/java/org/apache/beam/sdk/io/hbase/HBaseIOTest.java#L328

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L128

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreV1Test.java#L223

I guess if Dataflow's (or another runner's) proper support of DisplayData
on composite transforms is sufficiently near, or at least planned, then we
should keep existing tests, and add more support and more tests once the
support lands; but meanwhile not require or encourage new transform authors
to implement it (but it's fine if they do). How does this sound?


>
> Kenn
>
>
>
> > [1]
> > https://lists.apache.org/thread.html/53b33f17d981054ed198af03511969
> > 07bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
> > [2] https://issues.apache.org/jira/browse/BEAM-366
> >
>

Re: The trouble with DisplayData tests

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Hi Pablo,

On Wed, Nov 1, 2017 at 6:22 PM, Pablo Estrada <pa...@google.com.invalid>
wrote:

> >
> > As of today, the Java and Python Dataflow runners actually transmit the
> > entire original pipeline, which has slots for display data on composites.
> > So the future in which this is actually used - at least for Dataflow -
> > might be very near.
> >
>
> I am not 100% sure, but I believe this is not true.
>

You are right that it is not true at this moment, but it was for a bit. It
was implemented for Java in https://github.com/apache/beam/pull/3977 and
Python in https://github.com/apache/beam/pull/4010 but had to be rolled
back due to bugs in the interaction between Beam and the Dataflow service.
It will be rolled forward in another form very soon. After that, we do need
to flesh out the proto serialization of all the display data.

There's no reason that you ought to know, either. This wasn't announced
loudly because the feature is not really fully delivered, but the
intermediate step of getting the full pipeline available - including
composites and their display data - is very close. So I'm glad you think it
will be useful :-)

Kenn


> The workflow graph does not contain composites. It only contains
> primitives, and dataflow uses their names to infer composites (e.g.
> 'Do/DoFirst' and 'Do/DoSecond' are inferred to belong to the same composite
> because they share their first level translform).
> Can someone confirm whether this is the case?
> If so, the workflow graph representation for Dataflow would need to change
> to explicitly include display data for composites.
>
> Nonetheless, display data is very useful, so I'd think it's useful to make
> it available in other runners, and for composites..
>
> Best
> -P.
>
> >
> >
> > - There are a lot of tests testing display data of composite transforms -
> > > these tests are currently testing something that is a no-op in
> > production;
> > > and these tests often are a maintenance burden when refactoring a
> > > transform.
> >
> >
> > Which composites' tests in particular? Are they legitimate tests?
> >
> >
> > What should we do about this? I see two options:
> > > - Publish guidance for library authors that Beam is not currently ready
> > for
> > > supporting HasDisplayData on composite transforms, and you shouldn't
> > bother
> > > implementing or testing it (though you can implement it on primitive
> > > transforms of course - which is basically just Read.from(Source) and
> > ParDo)
> > > - Say that people should be implementing and testing this feature
> > > nevertheless, and promise that Beam runners are going to make it "worth
> > it"
> > > (not be a no-op) in the foreseeable future.
> > >
> >
> > Option 3: Don't say anyone _should_ do any particular thing, but promise
> > that [some] Beam runners are [probably] going to make it "worth it"
> pretty
> > soon. I expect third-party transform authors are already not using this,
> > are they?
> >
> > Kenn
> >
> >
> >
> > > [1]
> > > https://lists.apache.org/thread.html/53b33f17d981054ed198af03511969
> > > 07bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
> > > [2] https://issues.apache.org/jira/browse/BEAM-366
> > >
> >
>

Re: The trouble with DisplayData tests

Posted by Pablo Estrada <pa...@google.com.INVALID>.
>
> As of today, the Java and Python Dataflow runners actually transmit the
> entire original pipeline, which has slots for display data on composites.
> So the future in which this is actually used - at least for Dataflow -
> might be very near.
>

I am not 100% sure, but I believe this is not true.
The workflow graph does not contain composites. It only contains
primitives, and dataflow uses their names to infer composites (e.g.
'Do/DoFirst' and 'Do/DoSecond' are inferred to belong to the same composite
because they share their first level translform).
Can someone confirm whether this is the case?
If so, the workflow graph representation for Dataflow would need to change
to explicitly include display data for composites.

Nonetheless, display data is very useful, so I'd think it's useful to make
it available in other runners, and for composites..

Best
-P.

>
>
> - There are a lot of tests testing display data of composite transforms -
> > these tests are currently testing something that is a no-op in
> production;
> > and these tests often are a maintenance burden when refactoring a
> > transform.
>
>
> Which composites' tests in particular? Are they legitimate tests?
>
>
> What should we do about this? I see two options:
> > - Publish guidance for library authors that Beam is not currently ready
> for
> > supporting HasDisplayData on composite transforms, and you shouldn't
> bother
> > implementing or testing it (though you can implement it on primitive
> > transforms of course - which is basically just Read.from(Source) and
> ParDo)
> > - Say that people should be implementing and testing this feature
> > nevertheless, and promise that Beam runners are going to make it "worth
> it"
> > (not be a no-op) in the foreseeable future.
> >
>
> Option 3: Don't say anyone _should_ do any particular thing, but promise
> that [some] Beam runners are [probably] going to make it "worth it" pretty
> soon. I expect third-party transform authors are already not using this,
> are they?
>
> Kenn
>
>
>
> > [1]
> > https://lists.apache.org/thread.html/53b33f17d981054ed198af03511969
> > 07bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
> > [2] https://issues.apache.org/jira/browse/BEAM-366
> >
>

Re: The trouble with DisplayData tests

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
I see where you are coming from. It is truly a marginal feature of Beam at
the moment, but really *really* useful in debugging, when a runner takes
advantage of it. More inline - FWIW it may seem like I'm defending the
tests, but I have been in the same situation and felt the same way. I'm
just trying to ground things in specifics.

On Thu, Oct 26, 2017 at 2:24 PM, Eugene Kirpichov <
kirpichov@google.com.invalid> wrote:

> Hello,
>
> DisplayData is currently in an odd position:
> - Dataflow is the only runner that supports it at all [1]
> - Dataflow supports it only for primitive transforms, but discards it for
> composite transforms [2](which is quite problematic for very complex but
> commonly used composite transforms, such as TextIO.write() and
> BigQueryIO.write())
>

As of today, the Java and Python Dataflow runners actually transmit the
entire original pipeline, which has slots for display data on composites.
So the future in which this is actually used - at least for Dataflow -
might be very near.


- There are a lot of tests testing display data of composite transforms -
> these tests are currently testing something that is a no-op in production;
> and these tests often are a maintenance burden when refactoring a
> transform.


Which composites' tests in particular? Are they legitimate tests?


What should we do about this? I see two options:
> - Publish guidance for library authors that Beam is not currently ready for
> supporting HasDisplayData on composite transforms, and you shouldn't bother
> implementing or testing it (though you can implement it on primitive
> transforms of course - which is basically just Read.from(Source) and ParDo)
> - Say that people should be implementing and testing this feature
> nevertheless, and promise that Beam runners are going to make it "worth it"
> (not be a no-op) in the foreseeable future.
>

Option 3: Don't say anyone _should_ do any particular thing, but promise
that [some] Beam runners are [probably] going to make it "worth it" pretty
soon. I expect third-party transform authors are already not using this,
are they?

Kenn



> [1]
> https://lists.apache.org/thread.html/53b33f17d981054ed198af03511969
> 07bf147b6acf16160dbf395b62@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-366
>