You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Etienne Chauchot (JIRA)" <ji...@apache.org> on 2018/01/11 09:18:00 UTC

[jira] [Commented] (BEAM-3323) Create a generator of finite-but-unbounded PCollection's for integration testing

    [ https://issues.apache.org/jira/browse/BEAM-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321913#comment-16321913 ] 

Etienne Chauchot commented on BEAM-3323:
----------------------------------------

big +1 to that.
Some related comments:
- some runners like flink use bounded collections but still trigger streaming mode with a streaming=true pipelineOption
- others like spark have coded an equivalent of direct runner's Teststream (which is called CreateStream) that generate unbounded PCollections for tests in streaming mode.
So having a portable (between runners) generator of unbounded PCollections will allow us to have common (between runners) validatesRunner streaming tests. 

> Create a generator of finite-but-unbounded PCollection's for integration testing
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-3323
>                 URL: https://issues.apache.org/jira/browse/BEAM-3323
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Kenneth Knowles
>
> Several IOs have features that exhibit nontrivial behavior when writing unbounded PCollection's - e.g. WriteFiles with windowed writes; BigQueryIO. We need to be able to write integration tests for these features.
> Currently we have two ways to generate an unbounded PCollection without reading from a real-world external streaming system such as pubsub or kafka:
> 1) TestStream, which only works in direct runner - sufficient for some tests but not all: definitely not sufficient for large-scale tests or for tests that need to interact with a real instance of the external system (e.g. BigQueryIO). It is also quite verbose to use.
> 2) GenerateSequence.from(0) without a .to(), which returns an infinite amount of data.
> GenerateSequence.from(a).to(b) returns a finite amount of data, but returns it as a bounded PCollection, and doesn't report the watermark.
> I think the right thing to do here, for now, is to make GenerateSequence.from(a).to(b) have an option (e.g. ".asUnbounded()", where it will return an unbounded PCollection, go through UnboundedSource (or potentially via SDF in runners that support it), and track the watermark properly (or via a configurable watermark fn).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)