You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Carlos Alonso <ca...@mrcalonso.com> on 2018/05/29 20:05:04 UTC

Testing an updating side input on global window

Hi all!!

Basically that's what I'm trying to do. I'm building a pipeline that has a
refreshing, multimap, side input (BQ schemas) that then I apply to the main
stream of data (records that are ultimately saved to the corresponding BQ
table).

My job, although being of streaming nature, runs on the global window, and
I want to unit test that the side input refreshes and that the updates are
successfully applied.

I'm using scio and I can't seem to simulate that refreshing behaviour.
These are the relevant bits of the code:
https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98

The way I see understand it, the side collection is refreshed before
accessing it so when accessed, it already contains the final (updated)
snapshot of the schemas, is that true? In which case, how can I simulate
that synchronisation? I'm using processing times as I thought that could be
the way to go, but obviously something is wrong there.

Many thanks!!

Re: Testing an updating side input on global window

Posted by Kenneth Knowles <kl...@google.com>.
After the first value for a triggered side input, that side input is always
"ready" for the particular window, so there is no longer any
synchronization with the main input. You probably want to have tests with
the updated and the old value. I would expect that your steps 1-4 would
work with TestStream to do just what you want, actually. Then you should
also test with steps 1&3 merged.

Kenn

On Tue, May 29, 2018 at 2:44 PM Pablo Estrada <pa...@google.com> wrote:

> As far as I know, that behavior is not specified. I do not think that
> Dataflow streaming supports this sort of updating to side inputs, though
> I've added Slava who might have more to add.
>
> If updating side inputs is really not supported in Dataflow, you may be
> able to use a LoadingCache, like so:
> https://lists.apache.org/thread.html/%3CB1660EAB-AEC8-4635-8386-8353685DB19A@gameduell.de%3E
>
> Best
> -P.
>
> On Tue, May 29, 2018 at 2:36 PM Carlos Alonso <ca...@mrcalonso.com>
> wrote:
>
>> Hi Lukasz, many thanks for your responses.
>>
>> I'm actually using them but I think I'm not being able to synchronise the
>> following steps:
>> 1: The side input gets its first value (v1)
>> 2: The main stream gets that side input applied and finds that v1 value
>> 3: The side one gets updated (v2)
>> 4: The main stream gets the side input applied again and finds the v2
>> value (along with v1 as this is multimap)
>>
>> Regards
>>
>> On Tue, May 29, 2018 at 10:57 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Your best bet is to use TestStreams[1] as it is used to validate
>>> window/triggering behavior. Note that the transform requires special runner
>>> based execution and currently only works with the DirectRunner. All
>>> examples are marked with the JUnit category "UsesTestStream", for example
>>> [2].
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java
>>> 2:
>>> https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69
>>>
>>>
>>> On Tue, May 29, 2018 at 1:05 PM Carlos Alonso <ca...@mrcalonso.com>
>>> wrote:
>>>
>>>> Hi all!!
>>>>
>>>> Basically that's what I'm trying to do. I'm building a pipeline that
>>>> has a refreshing, multimap, side input (BQ schemas) that then I apply to
>>>> the main stream of data (records that are ultimately saved to the
>>>> corresponding BQ table).
>>>>
>>>> My job, although being of streaming nature, runs on the global window,
>>>> and I want to unit test that the side input refreshes and that the updates
>>>> are successfully applied.
>>>>
>>>> I'm using scio and I can't seem to simulate that refreshing behaviour.
>>>> These are the relevant bits of the code:
>>>> https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98
>>>>
>>>> The way I see understand it, the side collection is refreshed before
>>>> accessing it so when accessed, it already contains the final (updated)
>>>> snapshot of the schemas, is that true? In which case, how can I simulate
>>>> that synchronisation? I'm using processing times as I thought that could be
>>>> the way to go, but obviously something is wrong there.
>>>>
>>>> Many thanks!!
>>>>
>>> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>

Re: Testing an updating side input on global window

Posted by Pablo Estrada <pa...@google.com>.
As far as I know, that behavior is not specified. I do not think that
Dataflow streaming supports this sort of updating to side inputs, though
I've added Slava who might have more to add.

If updating side inputs is really not supported in Dataflow, you may be
able to use a LoadingCache, like so:
https://lists.apache.org/thread.html/%3CB1660EAB-AEC8-4635-8386-8353685DB19A@gameduell.de%3E

Best
-P.

On Tue, May 29, 2018 at 2:36 PM Carlos Alonso <ca...@mrcalonso.com> wrote:

> Hi Lukasz, many thanks for your responses.
>
> I'm actually using them but I think I'm not being able to synchronise the
> following steps:
> 1: The side input gets its first value (v1)
> 2: The main stream gets that side input applied and finds that v1 value
> 3: The side one gets updated (v2)
> 4: The main stream gets the side input applied again and finds the v2
> value (along with v1 as this is multimap)
>
> Regards
>
> On Tue, May 29, 2018 at 10:57 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> Your best bet is to use TestStreams[1] as it is used to validate
>> window/triggering behavior. Note that the transform requires special runner
>> based execution and currently only works with the DirectRunner. All
>> examples are marked with the JUnit category "UsesTestStream", for example
>> [2].
>>
>> 1:
>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java
>> 2:
>> https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69
>>
>>
>> On Tue, May 29, 2018 at 1:05 PM Carlos Alonso <ca...@mrcalonso.com>
>> wrote:
>>
>>> Hi all!!
>>>
>>> Basically that's what I'm trying to do. I'm building a pipeline that has
>>> a refreshing, multimap, side input (BQ schemas) that then I apply to the
>>> main stream of data (records that are ultimately saved to the corresponding
>>> BQ table).
>>>
>>> My job, although being of streaming nature, runs on the global window,
>>> and I want to unit test that the side input refreshes and that the updates
>>> are successfully applied.
>>>
>>> I'm using scio and I can't seem to simulate that refreshing behaviour.
>>> These are the relevant bits of the code:
>>> https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98
>>>
>>> The way I see understand it, the side collection is refreshed before
>>> accessing it so when accessed, it already contains the final (updated)
>>> snapshot of the schemas, is that true? In which case, how can I simulate
>>> that synchronisation? I'm using processing times as I thought that could be
>>> the way to go, but obviously something is wrong there.
>>>
>>> Many thanks!!
>>>
>> --
Got feedback? go/pabloem-feedback

Re: Testing an updating side input on global window

Posted by Carlos Alonso <ca...@mrcalonso.com>.
Hi Lukasz, many thanks for your responses.

I'm actually using them but I think I'm not being able to synchronise the
following steps:
1: The side input gets its first value (v1)
2: The main stream gets that side input applied and finds that v1 value
3: The side one gets updated (v2)
4: The main stream gets the side input applied again and finds the v2 value
(along with v1 as this is multimap)

Regards

On Tue, May 29, 2018 at 10:57 PM Lukasz Cwik <lc...@google.com> wrote:

> Your best bet is to use TestStreams[1] as it is used to validate
> window/triggering behavior. Note that the transform requires special runner
> based execution and currently only works with the DirectRunner. All
> examples are marked with the JUnit category "UsesTestStream", for example
> [2].
>
> 1:
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java
> 2:
> https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69
>
>
> On Tue, May 29, 2018 at 1:05 PM Carlos Alonso <ca...@mrcalonso.com>
> wrote:
>
>> Hi all!!
>>
>> Basically that's what I'm trying to do. I'm building a pipeline that has
>> a refreshing, multimap, side input (BQ schemas) that then I apply to the
>> main stream of data (records that are ultimately saved to the corresponding
>> BQ table).
>>
>> My job, although being of streaming nature, runs on the global window,
>> and I want to unit test that the side input refreshes and that the updates
>> are successfully applied.
>>
>> I'm using scio and I can't seem to simulate that refreshing behaviour.
>> These are the relevant bits of the code:
>> https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98
>>
>> The way I see understand it, the side collection is refreshed before
>> accessing it so when accessed, it already contains the final (updated)
>> snapshot of the schemas, is that true? In which case, how can I simulate
>> that synchronisation? I'm using processing times as I thought that could be
>> the way to go, but obviously something is wrong there.
>>
>> Many thanks!!
>>
>

Re: Testing an updating side input on global window

Posted by Lukasz Cwik <lc...@google.com>.
Your best bet is to use TestStreams[1] as it is used to validate
window/triggering behavior. Note that the transform requires special runner
based execution and currently only works with the DirectRunner. All
examples are marked with the JUnit category "UsesTestStream", for example
[2].

1:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/TestStream.java
2:
https://github.com/apache/beam/blob/0cbcf4ad1db7d820c5476d636f3a3d69062021a5/sdks/java/core/src/test/java/org/apache/beam/sdk/testing/TestStreamTest.java#L69


On Tue, May 29, 2018 at 1:05 PM Carlos Alonso <ca...@mrcalonso.com> wrote:

> Hi all!!
>
> Basically that's what I'm trying to do. I'm building a pipeline that has a
> refreshing, multimap, side input (BQ schemas) that then I apply to the main
> stream of data (records that are ultimately saved to the corresponding BQ
> table).
>
> My job, although being of streaming nature, runs on the global window, and
> I want to unit test that the side input refreshes and that the updates are
> successfully applied.
>
> I'm using scio and I can't seem to simulate that refreshing behaviour.
> These are the relevant bits of the code:
> https://gist.github.com/calonso/87d392a65079a66db75a78cb8d80ea98
>
> The way I see understand it, the side collection is refreshed before
> accessing it so when accessed, it already contains the final (updated)
> snapshot of the schemas, is that true? In which case, how can I simulate
> that synchronisation? I'm using processing times as I thought that could be
> the way to go, but obviously something is wrong there.
>
> Many thanks!!
>