You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Tudor Plugaru <tu...@gorgias.com> on 2021/12/01 09:06:58 UTC

Unit testing stateful DoFn

Hi,
What is the best approach in unit testing a stateful DoFn? I've looked over
the userstate_test.py in Beam repo, but those examples do not really apply
to our case. In those tests, the DoFn used for testing are returning values
from timer callbacks which does not really happen in reality.
I am more interested in testing if a timer was triggered after the
watermark advanced, or what is the state bag content at a specific time.

Actually it would really be nice to have some kind of documentation
regarding testing and best practices in writing unit/integration tests for
Beam pipelines.

Thanks,
Tudor

Re: Unit testing stateful DoFn

Posted by Tudor Plugaru <tu...@gorgias.com>.
Hi,
hmm, ok, I will try this approach then.
Thanks for the suggestion.
Tudor

On Wed, Dec 1, 2021 at 7:47 PM Luke Cwik <lc...@google.com> wrote:

> The purpose of pipeline/transform level testing is to verify outputs based
> upon inputs and to not check the internal state of the transform(s).
>
> For the example that you linked, it would make sense to create a test with
> inputs that would cause the timer to fire and clear state and then some
> more inputs that would produce output and the output would only be correct
> if the state was cleared because of the timer.
>
> For the timer value scenario, create the inputs that would cause the
> specific scenario to happen and then add more inputs based upon what makes
> setting the timer unique such that output is produced that would only be
> correct had that timer had a specific value.
>
> On Wed, Dec 1, 2021 at 9:32 AM Tudor Plugaru <tu...@gorgias.com> wrote:
>
>> I know about TestStream and I am using it, but, for example, I want to
>> test a use case that the timer callback is being called once the watermark
>> passes the set time in the timer. Like in this test [1] for example, I want
>> to be able to have something like assert bag_state == None at the end of
>> the test. Is this possible? As most of the tests from that module are
>> returning specific values from time callbacks and then the tests assert
>> that those values are being returned, but in a real use case, you don't
>> necessarily return values from timer callbacks.
>>
>> Another use case is when the time is set only in specific scenarios, how
>> can I test what the timer value is?
>>
>> Hope it makes sense what I am describing.
>>
>> [1]
>> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/userstate_test.py#L487
>>
>> On Wed, Dec 1, 2021 at 7:21 PM Luke Cwik <lc...@google.com> wrote:
>>
>>> That should have been "TestStream [2, 3, 4]"
>>>
>>> On Wed, Dec 1, 2021 at 9:20 AM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> There is some good information about testing in the Apache Beam
>>>> documentation[1] about how you want to test the transforms/pipeline instead
>>>> of the DoFn.
>>>>
>>>> For your use case, TestStream [1, 2, 3] is your best bet combined with
>>>> the above advice about transform/pipeline level testing. TestStream is used
>>>> to simulate ingestion of data and allows control of watermark and
>>>> processing time advancement.
>>>>
>>>> 1: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
>>>> 2: https://beam.apache.org/blog/test-stream/
>>>> 3:
>>>> https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7
>>>> 4:
>>>> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/deduplicate_test.py#L109
>>>>
>>>>
>>>> On Wed, Dec 1, 2021 at 1:07 AM Tudor Plugaru <tu...@gorgias.com> wrote:
>>>>
>>>>> Hi,
>>>>> What is the best approach in unit testing a stateful DoFn? I've looked
>>>>> over the userstate_test.py in Beam repo, but those examples do not really
>>>>> apply to our case. In those tests, the DoFn used for testing are returning
>>>>> values from timer callbacks which does not really happen in reality.
>>>>> I am more interested in testing if a timer was triggered after the
>>>>> watermark advanced, or what is the state bag content at a specific time.
>>>>>
>>>>> Actually it would really be nice to have some kind of documentation
>>>>> regarding testing and best practices in writing unit/integration tests for
>>>>> Beam pipelines.
>>>>>
>>>>> Thanks,
>>>>> Tudor
>>>>>
>>>>

Re: Unit testing stateful DoFn

Posted by Luke Cwik <lc...@google.com>.
The purpose of pipeline/transform level testing is to verify outputs based
upon inputs and to not check the internal state of the transform(s).

For the example that you linked, it would make sense to create a test with
inputs that would cause the timer to fire and clear state and then some
more inputs that would produce output and the output would only be correct
if the state was cleared because of the timer.

For the timer value scenario, create the inputs that would cause the
specific scenario to happen and then add more inputs based upon what makes
setting the timer unique such that output is produced that would only be
correct had that timer had a specific value.

On Wed, Dec 1, 2021 at 9:32 AM Tudor Plugaru <tu...@gorgias.com> wrote:

> I know about TestStream and I am using it, but, for example, I want to
> test a use case that the timer callback is being called once the watermark
> passes the set time in the timer. Like in this test [1] for example, I want
> to be able to have something like assert bag_state == None at the end of
> the test. Is this possible? As most of the tests from that module are
> returning specific values from time callbacks and then the tests assert
> that those values are being returned, but in a real use case, you don't
> necessarily return values from timer callbacks.
>
> Another use case is when the time is set only in specific scenarios, how
> can I test what the timer value is?
>
> Hope it makes sense what I am describing.
>
> [1]
> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/userstate_test.py#L487
>
> On Wed, Dec 1, 2021 at 7:21 PM Luke Cwik <lc...@google.com> wrote:
>
>> That should have been "TestStream [2, 3, 4]"
>>
>> On Wed, Dec 1, 2021 at 9:20 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> There is some good information about testing in the Apache Beam
>>> documentation[1] about how you want to test the transforms/pipeline instead
>>> of the DoFn.
>>>
>>> For your use case, TestStream [1, 2, 3] is your best bet combined with
>>> the above advice about transform/pipeline level testing. TestStream is used
>>> to simulate ingestion of data and allows control of watermark and
>>> processing time advancement.
>>>
>>> 1: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
>>> 2: https://beam.apache.org/blog/test-stream/
>>> 3:
>>> https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7
>>> 4:
>>> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/deduplicate_test.py#L109
>>>
>>>
>>> On Wed, Dec 1, 2021 at 1:07 AM Tudor Plugaru <tu...@gorgias.com> wrote:
>>>
>>>> Hi,
>>>> What is the best approach in unit testing a stateful DoFn? I've looked
>>>> over the userstate_test.py in Beam repo, but those examples do not really
>>>> apply to our case. In those tests, the DoFn used for testing are returning
>>>> values from timer callbacks which does not really happen in reality.
>>>> I am more interested in testing if a timer was triggered after the
>>>> watermark advanced, or what is the state bag content at a specific time.
>>>>
>>>> Actually it would really be nice to have some kind of documentation
>>>> regarding testing and best practices in writing unit/integration tests for
>>>> Beam pipelines.
>>>>
>>>> Thanks,
>>>> Tudor
>>>>
>>>

Re: Unit testing stateful DoFn

Posted by Tudor Plugaru <tu...@gorgias.com>.
I know about TestStream and I am using it, but, for example, I want to test
a use case that the timer callback is being called once the watermark
passes the set time in the timer. Like in this test [1] for example, I want
to be able to have something like assert bag_state == None at the end of
the test. Is this possible? As most of the tests from that module are
returning specific values from time callbacks and then the tests assert
that those values are being returned, but in a real use case, you don't
necessarily return values from timer callbacks.

Another use case is when the time is set only in specific scenarios, how
can I test what the timer value is?

Hope it makes sense what I am describing.

[1]
https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/userstate_test.py#L487

On Wed, Dec 1, 2021 at 7:21 PM Luke Cwik <lc...@google.com> wrote:

> That should have been "TestStream [2, 3, 4]"
>
> On Wed, Dec 1, 2021 at 9:20 AM Luke Cwik <lc...@google.com> wrote:
>
>> There is some good information about testing in the Apache Beam
>> documentation[1] about how you want to test the transforms/pipeline instead
>> of the DoFn.
>>
>> For your use case, TestStream [1, 2, 3] is your best bet combined with
>> the above advice about transform/pipeline level testing. TestStream is used
>> to simulate ingestion of data and allows control of watermark and
>> processing time advancement.
>>
>> 1: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
>> 2: https://beam.apache.org/blog/test-stream/
>> 3:
>> https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7
>> 4:
>> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/deduplicate_test.py#L109
>>
>>
>> On Wed, Dec 1, 2021 at 1:07 AM Tudor Plugaru <tu...@gorgias.com> wrote:
>>
>>> Hi,
>>> What is the best approach in unit testing a stateful DoFn? I've looked
>>> over the userstate_test.py in Beam repo, but those examples do not really
>>> apply to our case. In those tests, the DoFn used for testing are returning
>>> values from timer callbacks which does not really happen in reality.
>>> I am more interested in testing if a timer was triggered after the
>>> watermark advanced, or what is the state bag content at a specific time.
>>>
>>> Actually it would really be nice to have some kind of documentation
>>> regarding testing and best practices in writing unit/integration tests for
>>> Beam pipelines.
>>>
>>> Thanks,
>>> Tudor
>>>
>>

Re: Unit testing stateful DoFn

Posted by Luke Cwik <lc...@google.com>.
That should have been "TestStream [2, 3, 4]"

On Wed, Dec 1, 2021 at 9:20 AM Luke Cwik <lc...@google.com> wrote:

> There is some good information about testing in the Apache Beam
> documentation[1] about how you want to test the transforms/pipeline instead
> of the DoFn.
>
> For your use case, TestStream [1, 2, 3] is your best bet combined with the
> above advice about transform/pipeline level testing. TestStream is used to
> simulate ingestion of data and allows control of watermark and processing
> time advancement.
>
> 1: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
> 2: https://beam.apache.org/blog/test-stream/
> 3:
> https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7
> 4:
> https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/deduplicate_test.py#L109
>
>
> On Wed, Dec 1, 2021 at 1:07 AM Tudor Plugaru <tu...@gorgias.com> wrote:
>
>> Hi,
>> What is the best approach in unit testing a stateful DoFn? I've looked
>> over the userstate_test.py in Beam repo, but those examples do not really
>> apply to our case. In those tests, the DoFn used for testing are returning
>> values from timer callbacks which does not really happen in reality.
>> I am more interested in testing if a timer was triggered after the
>> watermark advanced, or what is the state bag content at a specific time.
>>
>> Actually it would really be nice to have some kind of documentation
>> regarding testing and best practices in writing unit/integration tests for
>> Beam pipelines.
>>
>> Thanks,
>> Tudor
>>
>

Re: Unit testing stateful DoFn

Posted by Luke Cwik <lc...@google.com>.
There is some good information about testing in the Apache Beam
documentation[1] about how you want to test the transforms/pipeline instead
of the DoFn.

For your use case, TestStream [1, 2, 3] is your best bet combined with the
above advice about transform/pipeline level testing. TestStream is used to
simulate ingestion of data and allows control of watermark and processing
time advancement.

1: https://beam.apache.org/documentation/pipelines/test-your-pipeline/
2: https://beam.apache.org/blog/test-stream/
3:
https://medium.com/@asitkovets/testing-in-apache-beam-part-2-stream-2a9950ba2bc7
4:
https://github.com/apache/beam/blob/8e217ea0d1f383ef5033ef507b14d01edf9c67e6/sdks/python/apache_beam/transforms/deduplicate_test.py#L109


On Wed, Dec 1, 2021 at 1:07 AM Tudor Plugaru <tu...@gorgias.com> wrote:

> Hi,
> What is the best approach in unit testing a stateful DoFn? I've looked
> over the userstate_test.py in Beam repo, but those examples do not really
> apply to our case. In those tests, the DoFn used for testing are returning
> values from timer callbacks which does not really happen in reality.
> I am more interested in testing if a timer was triggered after the
> watermark advanced, or what is the state bag content at a specific time.
>
> Actually it would really be nice to have some kind of documentation
> regarding testing and best practices in writing unit/integration tests for
> Beam pipelines.
>
> Thanks,
> Tudor
>