You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Michał Walenia <mi...@polidea.com> on 2019/10/08 13:42:25 UTC

Resizing Beam IOITs

Hi all,
I'm working on resizing IO integration tests in Beam and I'd like to ask
for the community's opinion.

Right now each IO integration test has a set of four predetermined sizes
(1000, 100k, 1M and 100M elements).
For every size there is a pre calculated hash for read correctness checking.
As it is now, measuring throughput in a IOIT is very costly - accessing
memory for each PCollection element increases the runtime of the test
manyfold, which changes the runtime measurements.

My proposed improvements change the test sizes, add dataset size reporting
to metrics (throughput will be possible to calculate at dashboard level)
and change the way test parameters are passed.
The changes are in a PR here <https://github.com/apache/beam/pull/9638>.
Tests were resized to about 1GB each.
Test configurations would be set by one string parameter in pipeline
options (eg. "testConfigName=XML_1GB" instead of
"numberOfRecords=1000000").

What in general do you think about this approach? Do you think that 1GB
test datasets are reasonable?
Thanks,

Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Resizing Beam IOITs

Posted by Michał Walenia <mi...@polidea.com>.
Hi,
thanks for the suggestion. I think it's reasonable to include a small
configuration for fast testing. I'll add such a config to the PR.

Have a good day,
Michal

On Wed, Oct 9, 2019 at 5:05 AM Chamikara Jayalath <ch...@google.com>
wrote:

>
>
> On Tue, Oct 8, 2019 at 6:52 AM Michał Walenia <mi...@polidea.com>
> wrote:
>
>> Hi all,
>> I'm working on resizing IO integration tests in Beam and I'd like to ask
>> for the community's opinion.
>>
>> Right now each IO integration test has a set of four predetermined sizes
>> (1000, 100k, 1M and 100M elements).
>> For every size there is a pre calculated hash for read correctness
>> checking.
>> As it is now, measuring throughput in a IOIT is very costly - accessing
>> memory for each PCollection element increases the runtime of the test
>> manyfold, which changes the runtime measurements.
>>
>> My proposed improvements change the test sizes, add dataset size
>> reporting to metrics (throughput will be possible to calculate at dashboard
>> level) and change the way test parameters are passed.
>> The changes are in a PR here <https://github.com/apache/beam/pull/9638>.
>> Tests were resized to about 1GB each.
>> Test configurations would be set by one string parameter in pipeline
>> options (eg. "testConfigName=XML_1GB" instead of
>> "numberOfRecords=1000000").
>>
>> What in general do you think about this approach? Do you think that 1GB
>> test datasets are reasonable?
>> Thanks,
>>
>
> Thanks Michal. I think these tests fulfil two purposes currently.
> (1) As end-to-end integration tests that confirm that connectors work with
> a given runner.
> (2) As Large scale performance tests for tracking performance and
> triggering alerts.
>
> It might be good to separate out these two cases and run two integration
> tests for each connector. For example,
> (1) Version with a small input (say 1KB - 1MB) that we run often,
> potentially with every run of post-commit test suite.
> (2) A version with a large input (say 10-100 GB, depending on the
> connector) that is used for performance tracking and triggering alerts.
> This version should be run less frequently (for example, once a day).
>
> WDYT ?
>
> Thanks,
> Cham
>
>
>>
>> Michal
>>
>> --
>>
>> Michał Walenia
>> Polidea <https://www.polidea.com/> | Software Engineer
>>
>> M: +48 791 432 002 <+48791432002>
>> E: michal.walenia@polidea.com
>>
>> Unique Tech
>> Check out our projects! <https://www.polidea.com/our-work>
>>
>

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Resizing Beam IOITs

Posted by Chamikara Jayalath <ch...@google.com>.
On Tue, Oct 8, 2019 at 6:52 AM Michał Walenia <mi...@polidea.com>
wrote:

> Hi all,
> I'm working on resizing IO integration tests in Beam and I'd like to ask
> for the community's opinion.
>
> Right now each IO integration test has a set of four predetermined sizes
> (1000, 100k, 1M and 100M elements).
> For every size there is a pre calculated hash for read correctness
> checking.
> As it is now, measuring throughput in a IOIT is very costly - accessing
> memory for each PCollection element increases the runtime of the test
> manyfold, which changes the runtime measurements.
>
> My proposed improvements change the test sizes, add dataset size reporting
> to metrics (throughput will be possible to calculate at dashboard level)
> and change the way test parameters are passed.
> The changes are in a PR here <https://github.com/apache/beam/pull/9638>.
> Tests were resized to about 1GB each.
> Test configurations would be set by one string parameter in pipeline
> options (eg. "testConfigName=XML_1GB" instead of
> "numberOfRecords=1000000").
>
> What in general do you think about this approach? Do you think that 1GB
> test datasets are reasonable?
> Thanks,
>

Thanks Michal. I think these tests fulfil two purposes currently.
(1) As end-to-end integration tests that confirm that connectors work with
a given runner.
(2) As Large scale performance tests for tracking performance and
triggering alerts.

It might be good to separate out these two cases and run two integration
tests for each connector. For example,
(1) Version with a small input (say 1KB - 1MB) that we run often,
potentially with every run of post-commit test suite.
(2) A version with a large input (say 10-100 GB, depending on the
connector) that is used for performance tracking and triggering alerts.
This version should be run less frequently (for example, once a day).

WDYT ?

Thanks,
Cham


>
> Michal
>
> --
>
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.walenia@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>