You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kamil Wasilewski (Jira)" <ji...@apache.org> on 2020/08/20 09:53:00 UTC

[jira] [Created] (BEAM-10774) GBK Python streaming load tests are too slow

Kamil Wasilewski created BEAM-10774:
---------------------------------------

             Summary: GBK Python streaming load tests are too slow
                 Key: BEAM-10774
                 URL: https://issues.apache.org/jira/browse/BEAM-10774
             Project: Beam
          Issue Type: Bug
          Components: testing
            Reporter: Kamil Wasilewski


The following GBK streaming test cases take too long on Dataflow:

 

1) 2GB of 10B records

2) 2GB of 100B records

4) fanout 4 times with 2GB 10-byte records total

5) fanout 8 times with 2GB 10-byte records total

 

Each of them needs at least 1 hour to execute, which is way too long for one Jenkins job. 

Job's definition: [https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy]

Test pipeline: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py]

It is probable that those cases are too extreme. The first two cases involve grouping 20M unique keys, which is a stressful operation. A solution might be to overhaul the cases so that they would be less complex.

Both the current production Dataflow runner and the new Dataflow Runner V2 were tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)