You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/04/03 02:26:00 UTC

[jira] [Updated] (BEAM-10774) GBK Python streaming load tests are too slow

     [ https://issues.apache.org/jira/browse/BEAM-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-10774:
-----------------------------------
    Component/s: sdk-py-core

> GBK Python streaming load tests are too slow
> --------------------------------------------
>
>                 Key: BEAM-10774
>                 URL: https://issues.apache.org/jira/browse/BEAM-10774
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core, testing
>            Reporter: Kamil Wasilewski
>            Priority: P3
>
> The following GBK streaming test cases take too long on Dataflow:
>  
> 1) 2GB of 10B records
> 2) 2GB of 100B records
> 4) fanout 4 times with 2GB 10-byte records total
> 5) fanout 8 times with 2GB 10-byte records total
>  
> Each of them needs at least 1 hour to execute, which is way too long for one Jenkins job. 
> Job's definition: [https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy]
> Test pipeline: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py]
> It is probable that those cases are too extreme. The first two cases involve grouping 20M unique keys, which is a stressful operation. A solution might be to overhaul the cases so that they would be less complex.
> Both the current production Dataflow runner and the new Dataflow Runner V2 were tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)