You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by David S��nchez <da...@gmail.com> on 2021/03/24 14:51:58 UTC

Dataflow v2 runner scaling behaviour

Hi folks!

I'm testing the dataflow v2 runner in a batch pipeline (Apache Beam Python 3.7 SDK 2.27.0) that reads many million of rows from BigQuery and writes to PubSub and BigQuery using the flag "--experiments=use_runner_v2". 

The same job used to scale up immediately to over 50 workers, but in v2 it never scales up further than 5-6 workers, thus it's way slower. I can see however that the total vCPU and memory are about half than before, which is promising. Any clue about why the scaling is behaving differently? 

Many thanks

Re: Dataflow v2 runner scaling behaviour

Posted by David Sánchez <da...@gmail.com>.

Hi Pablo,

Did you find out anything? Any suggestion we can try?

Many thanks,
David

On Wed, Mar 24, 2021 at 5:19 PM David Sánchez <da...@gmail.com> wrote:

> Hi Pablo,
>
> This is the input data we are testing
>
> Elements added38,792,932
> Estimated size3.14 GB
>
> On Wed, Mar 24, 2021 at 5:09 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hi David,
>> Thanks for sharing. I'm investigating something like this recently.
>> What's the size of your data?
>> Best
>> -P.
>>
>> On Wed, Mar 24, 2021, 7:52 AM David Sánchez <da...@gmail.com> wrote:
>>
>>> Hi folks!
>>>
>>> I'm testing the dataflow v2 runner in a batch pipeline (Apache Beam
>>> Python 3.7 SDK 2.27.0) that reads many million of rows from BigQuery and
>>> writes to PubSub and BigQuery using the flag "--experiments=use_runner_v2".
>>>
>>> The same job used to scale up immediately to over 50 workers, but in v2
>>> it never scales up further than 5-6 workers, thus it's way slower. I can
>>> see however that the total vCPU and memory are about half than before,
>>> which is promising. Any clue about why the scaling is behaving differently?
>>>
>>> Many thanks
>>>
>>

Re: Dataflow v2 runner scaling behaviour

Posted by David Sánchez <da...@gmail.com>.

Hi Pablo,

This is the input data we are testing

Elements added38,792,932
Estimated size3.14 GB

On Wed, Mar 24, 2021 at 5:09 PM Pablo Estrada <pa...@google.com> wrote:

> Hi David,
> Thanks for sharing. I'm investigating something like this recently. What's
> the size of your data?
> Best
> -P.
>
> On Wed, Mar 24, 2021, 7:52 AM David Sánchez <da...@gmail.com> wrote:
>
>> Hi folks!
>>
>> I'm testing the dataflow v2 runner in a batch pipeline (Apache Beam
>> Python 3.7 SDK 2.27.0) that reads many million of rows from BigQuery and
>> writes to PubSub and BigQuery using the flag "--experiments=use_runner_v2".
>>
>> The same job used to scale up immediately to over 50 workers, but in v2
>> it never scales up further than 5-6 workers, thus it's way slower. I can
>> see however that the total vCPU and memory are about half than before,
>> which is promising. Any clue about why the scaling is behaving differently?
>>
>> Many thanks
>>
>

Re: Dataflow v2 runner scaling behaviour

Posted by Pablo Estrada <pa...@google.com>.

Hi David,
Thanks for sharing. I'm investigating something like this recently. What's
the size of your data?
Best
-P.

On Wed, Mar 24, 2021, 7:52 AM David Sánchez <da...@gmail.com> wrote:

> Hi folks!
>
> I'm testing the dataflow v2 runner in a batch pipeline (Apache Beam Python
> 3.7 SDK 2.27.0) that reads many million of rows from BigQuery and writes to
> PubSub and BigQuery using the flag "--experiments=use_runner_v2".
>
> The same job used to scale up immediately to over 50 workers, but in v2 it
> never scales up further than 5-6 workers, thus it's way slower. I can see
> however that the total vCPU and memory are about half than before, which is
> promising. Any clue about why the scaling is behaving differently?
>
> Many thanks
>