You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Mark Kelly <ma...@gmail.com> on 2020/07/14 13:50:06 UTC

WriteToBigQuery - performance issues?

We’re currently developing a streaming Dataflow pipeline using the latest version of the Python Beam SDK. 

The pipeline does a number of transformations/aggregations, before attempting to write to BigQuery. We're peaking at ~250 elements/sec going into the writeToBigQuery step, however, we're seeing v poor performance in the pipeline, needing to scale to a considerable number of workers, and often seeing the entire pipeline 'freeze' with throughput dropping to zero at all stages, for ~30 min periods. 

The number of unacked messages keeps growing (so it looks like the pipeline could never catch-up). The wall time on the WriteToBQ steps is considerably higher than the rest of the stages in the pipeline.

If we run another version of the Dataflow job, removing the WriteToBigQuery step - performance is *dramatically* improved. System lag is minimal and the approx 1/3 of the number of vCPUs is required to keep on top of the incoming messages.

Are there known limitations with WriteToBigQuery in the Python SDK? We have had our quota raised by Google, so limits on streaming inserts shouldn’t be an issue.

Re: WriteToBigQuery - performance issues?

Posted by Mark Kelly <ma...@gmail.com>.
Thanks, however in this case, it looks like the issue may be elsewhere.
I’ve switched to SSD, and to instance types with a greater number of vCPU,
and I’m still seeing the same behaviour:

A burst of throughput at the start, then all CPUs are maxed. Looking at the
instance monitoring, disk I/O looks minimal. There’s a spike at the start,
but after that we’re looking at ~500kb/sec with the occasional blip.



-- 
Mark Kelly
Sent with Airmail

On 14 July 2020 at 15:28:13, Jeff Klukas (jklukas@mozilla.com) wrote:

In particular, the GCE docs have a nice reference for how I/O throughput
depends on both vCPU count and disk type/size:

https://cloud.google.com/compute/docs/disks/performance#cpu_count_size

That should help you choose which configurations to test.

On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly <ma...@gmail.com> wrote:

> Having tested with both the streaming engine option, and without - I’m not
> seeing any difference in performance.
>
> As it happens, I’m seeing more underlying gRPC errors when using the
> streaming-engine option, so have avoided it in the last few test runs
> (although not sure if these errors are problematic)
>
> I’ll definitely give the SSD option a shot.
>
> Thanks.
>
> --
> Mark Kelly
> Sent with Airmail
>
> On 14 July 2020 at 15:12:46, Jeff Klukas (jklukas@mozilla.com) wrote:
>
> In my experience with writing to BQ via BigQueryIO in the Java SDK, the
> bottleneck tends to be disk I/O. The BigQueryIO logic requires several
> shuffles that cause checkpointing even in the case of streaming inserts,
> which in the Dataflow case means writing to disk. I assume the Python logic
> is similar, but don't know for sure.
>
> If this is the case for you, you may see significantly improved
> performance by provisioning SSDs for your workers or by opting to use the
> Dataflow streaming engine.
>
> On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <ma...@gmail.com> wrote:
>
>> We’re currently developing a streaming Dataflow pipeline using the latest
>> version of the Python Beam SDK.
>>
>> The pipeline does a number of transformations/aggregations, before
>> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
>> into the writeToBigQuery step, however, we're seeing v poor performance in
>> the pipeline, needing to scale to a considerable number of workers, and
>> often seeing the entire pipeline 'freeze' with throughput dropping to zero
>> at all stages, for ~30 min periods.
>>
>> The number of unacked messages keeps growing (so it looks like the
>> pipeline could never catch-up). The wall time on the WriteToBQ steps is
>> considerably higher than the rest of the stages in the pipeline.
>>
>> If we run another version of the Dataflow job, removing the
>> WriteToBigQuery step - performance is *dramatically* improved. System lag
>> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
>> top of the incoming messages.
>>
>> Are there known limitations with WriteToBigQuery in the Python SDK? We
>> have had our quota raised by Google, so limits on streaming inserts
>> shouldn’t be an issue.
>>
>

Re: WriteToBigQuery - performance issues?

Posted by Jeff Klukas <jk...@mozilla.com>.
In particular, the GCE docs have a nice reference for how I/O throughput
depends on both vCPU count and disk type/size:

https://cloud.google.com/compute/docs/disks/performance#cpu_count_size

That should help you choose which configurations to test.

On Tue, Jul 14, 2020 at 10:18 AM Mark Kelly <ma...@gmail.com> wrote:

> Having tested with both the streaming engine option, and without - I’m not
> seeing any difference in performance.
>
> As it happens, I’m seeing more underlying gRPC errors when using the
> streaming-engine option, so have avoided it in the last few test runs
> (although not sure if these errors are problematic)
>
> I’ll definitely give the SSD option a shot.
>
> Thanks.
>
> --
> Mark Kelly
> Sent with Airmail
>
> On 14 July 2020 at 15:12:46, Jeff Klukas (jklukas@mozilla.com) wrote:
>
> In my experience with writing to BQ via BigQueryIO in the Java SDK, the
> bottleneck tends to be disk I/O. The BigQueryIO logic requires several
> shuffles that cause checkpointing even in the case of streaming inserts,
> which in the Dataflow case means writing to disk. I assume the Python logic
> is similar, but don't know for sure.
>
> If this is the case for you, you may see significantly improved
> performance by provisioning SSDs for your workers or by opting to use the
> Dataflow streaming engine.
>
> On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <ma...@gmail.com> wrote:
>
>> We’re currently developing a streaming Dataflow pipeline using the latest
>> version of the Python Beam SDK.
>>
>> The pipeline does a number of transformations/aggregations, before
>> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
>> into the writeToBigQuery step, however, we're seeing v poor performance in
>> the pipeline, needing to scale to a considerable number of workers, and
>> often seeing the entire pipeline 'freeze' with throughput dropping to zero
>> at all stages, for ~30 min periods.
>>
>> The number of unacked messages keeps growing (so it looks like the
>> pipeline could never catch-up). The wall time on the WriteToBQ steps is
>> considerably higher than the rest of the stages in the pipeline.
>>
>> If we run another version of the Dataflow job, removing the
>> WriteToBigQuery step - performance is *dramatically* improved. System lag
>> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
>> top of the incoming messages.
>>
>> Are there known limitations with WriteToBigQuery in the Python SDK? We
>> have had our quota raised by Google, so limits on streaming inserts
>> shouldn’t be an issue.
>>
>

Re: WriteToBigQuery - performance issues?

Posted by Mark Kelly <ma...@gmail.com>.
Having tested with both the streaming engine option, and without - I’m not
seeing any difference in performance.

As it happens, I’m seeing more underlying gRPC errors when using the
streaming-engine option, so have avoided it in the last few test runs
(although not sure if these errors are problematic)

I’ll definitely give the SSD option a shot.

Thanks.

-- 
Mark Kelly
Sent with Airmail

On 14 July 2020 at 15:12:46, Jeff Klukas (jklukas@mozilla.com) wrote:

In my experience with writing to BQ via BigQueryIO in the Java SDK, the
bottleneck tends to be disk I/O. The BigQueryIO logic requires several
shuffles that cause checkpointing even in the case of streaming inserts,
which in the Dataflow case means writing to disk. I assume the Python logic
is similar, but don't know for sure.

If this is the case for you, you may see significantly improved performance
by provisioning SSDs for your workers or by opting to use the Dataflow
streaming engine.

On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <ma...@gmail.com> wrote:

> We’re currently developing a streaming Dataflow pipeline using the latest
> version of the Python Beam SDK.
>
> The pipeline does a number of transformations/aggregations, before
> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
> into the writeToBigQuery step, however, we're seeing v poor performance in
> the pipeline, needing to scale to a considerable number of workers, and
> often seeing the entire pipeline 'freeze' with throughput dropping to zero
> at all stages, for ~30 min periods.
>
> The number of unacked messages keeps growing (so it looks like the
> pipeline could never catch-up). The wall time on the WriteToBQ steps is
> considerably higher than the rest of the stages in the pipeline.
>
> If we run another version of the Dataflow job, removing the
> WriteToBigQuery step - performance is *dramatically* improved. System lag
> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
> top of the incoming messages.
>
> Are there known limitations with WriteToBigQuery in the Python SDK? We
> have had our quota raised by Google, so limits on streaming inserts
> shouldn’t be an issue.
>

Re: WriteToBigQuery - performance issues?

Posted by Jeff Klukas <jk...@mozilla.com>.
In my experience with writing to BQ via BigQueryIO in the Java SDK, the
bottleneck tends to be disk I/O. The BigQueryIO logic requires several
shuffles that cause checkpointing even in the case of streaming inserts,
which in the Dataflow case means writing to disk. I assume the Python logic
is similar, but don't know for sure.

If this is the case for you, you may see significantly improved performance
by provisioning SSDs for your workers or by opting to use the Dataflow
streaming engine.

On Tue, Jul 14, 2020 at 9:50 AM Mark Kelly <ma...@gmail.com> wrote:

> We’re currently developing a streaming Dataflow pipeline using the latest
> version of the Python Beam SDK.
>
> The pipeline does a number of transformations/aggregations, before
> attempting to write to BigQuery. We're peaking at ~250 elements/sec going
> into the writeToBigQuery step, however, we're seeing v poor performance in
> the pipeline, needing to scale to a considerable number of workers, and
> often seeing the entire pipeline 'freeze' with throughput dropping to zero
> at all stages, for ~30 min periods.
>
> The number of unacked messages keeps growing (so it looks like the
> pipeline could never catch-up). The wall time on the WriteToBQ steps is
> considerably higher than the rest of the stages in the pipeline.
>
> If we run another version of the Dataflow job, removing the
> WriteToBigQuery step - performance is *dramatically* improved. System lag
> is minimal and the approx 1/3 of the number of vCPUs is required to keep on
> top of the incoming messages.
>
> Are there known limitations with WriteToBigQuery in the Python SDK? We
> have had our quota raised by Google, so limits on streaming inserts
> shouldn’t be an issue.
>