You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Jeff Klukas <jk...@mozilla.com> on 2019/01/22 20:28:22 UTC

Tuning BigQueryIO.Write

I'm attempting to deploy a fairly simple job on the Dataflow runner that
reads from PubSub and writes to BigQuery using file loads, but I have so
far not been able to tune it to keep up with the incoming data rate.

I have configured BigQueryIO.write to trigger loads every 5 minutes, and
I've let the job autoscale up to a max of 60 workers (which it has done).
I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming
data per table is ~10k events/second, so every 5 minutes each table should
be ingesting on order 200k records of ~20 kB apiece.

We don't get many knobs to turn in BigQueryIO. I have tested numShards
between 10 and 1000, but haven't seen obvious differences in performance.

Potentially relevant: I see a high rate of warnings on the shuffler. They
consist mostly of LevelDB warnings about "Too many L0 files". There are
occasionally some other warnings relating to memory as well. Would using
larger workers potentially help?

Does anybody have experience with tuning BigQueryIO writing? It's quite a
complicated transform under the hood and it looks like there are several
steps of grouping and shuffling data that could be limiting throughput.

Re: Tuning BigQueryIO.Write

Posted by Jeff Klukas <jk...@mozilla.com>.
I opened a support case with Google and they helped me get the pipeline to
a state where it's able to keep up with the 10k events/second input. The
job was bottlenecked on disk I/O, so we switched the workers to use SSD and
added enough disk capacity to max out the potential disk throughput per
worker.

On Wed, Jan 30, 2019 at 3:03 AM Kaymak, Tobias <to...@ricardo.ch>
wrote:

> Hi,
>
> I am currently playing around with BigQueryIO options, and I am not an
> expert on it, but 60 workers sounds like a lot to me (or expensive
> computation) for 10k records hitting 2 tables each.
> Could you maybe share the code of your pipeline?
>
> Cheers,
> Tobi
>
> On Tue, Jan 22, 2019 at 9:28 PM Jeff Klukas <jk...@mozilla.com> wrote:
>
>> I'm attempting to deploy a fairly simple job on the Dataflow runner that
>> reads from PubSub and writes to BigQuery using file loads, but I have so
>> far not been able to tune it to keep up with the incoming data rate.
>>
>> I have configured BigQueryIO.write to trigger loads every 5 minutes, and
>> I've let the job autoscale up to a max of 60 workers (which it has done).
>> I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming
>> data per table is ~10k events/second, so every 5 minutes each table should
>> be ingesting on order 200k records of ~20 kB apiece.
>>
>> We don't get many knobs to turn in BigQueryIO. I have tested numShards
>> between 10 and 1000, but haven't seen obvious differences in performance.
>>
>> Potentially relevant: I see a high rate of warnings on the shuffler. They
>> consist mostly of LevelDB warnings about "Too many L0 files". There are
>> occasionally some other warnings relating to memory as well. Would using
>> larger workers potentially help?
>>
>> Does anybody have experience with tuning BigQueryIO writing? It's quite a
>> complicated transform under the hood and it looks like there are several
>> steps of grouping and shuffling data that could be limiting throughput.
>>
>
>
> --
> Tobias Kaymak
> Data Engineer
>
> tobias.kaymak@ricardo.ch
> www.ricardo.ch
>

Re: Tuning BigQueryIO.Write

Posted by "Kaymak, Tobias" <to...@ricardo.ch>.
Hi,

I am currently playing around with BigQueryIO options, and I am not an
expert on it, but 60 workers sounds like a lot to me (or expensive
computation) for 10k records hitting 2 tables each.
Could you maybe share the code of your pipeline?

Cheers,
Tobi

On Tue, Jan 22, 2019 at 9:28 PM Jeff Klukas <jk...@mozilla.com> wrote:

> I'm attempting to deploy a fairly simple job on the Dataflow runner that
> reads from PubSub and writes to BigQuery using file loads, but I have so
> far not been able to tune it to keep up with the incoming data rate.
>
> I have configured BigQueryIO.write to trigger loads every 5 minutes, and
> I've let the job autoscale up to a max of 60 workers (which it has done).
> I'm using dynamic destinations to hit 2 field-partitioned tables. Incoming
> data per table is ~10k events/second, so every 5 minutes each table should
> be ingesting on order 200k records of ~20 kB apiece.
>
> We don't get many knobs to turn in BigQueryIO. I have tested numShards
> between 10 and 1000, but haven't seen obvious differences in performance.
>
> Potentially relevant: I see a high rate of warnings on the shuffler. They
> consist mostly of LevelDB warnings about "Too many L0 files". There are
> occasionally some other warnings relating to memory as well. Would using
> larger workers potentially help?
>
> Does anybody have experience with tuning BigQueryIO writing? It's quite a
> complicated transform under the hood and it looks like there are several
> steps of grouping and shuffling data that could be limiting throughput.
>


-- 
Tobias Kaymak
Data Engineer

tobias.kaymak@ricardo.ch
www.ricardo.ch