You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Josh <jo...@gmail.com> on 2018/02/28 07:56:48 UTC

BigQueryIO streaming inserts - poor performance with multiple tables

Hi all,

We are using BigQueryIO.write() to stream data into BigQuery, and are
seeing very poor performance in terms of number of writes per second per
worker.

We are currently using *32* x *n1-standard-4* workers to stream ~15,000
writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
number of workers and worker CPU utilisation remains constant at ~90% even
when the rate of input fluctuates down to below 10,000 writes/sec. The job
always keeps up with the stream (no backlog).

I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
with a single node, when streaming data into a *single* BQ table... So my
theory is that writing to multiple tables is somehow causing the
performance issue. Our writes are spread (unevenly) across 200+ tables. The
job itself does very little processing, and looking at the Dataflow metrics
pretty much all of the wall time is spent in the *StreamingWrite* step of
BigQueryIO. The Beam version is 2.2.0.

Our code looks like this:

stream.apply(BigQueryIO.<MyElement>write()
    .to(new ToDestination())
    .withFormatFunction(new FormatForBigQuery())
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

where ToDestination is a:

SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>

which returns a:

new TableDestination(tableName, "")

where tableName looks like "myproject:dataset.tablename$20180228"

Has as anyone else seen this kind of poor performance when streaming
writes to multiple BQ tables? Is there anything here that sounds
wrong, or any optimisations we can make?

Thanks for any advice!

Josh

Re: BigQueryIO streaming inserts - poor performance with multiple tables

Posted by Carlos Alonso <ca...@mrcalonso.com>.

Could you please keep writing here the findings you make?

I'm very interested in this issue as well.

Thanks!

On Thu, Mar 1, 2018 at 9:45 AM Josh <jo...@gmail.com> wrote:

> Hi Cham,
>
> Thanks, I have emailed the dataflow-feedback email address with the
> details.
>
> Best regards,
> Josh
>
> On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> Could be a DataflowRunner specific issue. Would you mind reporting this
>> with corresponding Dataflow job IDs to either Dataflow stackoverflow
>> channel [1] or dataflow-feedback@google.com ?
>>
>> I suspect Dataflow split writing to multiple tables into multiple workers
>> which may be keep all workers busy but we have to look at the job to
>> confirm.
>>
>> Thanks,
>> Cham
>>
>> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>>
>> On Tue, Feb 27, 2018 at 11:56 PM Josh <jo...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>>> seeing very poor performance in terms of number of writes per second per
>>> worker.
>>>
>>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>>> number of workers and worker CPU utilisation remains constant at ~90% even
>>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>>> always keeps up with the stream (no backlog).
>>>
>>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being
>>> achieved with a single node, when streaming data into a *single* BQ
>>> table... So my theory is that writing to multiple tables is somehow causing
>>> the performance issue. Our writes are spread (unevenly) across 200+ tables.
>>> The job itself does very little processing, and looking at the Dataflow
>>> metrics pretty much all of the wall time is spent in the
>>> *StreamingWrite* step of BigQueryIO. The Beam version is 2.2.0.
>>>
>>> Our code looks like this:
>>>
>>> stream.apply(BigQueryIO.<MyElement>write()
>>>     .to(new ToDestination())
>>>     .withFormatFunction(new FormatForBigQuery())
>>>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>>
>>> where ToDestination is a:
>>>
>>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>>>
>>> which returns a:
>>>
>>> new TableDestination(tableName, "")
>>>
>>> where tableName looks like "myproject:dataset.tablename$20180228"
>>>
>>> Has as anyone else seen this kind of poor performance when streaming writes to multiple BQ tables? Is there anything here that sounds wrong, or any optimisations we can make?
>>>
>>> Thanks for any advice!
>>>
>>> Josh
>>>
>>
>

Re: BigQueryIO streaming inserts - poor performance with multiple tables

Posted by Josh <jo...@gmail.com>.

Hi Cham,

Thanks, I have emailed the dataflow-feedback email address with the details.

Best regards,
Josh

On Thu, Mar 1, 2018 at 12:26 AM, Chamikara Jayalath <ch...@google.com>
wrote:

> Could be a DataflowRunner specific issue. Would you mind reporting this
> with corresponding Dataflow job IDs to either Dataflow stackoverflow
> channel [1] or dataflow-feedback@google.com ?
>
> I suspect Dataflow split writing to multiple tables into multiple workers
> which may be keep all workers busy but we have to look at the job to
> confirm.
>
> Thanks,
> Cham
>
> [1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow
>
> On Tue, Feb 27, 2018 at 11:56 PM Josh <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> We are using BigQueryIO.write() to stream data into BigQuery, and are
>> seeing very poor performance in terms of number of writes per second per
>> worker.
>>
>> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
>> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
>> number of workers and worker CPU utilisation remains constant at ~90% even
>> when the rate of input fluctuates down to below 10,000 writes/sec. The job
>> always keeps up with the stream (no backlog).
>>
>> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
>> with a single node, when streaming data into a *single* BQ table... So
>> my theory is that writing to multiple tables is somehow causing the
>> performance issue. Our writes are spread (unevenly) across 200+ tables. The
>> job itself does very little processing, and looking at the Dataflow metrics
>> pretty much all of the wall time is spent in the *StreamingWrite* step
>> of BigQueryIO. The Beam version is 2.2.0.
>>
>> Our code looks like this:
>>
>> stream.apply(BigQueryIO.<MyElement>write()
>>     .to(new ToDestination())
>>     .withFormatFunction(new FormatForBigQuery())
>>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>>
>> where ToDestination is a:
>>
>> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>>
>> which returns a:
>>
>> new TableDestination(tableName, "")
>>
>> where tableName looks like "myproject:dataset.tablename$20180228"
>>
>> Has as anyone else seen this kind of poor performance when streaming writes to multiple BQ tables? Is there anything here that sounds wrong, or any optimisations we can make?
>>
>> Thanks for any advice!
>>
>> Josh
>>
>

Re: BigQueryIO streaming inserts - poor performance with multiple tables

Posted by Chamikara Jayalath <ch...@google.com>.

Could be a DataflowRunner specific issue. Would you mind reporting this
with corresponding Dataflow job IDs to either Dataflow stackoverflow
channel [1] or dataflow-feedback@google.com ?

I suspect Dataflow split writing to multiple tables into multiple workers
which may be keep all workers busy but we have to look at the job to
confirm.

Thanks,
Cham

[1] https://stackoverflow.com/questions/tagged/google-cloud-dataflow

On Tue, Feb 27, 2018 at 11:56 PM Josh <jo...@gmail.com> wrote:

> Hi all,
>
> We are using BigQueryIO.write() to stream data into BigQuery, and are
> seeing very poor performance in terms of number of writes per second per
> worker.
>
> We are currently using *32* x *n1-standard-4* workers to stream ~15,000
> writes/sec to BigQuery. Each worker has ~90% CPU utilisation. Strangely the
> number of workers and worker CPU utilisation remains constant at ~90% even
> when the rate of input fluctuates down to below 10,000 writes/sec. The job
> always keeps up with the stream (no backlog).
>
> I've seen BigQueryIO benchmarks which show ~20k writes/sec being achieved
> with a single node, when streaming data into a *single* BQ table... So my
> theory is that writing to multiple tables is somehow causing the
> performance issue. Our writes are spread (unevenly) across 200+ tables. The
> job itself does very little processing, and looking at the Dataflow metrics
> pretty much all of the wall time is spent in the *StreamingWrite* step of
> BigQueryIO. The Beam version is 2.2.0.
>
> Our code looks like this:
>
> stream.apply(BigQueryIO.<MyElement>write()
>     .to(new ToDestination())
>     .withFormatFunction(new FormatForBigQuery())
>     .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
>     .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
>
> where ToDestination is a:
>
> SerializableFunction<ValueInSingleWindow<MyElement>, TableDestination>
>
> which returns a:
>
> new TableDestination(tableName, "")
>
> where tableName looks like "myproject:dataset.tablename$20180228"
>
> Has as anyone else seen this kind of poor performance when streaming writes to multiple BQ tables? Is there anything here that sounds wrong, or any optimisations we can make?
>
> Thanks for any advice!
>
> Josh
>