You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Julien Phalip <jp...@gmail.com> on 2021/11/24 04:48:15 UTC

Custom batching for BigQuery streaming inserts

Hi,

AFAIK there are two types of batching/sharding for BigQuery streaming
inserts: 1) Hard-coded batch size via the `--numStreamingKeys` pipeline
option, and 2) Automatic sharing via `withAutoSharding()`.

Instead, I'd like to do my own batching and provide my own GroupIntoBatches
implementation. More precisely, I'd like to batch rows by overall bytesize
instead of by number of rows. The reason is that some individual rows might
potentially be very large, which I believe could cause some streaming
insert requests to fail because their overall payloads would be too large
and be rejected by the BigQuery API.

However, I'm not sure that is possible from looking at the Java SDK
internals, as `BigQueryIO.write()` appears to only accept
`PCollection<TableRow>`. Ideally I'd like to instead provide an input of
pre-batched rows in the form of `Iterable<TableRow>`.

From looking at the Python SDK, it looks like that might be possible by
setting `BigQueryWriteFn.with_batched_input=True`.

Is what I'm trying to achieve possible with the Java SDK?

Thanks!

Julien