You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Chris Heath <ch...@networksgroup.com> on 2018/05/18 17:24:04 UTC

Doubled trigger frequency writing to BQ

Hi,

I am using BigQueryIO from Apache Beam 2.4.0 to write data to BQ
(Write.Method.FILE_LOADS). My volumes per second will be pretty low -- 200
- 2000 rows per second.

What I'm finding is that data often takes twice as long to be inserted into
BQ as the configured trigger frequency.  My code is something like this:

          .apply(BigQueryIO.writeTableRows()
            .to(XXX)
            .withSchema(XXX)

.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)

.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
            .withTriggeringFrequency(Duration.standardMinutes(3))
            .withNumFileShards(1)

In my case, with a 3-minute triggering frequency, in the worst case, a row
that is computed immediately after a batch load runs would appear in BQ 6
minutes later (not 3 minutes, as I would have expected).

The behavior I'm seeing is as if writing the file and running the BQ load
job are not sequential.  Is BQ load job waiting until the file write is
complete before running?  It is behaving like it has a race condition where
the worst case row is written to a file 3 minutes later but the BQ load job
has already run so it has to wait another 3 minutes.  But sometimes (about
30% of the time) I'm seeing it in BQ after 3 minutes.

What could be causing this?  Any way to make the worst-case insertion time
closer to the triggering frequency?  Perhaps writing the files more often,
like triggeringFrequency/10 ?

Thanks,
Chris Heath