You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Siyuan Chen <sy...@google.com> on 2021/07/20 18:29:36 UTC

BigQueryIO SchemaUpdateOptions incompatible with temp tables?

Hi Dev,

I encountered a problem when trying to write data to BigQuery using FILE
LOADS
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1742>.
With FILE LOADS, input data is first written to temp files
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L280>
and then batch loaded
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L340>
to BigQuery. When temp tables are needed (to avoid too many files in a
single load job), the default write deposition is set to WRITE_TRUNCATE
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L223>
to allow retries. However, when the SchemaUpdateOptions
<https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2001>
was also set, the load job failed with the following error:

Schema update options should only be specified with WRITE_APPEND
disposition, or with WRITE_TRUNCATE disposition on a table partition.

I think it means that if WRITE_TRUNCATE is used, the partition of the table
to truncate should also be supplied (which kinda makes sense as rows in a
partition share the same schema). I failed to find code that would
append partition
decorator
<https://cloud.google.com/bigquery/docs/managing-partitioned-table-data#using_a_load_job>
to the temp tables. Does it sound like a missing piece in the BigQueryIO
implementation? Please let me know if I missed anything important.

Thanks in advance!

--
Best regards,
Siyuan

Re: BigQueryIO SchemaUpdateOptions incompatible with temp tables?

Posted by Siyuan Chen <sy...@google.com>.
Thanks Ahmet, Cham and Steve! I opened a Jira issue for this:
https://issues.apache.org/jira/browse/BEAM-12482?filter=-2
--
Best regards,
Siyuan


On Tue, Jul 27, 2021 at 1:39 PM Steve Niemitz <sn...@apache.org> wrote:

> I think the same problem was recently fixed in python [1].  It'd be great
> to fix this in java though, we hit this a bunch, I've never had enough time
> to fix it.
>
> [1] https://github.com/apache/beam/pull/14113
>

Great to know about the fix in python. I will look into that and see if I
can find time to work on a fix for Java.


>
> On Tue, Jul 27, 2021 at 4:19 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> I don't have a lot of context regarding schema update options but this
>> does sound like a bug. Temp tables are only used for very large writes
>> (11TB or so last time I checked) so I wouldn't be surprised if not too many
>> users have run into this by using
>>
> Yeah there is also a limit on the number of files (10k) which might be
easier to hit.

> schema update options with very large tables.
>> Can you create a Jira ?
>>
>> Thanks,
>> Cham
>>
>> On Tue, Jul 20, 2021 at 11:29 AM Siyuan Chen <sy...@google.com> wrote:
>>
>>> Hi Dev,
>>>
>>> I encountered a problem when trying to write data to BigQuery using FILE
>>> LOADS
>>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1742>.
>>> With FILE LOADS, input data is first written to temp files
>>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L280>
>>> and then batch loaded
>>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L340>
>>> to BigQuery. When temp tables are needed (to avoid too many files in a
>>> single load job), the default write deposition is set to WRITE_TRUNCATE
>>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L223>
>>> to allow retries. However, when the SchemaUpdateOptions
>>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2001>
>>> was also set, the load job failed with the following error:
>>>
>>> Schema update options should only be specified with WRITE_APPEND
>>> disposition, or with WRITE_TRUNCATE disposition on a table partition.
>>>
>>> I think it means that if WRITE_TRUNCATE is used, the partition of the
>>> table to truncate should also be supplied (which kinda makes sense as rows
>>> in a partition share the same schema). I failed to find code that would
>>> append partition decorator
>>> <https://cloud.google.com/bigquery/docs/managing-partitioned-table-data#using_a_load_job>
>>> to the temp tables. Does it sound like a missing piece in the BigQueryIO
>>> implementation? Please let me know if I missed anything important.
>>>
>>> Thanks in advance!
>>>
>>> --
>>> Best regards,
>>> Siyuan
>>>
>>

Re: BigQueryIO SchemaUpdateOptions incompatible with temp tables?

Posted by Steve Niemitz <sn...@apache.org>.
I think the same problem was recently fixed in python [1].  It'd be great
to fix this in java though, we hit this a bunch, I've never had enough time
to fix it.

[1] https://github.com/apache/beam/pull/14113

On Tue, Jul 27, 2021 at 4:19 PM Chamikara Jayalath <ch...@google.com>
wrote:

> I don't have a lot of context regarding schema update options but this
> does sound like a bug. Temp tables are only used for very large writes
> (11TB or so last time I checked) so I wouldn't be surprised if not too many
> users have run into this by using schema update options with very large
> tables.
> Can you create a Jira ?
>
> Thanks,
> Cham
>
> On Tue, Jul 20, 2021 at 11:29 AM Siyuan Chen <sy...@google.com> wrote:
>
>> Hi Dev,
>>
>> I encountered a problem when trying to write data to BigQuery using FILE
>> LOADS
>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1742>.
>> With FILE LOADS, input data is first written to temp files
>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L280>
>> and then batch loaded
>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L340>
>> to BigQuery. When temp tables are needed (to avoid too many files in a
>> single load job), the default write deposition is set to WRITE_TRUNCATE
>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L223>
>> to allow retries. However, when the SchemaUpdateOptions
>> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2001>
>> was also set, the load job failed with the following error:
>>
>> Schema update options should only be specified with WRITE_APPEND
>> disposition, or with WRITE_TRUNCATE disposition on a table partition.
>>
>> I think it means that if WRITE_TRUNCATE is used, the partition of the
>> table to truncate should also be supplied (which kinda makes sense as rows
>> in a partition share the same schema). I failed to find code that would
>> append partition decorator
>> <https://cloud.google.com/bigquery/docs/managing-partitioned-table-data#using_a_load_job>
>> to the temp tables. Does it sound like a missing piece in the BigQueryIO
>> implementation? Please let me know if I missed anything important.
>>
>> Thanks in advance!
>>
>> --
>> Best regards,
>> Siyuan
>>
>

Re: BigQueryIO SchemaUpdateOptions incompatible with temp tables?

Posted by Chamikara Jayalath <ch...@google.com>.
I don't have a lot of context regarding schema update options but this does
sound like a bug. Temp tables are only used for very large writes (11TB or
so last time I checked) so I wouldn't be surprised if not too many users
have run into this by using schema update options with very large tables.
Can you create a Jira ?

Thanks,
Cham

On Tue, Jul 20, 2021 at 11:29 AM Siyuan Chen <sy...@google.com> wrote:

> Hi Dev,
>
> I encountered a problem when trying to write data to BigQuery using FILE
> LOADS
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1742>.
> With FILE LOADS, input data is first written to temp files
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L280>
> and then batch loaded
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L340>
> to BigQuery. When temp tables are needed (to avoid too many files in a
> single load job), the default write deposition is set to WRITE_TRUNCATE
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L223>
> to allow retries. However, when the SchemaUpdateOptions
> <https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2001>
> was also set, the load job failed with the following error:
>
> Schema update options should only be specified with WRITE_APPEND
> disposition, or with WRITE_TRUNCATE disposition on a table partition.
>
> I think it means that if WRITE_TRUNCATE is used, the partition of the
> table to truncate should also be supplied (which kinda makes sense as rows
> in a partition share the same schema). I failed to find code that would
> append partition decorator
> <https://cloud.google.com/bigquery/docs/managing-partitioned-table-data#using_a_load_job>
> to the temp tables. Does it sound like a missing piece in the BigQueryIO
> implementation? Please let me know if I missed anything important.
>
> Thanks in advance!
>
> --
> Best regards,
> Siyuan
>