You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Yohei Onishi <vi...@gmail.com> on 2019/08/08 05:55:37 UTC

BigQueryIO - insert retry policy in Apache Beam

Hi,

If you are familiar with BiqQuery insert retry policies in Apache Beam API
(BigQueryIO), please help me understand the following behavior. I am using
Dataflow runner.

   - How Dataflow job behave if I specify retryTransientErrors?
   - shouldRetry provides an error from BigQuery and I can decide if I
   should retry. Where can I find expected error from BigQuery?

*BiqQuery insert retry policies*
https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.html


   - alwaysRetry - Always retry all failures.
   - neverRetry - Never retry any failures.
   - retryTransientErrors - Retry all failures except for known persistent
   errors.
   - shouldRetry - Return true if this failure should be retried.

*Background*

   - When my Cloud Dataflow job inserting very old timestamp (more than 1
   year before from now) to BigQuery, I got the following error.
   - Retry did not stop so I added retryTransientErrors to BigQueryIO.Write
   step then the retry stopped.

 jsonPayload: {
>   exception:  "java.lang.RuntimeException: java.io.IOException: Insert
> failed:
>  [{"errors":[{"debugInfo":"","location":"","message":"Value 690000000 for
> field
>  timestamp_scanned of the destination table
> fr-prd-datalake:rfid_raw.store_epc_transactions_cr_uqjp is outside the
> allowed bounds.
> You can only stream to date range within 365 days in the past and 183 days
> in
> the future relative to the current date.","reason":"invalid"}],
> After the first error, Dataflow try to retry insert and it always rejected
> from BigQuery with the same error.


I also posted the same question here
https://stackoverflow.com/questions/57403980/biqquery-insert-retry-policy-in-apache-beam

Yohei Onishi

Re: BigQueryIO - insert retry policy in Apache Beam

Posted by Lukasz Cwik <lc...@google.com>.
On Wed, Aug 7, 2019 at 10:55 PM Yohei Onishi <vi...@gmail.com> wrote:

> Hi,
>
> If you are familiar with BiqQuery insert retry policies in Apache Beam API
> (BigQueryIO), please help me understand the following behavior. I am using
> Dataflow runner.
>
>    - How Dataflow job behave if I specify retryTransientErrors?
>
>
All errors are considered transient except if BigQuery says that the error
reason is one of "invalid", "invalidQuery", "notImplemented"
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.java#L44


>
>    - shouldRetry provides an error from BigQuery and I can decide if I
>    should retry. Where can I find expected error from BigQuery?
>
> You can't since the errors are not visible to the caller:
https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.java#L36
I'm not sure if this was done on purpose or whether Apache Beam should
expose the errors so users can write their own retry logic.


> *BiqQuery insert retry policies*
>
> https://beam.apache.org/releases/javadoc/2.1.0/org/apache/beam/sdk/io/gcp/bigquery/InsertRetryPolicy.html
>
>
>    - alwaysRetry - Always retry all failures.
>    - neverRetry - Never retry any failures.
>    - retryTransientErrors - Retry all failures except for known
>    persistent errors.
>    - shouldRetry - Return true if this failure should be retried.
>
> *Background*
>
>    - When my Cloud Dataflow job inserting very old timestamp (more than 1
>    year before from now) to BigQuery, I got the following error.
>    - Retry did not stop so I added retryTransientErrors to
>    BigQueryIO.Write step then the retry stopped.
>
>  jsonPayload: {
>>   exception:  "java.lang.RuntimeException: java.io.IOException: Insert
>> failed:
>>  [{"errors":[{"debugInfo":"","location":"","message":"Value 690000000 for
>> field
>>  timestamp_scanned of the destination table
>> fr-prd-datalake:rfid_raw.store_epc_transactions_cr_uqjp is outside the
>> allowed bounds.
>> You can only stream to date range within 365 days in the past and 183
>> days in
>> the future relative to the current date.","reason":"invalid"}],
>> After the first error, Dataflow try to retry insert and it always
>> rejected from BigQuery with the same error.
>
>
> I also posted the same question here
> https://stackoverflow.com/questions/57403980/biqquery-insert-retry-policy-in-apache-beam
>
> Yohei Onishi
>