You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Josh <jo...@gmail.com> on 2017/04/10 18:11:36 UTC

How to skip processing on failure at BigQueryIO sink?

Hi,

I'm using BigQueryIO to write the output of an unbounded streaming job to
BigQuery.

In the case that an element in the stream cannot be written to BigQuery,
the BigQueryIO seems to have some default retry logic which retries the
write a few times. However, if the write fails repeatedly, it seems to
cause the whole pipeline to halt.

How can I configure beam so that if writing an element fails a few times,
it simply gives up on writing that element and moves on without affecting
the pipeline?

Thanks for any advice,
Josh

Re: How to skip processing on failure at BigQueryIO sink?

Posted by Josh <jo...@gmail.com>.
Thanks for the replies,
@Lukasz that sounds like a good option. It's just it may be hard to catch
and filter out every case that will result in a 4xx error. I just want to
avoid the whole pipeline failing in the case of a few elements in the
stream being bad.

@Dan that sounds promising, I will keep an eye on BEAM-190. do you have any
idea if there will be an initial version of this to try out in the next
couple of weeks?

On Tue, Apr 11, 2017 at 11:37 PM, Dan Halperin <dh...@google.com> wrote:

> I believe this is BEAM-190, which is actually being worked on today.
> However, it will probably not be ready in time for the first stable release.
>
> https://issues.apache.org/jira/browse/BEAM-190
>
> On Tue, Apr 11, 2017 at 7:52 AM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Have you thought of fetching the schema upfront from BigQuery and
>> prefiltering out any records in a preceeding DoFn instead of relying on
>> BigQuery telling you that the schema doesn't match?
>>
>> Otherwise you are correct in believing that you will need to update
>> BigQueryIO to have the retry/error semantics that you want.
>>
>> On Tue, Apr 11, 2017 at 1:12 AM, Josh <jo...@gmail.com> wrote:
>>
>>> What I really want to do is configure BigQueryIO to log an error and
>>> skip the write if it receives a 4xx response from BigQuery (e.g. element
>>> does not match table schema). And for other errors (e.g. 5xx) I want it to
>>> retry n times with exponential backoff.
>>>
>>> Is there any way to do this at the moment? Will I need to make some
>>> custom changes to BigQueryIO?
>>>
>>>
>>>
>>> On Mon, Apr 10, 2017 at 7:11 PM, Josh <jo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm using BigQueryIO to write the output of an unbounded streaming job
>>>> to BigQuery.
>>>>
>>>> In the case that an element in the stream cannot be written to
>>>> BigQuery, the BigQueryIO seems to have some default retry logic which
>>>> retries the write a few times. However, if the write fails repeatedly, it
>>>> seems to cause the whole pipeline to halt.
>>>>
>>>> How can I configure beam so that if writing an element fails a few
>>>> times, it simply gives up on writing that element and moves on without
>>>> affecting the pipeline?
>>>>
>>>> Thanks for any advice,
>>>> Josh
>>>>
>>>
>>>
>>
>

Re: How to skip processing on failure at BigQueryIO sink?

Posted by Dan Halperin <dh...@google.com>.
I believe this is BEAM-190, which is actually being worked on today.
However, it will probably not be ready in time for the first stable release.

https://issues.apache.org/jira/browse/BEAM-190

On Tue, Apr 11, 2017 at 7:52 AM, Lukasz Cwik <lc...@google.com> wrote:

> Have you thought of fetching the schema upfront from BigQuery and
> prefiltering out any records in a preceeding DoFn instead of relying on
> BigQuery telling you that the schema doesn't match?
>
> Otherwise you are correct in believing that you will need to update
> BigQueryIO to have the retry/error semantics that you want.
>
> On Tue, Apr 11, 2017 at 1:12 AM, Josh <jo...@gmail.com> wrote:
>
>> What I really want to do is configure BigQueryIO to log an error and skip
>> the write if it receives a 4xx response from BigQuery (e.g. element does
>> not match table schema). And for other errors (e.g. 5xx) I want it to retry
>> n times with exponential backoff.
>>
>> Is there any way to do this at the moment? Will I need to make some
>> custom changes to BigQueryIO?
>>
>>
>>
>> On Mon, Apr 10, 2017 at 7:11 PM, Josh <jo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm using BigQueryIO to write the output of an unbounded streaming job
>>> to BigQuery.
>>>
>>> In the case that an element in the stream cannot be written to BigQuery,
>>> the BigQueryIO seems to have some default retry logic which retries the
>>> write a few times. However, if the write fails repeatedly, it seems to
>>> cause the whole pipeline to halt.
>>>
>>> How can I configure beam so that if writing an element fails a few
>>> times, it simply gives up on writing that element and moves on without
>>> affecting the pipeline?
>>>
>>> Thanks for any advice,
>>> Josh
>>>
>>
>>
>

Re: How to skip processing on failure at BigQueryIO sink?

Posted by Lukasz Cwik <lc...@google.com>.
Have you thought of fetching the schema upfront from BigQuery and
prefiltering out any records in a preceeding DoFn instead of relying on
BigQuery telling you that the schema doesn't match?

Otherwise you are correct in believing that you will need to update
BigQueryIO to have the retry/error semantics that you want.

On Tue, Apr 11, 2017 at 1:12 AM, Josh <jo...@gmail.com> wrote:

> What I really want to do is configure BigQueryIO to log an error and skip
> the write if it receives a 4xx response from BigQuery (e.g. element does
> not match table schema). And for other errors (e.g. 5xx) I want it to retry
> n times with exponential backoff.
>
> Is there any way to do this at the moment? Will I need to make some custom
> changes to BigQueryIO?
>
>
>
> On Mon, Apr 10, 2017 at 7:11 PM, Josh <jo...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm using BigQueryIO to write the output of an unbounded streaming job to
>> BigQuery.
>>
>> In the case that an element in the stream cannot be written to BigQuery,
>> the BigQueryIO seems to have some default retry logic which retries the
>> write a few times. However, if the write fails repeatedly, it seems to
>> cause the whole pipeline to halt.
>>
>> How can I configure beam so that if writing an element fails a few times,
>> it simply gives up on writing that element and moves on without affecting
>> the pipeline?
>>
>> Thanks for any advice,
>> Josh
>>
>
>

Re: How to skip processing on failure at BigQueryIO sink?

Posted by Josh <jo...@gmail.com>.
What I really want to do is configure BigQueryIO to log an error and skip
the write if it receives a 4xx response from BigQuery (e.g. element does
not match table schema). And for other errors (e.g. 5xx) I want it to retry
n times with exponential backoff.

Is there any way to do this at the moment? Will I need to make some custom
changes to BigQueryIO?



On Mon, Apr 10, 2017 at 7:11 PM, Josh <jo...@gmail.com> wrote:

> Hi,
>
> I'm using BigQueryIO to write the output of an unbounded streaming job to
> BigQuery.
>
> In the case that an element in the stream cannot be written to BigQuery,
> the BigQueryIO seems to have some default retry logic which retries the
> write a few times. However, if the write fails repeatedly, it seems to
> cause the whole pipeline to halt.
>
> How can I configure beam so that if writing an element fails a few times,
> it simply gives up on writing that element and moves on without affecting
> the pipeline?
>
> Thanks for any advice,
> Josh
>