You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Wout Scheepers <Wo...@vente-exclusive.com> on 2018/11/12 09:03:25 UTC

Bigquery streaming TableRow size limit

Hey all,

The TableRow size limit is 1mb when streaming into bigquery.
To prevent data loss, I’m going to implement a TableRow size check and add a fan out to do a bigquery load job in case the size is above the limit.
Of course this load job would be windowed.

I know it doesn’t make sense to stream data bigger than 1mb, but as we’re using pub sub and want to make sure no data loss happens whatsoever, I’ll need to implement it.

Is this functionality any of you would like to see in BigqueryIO itself?
Or do you think my use case is too specific and implementing my solution around BigqueryIO will suffice.

Thanks for your thoughts,
Wout



Re: Bigquery streaming TableRow size limit

Posted by Reuven Lax <re...@google.com>.
This sounds a bit more specific, so I wouldn't add this to BigQueryIO yet.

On Thu, Nov 15, 2018 at 6:58 PM Wout Scheepers <
Wout.Scheepers@vente-exclusive.com> wrote:

> Thanks for your thoughts.
>
> Also, I’m doing something similar when streaming data into partitioned
> tables.
>
> From [1]:
>
> “ When the data is streamed, data between 7 days in the past and 3 days
> in the future is placed in the streaming buffer, and then it is extracted
> to the corresponding partitions.”
>
>
>
> I added a check to see if the event time is within this timebound. If not,
> a load job is triggered. This can happen when we replay old data.
>
>
>
> Do you also think this would be worth adding to BigqueryIO?
>
> If so, I’ll try to create a PR for both features.
>
>
>
> Thanks,
>
> Wout
>
>
>
> [1] :
> https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables
>
>
>
>
>
> *From: *Reuven Lax <re...@google.com>
> *Reply-To: *"dev@beam.apache.org" <de...@beam.apache.org>
> *Date: *Wednesday, 14 November 2018 at 14:51
> *To: *"dev@beam.apache.org" <de...@beam.apache.org>
> *Subject: *Re: Bigquery streaming TableRow size limit
>
>
>
> Generally I would agree, but the consequences here of a mistake are
> severe. Not only will the beam pipeline get stuck for 24 hours, _anything_
> else in the user's GCP project that tries to load data into BigQuery will
> also fail for the next 24 hours. Given the severity, I think it's best to
> make the user opt into this behavior rather than do it magically.
>
>
>
> On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik <lc...@google.com> wrote:
>
> I would rather not have the builder method and run into the quota issue
> then require the builder method and still run into quota issues.
>
>
>
> On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <re...@google.com> wrote:
>
> I'm a bit worried about making this automatic, as it can have unexpected
> side effects on BigQuery load-job quota. This is a 24-hour quota, so if
> it's accidentally exceeded all load jobs for the project may be blocked for
> the next 24 hours. However if the user opts in (possibly via .a builder
> method), this seems like it could be automatic.
>
>
>
> Reuven
>
>
>
> On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <lc...@google.com> wrote:
>
> Having data ingestion work without needing to worry about how big the
> blobs are would be nice if it was automatic for users.
>
>
>
> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
> Wout.Scheepers@vente-exclusive.com> wrote:
>
> Hey all,
>
>
>
> The TableRow size limit is 1mb when streaming into bigquery.
>
> To prevent data loss, I’m going to implement a TableRow size check and add
> a fan out to do a bigquery load job in case the size is above the limit.
>
> Of course this load job would be windowed.
>
>
>
> I know it doesn’t make sense to stream data bigger than 1mb, but as we’re
> using pub sub and want to make sure no data loss happens whatsoever, I’ll
> need to implement it.
>
>
>
> Is this functionality any of you would like to see in BigqueryIO itself?
>
> Or do you think my use case is too specific and implementing my solution
> around BigqueryIO will suffice.
>
>
>
> Thanks for your thoughts,
>
> Wout
>
>
>
>
>
>

Re: Bigquery streaming TableRow size limit

Posted by Wout Scheepers <Wo...@vente-exclusive.com>.
Thanks for your thoughts.

Also, I’m doing something similar when streaming data into partitioned tables.
From [1]:
“ When the data is streamed, data between 7 days in the past and 3 days in the future is placed in the streaming buffer, and then it is extracted to the corresponding partitions.”

I added a check to see if the event time is within this timebound. If not, a load job is triggered. This can happen when we replay old data.

Do you also think this would be worth adding to BigqueryIO?
If so, I’ll try to create a PR for both features.

Thanks,
Wout

[1] : https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_partitioned_tables


From: Reuven Lax <re...@google.com>
Reply-To: "dev@beam.apache.org" <de...@beam.apache.org>
Date: Wednesday, 14 November 2018 at 14:51
To: "dev@beam.apache.org" <de...@beam.apache.org>
Subject: Re: Bigquery streaming TableRow size limit

Generally I would agree, but the consequences here of a mistake are severe. Not only will the beam pipeline get stuck for 24 hours, _anything_ else in the user's GCP project that tries to load data into BigQuery will also fail for the next 24 hours. Given the severity, I think it's best to make the user opt into this behavior rather than do it magically.

On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik <lc...@google.com>> wrote:
I would rather not have the builder method and run into the quota issue then require the builder method and still run into quota issues.

On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <re...@google.com>> wrote:
I'm a bit worried about making this automatic, as it can have unexpected side effects on BigQuery load-job quota. This is a 24-hour quota, so if it's accidentally exceeded all load jobs for the project may be blocked for the next 24 hours. However if the user opts in (possibly via .a builder method), this seems like it could be automatic.

Reuven

On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <lc...@google.com>> wrote:
Having data ingestion work without needing to worry about how big the blobs are would be nice if it was automatic for users.

On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <Wo...@vente-exclusive.com>> wrote:
Hey all,

The TableRow size limit is 1mb when streaming into bigquery.
To prevent data loss, I’m going to implement a TableRow size check and add a fan out to do a bigquery load job in case the size is above the limit.
Of course this load job would be windowed.

I know it doesn’t make sense to stream data bigger than 1mb, but as we’re using pub sub and want to make sure no data loss happens whatsoever, I’ll need to implement it.

Is this functionality any of you would like to see in BigqueryIO itself?
Or do you think my use case is too specific and implementing my solution around BigqueryIO will suffice.

Thanks for your thoughts,
Wout



Re: Bigquery streaming TableRow size limit

Posted by Reuven Lax <re...@google.com>.
Generally I would agree, but the consequences here of a mistake are severe.
Not only will the beam pipeline get stuck for 24 hours, _anything_ else in
the user's GCP project that tries to load data into BigQuery will also fail
for the next 24 hours. Given the severity, I think it's best to make the
user opt into this behavior rather than do it magically.

On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik <lc...@google.com> wrote:

> I would rather not have the builder method and run into the quota issue
> then require the builder method and still run into quota issues.
>
> On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <re...@google.com> wrote:
>
>> I'm a bit worried about making this automatic, as it can have unexpected
>> side effects on BigQuery load-job quota. This is a 24-hour quota, so if
>> it's accidentally exceeded all load jobs for the project may be blocked for
>> the next 24 hours. However if the user opts in (possibly via .a builder
>> method), this seems like it could be automatic.
>>
>> Reuven
>>
>> On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Having data ingestion work without needing to worry about how big the
>>> blobs are would be nice if it was automatic for users.
>>>
>>> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
>>> Wout.Scheepers@vente-exclusive.com> wrote:
>>>
>>>> Hey all,
>>>>
>>>>
>>>>
>>>> The TableRow size limit is 1mb when streaming into bigquery.
>>>>
>>>> To prevent data loss, I’m going to implement a TableRow size check and
>>>> add a fan out to do a bigquery load job in case the size is above the limit.
>>>>
>>>> Of course this load job would be windowed.
>>>>
>>>>
>>>>
>>>> I know it doesn’t make sense to stream data bigger than 1mb, but as
>>>> we’re using pub sub and want to make sure no data loss happens whatsoever,
>>>> I’ll need to implement it.
>>>>
>>>>
>>>>
>>>> Is this functionality any of you would like to see in BigqueryIO
>>>> itself?
>>>>
>>>> Or do you think my use case is too specific and implementing my
>>>> solution around BigqueryIO will suffice.
>>>>
>>>>
>>>>
>>>> Thanks for your thoughts,
>>>>
>>>> Wout
>>>>
>>>>
>>>>
>>>>
>>>>
>>>

Re: Bigquery streaming TableRow size limit

Posted by Lukasz Cwik <lc...@google.com>.
I would rather not have the builder method and run into the quota issue
then require the builder method and still run into quota issues.

On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax <re...@google.com> wrote:

> I'm a bit worried about making this automatic, as it can have unexpected
> side effects on BigQuery load-job quota. This is a 24-hour quota, so if
> it's accidentally exceeded all load jobs for the project may be blocked for
> the next 24 hours. However if the user opts in (possibly via .a builder
> method), this seems like it could be automatic.
>
> Reuven
>
> On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <lc...@google.com> wrote:
>
>> Having data ingestion work without needing to worry about how big the
>> blobs are would be nice if it was automatic for users.
>>
>> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
>> Wout.Scheepers@vente-exclusive.com> wrote:
>>
>>> Hey all,
>>>
>>>
>>>
>>> The TableRow size limit is 1mb when streaming into bigquery.
>>>
>>> To prevent data loss, I’m going to implement a TableRow size check and
>>> add a fan out to do a bigquery load job in case the size is above the limit.
>>>
>>> Of course this load job would be windowed.
>>>
>>>
>>>
>>> I know it doesn’t make sense to stream data bigger than 1mb, but as
>>> we’re using pub sub and want to make sure no data loss happens whatsoever,
>>> I’ll need to implement it.
>>>
>>>
>>>
>>> Is this functionality any of you would like to see in BigqueryIO itself?
>>>
>>> Or do you think my use case is too specific and implementing my solution
>>> around BigqueryIO will suffice.
>>>
>>>
>>>
>>> Thanks for your thoughts,
>>>
>>> Wout
>>>
>>>
>>>
>>>
>>>
>>

Re: Bigquery streaming TableRow size limit

Posted by Reuven Lax <re...@google.com>.
I'm a bit worried about making this automatic, as it can have unexpected
side effects on BigQuery load-job quota. This is a 24-hour quota, so if
it's accidentally exceeded all load jobs for the project may be blocked for
the next 24 hours. However if the user opts in (possibly via .a builder
method), this seems like it could be automatic.

Reuven

On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik <lc...@google.com> wrote:

> Having data ingestion work without needing to worry about how big the
> blobs are would be nice if it was automatic for users.
>
> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
> Wout.Scheepers@vente-exclusive.com> wrote:
>
>> Hey all,
>>
>>
>>
>> The TableRow size limit is 1mb when streaming into bigquery.
>>
>> To prevent data loss, I’m going to implement a TableRow size check and
>> add a fan out to do a bigquery load job in case the size is above the limit.
>>
>> Of course this load job would be windowed.
>>
>>
>>
>> I know it doesn’t make sense to stream data bigger than 1mb, but as we’re
>> using pub sub and want to make sure no data loss happens whatsoever, I’ll
>> need to implement it.
>>
>>
>>
>> Is this functionality any of you would like to see in BigqueryIO itself?
>>
>> Or do you think my use case is too specific and implementing my solution
>> around BigqueryIO will suffice.
>>
>>
>>
>> Thanks for your thoughts,
>>
>> Wout
>>
>>
>>
>>
>>
>

Re: Bigquery streaming TableRow size limit

Posted by Lukasz Cwik <lc...@google.com>.
Having data ingestion work without needing to worry about how big the blobs
are would be nice if it was automatic for users.

On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
Wout.Scheepers@vente-exclusive.com> wrote:

> Hey all,
>
>
>
> The TableRow size limit is 1mb when streaming into bigquery.
>
> To prevent data loss, I’m going to implement a TableRow size check and add
> a fan out to do a bigquery load job in case the size is above the limit.
>
> Of course this load job would be windowed.
>
>
>
> I know it doesn’t make sense to stream data bigger than 1mb, but as we’re
> using pub sub and want to make sure no data loss happens whatsoever, I’ll
> need to implement it.
>
>
>
> Is this functionality any of you would like to see in BigqueryIO itself?
>
> Or do you think my use case is too specific and implementing my solution
> around BigqueryIO will suffice.
>
>
>
> Thanks for your thoughts,
>
> Wout
>
>
>
>
>