You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Julien Phalip <jp...@gmail.com> on 2022/09/28 20:53:31 UTC

Why is BigQueryIO.withMaxFileSize() not public?

Hi,

I'd like to control the size of files written to GCS when using
BigQueryIO's FILE_LOAD write method.

However, it looks like the withMaxFileSize method (
https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
is not public.

Is that intentional? Is there a workaround to control the file size?

Thanks,

Julien

Re: Why is BigQueryIO.withMaxFileSize() not public?

Posted by Reuven Lax via user <us...@beam.apache.org>.
The default max file size is 4Tib. BigQuery supports files up to 5Tib, but
there might be some slop in our file-size estimation which is why Beam set
a slightly lower limit. In any case, you won't be able to increase that
value by too much, or BigQuery will reject the load job.

The default max bytes per partition maybe can be increased. When the code
was written, BigQuery's max limit was 12 Tib, but if it's now 15 TiB that
would be a reason to increase it.

BigQuery does not provide guarantees on scheduling load jobs (especially if
you don't have reserved slots). Some other ideas for how to improve things:
    - If you are running in streaming mode, then consider increasing the
triggering duration so you generate load jobs less often.
    - By default, files are written out in json format. This is inefficient
and tends to create many more files. There is currently partial support for
writing files in a more-efficient AVRO format, but it requires you to call
withAvroWriter to pass in a function that converts your records into AVRO.
    - I would also recommend trying the storage API write method. This does
not have the same issues with scheduling that load jobs have.

Reuven

On Thu, Sep 29, 2022 at 1:02 PM Julien Phalip <jp...@gmail.com> wrote:

> Hi all,
>
> Thanks for the replies.
>
> @Ahmed, you mentioned that one could hardcode another value
> for DEFAULT_MAX_FILE_SIZE. How may I do that from my own code?
>
> @Reuven, to give you more context on my use case: I'm running into an
> issue where a job that writes to BQ is taking an unexpectedly long time. It
> looks like things are slowing down on the BQ load job side of things. My
> theory is that the pipeline might generate too many BQ load job requests
> for BQ to handle in a timely manner. So I was thinking that this could be
> mitigated by increasing the file size, and therefore reducing the number of
> load job requests.
>
> That said, now that you've pointed at withMaxBytesPerPartition(), maybe
> that's what I should use instead? I see this defaults to 11TiB but perhaps
> I could try increasing it  to something closer to BQ's limit (15TiB)?
>
> Thanks,
>
> Julien
>
> On Thu, Sep 29, 2022 at 11:01 AM Ahmed Abualsaud via user <
> user@beam.apache.org> wrote:
>
>> That's right, if maxFileSize is made too small you may hit the default
>> maximum files per partition (10,000), in which case copy jobs will be
>> triggered. With that said though, BigQueryIO already has a public
>> withMaxBytesPerPartition() [1] method that controls the partition byte
>> size, which is arguably more influential in triggering this other codepath.
>>
>> [1]
>> https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623
>>
>> On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <re...@google.com> wrote:
>>
>>> It's not public because it was added for use in unit tests, and
>>> modifying this value can have very unexpected results (e.g. making it
>>> smaller can trigger a completely different codepath that is triggered when
>>> there are too many files, leading to unexpected cost increases in the
>>> pipeline).
>>>
>>> Out of curiosity, what is your use case for needing to control this file
>>> size?
>>>
>>> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <
>>> ahmedabualsaud@google.com> wrote:
>>>
>>>> Hey Julien,
>>>>
>>>> I don't see a problem with exposing that method. That part of the code
>>>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>>>
>>>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>>>> Would this work temporarily? @Chamikara Jayalath <ch...@google.com>
>>>>  @Reuven Lax <re...@google.com> other thoughts?
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>>>
>>>> Best,
>>>> Ahmed
>>>>
>>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to control the size of files written to GCS when using
>>>>> BigQueryIO's FILE_LOAD write method.
>>>>>
>>>>> However, it looks like the withMaxFileSize method (
>>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>>> is not public.
>>>>>
>>>>> Is that intentional? Is there a workaround to control the file size?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Julien
>>>>>
>>>>
>>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to control the size of files written to GCS when using
>>>>> BigQueryIO's FILE_LOAD write method.
>>>>>
>>>>> However, it looks like the withMaxFileSize method (
>>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>>> is not public.
>>>>>
>>>>> Is that intentional? Is there a workaround to control the file size?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Julien
>>>>>
>>>>

Re: Why is BigQueryIO.withMaxFileSize() not public?

Posted by Julien Phalip <jp...@gmail.com>.
Hi all,

Thanks for the replies.

@Ahmed, you mentioned that one could hardcode another value
for DEFAULT_MAX_FILE_SIZE. How may I do that from my own code?

@Reuven, to give you more context on my use case: I'm running into an issue
where a job that writes to BQ is taking an unexpectedly long time. It looks
like things are slowing down on the BQ load job side of things. My theory
is that the pipeline might generate too many BQ load job requests for BQ to
handle in a timely manner. So I was thinking that this could be mitigated
by increasing the file size, and therefore reducing the number of load job
requests.

That said, now that you've pointed at withMaxBytesPerPartition(), maybe
that's what I should use instead? I see this defaults to 11TiB but perhaps
I could try increasing it  to something closer to BQ's limit (15TiB)?

Thanks,

Julien

On Thu, Sep 29, 2022 at 11:01 AM Ahmed Abualsaud via user <
user@beam.apache.org> wrote:

> That's right, if maxFileSize is made too small you may hit the default
> maximum files per partition (10,000), in which case copy jobs will be
> triggered. With that said though, BigQueryIO already has a public
> withMaxBytesPerPartition() [1] method that controls the partition byte
> size, which is arguably more influential in triggering this other codepath.
>
> [1]
> https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623
>
> On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <re...@google.com> wrote:
>
>> It's not public because it was added for use in unit tests, and modifying
>> this value can have very unexpected results (e.g. making it smaller can
>> trigger a completely different codepath that is triggered when there are
>> too many files, leading to unexpected cost increases in the pipeline).
>>
>> Out of curiosity, what is your use case for needing to control this file
>> size?
>>
>> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <
>> ahmedabualsaud@google.com> wrote:
>>
>>> Hey Julien,
>>>
>>> I don't see a problem with exposing that method. That part of the code
>>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>>
>>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>>> Would this work temporarily? @Chamikara Jayalath <ch...@google.com> @Reuven
>>> Lax <re...@google.com> other thoughts?
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>>
>>> Best,
>>> Ahmed
>>>
>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to control the size of files written to GCS when using
>>>> BigQueryIO's FILE_LOAD write method.
>>>>
>>>> However, it looks like the withMaxFileSize method (
>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>> is not public.
>>>>
>>>> Is that intentional? Is there a workaround to control the file size?
>>>>
>>>> Thanks,
>>>>
>>>> Julien
>>>>
>>>
>>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to control the size of files written to GCS when using
>>>> BigQueryIO's FILE_LOAD write method.
>>>>
>>>> However, it looks like the withMaxFileSize method (
>>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>>> is not public.
>>>>
>>>> Is that intentional? Is there a workaround to control the file size?
>>>>
>>>> Thanks,
>>>>
>>>> Julien
>>>>
>>>

Re: Why is BigQueryIO.withMaxFileSize() not public?

Posted by Ahmed Abualsaud via user <us...@beam.apache.org>.
That's right, if maxFileSize is made too small you may hit the default
maximum files per partition (10,000), in which case copy jobs will be
triggered. With that said though, BigQueryIO already has a public
withMaxBytesPerPartition() [1] method that controls the partition byte
size, which is arguably more influential in triggering this other codepath.

[1]
https://github.com/apache/beam/blob/028c564b8ae1ba1ffa6aadb8212ec03555dd63b6/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2623

On Thu, Sep 29, 2022 at 12:24 PM Reuven Lax <re...@google.com> wrote:

> It's not public because it was added for use in unit tests, and modifying
> this value can have very unexpected results (e.g. making it smaller can
> trigger a completely different codepath that is triggered when there are
> too many files, leading to unexpected cost increases in the pipeline).
>
> Out of curiosity, what is your use case for needing to control this file
> size?
>
> On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <ah...@google.com>
> wrote:
>
>> Hey Julien,
>>
>> I don't see a problem with exposing that method. That part of the code
>> was committed ~6 years ago, my guess is it wasn't requested to be public.
>>
>> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
>> Would this work temporarily? @Chamikara Jayalath <ch...@google.com> @Reuven
>> Lax <re...@google.com> other thoughts?
>>
>> [1]
>> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>>
>> Best,
>> Ahmed
>>
>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'd like to control the size of files written to GCS when using
>>> BigQueryIO's FILE_LOAD write method.
>>>
>>> However, it looks like the withMaxFileSize method (
>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>> is not public.
>>>
>>> Is that intentional? Is there a workaround to control the file size?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>
>> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'd like to control the size of files written to GCS when using
>>> BigQueryIO's FILE_LOAD write method.
>>>
>>> However, it looks like the withMaxFileSize method (
>>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>>> is not public.
>>>
>>> Is that intentional? Is there a workaround to control the file size?
>>>
>>> Thanks,
>>>
>>> Julien
>>>
>>

Re: Why is BigQueryIO.withMaxFileSize() not public?

Posted by Reuven Lax via user <us...@beam.apache.org>.
It's not public because it was added for use in unit tests, and modifying
this value can have very unexpected results (e.g. making it smaller can
trigger a completely different codepath that is triggered when there are
too many files, leading to unexpected cost increases in the pipeline).

Out of curiosity, what is your use case for needing to control this file
size?

On Thu, Sep 29, 2022 at 8:01 AM Ahmed Abualsaud <ah...@google.com>
wrote:

> Hey Julien,
>
> I don't see a problem with exposing that method. That part of the code was
> committed ~6 years ago, my guess is it wasn't requested to be public.
>
> One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
> Would this work temporarily? @Chamikara Jayalath <ch...@google.com> @Reuven
> Lax <re...@google.com> other thoughts?
>
> [1]
> https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109
>
> Best,
> Ahmed
>
> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>
>> Hi,
>>
>> I'd like to control the size of files written to GCS when using
>> BigQueryIO's FILE_LOAD write method.
>>
>> However, it looks like the withMaxFileSize method (
>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>> is not public.
>>
>> Is that intentional? Is there a workaround to control the file size?
>>
>> Thanks,
>>
>> Julien
>>
>
> On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:
>
>> Hi,
>>
>> I'd like to control the size of files written to GCS when using
>> BigQueryIO's FILE_LOAD write method.
>>
>> However, it looks like the withMaxFileSize method (
>> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
>> is not public.
>>
>> Is that intentional? Is there a workaround to control the file size?
>>
>> Thanks,
>>
>> Julien
>>
>

Re: Why is BigQueryIO.withMaxFileSize() not public?

Posted by Ahmed Abualsaud via user <us...@beam.apache.org>.
Hey Julien,

I don't see a problem with exposing that method. That part of the code was
committed ~6 years ago, my guess is it wasn't requested to be public.

One workaround is to hardcode another value for DEFAULT_MAX_FILE_SIZE [1].
Would this work temporarily? @Chamikara Jayalath <ch...@google.com> @Reuven
Lax <re...@google.com> other thoughts?

[1]
https://github.com/apache/beam/blob/17453e71a81ba774ab451ad141fc8c21ea8770c9/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L109

Best,
Ahmed

On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:

> Hi,
>
> I'd like to control the size of files written to GCS when using
> BigQueryIO's FILE_LOAD write method.
>
> However, it looks like the withMaxFileSize method (
> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
> is not public.
>
> Is that intentional? Is there a workaround to control the file size?
>
> Thanks,
>
> Julien
>

On Wed, Sep 28, 2022 at 4:55 PM Julien Phalip <jp...@gmail.com> wrote:

> Hi,
>
> I'd like to control the size of files written to GCS when using
> BigQueryIO's FILE_LOAD write method.
>
> However, it looks like the withMaxFileSize method (
> https://github.com/apache/beam/blob/948af30a5b665fe74b7052b673e95ff5f5fc426a/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L2597)
> is not public.
>
> Is that intentional? Is there a workaround to control the file size?
>
> Thanks,
>
> Julien
>