You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Ahmed Abualsaud <ah...@google.com> on 2022/06/06 14:38:49 UTC

Standardizing output of WriteToBigQuery

Hey everyone,

I've written a design document to standardize Python SDK's WriteToBigQuery
output.

TLDR:
Beam I/O standards specify that the output of a WriteTo{IO} transform
should be fixed. However, the output of Python SDK’s WriteToBigQuery is
inconsistent, giving back a dictionary of 2-3 different PCollections
depending on the write method:

Method

Return object type

PCollections


STREAMING_INSERTS

Dictionary of PCollections
<https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery.py#L2238-L2242>

FAILED_ROWS

FAILED_ROWS_WITH_ERRORS

FILE_LOADS

Dictionary of PCollections
<https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1219-L1223>

destination_load_jobid_pairs

destination_file_pairs

destination_copy_jobid_pairs

This design doc seeks to explore how to move forward in a way that is more
aligned with the Beam I/O standards we have set. The solution I’m leaning
towards is developing a WriteResult object (similar to that in the Java
SDK) that users can refer to for metadata about the write. This way, users
can expect a consistent, fixed output from their WriteToBigQuery calls.

This document is in-progress and is currently focusing more on how to move
forward than on the specific solution I mentioned, so please feel free to
add any comments, suggestions, concerns, etc.

https://docs.google.com/document/d/1w151JYmC1hYSVeKau8nP62vrmvTkbOuEyBawKjWkc30/edit?usp=sharing


Thanks,
Ahmed

Re: Standardizing output of WriteToBigQuery

Posted by Pablo Estrada <pa...@google.com>.

Thanks for the proposal, Ahmed!

I made a couple comments.
Best
-P.

On Mon, Jun 6, 2022 at 7:39 AM Ahmed Abualsaud <ah...@google.com>
wrote:

> Hey everyone,
>
> I've written a design document to standardize Python SDK's WriteToBigQuery
> output.
>
> TLDR:
> Beam I/O standards specify that the output of a WriteTo{IO} transform
> should be fixed. However, the output of Python SDK’s WriteToBigQuery is
> inconsistent, giving back a dictionary of 2-3 different PCollections
> depending on the write method:
>
> Method
>
> Return object type
>
> PCollections
>
>
> STREAMING_INSERTS
>
> Dictionary of PCollections
> <https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery.py#L2238-L2242>
>
> FAILED_ROWS
>
> FAILED_ROWS_WITH_ERRORS
>
> FILE_LOADS
>
> Dictionary of PCollections
> <https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1219-L1223>
>
> destination_load_jobid_pairs
>
> destination_file_pairs
>
> destination_copy_jobid_pairs
>
> This design doc seeks to explore how to move forward in a way that is more
> aligned with the Beam I/O standards we have set. The solution I’m leaning
> towards is developing a WriteResult object (similar to that in the Java
> SDK) that users can refer to for metadata about the write. This way, users
> can expect a consistent, fixed output from their WriteToBigQuery calls.
>
> This document is in-progress and is currently focusing more on how to move
> forward than on the specific solution I mentioned, so please feel free to
> add any comments, suggestions, concerns, etc.
>
>
> https://docs.google.com/document/d/1w151JYmC1hYSVeKau8nP62vrmvTkbOuEyBawKjWkc30/edit?usp=sharing
>
>
> Thanks,
> Ahmed
>