You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Ahmed Abualsaud <ah...@google.com> on 2022/06/06 14:38:49 UTC
Standardizing output of WriteToBigQuery
Hey everyone,
I've written a design document to standardize Python SDK's WriteToBigQuery
output.
TLDR:
Beam I/O standards specify that the output of a WriteTo{IO} transform
should be fixed. However, the output of Python SDK’s WriteToBigQuery is
inconsistent, giving back a dictionary of 2-3 different PCollections
depending on the write method:
Method
Return object type
PCollections
STREAMING_INSERTS
Dictionary of PCollections
<https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery.py#L2238-L2242>
FAILED_ROWS
FAILED_ROWS_WITH_ERRORS
FILE_LOADS
Dictionary of PCollections
<https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1219-L1223>
destination_load_jobid_pairs
destination_file_pairs
destination_copy_jobid_pairs
This design doc seeks to explore how to move forward in a way that is more
aligned with the Beam I/O standards we have set. The solution I’m leaning
towards is developing a WriteResult object (similar to that in the Java
SDK) that users can refer to for metadata about the write. This way, users
can expect a consistent, fixed output from their WriteToBigQuery calls.
This document is in-progress and is currently focusing more on how to move
forward than on the specific solution I mentioned, so please feel free to
add any comments, suggestions, concerns, etc.
https://docs.google.com/document/d/1w151JYmC1hYSVeKau8nP62vrmvTkbOuEyBawKjWkc30/edit?usp=sharing
Thanks,
Ahmed
Re: Standardizing output of WriteToBigQuery
Posted by Pablo Estrada <pa...@google.com>.
Thanks for the proposal, Ahmed!
I made a couple comments.
Best
-P.
On Mon, Jun 6, 2022 at 7:39 AM Ahmed Abualsaud <ah...@google.com>
wrote:
> Hey everyone,
>
> I've written a design document to standardize Python SDK's WriteToBigQuery
> output.
>
> TLDR:
> Beam I/O standards specify that the output of a WriteTo{IO} transform
> should be fixed. However, the output of Python SDK’s WriteToBigQuery is
> inconsistent, giving back a dictionary of 2-3 different PCollections
> depending on the write method:
>
> Method
>
> Return object type
>
> PCollections
>
>
> STREAMING_INSERTS
>
> Dictionary of PCollections
> <https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery.py#L2238-L2242>
>
> FAILED_ROWS
>
> FAILED_ROWS_WITH_ERRORS
>
> FILE_LOADS
>
> Dictionary of PCollections
> <https://github.com/apache/beam/blob/4dce7b8857f37608321253073745fe7611a48af9/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1219-L1223>
>
> destination_load_jobid_pairs
>
> destination_file_pairs
>
> destination_copy_jobid_pairs
>
> This design doc seeks to explore how to move forward in a way that is more
> aligned with the Beam I/O standards we have set. The solution I’m leaning
> towards is developing a WriteResult object (similar to that in the Java
> SDK) that users can refer to for metadata about the write. This way, users
> can expect a consistent, fixed output from their WriteToBigQuery calls.
>
> This document is in-progress and is currently focusing more on how to move
> forward than on the specific solution I mentioned, so please feel free to
> add any comments, suggestions, concerns, etc.
>
>
> https://docs.google.com/document/d/1w151JYmC1hYSVeKau8nP62vrmvTkbOuEyBawKjWkc30/edit?usp=sharing
>
>
> Thanks,
> Ahmed
>