You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Rajnil Guha <ra...@gmail.com> on 2021/08/01 19:05:50 UTC

Writing Avro Files to Big Query using Python SDK and Dataflow Runner

Hi Beam Users,

Our pipeline is reading avro files from GCS into Dataflow and writing them
into Big Query tables . I am using the WriteToBigQuery transform to write
my Pcoll contents into Big Query.
My avro file contains about 150-200 fields. We have tested our pipeline by
providing the field information for all the fields in the TableSchema
object within the pipeline code. So every time there is a change in schema
or the schema evolves we need to change our pipeline code.
I was wondering if there is any way to provide the BigQuery table schema
information outside the pipeline code and infer into the pipeline from
there as it's much easier to maintain that way.

Note:- We are using the Python SDK to write our pipelines and running on
Dataflow.

Thanks & Regards
Rajnil Guha

Re: Writing Avro Files to Big Query using Python SDK and Dataflow Runner

Posted by Pavel Solomin <p....@gmail.com>.

> wrote a generic BigQuery reader or writer

I think, I have seen an example here -
https://github.com/the-dagger/dataflow-dynamic-schema/blob/28b7d075c18d6364a67129e56652f452da67a2f6/src/main/java/com/google/cloud/pso/bigquery/BigQuerySchemaMutator.java#L38

This is in Java, but you can try to adapt it for Python SDK. Don't know if
it is possible, I use Java SDK myself for all stream processing apps,
including Beam apps.

Best Regards,
Pavel Solomin

Tel: +351 962 950 692 | Skype: pavel_solomin | Linkedin
<https://www.linkedin.com/in/pavelsolomin>





On Mon, 9 Aug 2021 at 21:55, Luke Cwik <lc...@google.com> wrote:

> The issue is that the encoding that is passed between transforms needs to
> store the metadata of what was in each column when the data is read as it
> is passed around in the pipeline. Imagine that column X was a string, was
> then deleted, and then re-added as a datetime. These kinds of schema
> evolutions typically have business specific rules as to what to do.
>
> I believe there was a user that wrote a custom coder that encoded this
> extra information with each row and wrote a generic BigQuery reader or
> writer(don't remember which) that could do something as you wanted with
> limitations around schema evolution and at the performance cost of passing
> around the metadata but I don't believe this was contributed back to the
> community.
>
> Try searching through the dev[1]/user[2] e-mail archives.
>
> 1: https://lists.apache.org/list.html?dev@beam.apache.org:lte=99M
> 2: https://lists.apache.org/list.html?user@beam.apache.org:lte=99M
>
> On Sun, Aug 1, 2021 at 12:06 PM Rajnil Guha <ra...@gmail.com>
> wrote:
>
>> Hi Beam Users,
>>
>> Our pipeline is reading avro files from GCS into Dataflow and writing
>> them into Big Query tables . I am using the WriteToBigQuery transform to
>> write my Pcoll contents into Big Query.
>> My avro file contains about 150-200 fields. We have tested our pipeline
>> by providing the field information for all the fields in the TableSchema
>> object within the pipeline code. So every time there is a change in schema
>> or the schema evolves we need to change our pipeline code.
>> I was wondering if there is any way to provide the BigQuery table schema
>> information outside the pipeline code and infer into the pipeline from
>> there as it's much easier to maintain that way.
>>
>> Note:- We are using the Python SDK to write our pipelines and running on
>> Dataflow.
>>
>> Thanks & Regards
>> Rajnil Guha
>>
>>
>>
>>
>>
>>
>>

Re: Writing Avro Files to Big Query using Python SDK and Dataflow Runner

Posted by Luke Cwik <lc...@google.com>.

The issue is that the encoding that is passed between transforms needs to
store the metadata of what was in each column when the data is read as it
is passed around in the pipeline. Imagine that column X was a string, was
then deleted, and then re-added as a datetime. These kinds of schema
evolutions typically have business specific rules as to what to do.

I believe there was a user that wrote a custom coder that encoded this
extra information with each row and wrote a generic BigQuery reader or
writer(don't remember which) that could do something as you wanted with
limitations around schema evolution and at the performance cost of passing
around the metadata but I don't believe this was contributed back to the
community.

Try searching through the dev[1]/user[2] e-mail archives.

1: https://lists.apache.org/list.html?dev@beam.apache.org:lte=99M
2: https://lists.apache.org/list.html?user@beam.apache.org:lte=99M

On Sun, Aug 1, 2021 at 12:06 PM Rajnil Guha <ra...@gmail.com> wrote:

> Hi Beam Users,
>
> Our pipeline is reading avro files from GCS into Dataflow and writing them
> into Big Query tables . I am using the WriteToBigQuery transform to write
> my Pcoll contents into Big Query.
> My avro file contains about 150-200 fields. We have tested our pipeline by
> providing the field information for all the fields in the TableSchema
> object within the pipeline code. So every time there is a change in schema
> or the schema evolves we need to change our pipeline code.
> I was wondering if there is any way to provide the BigQuery table schema
> information outside the pipeline code and infer into the pipeline from
> there as it's much easier to maintain that way.
>
> Note:- We are using the Python SDK to write our pipelines and running on
> Dataflow.
>
> Thanks & Regards
> Rajnil Guha
>
>
>
>
>
>
>