You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Zdenko Hrcek <zd...@gmail.com> on 2019/09/03 21:23:28 UTC

Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

Greetings,

I am using Beam 2.15 and Python 2.7.
I am doing a batch job to load data from CSV and upload to BigQuery. I like
functionality that instead of streaming to BigQuery I can use "file load",
to load table all at once.

For my case, there are few "bad" records in the input (it's geo data and
during manual upload, BigQuery doesn't accept those as valid geography
records. this is easily solved by setting the number of max bad records.
If I understand correctly, WriteToBigQuery supports
"additional_bq_parameters", but for some reason when running a pipeline on
Dataflow runner it looks like those settings are ignored.

I played with an example from the documentation
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
with
gist file https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
where the table should be created on the partitioned field and clustered,
but when running on Dataflow it doesn't happen.
When I run on DirectRunner it works as expected. interestingly, when I add
maxBadRecords parameter to additional_bq_parameters, DirectRunner complains
that it doesn't recognize that option.

This is the first time using this setup/combination so I'm just wondering
if I overlooked something. I would appreciate any help.

Best regards,
Zdenko


_______________________
 http://www.the-swamp.info

Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

Posted by Zdenko Hrcek <zd...@gmail.com>.

Hello Pablo,
thanks for explanation.

Best regards,
Zdenko
_______________________
 http://www.the-swamp.info



On Thu, Sep 5, 2019 at 8:10 PM Pablo Estrada <pa...@google.com> wrote:

> Hi Zdenko,
> sorry about the confusion. The reason behind this is that we have not
> jumped tu fully change the batch behavior of WriteToBigQuery, so to use
> BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need
> to pass 'use_beam_bq_sink' as an experiment to activate it.
> As you rightly figured out, you can use BigQueryBatchFileLoads directly.
> Best
> -P.
>
> On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek <zd...@gmail.com> wrote:
>
>> Thanks for the code sample,
>>
>> when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
>> of bigquery.WriteToBigQuery it works ok now. Not sure why with
>> WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
>> the hood...
>>
>> Thanks for the help.
>> Zdenko
>> _______________________
>>  http://www.the-swamp.info
>>
>>
>>
>> On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <ch...@google.com>
>> wrote:
>>
>>> +Pablo Estrada <pa...@google.com> who added this.
>>>
>>> I don't think we have tested this specific option but I believe
>>> additional BQ parameters option was added in a generic way to accept all
>>> additional parameters.
>>>
>>> Looking at the code, seems like additional parameters do get passed
>>> through to load jobs:
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>>>
>>> One thing you can try out is trying to run a BQ load job directly with
>>> the same set of data and options to see if the data gets loaded.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zd...@gmail.com> wrote:
>>>
>>>> Greetings,
>>>>
>>>> I am using Beam 2.15 and Python 2.7.
>>>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>>>> like functionality that instead of streaming to BigQuery I can use "file
>>>> load", to load table all at once.
>>>>
>>>> For my case, there are few "bad" records in the input (it's geo data
>>>> and during manual upload, BigQuery doesn't accept those as valid geography
>>>> records. this is easily solved by setting the number of max bad records.
>>>> If I understand correctly, WriteToBigQuery supports
>>>> "additional_bq_parameters", but for some reason when running a pipeline on
>>>> Dataflow runner it looks like those settings are ignored.
>>>>
>>>> I played with an example from the documentation
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py with
>>>> gist file
>>>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>>>> where the table should be created on the partitioned field and
>>>> clustered, but when running on Dataflow it doesn't happen.
>>>> When I run on DirectRunner it works as expected. interestingly, when I
>>>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>>>> complains that it doesn't recognize that option.
>>>>
>>>> This is the first time using this setup/combination so I'm just
>>>> wondering if I overlooked something. I would appreciate any help.
>>>>
>>>> Best regards,
>>>> Zdenko
>>>>
>>>>
>>>> _______________________
>>>>  http://www.the-swamp.info
>>>>
>>>>

Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

Posted by Pablo Estrada <pa...@google.com>.

Hi Zdenko,
sorry about the confusion. The reason behind this is that we have not
jumped tu fully change the batch behavior of WriteToBigQuery, so to use
BigQueryBatchFileLoads as the implementation of WriteToBigQuery, you need
to pass 'use_beam_bq_sink' as an experiment to activate it.
As you rightly figured out, you can use BigQueryBatchFileLoads directly.
Best
-P.

On Thu, Sep 5, 2019 at 6:06 AM Zdenko Hrcek <zd...@gmail.com> wrote:

> Thanks for the code sample,
>
> when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
> of bigquery.WriteToBigQuery it works ok now. Not sure why with
> WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
> the hood...
>
> Thanks for the help.
> Zdenko
> _______________________
>  http://www.the-swamp.info
>
>
>
> On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> +Pablo Estrada <pa...@google.com> who added this.
>>
>> I don't think we have tested this specific option but I believe
>> additional BQ parameters option was added in a generic way to accept all
>> additional parameters.
>>
>> Looking at the code, seems like additional parameters do get passed
>> through to load jobs:
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>>
>> One thing you can try out is trying to run a BQ load job directly with
>> the same set of data and options to see if the data gets loaded.
>>
>> Thanks,
>> Cham
>>
>> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zd...@gmail.com> wrote:
>>
>>> Greetings,
>>>
>>> I am using Beam 2.15 and Python 2.7.
>>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>>> like functionality that instead of streaming to BigQuery I can use "file
>>> load", to load table all at once.
>>>
>>> For my case, there are few "bad" records in the input (it's geo data and
>>> during manual upload, BigQuery doesn't accept those as valid geography
>>> records. this is easily solved by setting the number of max bad records.
>>> If I understand correctly, WriteToBigQuery supports
>>> "additional_bq_parameters", but for some reason when running a pipeline on
>>> Dataflow runner it looks like those settings are ignored.
>>>
>>> I played with an example from the documentation
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py with
>>> gist file
>>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>>> where the table should be created on the partitioned field and
>>> clustered, but when running on Dataflow it doesn't happen.
>>> When I run on DirectRunner it works as expected. interestingly, when I
>>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>>> complains that it doesn't recognize that option.
>>>
>>> This is the first time using this setup/combination so I'm just
>>> wondering if I overlooked something. I would appreciate any help.
>>>
>>> Best regards,
>>> Zdenko
>>>
>>>
>>> _______________________
>>>  http://www.the-swamp.info
>>>
>>>

Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

Posted by Zdenko Hrcek <zd...@gmail.com>.

Thanks for the code sample,

when I switched to use bigquery_file_loads.BigQueryBatchFileLoads instead
of bigquery.WriteToBigQuery it works ok now. Not sure why with
WriteToBigQuery doesn't work, since it's using BigQueryBatchFileLoads under
the hood...

Thanks for the help.
Zdenko
_______________________
 http://www.the-swamp.info



On Wed, Sep 4, 2019 at 6:55 PM Chamikara Jayalath <ch...@google.com>
wrote:

> +Pablo Estrada <pa...@google.com> who added this.
>
> I don't think we have tested this specific option but I believe additional
> BQ parameters option was added in a generic way to accept all additional
> parameters.
>
> Looking at the code, seems like additional parameters do get passed
> through to load jobs:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427
>
> One thing you can try out is trying to run a BQ load job directly with the
> same set of data and options to see if the data gets loaded.
>
> Thanks,
> Cham
>
> On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zd...@gmail.com> wrote:
>
>> Greetings,
>>
>> I am using Beam 2.15 and Python 2.7.
>> I am doing a batch job to load data from CSV and upload to BigQuery. I
>> like functionality that instead of streaming to BigQuery I can use "file
>> load", to load table all at once.
>>
>> For my case, there are few "bad" records in the input (it's geo data and
>> during manual upload, BigQuery doesn't accept those as valid geography
>> records. this is easily solved by setting the number of max bad records.
>> If I understand correctly, WriteToBigQuery supports
>> "additional_bq_parameters", but for some reason when running a pipeline on
>> Dataflow runner it looks like those settings are ignored.
>>
>> I played with an example from the documentation
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py with
>> gist file
>> https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
>> where the table should be created on the partitioned field and clustered,
>> but when running on Dataflow it doesn't happen.
>> When I run on DirectRunner it works as expected. interestingly, when I
>> add maxBadRecords parameter to additional_bq_parameters, DirectRunner
>> complains that it doesn't recognize that option.
>>
>> This is the first time using this setup/combination so I'm just wondering
>> if I overlooked something. I would appreciate any help.
>>
>> Best regards,
>> Zdenko
>>
>>
>> _______________________
>>  http://www.the-swamp.info
>>
>>

Re: Python WriteToBigQuery with FILE_LOAD & additional_bq_parameters not working

Posted by Chamikara Jayalath <ch...@google.com>.

+Pablo Estrada <pa...@google.com> who added this.

I don't think we have tested this specific option but I believe additional
BQ parameters option was added in a generic way to accept all additional
parameters.

Looking at the code, seems like additional parameters do get passed through
to load jobs:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L427

One thing you can try out is trying to run a BQ load job directly with the
same set of data and options to see if the data gets loaded.

Thanks,
Cham

On Tue, Sep 3, 2019 at 2:24 PM Zdenko Hrcek <zd...@gmail.com> wrote:

> Greetings,
>
> I am using Beam 2.15 and Python 2.7.
> I am doing a batch job to load data from CSV and upload to BigQuery. I
> like functionality that instead of streaming to BigQuery I can use "file
> load", to load table all at once.
>
> For my case, there are few "bad" records in the input (it's geo data and
> during manual upload, BigQuery doesn't accept those as valid geography
> records. this is easily solved by setting the number of max bad records.
> If I understand correctly, WriteToBigQuery supports
> "additional_bq_parameters", but for some reason when running a pipeline on
> Dataflow runner it looks like those settings are ignored.
>
> I played with an example from the documentation
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py with
> gist file https://gist.github.com/zdenulo/99877307981b4d372df5a662d581a5df
> where the table should be created on the partitioned field and clustered,
> but when running on Dataflow it doesn't happen.
> When I run on DirectRunner it works as expected. interestingly, when I add
> maxBadRecords parameter to additional_bq_parameters, DirectRunner complains
> that it doesn't recognize that option.
>
> This is the first time using this setup/combination so I'm just wondering
> if I overlooked something. I would appreciate any help.
>
> Best regards,
> Zdenko
>
>
> _______________________
>  http://www.the-swamp.info
>
>