You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Rajnil Guha <ra...@gmail.com> on 2022/02/19 12:06:13 UTC

Apache Beam BigQueryIO Exception

Hi Beam Users,

We have a Dataflow pipeline which reads and writes data from and into
BigQuery. The basic structure of the pipeline is as follows:

query = <<select a bunch of columns from project_1.datasetId.tableId>>
with beam.Pipeline(options = options) as p:
     read_bq_records = (p | "ReadFromBQ" >> beam.io.ReadFromBigQuery(

                                             query = query,

                                             use_standard_sql = True

                                             )
    )
.
.
.
<<do some transformations>>
.
.
.
   write_bq_records = (previous_pcol |
beam.io.WriteToBigQuery(project_2:datasetId.tableId,

                           some_schema,

                           create_disposition =
beam.io.BigQueryDisposition.CREATE_IF_NEEDED,

                           write_disposition =
beam.io.BigQueryDisposition.WRITE_APPEND

                           )
    )

Our pipeline fails with a message that it's not able to create a temp
dataset in project_2(destination project) because the service account we
are using to run Dataflow jobs doesn't have "bigquery.dataset.create"
permission assigned to it for project_2. We have tried another pipeline
which reads data from GCS and writes to BQ, it works fine. So the issue
seems to be due to reading data from BQ.
I am not exactly aware of the internal working of the BigQueryIO but from
my initial understanding I assume it tries to snapshot the source table
into a temp dataset/table in the destination project from where it writes
data into the destination table.
It would be very helpful if someone aware or has faced similar exception
can shed some light into the behaviour of BigQueryIO.

Thanks & Regards
Rajnil

Re: Apache Beam BigQueryIO Exception

Posted by Rajnil Guha <ra...@gmail.com>.
Hi,

Thank you so much for your response. I tried by specifying a temp dataset
using the temp_dataset parameter of ReadFromBigQuery and it worked.
I was looking at the BigQuerySource class but could not find any such
parameters for setting temp_dataset and my job fails by throwing same
permission error. So is there any such way of doing using BigQuerySource
class.

Different: Could you please help understand the diff. between
ReadFromBigQuery and BigQuerySource in terms of how they work [1]. I
understand both are used to read from Big Query.

[1]
https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.ReadFromBigQuery

Thanks & Regards
Rajnil

On Tue, 22 Feb 2022 at 1:56 PM, Israel Herraiz <ih...@google.com> wrote:

> Yes, if you read from a query, BigQueryIO will write the results of the
> query to a temporary table in a temporary dataset.
>
> If you don't specify that temp dataset, Dataflow will try to create it.
>
> In Java, if you don't want to give permissions to the service account to
> create datasets, you can also specify an existing dataset, to be used as
> "temporary" dataset for the results of the queries. See the option
> temp_dataset in ReadFromBigQuery:
> https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.ReadFromBigQuery
>
> Dataflow will need permissions to be able to create tables inside that
> dataset.
>

Re: Apache Beam BigQueryIO Exception

Posted by Israel Herraiz <ih...@google.com>.
Yes, if you read from a query, BigQueryIO will write the results of the
query to a temporary table in a temporary dataset.

If you don't specify that temp dataset, Dataflow will try to create it.

In Java, if you don't want to give permissions to the service account to
create datasets, you can also specify an existing dataset, to be used as
"temporary" dataset for the results of the queries. See the option
temp_dataset in ReadFromBigQuery:
https://beam.apache.org/releases/pydoc/2.36.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.ReadFromBigQuery

Dataflow will need permissions to be able to create tables inside that
dataset.