You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hop.apache.org by Fabian Peters <po...@mercadu.de> on 2022/08/31 07:50:04 UTC

Workflow on GCP with BigQuery Output

Good morning!

I'm putting together my Dataflow deployment and am running into another problem I don't know how to deal with: I'm running a pipeline via Dataflow, which contains a "Workflow executor" transform. The workflow contains a number of pipelines that have their run configuration set to Beam-Direct. In principle, this works fine. (Yeah!)

However, in this setup a BigQuery Output fails with a "java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job: null." I see the the same when running just the pipeline (or any other with BigQuery Output) via Beam-Direct locally, which makes me think that the GCP credentials are not being picked up? Is there something I need to configure?

cheers

Fabian

P.S.: Logs from running locally with Beam-Direct:

2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
2022/08/31 09:30:07 - sites - ERROR: org.apache.hop.core.exception.HopException: 
2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
2022/08/31 09:30:07 - sites - 
2022/08/31 09:30:07 - sites - 	at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
2022/08/31 09:30:07 - sites - 	at org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
2022/08/31 09:30:07 - sites - 	at java.base/java.lang.Thread.run(Thread.java:829)
2022/08/31 09:30:07 - sites - Caused by: org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
2022/08/31 09:30:07 - sites - 	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
2022/08/31 09:30:07 - sites - 	at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
2022/08/31 09:30:07 - sites - 	at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
2022/08/31 09:30:07 - sites - 	at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
2022/08/31 09:30:07 - sites - 	... 2 more
2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
2022/08/31 09:30:07 - sites - 	at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
2022/08/31 09:30:07 - sites - 	at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
2022/08/31 09:30:07 - sites - 	at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)


Re: Workflow on GCP with BigQuery Output

Posted by Fabian Peters <po...@mercadu.de>.
Hi Hans,

Yes, changing the "Temp location" in the Beam-Direct Pipeline Run Configuration to the GCS url was sufficient. Now it works, both wenn running via Beam-Direct locally and when running a pipeline with a "Workflow executor" transform on Dataflow.

cheers

Fabian

> Am 31.08.2022 um 17:39 schrieb Hans Van Akelyen <ha...@gmail.com>:
> 
> Thank you Israel for the support!
> 
> @Fabian, so the solution in your case was to change the Temp location in the Direct runner to a GCS path?
> 
> In our documentation about the runner config we state that it should be a GCS location for the Dataflow runner, in the Direct runner we do not state this explicitly. I'll add a note to the docs that when using non-local IO's such as BigQuery a GCS path is required here.
> 
> I have also created a ticket to expose the gcpTempLocation, currently only the tempLocation is configurable via the UI.
> 
> Cheers,
> Hans
> 
> 
> 
> 
> On Wed, 31 Aug 2022 at 17:10, Israel Herraiz via users <users@hop.apache.org <ma...@hop.apache.org>> wrote:
> I am searching in Google, and I cannot find any reference. 
> 
> I seem to remember that the stacktrace will tell you something like "temp location is not in GCS" or something like that.
> 
> In any case, that temp location depends on the method used to write to BigQuery. The default method in Beam is using FILE_LOADS, which will create a BigQuery job (https://cloud.google.com/bigquery/docs/batch-loading-data <https://cloud.google.com/bigquery/docs/batch-loading-data>). Those jobs will read data from GCS.
> 
> For FILE_LOADS, Beam creates Avro files in the tempLocation of the pipeline, and uses the location of those files as an input parameter for the BQ job. So it has to be a location in GCS.
> 
> Now, tempLocation is used for more things. If you want to use a different tempLocation for the rest of the pipeline, you can use the option --gcpTempLocation in combination with --tempLocation. BigQueryIO will use gcpTempLocation if it is set, and it will fall back to tempLocation if gcpTempLocation is not set.
> 
> Bear also in mind that if you are using a different write method (e.g. STORAGE_WRITE_API), Beam will not generate files, so whether tempLocation is in GCS or not does not matter, and the data will be directly written to BigQuery (https://cloud.google.com/bigquery/docs/write-api-batch <https://cloud.google.com/bigquery/docs/write-api-batch>).
> 
> These are the write methods that can be used with Beam and BigQuery: https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html>
> 
> Kind regards,
> Israel
> 
> 
> On Wed, 31 Aug 2022 at 16:50, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Hi Israel,
> 
> That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the requirement to use a GCS location documented somewhere?
> 
> cheers
> 
> Fabian
> 
>> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users <users@hop.apache.org <ma...@hop.apache.org>>:
>> 
>> What are the command line arguments that you are using for those direct runner pipelines? For instance, for BigQuery you will need to set --tempLocation to a GCS location for the BQ jobs to work.
>> 
>> 
>> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
>> Good morning!
>> 
>> I'm putting together my Dataflow deployment and am running into another problem I don't know how to deal with: I'm running a pipeline via Dataflow, which contains a "Workflow executor" transform. The workflow contains a number of pipelines that have their run configuration set to Beam-Direct. In principle, this works fine. (Yeah!)
>> 
>> However, in this setup a BigQuery Output fails with a "java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job: null." I see the the same when running just the pipeline (or any other with BigQuery Output) via Beam-Direct locally, which makes me think that the GCP credentials are not being picked up? Is there something I need to configure?
>> 
>> cheers
>> 
>> Fabian
>> 
>> P.S.: Logs from running locally with Beam-Direct:
>> 
>> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
>> 2022/08/31 09:30:07 - sites - ERROR: org.apache.hop.core.exception.HopException: 
>> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
>> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites - 
>> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
>> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
>> 2022/08/31 09:30:07 - sites -   at java.base/java.lang.Thread.run(Thread.java:829)
>> 2022/08/31 09:30:07 - sites - Caused by: org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
>> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
>> 2022/08/31 09:30:07 - sites -   ... 2 more
>> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
>> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
>> 
> 


Re: Workflow on GCP with BigQuery Output

Posted by Hans Van Akelyen <ha...@gmail.com>.
Thank you Israel for the support!

@Fabian, so the solution in your case was to change the Temp location in
the Direct runner to a GCS path?

In our documentation about the runner config we state that it should be a
GCS location for the Dataflow runner, in the Direct runner we do not state
this explicitly. I'll add a note to the docs that when using non-local IO's
such as BigQuery a GCS path is required here.

I have also created a ticket to expose the gcpTempLocation, currently only
the tempLocation is configurable via the UI.

Cheers,
Hans




On Wed, 31 Aug 2022 at 17:10, Israel Herraiz via users <us...@hop.apache.org>
wrote:

> I am searching in Google, and I cannot find any reference.
>
> I seem to remember that the stacktrace will tell you something like "temp
> location is not in GCS" or something like that.
>
> In any case, that temp location depends on the method used to write to
> BigQuery. The default method in Beam is using FILE_LOADS, which will create
> a BigQuery job (https://cloud.google.com/bigquery/docs/batch-loading-data).
> Those jobs will read data from GCS.
>
> For FILE_LOADS, Beam creates Avro files in the tempLocation of the
> pipeline, and uses the location of those files as an input parameter for
> the BQ job. So it has to be a location in GCS.
>
> Now, tempLocation is used for more things. If you want to use a different
> tempLocation for the rest of the pipeline, you can use the option
> --gcpTempLocation in combination with --tempLocation. BigQueryIO will use
> gcpTempLocation if it is set, and it will fall back to tempLocation if
> gcpTempLocation is not set.
>
> Bear also in mind that if you are using a different write method (e.g.
> STORAGE_WRITE_API), Beam will not generate files, so whether tempLocation
> is in GCS or not does not matter, and the data will be directly written to
> BigQuery (https://cloud.google.com/bigquery/docs/write-api-batch).
>
> These are the write methods that can be used with Beam and BigQuery:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html
>
> Kind regards,
> Israel
>
>
> On Wed, 31 Aug 2022 at 16:50, Fabian Peters <po...@mercadu.de> wrote:
>
>> Hi Israel,
>>
>> That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the
>> requirement to use a GCS location documented somewhere?
>>
>> cheers
>>
>> Fabian
>>
>> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users <
>> users@hop.apache.org>:
>>
>> What are the command line arguments that you are using for those direct
>> runner pipelines? For instance, for BigQuery you will need to set
>> --tempLocation to a GCS location for the BQ jobs to work.
>>
>>
>> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <po...@mercadu.de> wrote:
>>
>>> Good morning!
>>>
>>> I'm putting together my Dataflow deployment and am running into another
>>> problem I don't know how to deal with: I'm running a pipeline via Dataflow,
>>> which contains a "Workflow executor" transform. The workflow contains a
>>> number of pipelines that have their run configuration set to Beam-Direct.
>>> In principle, this works fine. (Yeah!)
>>>
>>> However, in this setup a BigQuery Output fails with a
>>> "java.lang.RuntimeException: Failed to create job with prefix
>>> beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job:
>>> null." I see the the same when running just the pipeline (or any other with
>>> BigQuery Output) via Beam-Direct locally, which makes me think that the GCP
>>> credentials are not being picked up? Is there something I need to configure?
>>>
>>> cheers
>>>
>>> Fabian
>>>
>>> P.S.: Logs from running locally with Beam-Direct:
>>>
>>> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
>>> 2022/08/31 09:30:07 - sites - ERROR:
>>> org.apache.hop.core.exception.HopException:
>>> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
>>> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to
>>> create job with prefix
>>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>> reached max retries: 3, last failed job: null.
>>> 2022/08/31 09:30:07 - sites -
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
>>> 2022/08/31 09:30:07 - sites -   at
>>> java.base/java.lang.Thread.run(Thread.java:829)
>>> 2022/08/31 09:30:07 - sites - Caused by:
>>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>>> java.lang.RuntimeException: Failed to create job with prefix
>>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>> reached max retries: 3, last failed job: null.
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
>>> 2022/08/31 09:30:07 - sites -   ... 2 more
>>> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException:
>>> Failed to create job with prefix
>>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>> reached max retries: 3, last failed job: null.
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
>>> 2022/08/31 09:30:07 - sites -   at
>>> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
>>>
>>>
>>

Re: Workflow on GCP with BigQuery Output

Posted by Israel Herraiz via users <us...@hop.apache.org>.
I am searching in Google, and I cannot find any reference.

I seem to remember that the stacktrace will tell you something like "temp
location is not in GCS" or something like that.

In any case, that temp location depends on the method used to write to
BigQuery. The default method in Beam is using FILE_LOADS, which will create
a BigQuery job (https://cloud.google.com/bigquery/docs/batch-loading-data).
Those jobs will read data from GCS.

For FILE_LOADS, Beam creates Avro files in the tempLocation of the
pipeline, and uses the location of those files as an input parameter for
the BQ job. So it has to be a location in GCS.

Now, tempLocation is used for more things. If you want to use a different
tempLocation for the rest of the pipeline, you can use the option
--gcpTempLocation in combination with --tempLocation. BigQueryIO will use
gcpTempLocation if it is set, and it will fall back to tempLocation if
gcpTempLocation is not set.

Bear also in mind that if you are using a different write method (e.g.
STORAGE_WRITE_API), Beam will not generate files, so whether tempLocation
is in GCS or not does not matter, and the data will be directly written to
BigQuery (https://cloud.google.com/bigquery/docs/write-api-batch).

These are the write methods that can be used with Beam and BigQuery:
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html

Kind regards,
Israel


On Wed, 31 Aug 2022 at 16:50, Fabian Peters <po...@mercadu.de> wrote:

> Hi Israel,
>
> That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the
> requirement to use a GCS location documented somewhere?
>
> cheers
>
> Fabian
>
> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users <
> users@hop.apache.org>:
>
> What are the command line arguments that you are using for those direct
> runner pipelines? For instance, for BigQuery you will need to set
> --tempLocation to a GCS location for the BQ jobs to work.
>
>
> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <po...@mercadu.de> wrote:
>
>> Good morning!
>>
>> I'm putting together my Dataflow deployment and am running into another
>> problem I don't know how to deal with: I'm running a pipeline via Dataflow,
>> which contains a "Workflow executor" transform. The workflow contains a
>> number of pipelines that have their run configuration set to Beam-Direct.
>> In principle, this works fine. (Yeah!)
>>
>> However, in this setup a BigQuery Output fails with a
>> "java.lang.RuntimeException: Failed to create job with prefix
>> beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job:
>> null." I see the the same when running just the pipeline (or any other with
>> BigQuery Output) via Beam-Direct locally, which makes me think that the GCP
>> credentials are not being picked up? Is there something I need to configure?
>>
>> cheers
>>
>> Fabian
>>
>> P.S.: Logs from running locally with Beam-Direct:
>>
>> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
>> 2022/08/31 09:30:07 - sites - ERROR:
>> org.apache.hop.core.exception.HopException:
>> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
>> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to
>> create job with prefix
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>> reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
>> 2022/08/31 09:30:07 - sites -   at
>> java.base/java.lang.Thread.run(Thread.java:829)
>> 2022/08/31 09:30:07 - sites - Caused by:
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
>> java.lang.RuntimeException: Failed to create job with prefix
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>> reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
>> 2022/08/31 09:30:07 - sites -   ... 2 more
>> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException:
>> Failed to create job with prefix
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>> reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
>> 2022/08/31 09:30:07 - sites -   at
>> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
>>
>>
>

Re: Workflow on GCP with BigQuery Output

Posted by Fabian Peters <po...@mercadu.de>.
Hi Israel,

That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the requirement to use a GCS location documented somewhere?

cheers

Fabian

> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users <us...@hop.apache.org>:
> 
> What are the command line arguments that you are using for those direct runner pipelines? For instance, for BigQuery you will need to set --tempLocation to a GCS location for the BQ jobs to work.
> 
> 
> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <post@mercadu.de <ma...@mercadu.de>> wrote:
> Good morning!
> 
> I'm putting together my Dataflow deployment and am running into another problem I don't know how to deal with: I'm running a pipeline via Dataflow, which contains a "Workflow executor" transform. The workflow contains a number of pipelines that have their run configuration set to Beam-Direct. In principle, this works fine. (Yeah!)
> 
> However, in this setup a BigQuery Output fails with a "java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job: null." I see the the same when running just the pipeline (or any other with BigQuery Output) via Beam-Direct locally, which makes me think that the GCP credentials are not being picked up? Is there something I need to configure?
> 
> cheers
> 
> Fabian
> 
> P.S.: Logs from running locally with Beam-Direct:
> 
> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
> 2022/08/31 09:30:07 - sites - ERROR: org.apache.hop.core.exception.HopException: 
> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites - 
> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
> 2022/08/31 09:30:07 - sites -   at java.base/java.lang.Thread.run(Thread.java:829)
> 2022/08/31 09:30:07 - sites - Caused by: org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
> 2022/08/31 09:30:07 - sites -   at org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
> 2022/08/31 09:30:07 - sites -   ... 2 more
> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException: Failed to create job with prefix beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
> 2022/08/31 09:30:07 - sites -   at org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
> 


Re: Workflow on GCP with BigQuery Output

Posted by Israel Herraiz via users <us...@hop.apache.org>.
What are the command line arguments that you are using for those direct
runner pipelines? For instance, for BigQuery you will need to set
--tempLocation to a GCS location for the BQ jobs to work.


On Wed, 31 Aug 2022 at 09:50, Fabian Peters <po...@mercadu.de> wrote:

> Good morning!
>
> I'm putting together my Dataflow deployment and am running into another
> problem I don't know how to deal with: I'm running a pipeline via Dataflow,
> which contains a "Workflow executor" transform. The workflow contains a
> number of pipelines that have their run configuration set to Beam-Direct.
> In principle, this works fine. (Yeah!)
>
> However, in this setup a BigQuery Output fails with a
> "java.lang.RuntimeException: Failed to create job with prefix
> beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job:
> null." I see the the same when running just the pipeline (or any other with
> BigQuery Output) via Beam-Direct locally, which makes me think that the GCP
> credentials are not being picked up? Is there something I need to configure?
>
> cheers
>
> Fabian
>
> P.S.: Logs from running locally with Beam-Direct:
>
> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
> 2022/08/31 09:30:07 - sites - ERROR:
> org.apache.hop.core.exception.HopException:
> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create
> job with prefix
> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
> reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites -
> 2022/08/31 09:30:07 - sites -   at
> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
> 2022/08/31 09:30:07 - sites -   at
> java.base/java.lang.Thread.run(Thread.java:829)
> 2022/08/31 09:30:07 - sites - Caused by:
> org.apache.beam.sdk.Pipeline$PipelineExecutionException:
> java.lang.RuntimeException: Failed to create job with prefix
> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
> reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
> 2022/08/31 09:30:07 - sites -   ... 2 more
> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException:
> Failed to create job with prefix
> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
> reached max retries: 3, last failed job: null.
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
> 2022/08/31 09:30:07 - sites -   at
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
>
>