You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 19:08:18 UTC

[GitHub] [beam] damccorm opened a new issue, #20748: GCP DataFlow not cleaning up GCP BigQuery temporary datasets

damccorm opened a new issue, #20748:
URL: https://github.com/apache/beam/issues/20748

   I'm running a number of GCP DataFlow jobs to transform some tables within GCP BigQuery, and they're creating a bunch of temporary datasets that are not deleted when the job completes successfully. I'm running the GCP DataFlow jobs by using Airflow / GCP Cloud Composer.
   
   The Composer environment Airflow UI does not reveal anything. When I go into GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded" and "SDK version: 2.27.0", a step within that job and a stage within that step (?), and then open up the Logs window and filter for "LogLevel: Error" and click on a log message, I get this:
   
    
   
   ```bash
   
   Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226, in execute self._split_task) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 234, in _perform_source_split_considering_api_limits desired_bundle_size) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 271, in _perform_source_split for split in source.split(desired_bundle_size): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 796, in split schema, metadata_list = self._export_files(bq) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 881, in _export_files bq.wait_for_bq_job(job_ref) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 525, in wait_for_b
 q_job job_reference.jobId, job.status.errorResult)) RuntimeError: BigQuery job beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436 failed. Error Result: <ErrorProto message: 'Not found: Table motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6 was not found in location US' reason: 'notFound'\>
   
   ```
   
    
   
   I would provide the equivalent REST for the batch job description but I'm not sure if it is helpful or sensitive information.
   
    
   
   I'm not sure whether Beam v2.27.0 is affected by https://issues.apache.org/jira/browse/BEAM-6514 or [https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,) but I am using the Python 3.7 SDK v2.27.0 and not the Java SDK.
   
    
   
   Appreciate any help for this issue.
   
   Imported from Jira [BEAM-11905](https://issues.apache.org/jira/browse/BEAM-11905). Original Jira may contain additional context.
   Reported by: yingw787.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] GCP DataFlow not cleaning up GCP BigQuery temporary datasets [beam]

Posted by "djaneluz (via GitHub)" <gi...@apache.org>.
djaneluz commented on issue #20748:
URL: https://github.com/apache/beam/issues/20748#issuecomment-1948526540

   This is still happening, I'm using BEAM version 2.52.0 and Airflow to trigger the pipeline.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] GCP DataFlow not cleaning up GCP BigQuery temporary datasets [beam]

Posted by "akshatmahesh0110 (via GitHub)" <gi...@apache.org>.
akshatmahesh0110 commented on issue #20748:
URL: https://github.com/apache/beam/issues/20748#issuecomment-1905303675

   Hi, is this issue resolved?
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org