You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/03/31 23:00:00 UTC
[jira] [Updated] (BEAM-11905) GCP DataFlow not cleaning up GCP BigQuery temporary datasets

     [ https://issues.apache.org/jira/browse/BEAM-11905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-11905:
-----------------------------------
    Labels:   (was: newbie)

> GCP DataFlow not cleaning up GCP BigQuery temporary datasets
> ------------------------------------------------------------
>
>                 Key: BEAM-11905
>                 URL: https://issues.apache.org/jira/browse/BEAM-11905
>             Project: Beam
>          Issue Type: Improvement
>          Components: beam-community
>    Affects Versions: 2.27.0
>         Environment: GCP DataFlow
>            Reporter: Ying Wang
>            Priority: P2
>
> I'm running a number of GCP DataFlow jobs to transform some tables within GCP BigQuery, and they're creating a bunch of temporary datasets that are not deleted when the job completes successfully. I'm running the GCP DataFlow jobs by using Airflow / GCP Cloud Composer.
> The Composer environment Airflow UI does not reveal anything. When I go into GCP DataFlow, click on a job named $BATCH_JOB marked with "Status: Succeeded" and "SDK version: 2.27.0", a step within that job and a stage within that step (?), and then open up the Logs window and filter for "LogLevel: Error" and click on a log message, I get this:
>  
> ```bash
> Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work work_executor.execute() File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 226, in execute self._split_task) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 234, in _perform_source_split_considering_api_limits desired_bundle_size) File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 271, in _perform_source_split for split in source.split(desired_bundle_size): File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 796, in split schema, metadata_list = self._export_files(bq) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery.py", line 881, in _export_files bq.wait_for_bq_job(job_ref) File "/usr/local/lib/python3.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 525, in wait_for_bq_job job_reference.jobId, job.status.errorResult)) RuntimeError: BigQuery job beam_bq_job_EXPORT_latestrow060a408d75f23074efbacd477228b4b30bc_68cc517f-f_436 failed. Error Result: <ErrorProto message: 'Not found: Table motorefi-analytics:temp_dataset_3a43c81c858e429f871d37802d7ac4f6.temp_table_3a43c81c858e429f871d37802d7ac4f6 was not found in location US' reason: 'notFound'>
> ```
>  
> I would provide the equivalent REST for the batch job description but I'm not sure if it is helpful or sensitive information.
>  
> I'm not sure whether Beam v2.27.0 is affected by https://issues.apache.org/jira/browse/BEAM-6514 or [https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/609,] but I am using the Python 3.7 SDK v2.27.0 and not the Java SDK.
>  
> Appreciate any help for this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)