You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "ahmedabu98 (via GitHub)" <gi...@apache.org> on 2023/03/28 15:18:02 UTC
[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26002: Moving UUID definition in Python BQ IO

ahmedabu98 commented on code in PR #26002:
URL: https://github.com/apache/beam/pull/26002#discussion_r1150741857


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
                 method=self.method,
                 job_name=job_name,
                 step_name=step_name,
-                unique_id=unique_id,
+                unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   It's critical that the `unique_id` here is equal to `unique_id` on line 2492. BigQuery Export read generally works in three steps: 
   1. [output to files in GCS](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L851)
   2. read from those files
   3. [delete GCS files](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L2671)
   
   The unique id is used when creating those file names, and the same unique id is used to generate the name of the directory to delete after the read operation. Unique ID should be the same for both so that we can clean up properly.



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2674,7 +2672,7 @@ def expand(self, pcoll):
             bigquery_job_labels=self.bigquery_job_labels,
             job_name=job_name,
             step_name=step_name,
-            unique_id=unique_id,
+            unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   Does doing it this way make a difference? I'm not so familiar, would refer to a Python SDK expert for this



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
                 method=self.method,
                 job_name=job_name,
                 step_name=step_name,
-                unique_id=unique_id,
+                unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   What you could do here is create the `unique_id` beforehand in a DoFn then wrap it in a `pvalue.AsSingleton` and pass it to both the Read and Cleanup operations as a side-input. See an example [here](https://github.com/apache/beam/blob/a4f55afa052b83237387698649b127570e1e5a63/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1180) that does this.
   
   In the relevant issue, it mentions this is the way Java does it. It [creates the value in a DoFn](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1094-L1106) then uses it as a sideinput for [read](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1150) and [cleanup](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1206).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org