You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "svetakvsundhar (via GitHub)" <gi...@apache.org> on 2023/04/13 21:15:46 UTC

[GitHub] [beam] svetakvsundhar opened a new pull request, #26270: Generating Unique ID within function so that it's unique for every run.

svetakvsundhar opened a new pull request, #26270:
URL: https://github.com/apache/beam/pull/26270

   Fixes issue #22733 by deferring unique ID creation to later function. 
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "ahmedabu98 (via GitHub)" <gi...@apache.org>.
ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171227353


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   P.S. I've confirmed that this change leads to temp files being left undeleted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1513290711

   > Postcommits passed, did you do a manual inspection that the ID is now regenerated once you launch a couple of template jobs?
   
   Yep. 
   
   Results:
   
   beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_284_2961f53aa7224fc3c787368d14c1c242_9a14ce5d0c9e4ab2852dabbbb2dd3862
   beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_STEP_254_2961f53aa7224fc3c787368d14c1c242_beecfd39733f4db3a9f8eb310d432fd0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1507682699

   Run Python_Coverage PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn merged pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn merged PR #26270:
URL: https://github.com/apache/beam/pull/26270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1507691120

   R: @AnandInguva 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "ahmedabu98 (via GitHub)" <gi...@apache.org>.
ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171227353


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   P.S. I've confirmed that this change leads to temp files being left undeleted
   
   Steps:
   Use export read and specify a new folder for GCS temp location. After the read is over, check if the folder and its contents are still there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] AnandInguva commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "AnandInguva (via GitHub)" <gi...@apache.org>.
AnandInguva commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1507690587

   
   FYI Run Python_Coverage PreCommit -> this one is failing already due to https://github.com/apache/beam/issues/26251. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1508762662

   R: @tvalentyn for committer review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Generating Unique ID within function so that it's unique for every run. [beam]

Posted by "DerRidda (via GitHub)" <gi...@apache.org>.
DerRidda commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1469709852


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   @ahmedabu98 That workaround is currently not possible to use with DIRECT_READ as neither the temp_dataset nor the temp_table can be changed at runtime, as these are static options and don't use value providers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1507690326

   Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`:
   
   R: @jrmccluskey for label python.
   R: @johnjcasey for label io.
   
   Available commands:
   - `stop reviewer notifications` - opt out of the automated review tooling
   - `remind me after tests pass` - tag the comment author after tests pass
   - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
   
   The PR bot will only process comments in the main thread (not review comments).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1507691763

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1509443718

   Run Python 3.8 PostCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1512997857

   Postcommits passed, did you do a manual inspection that the ID is now regenerated once you launch a couple of template jobs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1509873129

   > How was this tested ?
   
   I was relying on our current postcommit tests. I guess this would need to be tested in the templates repo for templates integration tests?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171241960


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   Thanks a lot @ahmedabu98 for the feedback and chiming in here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "ahmedabu98 (via GitHub)" <gi...@apache.org>.
ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171221658


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   Some more thought needs to be put into this line here. What ends up happening is `file_path_to_remove` is randomly generated and does not actually point to any files. 
   
   Prior to this change, it used the same `unique_id`/`source_uuid` to create the filepath generated in the export read here: https://github.com/apache/beam/blob/fc7a240eb4e5f2653225c594c582749047664202/sdks/python/apache_beam/io/gcp/bigquery.py#L865-L866
   
   But now each step is creating its own unique id, so ie. its own filepath. We end up exporting and reading from one filepath, then attempting to delete another filepath.
   
   With large reads, this will lead to a lot of temp files on GCS being left undeleted. I recommend rolling back this PR until a better solution is found for this because it introduces a serious regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1167351750


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -689,7 +689,6 @@ def __init__(
     self.query_priority = query_priority
     self._job_name = job_name or 'BQ_EXPORT_JOB'
     self._step_name = step_name
-    self._source_uuid = unique_id

Review Comment:
   yes, we should be able to remove that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on PR #26270:
URL: https://github.com/apache/beam/pull/26270#issuecomment-1509443114

   How was this tested ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] AnandInguva commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Posted by "AnandInguva (via GitHub)" <gi...@apache.org>.
AnandInguva commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1166101150


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -689,7 +689,6 @@ def __init__(
     self.query_priority = query_priority
     self._job_name = job_name or 'BQ_EXPORT_JOB'
     self._step_name = step_name
-    self._source_uuid = unique_id

Review Comment:
   Do we want to remove the input param `unique_id`? Since this is an internal function, I guess it would be okay?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org