You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "svetakvsundhar (via GitHub)" <gi...@apache.org> on 2023/04/12 14:09:32 UTC

[GitHub] [beam] svetakvsundhar opened a new pull request, #26236: Pickling and Savemainsession Doc update

svetakvsundhar opened a new pull request, #26236:
URL: https://github.com/apache/beam/pull/26236

   Fixes #20228 by documenting pickling and savemainsession on Beam Docs.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26236: Pickling and Savemainsession Doc update

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26236:
URL: https://github.com/apache/beam/pull/26236#issuecomment-1505517586

   Run Website PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26236: Pickling and Savemainsession Doc update

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26236:
URL: https://github.com/apache/beam/pull/26236#issuecomment-1505354669

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] AnandInguva commented on a diff in pull request #26236: Pickling and Savemainsession Doc update

Posted by "AnandInguva (via GitHub)" <gi...@apache.org>.
AnandInguva commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164397085


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter unexpected `NameErrors` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by

Review Comment:
   ```suggestion
   Thus, one might encounter unexpected `NameError`s when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on a diff in pull request #26236: Pickling and Savemainsession Doc update

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164468409


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   ah thanks for the catch! updating to make it clearer -- this doesn't apply on `DirectRunner`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm merged pull request #26236: Pickling and Savemainsession Doc update

Posted by "damccorm (via GitHub)" <gi...@apache.org>.
damccorm merged PR #26236:
URL: https://github.com/apache/beam/pull/26236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26236: Pickling and Savemainsession Doc update

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26236:
URL: https://github.com/apache/beam/pull/26236#issuecomment-1505518431

   Run Website PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] damccorm commented on a diff in pull request #26236: Pickling and Savemainsession Doc update

Posted by "damccorm (via GitHub)" <gi...@apache.org>.
damccorm commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164458088


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   Also, I'd make it clear that the dill pickler is the default.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   Doesn't this at least apply to all remote runners using portability (not just Dataflow)?



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session

Review Comment:
   If we're going to use Dataflow specific language here, we should specifically call that out in the section heading. I think this applies to other remote runners though.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   For example, we use it in our flink portability tests - https://github.com/apache/beam/blob/326373715e0ca071d610a03e92626b1957253f81/runners/portability/test_flink_uber_jar.sh#L24



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26236: Pickling and Savemainsession Doc update

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26236:
URL: https://github.com/apache/beam/pull/26236#issuecomment-1505352219

   R: @Abacn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] AnandInguva commented on a diff in pull request #26236: Pickling and Savemainsession Doc update

Posted by "AnandInguva (via GitHub)" <gi...@apache.org>.
AnandInguva commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164233753


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,13 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session

Review Comment:
   Note that this is true when the pickler is `dill`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26236: Pickling and Savemainsession Doc update

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26236:
URL: https://github.com/apache/beam/pull/26236#issuecomment-1505518137

   Run Website_Stage_GCS PreCommit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org