You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "svetakvsundhar (via GitHub)" <gi...@apache.org> on 2023/04/18 19:03:32 UTC

[GitHub] [beam] svetakvsundhar opened a new pull request, #26331: Adding info on picklers to docs.

svetakvsundhar opened a new pull request, #26331:
URL: https://github.com/apache/beam/pull/26331

   Follow on PR to fix #20228.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] Mention the appropriate issue in your description (for example: `addresses #123`), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment `fixes #<ISSUE NUMBER>` instead.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/get-started-contributing/#make-the-reviewers-job-easier).
   
   To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md)
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Go tests](https://github.com/apache/beam/workflows/Go%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Go+tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on PR #26331:
URL: https://github.com/apache/beam/pull/26331#issuecomment-1516274558

   Made some changes directly on the branch, PTAL if they look good to you and feel free to edit further. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on code in PR #26331:
URL: https://github.com/apache/beam/pull/26331#discussion_r1172518937


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -146,8 +146,12 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
 
 Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
 Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers.
+simply setting the main session, `--save_main_session`. This will load the pickled state of the global namespace onto the Dataflow workers.

Review Comment:
   ```suggestion
   setting the `--save_main_session` pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers.
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -146,8 +146,12 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
 
 Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
 Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers.
+simply setting the main session, `--save_main_session`. This will load the pickled state of the global namespace onto the Dataflow workers.
 For example, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors) to set the main session on the `DataflowRunner`.
 
+The dill pickler is the default pickler in the Python SDK.

Review Comment:
   ```suggestion
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -146,8 +146,12 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
 
 Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
 Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers.
+simply setting the main session, `--save_main_session`. This will load the pickled state of the global namespace onto the Dataflow workers.
 For example, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors) to set the main session on the `DataflowRunner`.
 
+The dill pickler is the default pickler in the Python SDK.
+
 **NOTE**: This applies to the Python SDK executing with the dill pickler on any remote runner using portability. Therefore, this issue will

Review Comment:
   ```suggestion
   **NOTE**: This applies to the Python SDK executing with the dill pickler on any remote runner. Therefore, this issue will
   ```
   
   (removing dev jargon which may be confusing to users)



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -146,8 +146,12 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
 
 Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
 Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by

Review Comment:
   ```suggestion
   Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner. To resolve this, supply the main session content with the pipeline by
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -146,8 +146,12 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep
 
 Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.

Review Comment:
   ```suggestion
   When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled)  into a bytecode using libraries that perform the serialization (also called picklers).  The default pickler library used by Beam is `dill`. By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26331:
URL: https://github.com/apache/beam/pull/26331#issuecomment-1516517765

   Thanks! Content LGTM but I edited for grammar, typos, and general flow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] svetakvsundhar commented on pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "svetakvsundhar (via GitHub)" <gi...@apache.org>.
svetakvsundhar commented on PR #26331:
URL: https://github.com/apache/beam/pull/26331#issuecomment-1513698990

   R: @tvalentyn 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on a diff in pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on code in PR #26331:
URL: https://github.com/apache/beam/pull/26331#discussion_r1173811477


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages that require installati
 
 **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file.
 
-## Pre-building SDK container image
+## Pre-building SDK Container Image
 
 In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time.
 However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with `--prebuild_sdk_container_engine`. For instructions of how to use pre-building with Google Cloud
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
 
-## Pickling and Managing Main Session
+## Pickling and Managing the Main Session
 
-Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
-Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers.
+When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using
+libraries that perform the serialization (also called picklers).  The default pickler library used by Beam is `dill`.
+To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option.

Review Comment:
   ```suggestion
   To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. The `cloudpickle` support is currently [experimental](https://github.com/apache/beam/issues/21298). 
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages that require installati
 
 **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file.
 
-## Pre-building SDK container image
+## Pre-building SDK Container Image
 
 In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time.
 However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with `--prebuild_sdk_container_engine`. For instructions of how to use pre-building with Google Cloud
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
 
-## Pickling and Managing Main Session
+## Pickling and Managing the Main Session
 
-Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job.
-Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers.
+When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using
+libraries that perform the serialization (also called picklers).  The default pickler library used by Beam is `dill`.
+To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option.

Review Comment:
   ```suggestion
   To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. The `cloudpickle` support is currently [experimental](https://github.com/apache/beam/issues/21298).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] github-actions[bot] commented on pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #26331:
URL: https://github.com/apache/beam/pull/26331#issuecomment-1513700596

   Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn commented on PR #26331:
URL: https://github.com/apache/beam/pull/26331#issuecomment-1517976967

   website test suite succeded: https://ci-beam.apache.org/job/beam_PreCommit_Website_Commit/11204/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn merged pull request #26331: Adding info on picklers to docs [follow-on]

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.
tvalentyn merged PR #26331:
URL: https://github.com/apache/beam/pull/26331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org