You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "damccorm (via GitHub)" <gi...@apache.org> on 2023/04/12 17:49:30 UTC

[GitHub] [beam] damccorm commented on a diff in pull request #26236: Pickling and Savemainsession Doc update

damccorm commented on code in PR #26236:
URL: https://github.com/apache/beam/pull/26236#discussion_r1164458088


##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   Also, I'd make it clear that the dill pickler is the default.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   Doesn't this at least apply to all remote runners using portability (not just Dataflow)?



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session

Review Comment:
   If we're going to use Dataflow specific language here, we should specifically call that out in the section heading. I think this applies to other remote runners though.



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -141,3 +141,12 @@ However, it may be possible to pre-build the SDK containers and perform the depe
 Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
+
+## Pickling and Managing Main Session
+
+Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Dataflow job.
+Thus, one might encounter an unexpected `NameError` when running a `DoFn` on Dataflow Runner. To resolve this, manage the main session by
+simply setting `--save_main_session=True`. This will load the pickled state of the global namespace onto the Dataflow workers.
+For more information, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#how-do-i-handle-nameerrors).
+
+**NOTE**: This strictly applies to the `Python SDK executing with the dill pickler on the Dataflow Runner`.

Review Comment:
   For example, we use it in our flink portability tests - https://github.com/apache/beam/blob/326373715e0ca071d610a03e92626b1957253f81/runners/portability/test_flink_uber_jar.sh#L24



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org