You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/06/23 21:55:49 UTC

[GitHub] [beam] TheNeuralBit opened a new pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

TheNeuralBit opened a new pull request #12067:
URL: https://github.com/apache/beam/pull/12067


   This PR creates a global cache for component ID assignments made in PipelineContext instances. See BEAM-10308 and BEAM-10143 for motivation.
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark
   --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-651426642


   Ok I think I'm finally done churning on this. Sorry for all the changes. What I have now stores the component ids in a `ComponentIdContext` object stored on the Pipeline. The same context is used in both `ExternalTransform.expand` and in `Pipeline.to_runner_api`, so component ids will be consistent across both of those functions, but everything we store in memory will be cleaned up with the Pipeline. PTAL


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652067737


   The only failure in Python precommit seems to be due to BEAM-10007. I think this is safe to merge after an LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652090261


   @lukecwik do you happen to have time to review this? Looks like cham is OOO
   
   I'd like to get this in before the release cut tomorrow since it fixes a significant issue with ExternalTransform.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-649168647


   I have a POC working that solves the problem by fixing all of the component IDs in the external transforms in `to_runner_api()`. It feels pretty error-prone, but it doesn't maintain any global state, and keeps all of the special logic within ExternalTransform. I'll clean it up and push it here tomorrow morning.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit edited a comment on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit edited a comment on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652067737


   The only failure in Python precommit seems to be a flake due to BEAM-10007. I think this is safe to merge after an LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-654504237


   @kennknowles are you ok with merging this now?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on a change in pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on a change in pull request #12067:
URL: https://github.com/apache/beam/pull/12067#discussion_r450442458



##########
File path: sdks/python/apache_beam/pipeline.py
##########
@@ -221,6 +221,14 @@ def __init__(self, runner=None, options=None, argv=None):
     # then the transform will have to be cloned with a new label.
     self.applied_labels = set()  # type: Set[str]
 
+    # Create a context for assigning IDs to components. Ensures that any
+    # components that receive an ID during pipeline construction (for example in
+    # ExternalTransform), will receive the same component ID when generating the
+    # full pipeline proto.
+    from apache_beam.runners import pipeline_context

Review comment:
       Done, thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652002495


   Run XVR_Flink PostCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652700056


   Run XVR_Flink PostCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652715930


   Run Python2_PVR_Flink PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-649888003


   Do you think we can add another unit test that shows how current state will result in an invalid proto during cross-language expansion ? That will help us understand the scope of the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-650456801


   I think that's fine. We should not be creating too many pipeline objects in parallel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-648451444


   R: @chamikaramj 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-651988106


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] kennknowles commented on a change in pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
kennknowles commented on a change in pull request #12067:
URL: https://github.com/apache/beam/pull/12067#discussion_r448646153



##########
File path: sdks/python/apache_beam/pipeline.py
##########
@@ -221,6 +221,14 @@ def __init__(self, runner=None, options=None, argv=None):
     # then the transform will have to be cloned with a new label.
     self.applied_labels = set()  # type: Set[str]
 
+    # Create a context for assigning IDs to components. Ensures that any
+    # components that receive an ID during pipeline construction (for example in
+    # ExternalTransform), will receive the same component ID when generating the
+    # full pipeline proto.
+    from apache_beam.runners import pipeline_context

Review comment:
       There's not actually a circular dependency of classes. You could move this class in here and that would be fine.

##########
File path: sdks/python/apache_beam/runners/pipeline_context.py
##########
@@ -49,6 +50,32 @@
   from apache_beam.coders.coder_impl import IterableStateWriter
 
 
+class ComponentIdContext(object):

Review comment:
       There's a lot of overlap between the purpose of this and the purpose of the `PipelineContext` but I can see how they are different. I see that the pipeline only takes the context at `to_runner_api` time, and the context has tweaks like `use_fake_coders` and `default_environment`. So when using xlang expansion you can keep just the ids and throw away the contents that will be generated according to these tweaks later.

##########
File path: sdks/python/apache_beam/transforms/external.py
##########
@@ -286,7 +286,8 @@ def expand(self, pvalueish):
     pipeline = (
         next(iter(self._inputs.values())).pipeline
         if self._inputs else pvalueish.pipeline)
-    context = pipeline_context.PipelineContext()
+    context = pipeline_context.PipelineContext(
+        component_id_context=pipeline._component_id_context)

Review comment:
       Based on this use, maybe `_component_id_context` isn't private at all? And maybe it is a `component_id_map` or some such. It isn't "context" in the sense that it isn't "the stuff surrounding the thing you are interested in".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-655155939


   Took a look and confirmed that x-lang KafkaIO works with this change. LGTM from me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on a change in pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on a change in pull request #12067:
URL: https://github.com/apache/beam/pull/12067#discussion_r450442205



##########
File path: sdks/python/apache_beam/transforms/external.py
##########
@@ -286,7 +286,8 @@ def expand(self, pvalueish):
     pipeline = (
         next(iter(self._inputs.values())).pipeline
         if self._inputs else pvalueish.pipeline)
-    context = pipeline_context.PipelineContext()
+    context = pipeline_context.PipelineContext(
+        component_id_context=pipeline._component_id_context)

Review comment:
       Good point, renamed it to `component_id_map` everywhere




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-650452189


   Run XVR_Flink PostCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on a change in pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on a change in pull request #12067:
URL: https://github.com/apache/beam/pull/12067#discussion_r446452942



##########
File path: sdks/python/apache_beam/runners/pipeline_context.py
##########
@@ -49,6 +50,52 @@
   from apache_beam.coders.coder_impl import IterableStateWriter
 
 
+class _UniqueRefAssigner(object):
+  """Utility for assigning unique refs to proto messages for use in components.
+
+  Instances of _UniqueRefAssigner are global (scoped by the base string). That
+  way once a unique ref is assigned it will be used consistently across
+  PipelineContext instances.
+  """
+  _INSTANCES = {}  # type: Dict[Tuple[Pipeline, str], _UniqueRefAssigner]
+
+  def __init__(self, base):
+    self._base = base
+    self._counter = 0
+    self._obj_to_id = {}  # type: Dict[Any, str]
+
+  def get_or_assign(self, obj=None, label=None):
+    # type: (Optional[Any], Optional[str]) -> str
+
+    """Retrieve the unique ref for the given object.
+
+    Generates and assigns a unique ref if one hasn't been assigned yet. label
+    will be incorporated into the unique ref when assigning a new unique ref,
+    otherwise it is ignored."""
+    if obj not in self._obj_to_id:
+      self._obj_to_id[obj] = self._unique_ref(obj, label)
+
+    return self._obj_to_id[obj]
+
+  def _unique_ref(self, obj=None, label=None):
+    self._counter += 1
+    return "%s_%s_%d" % (self._base, label or type(obj).__name__, self._counter)
+
+  @classmethod
+  def get_instance(cls, pipeline, base):
+    # type: (Optional[Pipeline], str) -> _UniqueRefAssigner
+
+    """Return the _UniqueRefAssigner with the given base string.
+
+    Creates a new instance if one doesn't already exist for this base string."""
+    key = (id(pipeline), base)

Review comment:
       Partitioning the cached ID assignments by `id(pipeline)` is a sub-par and dangerous solution because `id` is only guaranteed to be unique among objects that have non-overlapping lifetimes. If we go with something like this approach, we need to at least tie these instances' lifetimes to the lifetime of the pipeline.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-650454710


   I think something like this will fix the ExternalTransform bug, without breaking all the tests that rely on specific component id assignments.
   
   It will still grow without bound with each pipeline that's created in a process though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-651416156






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit merged pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit merged pull request #12067:
URL: https://github.com/apache/beam/pull/12067


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-649898941


   I'll see if I can add a simple unit test tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-648451338


   One shortcoming of this approach is that component ids are stored in a _global_ cache. I think it would be better if the cache could be scoped to a pipeline, but that would be more difficult to implement. I don't _think_ the global cache would cause any issues except for an extreme case where thousands of pipelines are being launched from a single python process and we consume a huge amount of memory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-648889479


   CC: @lukecwik 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-649046980


   I suppose another place where global state is ill-advised is when running tests, since we run all the tests in the same process and many of them create pipelines. 
   
   Python precommits are failing because of this. It looks like the failing tests are making assertions about specific values of component keys which seems brittle, for example:
   
   ```
   self = <apache_beam.runners.interactive.pipeline_instrument_test.PipelineInstrumentTest testMethod=test_cacheable_key_with_version_map>
   
       def test_cacheable_key_with_version_map(self):
         p = beam.Pipeline(interactive_runner.InteractiveRunner())
         # pylint: disable=range-builtin-not-iterating
         init_pcoll = p | 'Init Create' >> beam.Create(range(10))
       
         # It's normal that when executing, the pipeline object is a different
         # but equivalent instance from what user has built. The pipeline instrument
         # should be able to identify if the original instance has changed in an
         # interactive env while mutating the other instance for execution. The
         # version map can be used to figure out what the PCollection instances are
         # in the original instance and if the evaluation has changed since last
         # execution.
         p2 = beam.Pipeline(interactive_runner.InteractiveRunner())
         # pylint: disable=range-builtin-not-iterating
         init_pcoll_2 = p2 | 'Init Create' >> beam.Create(range(10))
         _, ctx = p2.to_runner_api(use_fake_coders=True, return_context=True)
       
         # The cacheable_key should use id(init_pcoll) as prefix even when
         # init_pcoll_2 is supplied as long as the version map is given.
         self.assertEqual(
             instr.cacheable_key(
                 init_pcoll_2,
                 instr.pcolls_to_pcoll_id(p2, ctx),
                 {'ref_PCollection_PCollection_8': str(id(init_pcoll))}),
   >         str(id(init_pcoll)) + '_ref_PCollection_PCollection_8')
   E     AssertionError: '140176476148624_ref_PCollection_PCollection_4539' != '140176476499024_ref_PCollection_PCollection_8'
   E     - 140176476148624_ref_PCollection_PCollection_4539
   E     ?          - ^^                               ^^^^
   E     + 140176476499024_ref_PCollection_PCollection_8
   E     ?           ^^^      
   ```
   
   Its probably easier to do the work to make sure cached component ids are scoped to an individual pipeline rather than fixing all of these tests (and a global cache shared across tests will be problematic anyway).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #12067: [BEAM-10308] Make component ID assignments consistent across PipelineContext instances

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #12067:
URL: https://github.com/apache/beam/pull/12067#issuecomment-652002592


   Run XVR_Spark PostCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org