You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/03/11 19:53:03 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14176: [BEAM-10708] Clean up pipeline instrument

TheNeuralBit commented on a change in pull request #14176:
URL: https://github.com/apache/beam/pull/14176#discussion_r592668654



##########
File path: sdks/python/apache_beam/runners/interactive/caching/cacheable.py
##########
@@ -0,0 +1,78 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module for dataclasses to hold metadata of cacheable PCollections in the user
+code scope.
+
+For internal use only; no backwards-compatibility guarantees.
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+from dataclasses import dataclass
+
+import apache_beam as beam
+from apache_beam.runners.interactive.utils import obfuscate
+
+
+@dataclass
+class Cacheable:
+  pcoll_id: str
+  var: str
+  version: str
+  pcoll: beam.pvalue.PCollection
+  producer_version: str
+
+  def __hash__(self):
+    return hash((
+        self.pcoll_id,
+        self.var,
+        self.version,
+        self.pcoll,
+        self.producer_version))

Review comment:
       It [looks like](https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass) you can add `@dataclass(frozen=True)` and it will generate a `__hash__` implementation for you. The trade off is that assigning to fields generates an exception. Would that work for you?

##########
File path: sdks/python/setup.py
##########
@@ -131,6 +131,10 @@ def get_version():
     # Avro 1.9.2 for python3 was broken. The issue was fixed in version 1.9.2.1
     'avro-python3>=1.8.1,!=1.9.2,<1.10.0',
     'crcmod>=1.7,<2.0',
+    # dataclasses backport for Python 3.6. No version boundary so that Python
+    # picks the correct one based on its version. For more details, see
+    # https://pypi.org/project/dataclasses/
+    'dataclasses',

Review comment:
       I don't see anything at https://pypi.org/project/dataclasses/ explaining why we don't need a version bound, could you clarify?
   
   It could make sense to add a `python_version<3.7`, so that if/when we drop python 3.6 support it's clear we can drop this.

##########
File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py
##########
@@ -885,10 +803,7 @@ def cacheables(pcolls_to_pcoll_id):
   cacheable_var_by_pcoll_id = {}
   for watching in ie.current_env().watching():
     for key, val in watching:
-      # TODO(BEAM-8288): cleanup the attribute check when py2 is not supported.
-      if hasattr(val, '__class__') and isinstance(val, beam.pvalue.PCollection):
-        cacheable = {}
-

Review comment:
       Do you still need this default value for cacheable in case its not set elsewhere?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org