You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/04/22 23:12:34 UTC

[GitHub] [beam] rohdesamuel opened a new pull request #11503: Beam 9692 gbk

rohdesamuel opened a new pull request #11503:
URL: https://github.com/apache/beam/pull/11503


   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-628266026


   Thanks for the comments, I moved the GroupByKeyOnly etc to the DirectRunner and cleaned up the PTransformOverride by not subclassing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robertwb commented on pull request #11503: Beam 9692 gbk

Posted by GitBox <gi...@apache.org>.
robertwb commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-618601834


   For this one, we should make GroupByKey primitive, and move the composite implementation to an override in the BundleBasedDirectRunner. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-624873948


   R: @robertwb 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-628297341


   Had to add back the ReifyWindows. Taking it out broke some internal tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on a change in pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on a change in pull request #11503:
URL: https://github.com/apache/beam/pull/11503#discussion_r424735424



##########
File path: sdks/python/apache_beam/runners/direct/direct_runner.py
##########
@@ -234,7 +234,89 @@ def get_replacement_transform(self, transform):
       from apache_beam.runners.direct.test_stream_impl import _ExpandableTestStream
       return _ExpandableTestStream(transform)
 
+  class GroupByKeyPTransformOverride(PTransformOverride):
+    """A ``PTransformOverride`` for ``GroupByKey``.
+
+    This replaces the Beam implementation as a primitive.
+    """
+    def matches(self, applied_ptransform):
+      # Imported here to avoid circular dependencies.
+      # pylint: disable=wrong-import-order, wrong-import-position
+      from apache_beam.transforms.core import GroupByKey
+      if (isinstance(applied_ptransform.transform, GroupByKey) and
+          not getattr(applied_ptransform.transform, 'override', False)):
+        self.input_type = applied_ptransform.inputs[0].element_type
+        return True
+      return False
+
+    def get_replacement_transform(self, ptransform):
+      # Imported here to avoid circular dependencies.
+      # pylint: disable=wrong-import-order, wrong-import-position
+      from apache_beam.transforms.core import GroupByKey
+
+      # Subclass from GroupByKey to inherit all the proper methods.
+      class GroupByKey(GroupByKey):
+        override = True
+
+        def expand(self, pcoll):
+          # Imported here to avoid circular dependencies.
+          # pylint: disable=wrong-import-order, wrong-import-position
+          from apache_beam.coders import typecoders
+          from apache_beam.typehints import trivial_inference
+
+          input_type = pcoll.element_type
+          if input_type is not None:
+            # Initialize type-hints used below to enforce type-checking and to
+            # pass downstream to further PTransforms.
+            key_type, value_type = trivial_inference.key_value_types(input_type)
+            # Enforce the input to a GBK has a KV element type.
+            pcoll.element_type = typehints.typehints.coerce_to_kv_type(
+                pcoll.element_type)
+            typecoders.registry.verify_deterministic(
+                typecoders.registry.get_coder(key_type),
+                'GroupByKey operation "%s"' % self.label)
+
+            reify_output_type = typehints.KV[
+                key_type, typehints.WindowedValue[value_type]]  # type: ignore[misc]
+            gbk_input_type = (
+                typehints.
+                KV[key_type,
+                   typehints.Iterable[
+                       typehints.WindowedValue[  # type: ignore[misc]
+                           value_type]]])
+            gbk_output_type = typehints.KV[key_type,
+                                           typehints.Iterable[value_type]]
+
+            # pylint: disable=bad-continuation
+            return (
+                pcoll
+                | 'ReifyWindows' >> (
+                    ParDo(self.ReifyWindows()).with_output_types(
+                        reify_output_type))
+                | 'GroupByKey' >> (
+                    _GroupByKeyOnly().with_input_types(
+                        reify_output_type).with_output_types(gbk_input_type))
+                | (
+                    'GroupByWindow' >>
+                    _GroupAlsoByWindow(pcoll.windowing).with_input_types(
+                        gbk_input_type).with_output_types(gbk_output_type)))
+          else:
+            # The input_type is None, run the default
+            return (
+                pcoll
+                | 'ReifyWindows' >> ParDo(self.ReifyWindows())

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on a change in pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on a change in pull request #11503:
URL: https://github.com/apache/beam/pull/11503#discussion_r424735647



##########
File path: sdks/python/apache_beam/runners/direct/direct_runner.py
##########
@@ -234,7 +234,89 @@ def get_replacement_transform(self, transform):
       from apache_beam.runners.direct.test_stream_impl import _ExpandableTestStream
       return _ExpandableTestStream(transform)
 
+  class GroupByKeyPTransformOverride(PTransformOverride):
+    """A ``PTransformOverride`` for ``GroupByKey``.
+
+    This replaces the Beam implementation as a primitive.
+    """
+    def matches(self, applied_ptransform):
+      # Imported here to avoid circular dependencies.
+      # pylint: disable=wrong-import-order, wrong-import-position
+      from apache_beam.transforms.core import GroupByKey
+      if (isinstance(applied_ptransform.transform, GroupByKey) and
+          not getattr(applied_ptransform.transform, 'override', False)):
+        self.input_type = applied_ptransform.inputs[0].element_type
+        return True
+      return False
+
+    def get_replacement_transform(self, ptransform):
+      # Imported here to avoid circular dependencies.
+      # pylint: disable=wrong-import-order, wrong-import-position
+      from apache_beam.transforms.core import GroupByKey
+
+      # Subclass from GroupByKey to inherit all the proper methods.

Review comment:
       Yep, I took out the subclass dependency and simplified the override




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-628793076


   Don't submit yet, there are some breaking internal tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robertwb merged pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
robertwb merged pull request #11503:
URL: https://github.com/apache/beam/pull/11503


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on a change in pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on a change in pull request #11503:
URL: https://github.com/apache/beam/pull/11503#discussion_r424753027



##########
File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py
##########
@@ -404,10 +396,6 @@ def test_gbk_then_flatten_input_visitor(self):
     flat = (none_str_pc, none_int_pc) | beam.Flatten()
     _ = flat | beam.GroupByKey()
 
-    # This may change if type inference changes, but we assert it here
-    # to make sure the check below is not vacuous.

Review comment:
       I added it back in. The DataflowRunner should change the element type and we should test before and after.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] rohdesamuel commented on pull request #11503: [BEAM-9692] Make GroupByKey into a primitive

Posted by GitBox <gi...@apache.org>.
rohdesamuel commented on pull request #11503:
URL: https://github.com/apache/beam/pull/11503#issuecomment-631701353


   I added a condition to skip external transforms because we can assume that the external service provides correct transforms. Internal testing passing now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org