You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/10/02 00:51:46 UTC

[GitHub] [beam] robertwb commented on a change in pull request #12705: [BEAM-10720][WIP] Implement StringMethods

robertwb commented on a change in pull request #12705:
URL: https://github.com/apache/beam/pull/12705#discussion_r498571169



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -905,7 +905,48 @@ def checked_callable_index(df):
             preserves_partition_by=partitionings.Singleton()))
 
 class _DeferredStringMethods(frame_base.DeferredBase):
-  pass
+  @frame_base.args_to_kwargs(pd.core.strings.StringMethods)
+  @frame_base.populate_defaults(pd.core.strings.StringMethods)
+  def cat(self, others, join, **kwargs):
+    if others is None:
+      # Concatenate series into a single String
+      requires = partitionings.Singleton()
+      func = lambda df: df.cat(join=join, **kwargs)
+      args = [self._expr]
+
+    elif isinstance(others, frame_base.DeferredBase) or \

Review comment:
       Use ()'s rather than backslashes for continuations. 

##########
File path: sdks/python/apache_beam/dataframe/transforms_test.py
##########
@@ -266,6 +266,17 @@ def check(actual):
           lambda x: {'res': 3 * x}, proxy)
       assert_that(res['res'], equal_to_series(three_series), 'CheckDictOut')
 
+  def test_cat(self):
+    # verify that cat works with a List[Series] sicne this is missing from doctests
+    df = pd.DataFrame({
+        'one': ['A', 'B', 'C'],
+        'two': ['BB', 'CC', 'A'],
+        'three': ['CCC', 'AA', 'B'],
+    })
+    self.run_scenario(df, lambda df: df.two.str.cat([df.three], join='outer'))
+    self.run_scenario(
+        df, lambda df: df.one.str.cat([df.two, df.three], join='outer'))
+

Review comment:
       I think this is because we're trying to partition the `StringMethods` object itself and send it across the stage/shuffle boundary, and that code only expects dataframes and series to be partition-able (expecting everything else to be a single-object scalar. 
   
   One route would be to add StringMethods to the list of things that are partitionable, but that feels a bit odd and won't work well with a columnar backend representation (e.g. how would we represent s.str[:] as an Arrow batch?) Better would be for these expressions to take the underlying series as an argument, and invoke `str` on it themselves as part of the function to be applied. This may take some refactoring of the elementwise operations. (It *may* be OK, as all the others get "fused away", similarly for DateTime, but we should probably add a check that we don't try to partition anything but frames).

##########
File path: sdks/python/apache_beam/dataframe/transforms_test.py
##########
@@ -266,6 +266,18 @@ def check(actual):
           lambda x: {'res': 3 * x}, proxy)
       assert_that(res['res'], equal_to_series(three_series), 'CheckDictOut')
 
+  def test_cat(self):
+    # verify that cat works with a List[Series] sicne this is

Review comment:
       since




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org