You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2021/06/14 19:38:18 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14992: [BEAM-9547] Add implementation for first and last

TheNeuralBit commented on a change in pull request #14992:
URL: https://github.com/apache/beam/pull/14992#discussion_r651215931



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),
+        requires_partition_by=partitionings.Arbitrary())
+    with expressions.allow_non_parallel_operations(True):
+      return frame_base.DeferredFrame.wrap(
+          expressions.ComputedExpression(
+              'first',
+              lambda df: df.sort_index().first(offset=offset), [per_partition],
+              preserves_partition_by=partitionings.Arbitrary(),
+              requires_partition_by=partitionings.Singleton()))

Review comment:
       Yep!

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3037,10 +3059,8 @@ def do_partition_apply(df):
   tail = frame_base.wont_implement_method(
       DataFrameGroupBy, 'tail', explanation=_PEEK_METHOD_EXPLANATION)
 
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason='order-sensitive')
-  last = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'last', reason='order-sensitive')
+  first = frame_base.not_implemented_method('first')

Review comment:
       `not_implemented_method` indicates a method that's not implemented just because we haven't gotten to it yet, it will `raise NotImplementedError(..)` if used. `wont_implement_method` is for operations that aren't implemented because of some structural issue (like being sensitive to order, or producing an output schema that we can't determine at construction time). The latter raises an error that will point to documentation about that type of limitation (BEAM-12029 for the error messages,  BEAM-11951 is for the documentation, that's still in progress). 
   
   "Wont implement" is a little strong, since in practice we may still implement some of those in the future. But the barrier for those is higher.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -3210,17 +3230,15 @@ class _DeferredGroupByCols(frame_base.DeferredFrame):
   diff = frame_base._elementwise_method('diff', base=DataFrameGroupBy)
   fillna = frame_base._elementwise_method('fillna', base=DataFrameGroupBy)
   filter = frame_base._elementwise_method('filter', base=DataFrameGroupBy)
-  first = frame_base.wont_implement_method(
-      DataFrameGroupBy, 'first', reason="order-sensitive")
+  first = frame_base._elementwise_method('first', base=DataFrameGroupBy)

Review comment:
       This is a weird quirk of our implementation. In pandas when you groupby() a DataFrame you can change the "axis" you want to group/aggregate across. The default is the intuitive axis="index", where each column is grouped/aggregated across all of the rows of the dataset.
   
   But users can also specify they want to groupby(axis="columns"), in which case each _row_ will be grouped/aggregated across the columns. This class, `_DeferredGroupByCols`. is just handling that `axis="columns"` case.
   
   Technically we can easily support most of these aggregations since they're just performing an operation on each element, but it's not clear this path actually gets much usage.

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -253,6 +253,36 @@ def fillna(self, value, method, axis, limit, **kwargs):
   backfill = _fillna_alias('backfill')
   pad = _fillna_alias('pad')
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  def first(self, offset):
+    per_partition = expressions.ComputedExpression(
+        'first-per-partition',
+        lambda df: df.sort_index().first(offset=offset), [self._expr],
+        preserves_partition_by=partitionings.Arbitrary(),

Review comment:
       This actually means it will preserve any partitioning, `preserves=Singleton()` would indicate it preserves no partitioning.
   
   In this case the operation doesn't modify the index at all, so the output should still be partitioned in the same way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org