You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/09/14 23:55:53 UTC

[GitHub] [beam] TheNeuralBit commented on a change in pull request #12812: [BEAM-10873] Stronger testing of dataframe partitioning declarations.

TheNeuralBit commented on a change in pull request #12812:
URL: https://github.com/apache/beam/pull/12812#discussion_r488296111



##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -551,9 +559,13 @@ def merge(
       return merged.reset_index(drop=True)
 
   @frame_base.args_to_kwargs(pd.DataFrame)
-  def nlargest(self, **kwargs):
-    if 'keep' in kwargs and kwargs['keep'] != 'all':
+  @frame_base.populate_defaults(pd.DataFrame)
+  def nlargest(self, keep, **kwargs):
+    if keep == 'any':
+      keep = 'first'

Review comment:
       `any` is not part of the pandas API. I assume this is an option you're adding to provide a way to get exactly `n` without promising to get the first or last since that's order-sensitive?
   
   That makes sense - but we should probably document the places where we diverge from pandas like this (could just be a TODO here and/or a jira for now).

##########
File path: sdks/python/apache_beam/dataframe/expressions.py
##########
@@ -45,6 +45,45 @@ def lookup(self, expr):  #  type: (Expression) -> Any
     return self._bindings[expr]
 
 
+class PartitioningSession(Session):
+  def evaluate(self, expr):

Review comment:
       ```suggestion
   class PartitioningSession(Session):
     """An extension of Session that ensures input dataframes are split into at least `len(df)` separate partitions and input in a random order. For testing only.
     """
     def evaluate(self, expr):
   ```
   
   Or something like that :)

##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -551,9 +559,13 @@ def merge(
       return merged.reset_index(drop=True)
 
   @frame_base.args_to_kwargs(pd.DataFrame)
-  def nlargest(self, **kwargs):
-    if 'keep' in kwargs and kwargs['keep'] != 'all':
+  @frame_base.populate_defaults(pd.DataFrame)
+  def nlargest(self, keep, **kwargs):
+    if keep == 'any':
+      keep = 'first'

Review comment:
       Also: I take it PartitioningSession revealed this problem with first/last? That's good to know, I was wondering about these options earlier but figured I was misunderstanding something.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org