You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/21 09:24:16 UTC

[GitHub] [spark] Yikun commented on a change in pull request #35840: [SPARK-38552][PYTHON] Implement `keep` parameter of `frame.nlargest/nsmallest` to decide how to resolve ties

Yikun commented on a change in pull request #35840:
URL: https://github.com/apache/spark/pull/35840#discussion_r830887319



##########
File path: python/pyspark/pandas/frame.py
##########
@@ -6846,7 +6866,16 @@ def _sort(
             (False, "last"): Column.desc_nulls_last,
         }
         by = [mapper[(asc, na_position)](scol) for scol, asc in zip(by, ascending)]
-        sdf = self._internal.resolved_copy.spark_frame.sort(*by, NATURAL_ORDER_COLUMN_NAME)
+
+        natural_order_scol = F.col(NATURAL_ORDER_COLUMN_NAME)
+
+        if keep == "last":
+            natural_order_scol = Column.desc(natural_order_scol)
+        elif keep != "first":

Review comment:
       `all`: NotImplementedError
   other: ValueError

##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -1789,21 +1789,47 @@ def test_swapaxes(self):
 
     def test_nlargest(self):
         pdf = pd.DataFrame(
-            {"a": [1, 2, 3, 4, 5, None, 7], "b": [7, 6, 5, 4, 3, 2, 1]}, index=np.random.rand(7)
+            {"a": [1, 2, 3, 4, 5, None, 7], "b": [7, 6, 5, 4, 3, 2, 1], "c": [1, 1, 2, 2, 3, 3, 3]},
+            index=np.random.rand(7),
         )
         psdf = ps.from_pandas(pdf)
         self.assert_eq(psdf.nlargest(n=5, columns="a"), pdf.nlargest(5, columns="a"))
         self.assert_eq(psdf.nlargest(n=5, columns=["a", "b"]), pdf.nlargest(5, columns=["a", "b"]))
+        self.assert_eq(psdf.nlargest(n=5, columns=["c"]), pdf.nlargest(5, columns=["c"]))

Review comment:
       ```suggestion
           self.assert_eq(psdf.nlargest(5, columns=["c"]), pdf.nlargest(5, columns=["c"]))
   ```
   
   nits: Looks like there are some irregular in before test.

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7321,6 +7338,10 @@ def nlargest(self, n: int, columns: Union[Name, List[Name]]) -> "DataFrame":
             Number of rows to return.
         columns : label or list of labels
             Column label(s) to order by.
+        keep : {'first', 'last'}, default 'first'

Review comment:
       maybe doc some on ``all`` not supported yet

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7438,8 +7498,42 @@ def nsmallest(self, n: int, columns: Union[Name, List[Name]]) -> "DataFrame":
         0  1.0   6
         1  2.0   7
         2  3.0   8
+
+        The examples below show how ties are resolved, which is decided by `keep`.
+
+        >>> tied_df = ps.DataFrame({'X': [1, 1, 2, 2, 3]}, index=['a', 'b', 'c', 'd', 'e'])

Review comment:
       it would be good give a multi cols dataframe, otherwise we couldn't see any diff on frist/last/default

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -7395,6 +7451,10 @@ def nsmallest(self, n: int, columns: Union[Name, List[Name]]) -> "DataFrame":
             Number of items to retrieve.
         columns : list or str
             Column name or names to order by.
+        keep : {'first', 'last'}, default 'first'

Review comment:
       ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org