You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/05 08:34:57 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

zhengruifeng opened a new pull request, #37801:
URL: https://github.com/apache/spark/pull/37801

   ### What changes were proposed in this pull request?
   Implement `GroupBy.nth`
   
   
   ### Why are the changes needed?
   for API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new API
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r965455458


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   Yes, so we can:
   
   - Raise TypeError for unsupported type in pandas as well.
   - Raise NotImplementedError which is not implemented only in pandas API on Spark, but existing in pandas.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r965431276


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   > nit: Maybe rather raises `NotImplementedError` since we should support the other types in the future ?
   
   let me take back my words. I think it should raise `NotImplementedError` for the types that Pandas already supported while we can not support right now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r962859128


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   there seems a bug in Pandas' `GroupBy.nth`, its returned index varies with `n`:
   ```
   In [23]: pdf
   Out[23]: 
      A    B  C      D
   0  1  3.1  a   True
   1  2  4.1  b  False
   2  1  4.1  b  False
   3  2  3.1  a   True
   
   In [24]: pdf.groupby(["A", "B", "C", "D"]).nth(0)
   Out[24]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
   
   In [25]: pdf.groupby(["A", "B", "C", "D"]).nth(0).index
   Out[25]: 
   MultiIndex([(1, 3.1, 'a',  True),
               (1, 4.1, 'b', False),
               (2, 3.1, 'a',  True),
               (2, 4.1, 'b', False)],
              names=['A', 'B', 'C', 'D'])
   
   In [26]: pdf.groupby(["A", "B", "C", "D"]).nth(1)
   Out[26]: 
   Empty DataFrame
   Columns: []
   Index: []
   
   In [27]: pdf.groupby(["A", "B", "C", "D"]).nth(1).index
   Out[27]: MultiIndex([], names=['A', 'B', 'C', 'D'])
   
   In [28]: pdf.groupby(["A", "B", "C", "D"]).nth(-1)
   Out[28]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
   
   In [29]: pdf.groupby(["A", "B", "C", "D"]).nth(-1).index
   Out[29]: 
   MultiIndex([(1, 3.1, 'a',  True),
               (1, 4.1, 'b', False),
               (2, 3.1, 'a',  True),
               (2, 4.1, 'b', False)],
              names=['A', 'B', 'C', 'D'])
   
   In [30]: pdf.groupby(["A", "B", "C", "D"]).nth(-2)
   Out[30]: 
   Empty DataFrame
   Columns: []
   Index: []
   
   In [31]: pdf.groupby(["A", "B", "C", "D"]).nth(-2).index
   Out[31]: MultiIndex([], names=['A', 'B', 'C', 'D'])
   ```
   
   while other functions' behavior in Pandas and PS are like this:
   ```
   In [17]: pdf
   Out[17]: 
      A    B  C      D
   0  1  3.1  a   True
   1  2  4.1  b  False
   2  1  4.1  b  False
   3  2  3.1  a   True
   
   In [18]: pdf.groupby(["A", "B", "C", "D"]).max()
   Out[18]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
   
   In [19]: pdf.groupby(["A", "B", "C", "D"]).mad()
   Out[19]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (1, 4.1, b, False), (2, 3.1, a, True), (2, 4.1, b, False)]
   
   In [20]: psdf.groupby(["A", "B", "C", "D"]).max()
   Out[20]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
   
   In [21]: psdf.groupby(["A", "B", "C", "D"]).mad()
   Out[21]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
   
   In [22]: 
   
   In [22]: psdf.groupby(["A", "B", "C", "D"]).nth(0)
   Out[22]: 
   Empty DataFrame
   Columns: []
   Index: [(1, 3.1, a, True), (2, 4.1, b, False), (1, 4.1, b, False), (2, 3.1, a, True)]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964975991


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,102 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Notes
+        -----
+        There is a behavior difference between pandas-on-Spark and pandas:
+
+        * when there is no aggregation column, and `n` not equal to 0 or -1,
+        the returned empty dataframe may have an index with different lenght `__len__`.

Review Comment:
   ```suggestion
   ```



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,102 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+

Review Comment:
   ```
           .. note:: There is a behavior difference between pandas-on-Spark and pandas:
               when there is no aggregation column, and `n` not equal to 0 or -1,
               the returned empty dataframe may have an index with different lenght `__len__`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r963416033


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   Got it.
   
   Btw, seems like the latest pandas (1.4.3) raises `TypeError` as below:
   
   ```python
   >>> g.nth("C")
   Traceback (most recent call last):
   ...
   TypeError: Invalid index <class 'str'>. Must be integer, list-like, slice or a tuple of integers and slices
   ```
   
   Can we follow the `TypeError` and its message from pandas, for more information to users ?



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   Got it.
   
   Btw, seems like the latest pandas (1.4.4) raises `TypeError` as below:
   
   ```python
   >>> g.nth("C")
   Traceback (most recent call last):
   ...
   TypeError: Invalid index <class 'str'>. Must be integer, list-like, slice or a tuple of integers and slices
   ```
   
   Can we follow the `TypeError` and its message from pandas, for more information to users ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964331381


##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -1380,8 +1380,18 @@ def test_nth(self):
         for n in [0, 1, 2, 128, -1, -2, -128]:
             self._test_stat_func(lambda groupby_obj: groupby_obj.nth(n))
 
+        pdf, psdf = self.pdf, self.psdf
         with self.assertRaisesRegex(TypeError, "Unsupported type"):
-            self.psdf.groupby("B").nth("x")
+            psdf.groupby("B").nth("x")
+
+        # in case there is no aggregation columns, the results should be the same
+        if LooseVersion("1.4.0") <= LooseVersion(pd.__version__) < LooseVersion("1.5.0"):

Review Comment:
   `< LooseVersion("1.5.0")`
   I think only `1.4.x` will test this case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #37801:
URL: https://github.com/apache/spark/pull/37801#issuecomment-1240370911

   Merged to master, thanks @itholic @Yikun for reviews !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r963359134


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   Pandas raise a `TypeError` for invalid `n`, see https://github.com/apache/spark/pull/37801#discussion_r962695251



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`
URL: https://github.com/apache/spark/pull/37801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964511083


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply like "yes, it's a bug".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #37801:
URL: https://github.com/apache/spark/pull/37801#issuecomment-1239114822

   I just report the index related issue here https://github.com/pandas-dev/pandas/issues/48434
   
   As to the issue here, I personally think it's not a big deal. What about just mention it in the docsting that `when there is no aggregation columns, the returned empty dataframe may has a different index other than Pandas.`
   
   @itholic @Yikun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r963156365


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter

Review Comment:
   Maybe do we want to create a ticket as a sub-tasks of SPARK-40327 ?



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   If there is a bug in pandas, maybe we should add a test by manually creating the expected result rather than just skipping the test ?
   
   e.g.
   https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_series.py#L1654-L1658



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   nit: Maybe rather raises `NotImplementedError` since we should support the other types in the future ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r965455458


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   Yes, so we can:
   
   - Raise `TypeError` for unsupported type in pandas as well.
   - Raise `NotImplementedError` which is not implemented only in pandas API on Spark, but existing in pandas.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r963659921


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Examples
+        --------
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        if not isinstance(n, int):
+            raise TypeError("Unsupported type %s" % type(n).__name__)

Review Comment:
   FYI, upgrade pandas to 1.4.4 https://github.com/apache/spark/pull/37810



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on PR #37801:
URL: https://github.com/apache/spark/pull/37801#issuecomment-1239153912

   @zhengruifeng I'm fine with it, too.
   
   I would be great to have the `Notes` section, and describe it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964511083


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   Or we can post a question for pandas community if it's a bug or intended behavior, and comment the question link it if they reply "yes, it's a bug".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964975991


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,102 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Returns
+        -------
+        Series or DataFrame
+
+        Notes
+        -----
+        There is a behavior difference between pandas-on-Spark and pandas:
+
+        * when there is no aggregation column, and `n` not equal to 0 or -1,
+        the returned empty dataframe may have an index with different lenght `__len__`.

Review Comment:
   ```suggestion
   ```



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,102 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+

Review Comment:
   ```
           .. note:: There is a behavior difference between pandas-on-Spark and pandas:
               when there is no aggregation column, and `n` not equal to 0 or -1,
               the returned empty dataframe may have an index with different lenght `__len__`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964315856


##########
python/pyspark/pandas/tests/test_groupby.py:
##########
@@ -1380,8 +1380,18 @@ def test_nth(self):
         for n in [0, 1, 2, 128, -1, -2, -128]:
             self._test_stat_func(lambda groupby_obj: groupby_obj.nth(n))
 
+        pdf, psdf = self.pdf, self.psdf
         with self.assertRaisesRegex(TypeError, "Unsupported type"):
-            self.psdf.groupby("B").nth("x")
+            psdf.groupby("B").nth("x")
+
+        # in case there is no aggregation columns, the results should be the same
+        if LooseVersion("1.4.0") <= LooseVersion(pd.__version__) < LooseVersion("1.5.0"):

Review Comment:
   qq: So, do we need upperbound (1.5.0) here since we're going to only support pandas 1.4.x for the Apache Spark 3.4.0 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #37801:
URL: https://github.com/apache/spark/pull/37801#issuecomment-1236712431

   cc @itholic @HyukjinKwon @xinrong-meng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964508929


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.
   
   When there is a bug in pandas, we usually do something like this:
   
   - we don't follow the behavior of pandas, just we assume it works properly and implement it.
   - comment the link related [pandas issues](https://github.com/pandas-dev/pandas/issues) to the test, from pandas repository(`https://github.com/pandas-dev/pandas/issues/...`) as below:
     - https://github.com/apache/spark/blob/0830575100c1bcbc89d98a6859fe3b8d46ca2e6e/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py#L510-L517
     - https://github.com/apache/spark/blob/0830575100c1bcbc89d98a6859fe3b8d46ca2e6e/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py#L478-L483
   
   - If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r963358263


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,95 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter

Review Comment:
   let's add it when we start to implement the parameters



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

Yikun commented on PR #37801:
URL: https://github.com/apache/spark/pull/37801#issuecomment-1239150477

   @zhengruifeng Yep, I'm fine with it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

Yikun commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r962662977


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"

Review Comment:
   verify_temp_column_name



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"

Review Comment:
   verify_temp_column_name



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:

Review Comment:
   validate n with a friendly exception?
   
   ```python
   >>> g.nth('C')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/yikun/venv/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 2304, in nth
       raise TypeError("n needs to be an int or a list/set/tuple of ints")
   TypeError: n needs to be an int or a list/set/tuple of ints
   ```



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:

Review Comment:
   https://github.com/apache/spark/blob/5a03f70aa1c6f709bee6c7d91e4e74bf38498b5a/python/pyspark/sql/functions.py#L3622
   
   Since 3.1, there are a `def nth_value` in spark, but considering negetive index and we are going to support list and slice in the future, I think use `row_number` is right in here, but just FYI if you have other idea.



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+

Review Comment:
   ```suggestion
           Returns
           -------
   ```



##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   Add a test to cover this? I'm a little fuzzy about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964768456


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:

Review Comment:
   I guess we can not apply `nth_value` for this purpose, it return the `n-th` row within one partition for each input row, can not use it to filter out unnecessary rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r962864581


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   so I think we can not add a test for it for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #37801: [SPARK-40333][PS] Implement `GroupBy.nth`

Posted by GitBox <gi...@apache.org>.

itholic commented on code in PR #37801:
URL: https://github.com/apache/spark/pull/37801#discussion_r964508929


##########
python/pyspark/pandas/groupby.py:
##########
@@ -895,6 +895,89 @@ def sem(col: Column) -> Column:
             bool_to_numeric=True,
         )
 
+    # TODO: 1, 'n' accepts list and slice; 2, implement 'dropna' parameter
+    def nth(self, n: int) -> FrameLike:
+        """
+        Take the nth row from each group.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        n : int
+            A single nth value for the row
+
+        Examples
+        --------
+
+        >>> df = ps.DataFrame({'A': [1, 1, 2, 1, 2],
+        ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
+        >>> g = df.groupby('A')
+        >>> g.nth(0)
+             B
+        A
+        1  NaN
+        2  3.0
+        >>> g.nth(1)
+             B
+        A
+        1  2.0
+        2  5.0
+        >>> g.nth(-1)
+             B
+        A
+        1  4.0
+        2  5.0
+
+        See Also
+        --------
+        pyspark.pandas.Series.groupby
+        pyspark.pandas.DataFrame.groupby
+        """
+        groupkey_names = [SPARK_INDEX_NAME_FORMAT(i) for i in range(len(self._groupkeys))]
+        internal, agg_columns, sdf = self._prepare_reduce(
+            groupkey_names=groupkey_names,
+            accepted_spark_types=None,
+            bool_to_numeric=False,
+        )
+        psdf: DataFrame = DataFrame(internal)
+
+        if len(psdf._internal.column_labels) > 0:
+            window1 = Window.partitionBy(*groupkey_names).orderBy(NATURAL_ORDER_COLUMN_NAME)
+            tmp_row_number_col = "__tmp_row_number_col__"
+            if n >= 0:
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_row_number_col, F.row_number().over(window1)
+                    )
+                    .where(F.col(tmp_row_number_col) == n + 1)
+                    .drop(tmp_row_number_col)
+                )
+            else:
+                window2 = Window.partitionBy(*groupkey_names).rowsBetween(
+                    Window.unboundedPreceding, Window.unboundedFollowing
+                )
+                tmp_group_size_col = "__tmp_group_size_col__"
+                sdf = (
+                    psdf._internal.spark_frame.withColumn(
+                        tmp_group_size_col, F.count(F.lit(0)).over(window2)
+                    )
+                    .withColumn(tmp_row_number_col, F.row_number().over(window1))
+                    .where(F.col(tmp_row_number_col) == F.col(tmp_group_size_col) + 1 + n)
+                    .drop(tmp_group_size_col, tmp_row_number_col)
+                )
+        else:
+            sdf = sdf.select(*groupkey_names).distinct()

Review Comment:
   Oh... I just noticed that we're following the pandas behavior even though there is a bug in pandas.
   
   When there is a bug in pandas, we usually do something like this:
   
   - we don't follow the behavior of pandas, we just assume it works properly and implement it.
   - comment the link related [pandas issues](https://github.com/pandas-dev/pandas/issues) to the test, from pandas repository(`https://github.com/pandas-dev/pandas/issues/...`) as below:
     - https://github.com/apache/spark/blob/0830575100c1bcbc89d98a6859fe3b8d46ca2e6e/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py#L510-L517
     - https://github.com/apache/spark/blob/0830575100c1bcbc89d98a6859fe3b8d46ca2e6e/python/pyspark/pandas/tests/data_type_ops/test_num_ops.py#L478-L483
   
   - If it's not clear that it's a bug (unless it's not an officially discussed as a bug in pandas community), we can just follow the pandas behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org