You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@spark.apache.org by gu...@apache.org on 2021/08/27 01:04:39 UTC

[spark] branch branch-3.2 updated (c25f1e4 -> 2dc15d9)

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git.


    from c25f1e4  [SPARK-36227][SQL][FOLLOWUP][3.2] Remove unused import in TimestampNTZType.scala
     new cb075b5  [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
     new f2f09e4  [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
     new 0fc8c39  [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
     new 31557d4  [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
     new 8829406  [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
     new 2dc15d9  [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype

The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .github/workflows/build_and_test.yml               |   4 +-
 python/pyspark/pandas/categorical.py               |   3 +-
 python/pyspark/pandas/data_type_ops/base.py        |   2 +-
 .../pandas/data_type_ops/categorical_ops.py        |   4 +-
 .../pyspark/pandas/data_type_ops/datetime_ops.py   |  19 +-
 python/pyspark/pandas/groupby.py                   |  37 +--
 python/pyspark/pandas/indexes/base.py              |   4 +-
 .../tests/data_type_ops/test_categorical_ops.py    |   6 +-
 python/pyspark/pandas/tests/indexes/test_base.py   |  84 ++++---
 .../pyspark/pandas/tests/indexes/test_category.py  |  15 +-
 python/pyspark/pandas/tests/test_categorical.py    |  76 ++++++-
 python/pyspark/pandas/tests/test_expanding.py      |  74 ++++--
 .../test_ops_on_diff_frames_groupby_expanding.py   |  13 +-
 .../test_ops_on_diff_frames_groupby_rolling.py     |  14 +-
 python/pyspark/pandas/tests/test_rolling.py        |  75 +++++--
 python/pyspark/pandas/tests/test_series.py         |  18 +-
 python/pyspark/pandas/window.py                    | 249 +++++++++++----------
 17 files changed, 432 insertions(+), 265 deletions(-)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 06/06: [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 2dc15d9d8476da327c54577e3bbb261ad7923f2f
Author: itholic <ha...@databricks.com>
AuthorDate: Thu Aug 26 17:43:49 2021 +0900

    [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype
    
    This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.
    
    - `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
    - Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.
    
    We should enable the tests as much as possible even if pandas has a bug.
    
    And we should follow the behavior of latest pandas.
    
    Yes, `GroupBy.transform` now follow the behavior of latest pandas.
    
    Unittests.
    
    Closes #33817 from itholic/SPARK-36537.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit fe486185c4a3a05278b1f01884e2b95ed3ca31bc)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/groupby.py                |   1 +
 python/pyspark/pandas/tests/test_categorical.py | 116 +++++++++++++-----------
 2 files changed, 63 insertions(+), 54 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index c732dff..2815a6b 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -2264,6 +2264,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
                 for c in psdf._internal.data_spark_column_names
                 if c not in groupkey_names
             ]
+
             return_schema = StructType([field.struct_field for field in data_fields])
 
             sdf = GroupBy._spark_group_map_apply(
diff --git a/python/pyspark/pandas/tests/test_categorical.py b/python/pyspark/pandas/tests/test_categorical.py
index 1fb0d58..e55c08c 100644
--- a/python/pyspark/pandas/tests/test_categorical.py
+++ b/python/pyspark/pandas/tests/test_categorical.py
@@ -74,10 +74,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.categories = ["z", "y", "x"]
         psser.cat.categories = ["z", "y", "x"]
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=["x", "y", "z"]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         with self.assertRaises(ValueError):
@@ -96,10 +96,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.add_categories(4, inplace=True)
         psser.cat.add_categories(4, inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[1, 2, 3, 4]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.add_categories(4))
@@ -124,10 +124,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.remove_categories(2, inplace=True)
         psser.cat.remove_categories(2, inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[1, 3]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.remove_categories(4))
@@ -151,10 +151,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.remove_unused_categories(inplace=True)
         psser.cat.remove_unused_categories(inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[1, 3]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
     def test_reorder_categories(self):
@@ -180,20 +180,17 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.reorder_categories([1, 2, 3], inplace=True)
         psser.cat.reorder_categories([1, 2, 3], inplace=True)
-        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.reorder_categories([3, 2, 1], ordered=True, inplace=True)
         psser.cat.reorder_categories([3, 2, 1], ordered=True, inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[3, 2, 1], ordered=True))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.reorder_categories([1, 2]))
@@ -214,10 +211,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.as_ordered(inplace=True)
         psser.cat.as_ordered(inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[1, 2, 3], ordered=True))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         # as_unordered
@@ -225,6 +222,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.as_unordered(inplace=True)
         psser.cat.as_unordered(inplace=True)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[1, 2, 3], ordered=False))
+            pdf.a = pser
+
         self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
@@ -445,13 +447,16 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         dtype = CategoricalDtype(categories=["a", "b", "c", "d"])
 
-        def astype(x) -> ps.Series[dtype]:
+        # The behavior for CategoricalDtype is changed from pandas 1.3
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            ret_dtype = pdf.b.dtype
+        else:
+            ret_dtype = dtype
+
+        def astype(x) -> ps.Series[ret_dtype]:
             return x.astype(dtype)
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
+        if LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
                 psdf.groupby("a").transform(astype).sort_values("b").reset_index(drop=True),
                 pdf.groupby("a").transform(astype).sort_values("b").reset_index(drop=True),
@@ -670,28 +675,30 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         pser.cat.rename_categories({"a": "A", "c": "C"}, inplace=True)
         psser.cat.rename_categories({"a": "A", "c": "C"}, inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=["C", "b", "d", "A"]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.rename_categories(lambda x: x.upper(), inplace=True)
         psser.cat.rename_categories(lambda x: x.upper(), inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=["C", "B", "D", "A"]))
+            pdf.b = pser
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.rename_categories([0, 1, 3, 2], inplace=True)
         psser.cat.rename_categories([0, 1, 3, 2], inplace=True)
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[0, 1, 3, 2]))
+            pdf.b = pser
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaisesRegex(
@@ -762,19 +769,20 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
             psser.cat.set_categories(["a", "c", "b", "o"], inplace=True, rename=True),
         )
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=["a", "c", "b", "o"]))
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.set_categories([2, 3, 1, 0], inplace=True, rename=False),
         psser.cat.set_categories([2, 3, 1, 0], inplace=True, rename=False),
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(pser, psser)
+            # Bug in pandas 1.3. dtype is not updated properly with `inplace` argument.
+            pser = pser.astype(CategoricalDtype(categories=[2, 3, 1, 0]))
+            pdf.b = pser
+
+        self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaisesRegex(

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 04/06: [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 31557d475958e18c7e26525d5ba6f7448bad800f
Author: itholic <ha...@databricks.com>
AuthorDate: Tue Aug 17 10:29:16 2021 -0700

    [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
    
    This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.
    
    In pandas < 1.3,
    ```python
    >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
    0    2020-10-27 00:00:01
    1                    NaT
    Name: datetime, dtype: string
    ```
    
    This is changed to
    
    ```python
    >>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
    0    2020-10-27 00:00:01
    1                   <NA>
    Name: datetime, dtype: string
    ```
    
    in pandas >= 1.3, so we follow the behavior of latest pandas.
    
    Because pandas-on-Spark always follow the behavior of latest pandas.
    
    Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)
    
    Unittest passed
    
    Closes #33735 from itholic/SPARK-36387.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Takuya UESHIN <ue...@databricks.com>
    (cherry picked from commit c0441bb7e83e83e3240bf7e2991de34b01a182f5)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/data_type_ops/base.py         |  2 +-
 python/pyspark/pandas/data_type_ops/datetime_ops.py | 19 ++++---------------
 python/pyspark/pandas/tests/test_series.py          |  8 +++++---
 3 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/base.py b/python/pyspark/pandas/data_type_ops/base.py
index c69715f..b4c8c3e 100644
--- a/python/pyspark/pandas/data_type_ops/base.py
+++ b/python/pyspark/pandas/data_type_ops/base.py
@@ -155,7 +155,7 @@ def _as_string_type(
     index_ops: IndexOpsLike, dtype: Union[str, type, Dtype], *, null_str: str = str(None)
 ) -> IndexOpsLike:
     """Cast `index_ops` to StringType Spark type, given `dtype` and `null_str`,
-    representing null Spark column.
+    representing null Spark column. Note that `null_str` is for non-extension dtypes only.
     """
     spark_type = StringType()
     if isinstance(dtype, extension_dtypes):
diff --git a/python/pyspark/pandas/data_type_ops/datetime_ops.py b/python/pyspark/pandas/data_type_ops/datetime_ops.py
index 071c22e..63d817b 100644
--- a/python/pyspark/pandas/data_type_ops/datetime_ops.py
+++ b/python/pyspark/pandas/data_type_ops/datetime_ops.py
@@ -23,7 +23,7 @@ import numpy as np
 import pandas as pd
 from pandas.api.types import CategoricalDtype
 
-from pyspark.sql import functions as F, Column
+from pyspark.sql import Column
 from pyspark.sql.types import BooleanType, LongType, StringType, TimestampType
 
 from pyspark.pandas._typing import Dtype, IndexOpsLike, SeriesOrIndex
@@ -33,10 +33,11 @@ from pyspark.pandas.data_type_ops.base import (
     _as_bool_type,
     _as_categorical_type,
     _as_other_type,
+    _as_string_type,
     _sanitize_list_like,
 )
 from pyspark.pandas.spark import functions as SF
-from pyspark.pandas.typedef import extension_dtypes, pandas_on_spark_type
+from pyspark.pandas.typedef import pandas_on_spark_type
 
 
 class DatetimeOps(DataTypeOps):
@@ -133,18 +134,6 @@ class DatetimeOps(DataTypeOps):
         elif isinstance(spark_type, BooleanType):
             return _as_bool_type(index_ops, dtype)
         elif isinstance(spark_type, StringType):
-            if isinstance(dtype, extension_dtypes):
-                # seems like a pandas' bug?
-                scol = F.when(index_ops.spark.column.isNull(), str(pd.NaT)).otherwise(
-                    index_ops.spark.column.cast(spark_type)
-                )
-            else:
-                null_str = str(pd.NaT)
-                casted = index_ops.spark.column.cast(spark_type)
-                scol = F.when(index_ops.spark.column.isNull(), null_str).otherwise(casted)
-            return index_ops._with_new_scol(
-                scol.alias(index_ops._internal.data_spark_column_names[0]),
-                field=index_ops._internal.data_fields[0].copy(dtype=dtype, spark_type=spark_type),
-            )
+            return _as_string_type(index_ops, dtype, null_str=str(pd.NaT))
         else:
             return _as_other_type(index_ops, dtype, spark_type)
diff --git a/python/pyspark/pandas/tests/test_series.py b/python/pyspark/pandas/tests/test_series.py
index d9ba3c76..58c87ed 100644
--- a/python/pyspark/pandas/tests/test_series.py
+++ b/python/pyspark/pandas/tests/test_series.py
@@ -1556,16 +1556,18 @@ class SeriesTest(PandasOnSparkTestCase, SQLTestUtils):
         if extension_object_dtypes_available:
             from pandas import StringDtype
 
+            # The behavior of casting datetime to nullable string is changed from pandas 1.3.
             if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-                # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-                pass
-            else:
                 self._check_extension(
                     psser.astype("M").astype("string"), pser.astype("M").astype("string")
                 )
                 self._check_extension(
                     psser.astype("M").astype(StringDtype()), pser.astype("M").astype(StringDtype())
                 )
+            else:
+                expected = ps.Series(["2020-10-27 00:00:01", None], name="x", dtype="string")
+                self._check_extension(psser.astype("M").astype("string"), expected)
+                self._check_extension(psser.astype("M").astype(StringDtype()), expected)
 
         with self.assertRaisesRegex(TypeError, "not understood"):
             psser.astype("int63")

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 05/06: [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 8829406366674e3947eec2efa7d9622cdc680740
Author: itholic <ha...@databricks.com>
AuthorDate: Wed Aug 18 11:38:59 2021 -0700

    [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
    
    This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.
    
    **Before:**
    ```python
    >>> pcat
    0    a
    1    b
    2    c
    dtype: category
    Categories (3, object): ['a', 'b', 'c']
    
    >>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
    0    a
    1    b
    2    c
    dtype: category
    Categories (3, object): ['b', 'c', 'a']
    ```
    
    **After:**
    ```python
    >>> pcat
    0    a
    1    b
    2    c
    dtype: category
    Categories (3, object): ['a', 'b', 'c']
    
    >>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
    0    a
    1    b
    2    c
    dtype: category
    Categories (3, object): ['a', 'b', 'c']  # CategoricalDtype is not updated if dtype is the same
    ```
    
    `CategoricalDtype` is treated as a same `dtype` if the unique values are the same.
    
    ```python
    >>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
    >>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
    >>> pcat1.dtype == pcat2.dtype
    True
    ```
    
    We should follow the latest pandas as much as possible.
    
    Yes, the behavior is changed as example in the PR description.
    
    Unittest
    
    Closes #33757 from itholic/SPARK-36368.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Takuya UESHIN <ue...@databricks.com>
    (cherry picked from commit f2e593bcf1a1aa8dde9f73b77e4863ceed5a7e28)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/categorical.py                     |  3 ++-
 python/pyspark/pandas/data_type_ops/categorical_ops.py   |  4 +++-
 .../pandas/tests/data_type_ops/test_categorical_ops.py   |  6 ++----
 python/pyspark/pandas/tests/indexes/test_category.py     | 16 +++++++---------
 python/pyspark/pandas/tests/test_categorical.py          | 16 +++++++---------
 5 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/categorical.py b/python/pyspark/pandas/categorical.py
index 77a3cee..fa11228 100644
--- a/python/pyspark/pandas/categorical.py
+++ b/python/pyspark/pandas/categorical.py
@@ -22,6 +22,7 @@ from pandas.api.types import CategoricalDtype, is_dict_like, is_list_like
 
 from pyspark.pandas.internal import InternalField
 from pyspark.pandas.spark import functions as SF
+from pyspark.pandas.data_type_ops.categorical_ops import _to_cat
 from pyspark.sql import functions as F
 from pyspark.sql.types import StructField
 
@@ -735,7 +736,7 @@ class CategoricalAccessor(object):
                 return self._data.copy()
         else:
             dtype = CategoricalDtype(categories=new_categories, ordered=ordered)
-            psser = self._data.astype(dtype)
+            psser = _to_cat(self._data).astype(dtype)
 
             if inplace:
                 internal = self._data._psdf._internal.with_new_spark_column(
diff --git a/python/pyspark/pandas/data_type_ops/categorical_ops.py b/python/pyspark/pandas/data_type_ops/categorical_ops.py
index 73af82e..69625e2 100644
--- a/python/pyspark/pandas/data_type_ops/categorical_ops.py
+++ b/python/pyspark/pandas/data_type_ops/categorical_ops.py
@@ -58,7 +58,9 @@ class CategoricalOps(DataTypeOps):
     def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) -> IndexOpsLike:
         dtype, _ = pandas_on_spark_type(dtype)
 
-        if isinstance(dtype, CategoricalDtype) and cast(CategoricalDtype, dtype).categories is None:
+        if isinstance(dtype, CategoricalDtype) and (
+            (dtype.categories is None) or (index_ops.dtype == dtype)
+        ):
             return index_ops.copy()
 
         return _to_cat(index_ops).astype(dtype)
diff --git a/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py b/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
index 11871ea..5e79eb3 100644
--- a/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
+++ b/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
@@ -192,13 +192,11 @@ class CategoricalOpsTest(PandasOnSparkTestCase, TestCasesUtils):
         self.assert_eq(pser.astype("category"), psser.astype("category"))
 
         cat_type = CategoricalDtype(categories=[3, 1, 2])
+        # CategoricalDtype is not updated if the dtype is same from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(pser.astype(cat_type), psser.astype(cat_type))
         else:
-            self.assert_eq(pd.Series(data).astype(cat_type), psser.astype(cat_type))
+            self.assert_eq(psser.astype(cat_type), pser)
 
     def test_neg(self):
         self.assertRaises(TypeError, lambda: -self.psser)
diff --git a/python/pyspark/pandas/tests/indexes/test_category.py b/python/pyspark/pandas/tests/indexes/test_category.py
index f241918..8393b31 100644
--- a/python/pyspark/pandas/tests/indexes/test_category.py
+++ b/python/pyspark/pandas/tests/indexes/test_category.py
@@ -172,25 +172,23 @@ class CategoricalIndexTest(PandasOnSparkTestCase, TestUtils):
         )
 
         pcidx = pidx.astype(CategoricalDtype(["c", "a", "b"]))
-        kcidx = psidx.astype(CategoricalDtype(["c", "a", "b"]))
+        pscidx = psidx.astype(CategoricalDtype(["c", "a", "b"]))
 
-        self.assert_eq(kcidx.astype("category"), pcidx.astype("category"))
+        self.assert_eq(pscidx.astype("category"), pcidx.astype("category"))
 
+        # CategoricalDtype is not updated if the dtype is same from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
-                kcidx.astype(CategoricalDtype(["b", "c", "a"])),
+                pscidx.astype(CategoricalDtype(["b", "c", "a"])),
                 pcidx.astype(CategoricalDtype(["b", "c", "a"])),
             )
         else:
             self.assert_eq(
-                kcidx.astype(CategoricalDtype(["b", "c", "a"])),
-                pidx.astype(CategoricalDtype(["b", "c", "a"])),
+                pscidx.astype(CategoricalDtype(["b", "c", "a"])),
+                pcidx,
             )
 
-        self.assert_eq(kcidx.astype(str), pcidx.astype(str))
+        self.assert_eq(pscidx.astype(str), pcidx.astype(str))
 
     def test_factorize(self):
         pidx = pd.CategoricalIndex([1, 2, 3, None])
diff --git a/python/pyspark/pandas/tests/test_categorical.py b/python/pyspark/pandas/tests/test_categorical.py
index 1335d59..1fb0d58 100644
--- a/python/pyspark/pandas/tests/test_categorical.py
+++ b/python/pyspark/pandas/tests/test_categorical.py
@@ -239,25 +239,23 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         )
 
         pcser = pser.astype(CategoricalDtype(["c", "a", "b"]))
-        kcser = psser.astype(CategoricalDtype(["c", "a", "b"]))
+        pscser = psser.astype(CategoricalDtype(["c", "a", "b"]))
 
-        self.assert_eq(kcser.astype("category"), pcser.astype("category"))
+        self.assert_eq(pscser.astype("category"), pcser.astype("category"))
 
+        # CategoricalDtype is not updated if the dtype is same from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
-                kcser.astype(CategoricalDtype(["b", "c", "a"])),
+                pscser.astype(CategoricalDtype(["b", "c", "a"])),
                 pcser.astype(CategoricalDtype(["b", "c", "a"])),
             )
         else:
             self.assert_eq(
-                kcser.astype(CategoricalDtype(["b", "c", "a"])),
-                pser.astype(CategoricalDtype(["b", "c", "a"])),
+                pscser.astype(CategoricalDtype(["b", "c", "a"])),
+                pcser,
             )
 
-        self.assert_eq(kcser.astype(str), pcser.astype(str))
+        self.assert_eq(pscser.astype(str), pcser.astype(str))
 
     def test_factorize(self):
         pser = pd.Series(["a", "b", "c", None], dtype=CategoricalDtype(["c", "a", "d", "b"]))

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 01/06: [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit cb075b5301e08b9d9b06f3d33a41b3d63d95378e
Author: Takuya UESHIN <ue...@databricks.com>
AuthorDate: Tue Aug 3 14:02:18 2021 +0900

    [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
    
    Disable tests failed by the incompatible behavior of pandas 1.3.
    
    Pandas 1.3 has been released.
    There are some behavior changes and we should follow it, but it's not ready yet.
    
    No.
    
    Disabled some tests related to the behavior change.
    
    Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.
    
    Authored-by: Takuya UESHIN <ue...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit 8cb9cf39b6a1899175aeaefb2a85480f5a514aac)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 .github/workflows/build_and_test.yml               |  4 +-
 python/pyspark/pandas/groupby.py                   |  8 +++
 .../tests/data_type_ops/test_categorical_ops.py    |  6 +-
 python/pyspark/pandas/tests/indexes/test_base.py   | 76 +++++++++++---------
 .../pyspark/pandas/tests/indexes/test_category.py  |  5 +-
 python/pyspark/pandas/tests/test_categorical.py    | 82 ++++++++++++++++++----
 python/pyspark/pandas/tests/test_expanding.py      | 51 ++++++++------
 .../test_ops_on_diff_frames_groupby_expanding.py   | 13 ++--
 .../test_ops_on_diff_frames_groupby_rolling.py     | 14 ++--
 python/pyspark/pandas/tests/test_rolling.py        | 52 ++++++++------
 python/pyspark/pandas/tests/test_series.py         | 16 +++--
 11 files changed, 222 insertions(+), 105 deletions(-)

diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml
index b518f87..7fc99ef 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -149,7 +149,7 @@ jobs:
     name: "Build modules: ${{ matrix.modules }}"
     runs-on: ubuntu-20.04
     container:
-      image: dongjoon/apache-spark-github-action-image:20210602
+      image: dongjoon/apache-spark-github-action-image:20210730
     strategy:
       fail-fast: false
       matrix:
@@ -227,8 +227,6 @@ jobs:
     # Run the tests.
     - name: Run tests
       run: |
-        # TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of the base image
-        python3.9 -m pip install 'mlflow>=1.0' sklearn
         export PATH=$PATH:$HOME/miniconda/bin
         ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
     - name: Upload test results to report
diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 70ece9c..faa1de6 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -20,6 +20,7 @@ A wrapper for GroupedData to behave similar to pandas GroupBy.
 """
 
 from abc import ABCMeta, abstractmethod
+import builtins
 import sys
 import inspect
 from collections import OrderedDict, namedtuple
@@ -43,6 +44,7 @@ from typing import (
     TYPE_CHECKING,
 )
 
+import numpy as np
 import pandas as pd
 from pandas.api.types import is_hashable, is_list_like
 
@@ -102,6 +104,12 @@ if TYPE_CHECKING:
 # to keep it the same as pandas
 NamedAgg = namedtuple("NamedAgg", ["column", "aggfunc"])
 
+_builtin_table = {
+    builtins.sum: np.sum,
+    builtins.max: np.max,
+    builtins.min: np.min,
+}  # type: Dict[Callable, Callable]
+
 
 class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
     """
diff --git a/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py b/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
index 6ac9073..11871ea 100644
--- a/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
+++ b/python/pyspark/pandas/tests/data_type_ops/test_categorical_ops.py
@@ -190,8 +190,12 @@ class CategoricalOpsTest(PandasOnSparkTestCase, TestCasesUtils):
         self.assert_eq(pser.astype(str), psser.astype(str))
         self.assert_eq(pser.astype(bool), psser.astype(bool))
         self.assert_eq(pser.astype("category"), psser.astype("category"))
+
         cat_type = CategoricalDtype(categories=[3, 1, 2])
-        if LooseVersion(pd.__version__) >= LooseVersion("1.2"):
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(pser.astype(cat_type), psser.astype(cat_type))
         else:
             self.assert_eq(pd.Series(data).astype(cat_type), psser.astype(cat_type))
diff --git a/python/pyspark/pandas/tests/indexes/test_base.py b/python/pyspark/pandas/tests/indexes/test_base.py
index 8238b67..39e22bd 100644
--- a/python/pyspark/pandas/tests/indexes/test_base.py
+++ b/python/pyspark/pandas/tests/indexes/test_base.py
@@ -1478,25 +1478,30 @@ class IndexesTest(PandasOnSparkTestCase, TestUtils):
             psidx2 = ps.from_pandas(pidx2)
 
             self.assert_eq(psidx1.union(psidx2), pidx1.union(pidx2))
-            self.assert_eq(psidx2.union(psidx1), pidx2.union(pidx1))
             self.assert_eq(
                 psidx1.union([3, 4, 3, 3, 5, 6]), pidx1.union([3, 4, 3, 4, 5, 6]), almost=True
             )
             self.assert_eq(
-                psidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
-                pidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
-                almost=True,
-            )
-            self.assert_eq(
                 psidx1.union(ps.Series([3, 4, 3, 3, 5, 6])),
                 pidx1.union(pd.Series([3, 4, 3, 4, 5, 6])),
                 almost=True,
             )
-            self.assert_eq(
-                psidx2.union(ps.Series([1, 2, 3, 4, 3, 4, 3, 4])),
-                pidx2.union(pd.Series([1, 2, 3, 4, 3, 4, 3, 4])),
-                almost=True,
-            )
+
+            if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+                # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+                pass
+            else:
+                self.assert_eq(psidx2.union(psidx1), pidx2.union(pidx1))
+                self.assert_eq(
+                    psidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
+                    pidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
+                    almost=True,
+                )
+                self.assert_eq(
+                    psidx2.union(ps.Series([1, 2, 3, 4, 3, 4, 3, 4])),
+                    pidx2.union(pd.Series([1, 2, 3, 4, 3, 4, 3, 4])),
+                    almost=True,
+                )
 
         # MultiIndex
         pmidx1 = pd.MultiIndex.from_tuples([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")])
@@ -1508,30 +1513,37 @@ class IndexesTest(PandasOnSparkTestCase, TestUtils):
         psmidx3 = ps.from_pandas(pmidx3)
         psmidx4 = ps.from_pandas(pmidx4)
 
-        self.assert_eq(psmidx1.union(psmidx2), pmidx1.union(pmidx2))
-        self.assert_eq(psmidx2.union(psmidx1), pmidx2.union(pmidx1))
-        self.assert_eq(psmidx3.union(psmidx4), pmidx3.union(pmidx4))
-        self.assert_eq(psmidx4.union(psmidx3), pmidx4.union(pmidx3))
-        self.assert_eq(
-            psmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
-            pmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
-        )
-        self.assert_eq(
-            psmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
-            pmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
-        )
-        self.assert_eq(
-            psmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
-            pmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
-        )
-        self.assert_eq(
-            psmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
-            pmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
-        )
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(psmidx1.union(psmidx2), pmidx1.union(pmidx2))
+            self.assert_eq(psmidx2.union(psmidx1), pmidx2.union(pmidx1))
+            self.assert_eq(psmidx3.union(psmidx4), pmidx3.union(pmidx4))
+            self.assert_eq(psmidx4.union(psmidx3), pmidx4.union(pmidx3))
+            self.assert_eq(
+                psmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
+                pmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
+            )
+            self.assert_eq(
+                psmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
+                pmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
+            )
+            self.assert_eq(
+                psmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
+                pmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
+            )
+            self.assert_eq(
+                psmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
+                pmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
+            )
 
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
         # Testing if the result is correct after sort=False.
         # The `sort` argument is added in pandas 0.24.
-        if LooseVersion(pd.__version__) >= LooseVersion("0.24"):
+        elif LooseVersion(pd.__version__) >= LooseVersion("0.24"):
             self.assert_eq(
                 psmidx1.union(psmidx2, sort=False).sort_values(),
                 pmidx1.union(pmidx2, sort=False).sort_values(),
diff --git a/python/pyspark/pandas/tests/indexes/test_category.py b/python/pyspark/pandas/tests/indexes/test_category.py
index 37216bd..f241918 100644
--- a/python/pyspark/pandas/tests/indexes/test_category.py
+++ b/python/pyspark/pandas/tests/indexes/test_category.py
@@ -176,7 +176,10 @@ class CategoricalIndexTest(PandasOnSparkTestCase, TestUtils):
 
         self.assert_eq(kcidx.astype("category"), pcidx.astype("category"))
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.2"):
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
                 kcidx.astype(CategoricalDtype(["b", "c", "a"])),
                 pcidx.astype(CategoricalDtype(["b", "c", "a"])),
diff --git a/python/pyspark/pandas/tests/test_categorical.py b/python/pyspark/pandas/tests/test_categorical.py
index 67cdf3c..1335d59 100644
--- a/python/pyspark/pandas/tests/test_categorical.py
+++ b/python/pyspark/pandas/tests/test_categorical.py
@@ -73,7 +73,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.categories = ["z", "y", "x"]
         psser.cat.categories = ["z", "y", "x"]
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         with self.assertRaises(ValueError):
@@ -91,7 +95,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.add_categories(4, inplace=True)
         psser.cat.add_categories(4, inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.add_categories(4))
@@ -115,7 +123,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.remove_categories(2, inplace=True)
         psser.cat.remove_categories(2, inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.remove_categories(4))
@@ -138,7 +150,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.remove_unused_categories(inplace=True)
         psser.cat.remove_unused_categories(inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
     def test_reorder_categories(self):
@@ -164,12 +180,20 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.reorder_categories([1, 2, 3], inplace=True)
         psser.cat.reorder_categories([1, 2, 3], inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.reorder_categories([3, 2, 1], ordered=True, inplace=True)
         psser.cat.reorder_categories([3, 2, 1], ordered=True, inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaises(ValueError, lambda: psser.cat.reorder_categories([1, 2]))
@@ -189,7 +213,11 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.as_ordered(inplace=True)
         psser.cat.as_ordered(inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         # as_unordered
@@ -215,7 +243,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         self.assert_eq(kcser.astype("category"), pcser.astype("category"))
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.2"):
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
                 kcser.astype(CategoricalDtype(["b", "c", "a"])),
                 pcser.astype(CategoricalDtype(["b", "c", "a"])),
@@ -419,7 +450,10 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
         def astype(x) -> ps.Series[dtype]:
             return x.astype(dtype)
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.2"):
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        elif LooseVersion(pd.__version__) >= LooseVersion("1.2"):
             self.assert_eq(
                 psdf.groupby("a").transform(astype).sort_values("b").reset_index(drop=True),
                 pdf.groupby("a").transform(astype).sort_values("b").reset_index(drop=True),
@@ -637,17 +671,29 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
 
         pser.cat.rename_categories({"a": "A", "c": "C"}, inplace=True)
         psser.cat.rename_categories({"a": "A", "c": "C"}, inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.rename_categories(lambda x: x.upper(), inplace=True)
         psser.cat.rename_categories(lambda x: x.upper(), inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.rename_categories([0, 1, 3, 2], inplace=True)
         psser.cat.rename_categories([0, 1, 3, 2], inplace=True)
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaisesRegex(
@@ -717,12 +763,20 @@ class CategoricalTest(PandasOnSparkTestCase, TestUtils):
             pser.cat.set_categories(["a", "c", "b", "o"], inplace=True, rename=True),
             psser.cat.set_categories(["a", "c", "b", "o"], inplace=True, rename=True),
         )
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         pser.cat.set_categories([2, 3, 1, 0], inplace=True, rename=False),
         psser.cat.set_categories([2, 3, 1, 0], inplace=True, rename=False),
-        self.assert_eq(pser, psser)
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(pser, psser)
         self.assert_eq(pdf, psdf)
 
         self.assertRaisesRegex(
diff --git a/python/pyspark/pandas/tests/test_expanding.py b/python/pyspark/pandas/tests/test_expanding.py
index 57b4e48..2cd5e52 100644
--- a/python/pyspark/pandas/tests/test_expanding.py
+++ b/python/pyspark/pandas/tests/test_expanding.py
@@ -145,18 +145,24 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
 
         pdf = pd.DataFrame({"a": [1.0, 2.0, 3.0, 2.0], "b": [4.0, 2.0, 3.0, 1.0]})
         psdf = ps.from_pandas(pdf)
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a).expanding(2), f)().sort_index(),
-            getattr(pdf.groupby(pdf.a).expanding(2), f)().sort_index(),
-        )
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a).expanding(2), f)().sum(),
-            getattr(pdf.groupby(pdf.a).expanding(2), f)().sum(),
-        )
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a + 1).expanding(2), f)().sort_index(),
-            getattr(pdf.groupby(pdf.a + 1).expanding(2), f)().sort_index(),
-        )
+
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a).expanding(2), f)().sort_index(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).expanding(2), f)().sum(),
+                getattr(pdf.groupby(pdf.a).expanding(2), f)().sum(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a + 1).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a + 1).expanding(2), f)().sort_index(),
+            )
+
         self.assert_eq(
             getattr(psdf.b.groupby(psdf.a).expanding(2), f)().sort_index(),
             getattr(pdf.b.groupby(pdf.a).expanding(2), f)().sort_index(),
@@ -174,15 +180,20 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
         columns = pd.MultiIndex.from_tuples([("a", "x"), ("a", "y")])
         pdf.columns = columns
         psdf.columns = columns
-        self.assert_eq(
-            getattr(psdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
-            getattr(pdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
-        )
 
-        self.assert_eq(
-            getattr(psdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
-            getattr(pdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
-        )
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
+            )
+
+            self.assert_eq(
+                getattr(psdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
+            )
 
     def test_groupby_expanding_count(self):
         # The behaviour of ExpandingGroupby.count are different between pandas>=1.0.0 and lower,
diff --git a/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py b/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py
index c6a2852..223adea 100644
--- a/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py
+++ b/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_expanding.py
@@ -52,10 +52,15 @@ class OpsOnDiffFramesGroupByExpandingTest(PandasOnSparkTestCase, TestUtils):
         psdf = ps.from_pandas(pdf)
         kkey = ps.from_pandas(pkey)
 
-        self.assert_eq(
-            getattr(psdf.groupby(kkey).expanding(2), f)().sort_index(),
-            getattr(pdf.groupby(pkey).expanding(2), f)().sort_index(),
-        )
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(kkey).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(pkey).expanding(2), f)().sort_index(),
+            )
+
         self.assert_eq(
             getattr(psdf.groupby(kkey)["b"].expanding(2), f)().sort_index(),
             getattr(pdf.groupby(pkey)["b"].expanding(2), f)().sort_index(),
diff --git a/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py b/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py
index 306a081..4f97769 100644
--- a/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py
+++ b/python/pyspark/pandas/tests/test_ops_on_diff_frames_groupby_rolling.py
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from distutils.version import LooseVersion
 
 import pandas as pd
 
@@ -49,10 +50,15 @@ class OpsOnDiffFramesGroupByRollingTest(PandasOnSparkTestCase, TestUtils):
         psdf = ps.from_pandas(pdf)
         kkey = ps.from_pandas(pkey)
 
-        self.assert_eq(
-            getattr(psdf.groupby(kkey).rolling(2), f)().sort_index(),
-            getattr(pdf.groupby(pkey).rolling(2), f)().sort_index(),
-        )
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(kkey).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(pkey).rolling(2), f)().sort_index(),
+            )
+
         self.assert_eq(
             getattr(psdf.groupby(kkey)["b"].rolling(2), f)().sort_index(),
             getattr(pdf.groupby(pkey)["b"].rolling(2), f)().sort_index(),
diff --git a/python/pyspark/pandas/tests/test_rolling.py b/python/pyspark/pandas/tests/test_rolling.py
index 92373d2..7409d69 100644
--- a/python/pyspark/pandas/tests/test_rolling.py
+++ b/python/pyspark/pandas/tests/test_rolling.py
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from distutils.version import LooseVersion
 
 import numpy as np
 import pandas as pd
@@ -110,18 +111,24 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
 
         pdf = pd.DataFrame({"a": [1.0, 2.0, 3.0, 2.0], "b": [4.0, 2.0, 3.0, 1.0]})
         psdf = ps.from_pandas(pdf)
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a).rolling(2), f)().sort_index(),
-            getattr(pdf.groupby(pdf.a).rolling(2), f)().sort_index(),
-        )
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a).rolling(2), f)().sum(),
-            getattr(pdf.groupby(pdf.a).rolling(2), f)().sum(),
-        )
-        self.assert_eq(
-            getattr(psdf.groupby(psdf.a + 1).rolling(2), f)().sort_index(),
-            getattr(pdf.groupby(pdf.a + 1).rolling(2), f)().sort_index(),
-        )
+
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a).rolling(2), f)().sort_index(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).rolling(2), f)().sum(),
+                getattr(pdf.groupby(pdf.a).rolling(2), f)().sum(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a + 1).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a + 1).rolling(2), f)().sort_index(),
+            )
+
         self.assert_eq(
             getattr(psdf.b.groupby(psdf.a).rolling(2), f)().sort_index(),
             getattr(pdf.b.groupby(pdf.a).rolling(2), f)().sort_index(),
@@ -139,15 +146,20 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
         columns = pd.MultiIndex.from_tuples([("a", "x"), ("a", "y")])
         pdf.columns = columns
         psdf.columns = columns
-        self.assert_eq(
-            getattr(psdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
-            getattr(pdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
-        )
 
-        self.assert_eq(
-            getattr(psdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
-            getattr(pdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
-        )
+        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+            pass
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
+            )
+
+            self.assert_eq(
+                getattr(psdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
+            )
 
     def test_groupby_rolling_count(self):
         self._test_groupby_rolling_func("count")
diff --git a/python/pyspark/pandas/tests/test_series.py b/python/pyspark/pandas/tests/test_series.py
index b42d3cd..d9ba3c76 100644
--- a/python/pyspark/pandas/tests/test_series.py
+++ b/python/pyspark/pandas/tests/test_series.py
@@ -1556,12 +1556,16 @@ class SeriesTest(PandasOnSparkTestCase, SQLTestUtils):
         if extension_object_dtypes_available:
             from pandas import StringDtype
 
-            self._check_extension(
-                psser.astype("M").astype("string"), pser.astype("M").astype("string")
-            )
-            self._check_extension(
-                psser.astype("M").astype(StringDtype()), pser.astype("M").astype(StringDtype())
-            )
+            if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
+                # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
+                pass
+            else:
+                self._check_extension(
+                    psser.astype("M").astype("string"), pser.astype("M").astype("string")
+                )
+                self._check_extension(
+                    psser.astype("M").astype(StringDtype()), pser.astype("M").astype(StringDtype())
+                )
 
         with self.assertRaisesRegex(TypeError, "not understood"):
             psser.astype("int63")

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 02/06: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit f2f09e4cdba815a3eeffdf0cdbf94dc9aa3c5634
Author: itholic <ha...@databricks.com>
AuthorDate: Mon Aug 9 11:10:01 2021 +0900

    [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
    
    This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.
    
    Before:
    ```python
    >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
    >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
    >>> ps_idx1.union(ps_idx2)
    Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
    ```
    
    After:
    ```python
    >>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
    >>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
    >>> ps_idx1.union(ps_idx2)
    Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
    ```
    
    This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.
    
    We should follow the behavior of pandas as much as possible.
    
    Yes, the result for some cases have duplicates values will change.
    
    Unit test.
    
    Closes #33634 from itholic/SPARK-36369.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit a9f371c2470ce28251012dea7428ff9be80bf3e5)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/indexes/base.py            |   4 +-
 python/pyspark/pandas/tests/indexes/test_base.py | 130 ++++++++++++-----------
 2 files changed, 68 insertions(+), 66 deletions(-)

diff --git a/python/pyspark/pandas/indexes/base.py b/python/pyspark/pandas/indexes/base.py
index a43a5d1..9d0d75a 100644
--- a/python/pyspark/pandas/indexes/base.py
+++ b/python/pyspark/pandas/indexes/base.py
@@ -2336,9 +2336,7 @@ class Index(IndexOpsMixin):
 
         sdf_self = self._internal.spark_frame.select(self._internal.index_spark_columns)
         sdf_other = other_idx._internal.spark_frame.select(other_idx._internal.index_spark_columns)
-        sdf = sdf_self.union(sdf_other.subtract(sdf_self))
-        if isinstance(self, MultiIndex):
-            sdf = sdf.drop_duplicates()
+        sdf = sdf_self.unionAll(sdf_other).exceptAll(sdf_self.intersectAll(sdf_other))
         if sort:
             sdf = sdf.sort(*self._internal.index_spark_column_names)
 
diff --git a/python/pyspark/pandas/tests/indexes/test_base.py b/python/pyspark/pandas/tests/indexes/test_base.py
index 39e22bd..605e3f8 100644
--- a/python/pyspark/pandas/tests/indexes/test_base.py
+++ b/python/pyspark/pandas/tests/indexes/test_base.py
@@ -1487,21 +1487,20 @@ class IndexesTest(PandasOnSparkTestCase, TestUtils):
                 almost=True,
             )
 
-            if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-                # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-                pass
-            else:
-                self.assert_eq(psidx2.union(psidx1), pidx2.union(pidx1))
-                self.assert_eq(
-                    psidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
-                    pidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
-                    almost=True,
-                )
-                self.assert_eq(
-                    psidx2.union(ps.Series([1, 2, 3, 4, 3, 4, 3, 4])),
-                    pidx2.union(pd.Series([1, 2, 3, 4, 3, 4, 3, 4])),
-                    almost=True,
-                )
+            # Manually create the expected result here since there is a bug in Index.union
+            # dropping duplicated values in pandas < 1.3.
+            expected = pd.Index([1, 2, 3, 3, 3, 4, 4, 4, 5, 6])
+            self.assert_eq(psidx2.union(psidx1), expected)
+            self.assert_eq(
+                psidx2.union([1, 2, 3, 4, 3, 4, 3, 4]),
+                expected,
+                almost=True,
+            )
+            self.assert_eq(
+                psidx2.union(ps.Series([1, 2, 3, 4, 3, 4, 3, 4])),
+                expected,
+                almost=True,
+            )
 
         # MultiIndex
         pmidx1 = pd.MultiIndex.from_tuples([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")])
@@ -1513,80 +1512,85 @@ class IndexesTest(PandasOnSparkTestCase, TestUtils):
         psmidx3 = ps.from_pandas(pmidx3)
         psmidx4 = ps.from_pandas(pmidx4)
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
-            self.assert_eq(psmidx1.union(psmidx2), pmidx1.union(pmidx2))
-            self.assert_eq(psmidx2.union(psmidx1), pmidx2.union(pmidx1))
-            self.assert_eq(psmidx3.union(psmidx4), pmidx3.union(pmidx4))
-            self.assert_eq(psmidx4.union(psmidx3), pmidx4.union(pmidx3))
-            self.assert_eq(
-                psmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
-                pmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
-            )
-            self.assert_eq(
-                psmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
-                pmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
-            )
-            self.assert_eq(
-                psmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
-                pmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
-            )
-            self.assert_eq(
-                psmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
-                pmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
-            )
+        # Manually create the expected result here since there is a bug in MultiIndex.union
+        # dropping duplicated values in pandas < 1.3.
+        expected = pd.MultiIndex.from_tuples(
+            [("x", "a"), ("x", "a"), ("x", "b"), ("x", "b"), ("x", "c"), ("x", "d")]
+        )
+        self.assert_eq(psmidx1.union(psmidx2), expected)
+        self.assert_eq(psmidx2.union(psmidx1), expected)
+        self.assert_eq(
+            psmidx1.union([("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")]),
+            expected,
+        )
+        self.assert_eq(
+            psmidx2.union([("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")]),
+            expected,
+        )
+
+        expected = pd.MultiIndex.from_tuples(
+            [(1, 1), (1, 2), (1, 3), (1, 3), (1, 4), (1, 4), (1, 5), (1, 6)]
+        )
+        self.assert_eq(psmidx3.union(psmidx4), expected)
+        self.assert_eq(psmidx4.union(psmidx3), expected)
+        self.assert_eq(
+            psmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)]),
+            expected,
+        )
+        self.assert_eq(
+            psmidx4.union([(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)]),
+            expected,
+        )
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
         # Testing if the result is correct after sort=False.
         # The `sort` argument is added in pandas 0.24.
-        elif LooseVersion(pd.__version__) >= LooseVersion("0.24"):
+        if LooseVersion(pd.__version__) >= LooseVersion("0.24"):
+            # Manually create the expected result here since there is a bug in MultiIndex.union
+            # dropping duplicated values in pandas < 1.3.
+            expected = pd.MultiIndex.from_tuples(
+                [("x", "a"), ("x", "a"), ("x", "b"), ("x", "b"), ("x", "c"), ("x", "d")]
+            )
             self.assert_eq(
                 psmidx1.union(psmidx2, sort=False).sort_values(),
-                pmidx1.union(pmidx2, sort=False).sort_values(),
+                expected,
             )
             self.assert_eq(
                 psmidx2.union(psmidx1, sort=False).sort_values(),
-                pmidx2.union(pmidx1, sort=False).sort_values(),
-            )
-            self.assert_eq(
-                psmidx3.union(psmidx4, sort=False).sort_values(),
-                pmidx3.union(pmidx4, sort=False).sort_values(),
-            )
-            self.assert_eq(
-                psmidx4.union(psmidx3, sort=False).sort_values(),
-                pmidx4.union(pmidx3, sort=False).sort_values(),
+                expected,
             )
             self.assert_eq(
                 psmidx1.union(
                     [("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")], sort=False
                 ).sort_values(),
-                pmidx1.union(
-                    [("x", "a"), ("x", "b"), ("x", "c"), ("x", "d")], sort=False
-                ).sort_values(),
+                expected,
             )
             self.assert_eq(
                 psmidx2.union(
                     [("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")], sort=False
                 ).sort_values(),
-                pmidx2.union(
-                    [("x", "a"), ("x", "b"), ("x", "a"), ("x", "b")], sort=False
-                ).sort_values(),
+                expected,
+            )
+
+            expected = pd.MultiIndex.from_tuples(
+                [(1, 1), (1, 2), (1, 3), (1, 3), (1, 4), (1, 4), (1, 5), (1, 6)]
+            )
+            self.assert_eq(
+                psmidx3.union(psmidx4, sort=False).sort_values(),
+                expected,
+            )
+            self.assert_eq(
+                psmidx4.union(psmidx3, sort=False).sort_values(),
+                expected,
             )
             self.assert_eq(
                 psmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)], sort=False).sort_values(),
-                pmidx3.union([(1, 3), (1, 4), (1, 5), (1, 6)], sort=False).sort_values(),
+                expected,
             )
             self.assert_eq(
                 psmidx4.union(
                     [(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)], sort=False
                 ).sort_values(),
-                pmidx4.union(
-                    [(1, 1), (1, 2), (1, 3), (1, 4), (1, 3), (1, 4)], sort=False
-                ).sort_values(),
+                expected,
             )
 
         self.assertRaises(NotImplementedError, lambda: psidx1.union(psmidx1))

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org

[spark] 03/06: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3

Posted by gu...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git

commit 0fc8c393b4d96a2724aafff0aaf45dd90936541a
Author: itholic <ha...@databricks.com>
AuthorDate: Tue Aug 10 10:12:52 2021 +0900

    [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
    
    This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.
    
    `RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.
    
    Before:
    ```python
    >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
    >>> df.groupby("A").rolling(2).sum()
           A    B
    A
    1 0  NaN  NaN
      1  2.0  1.0
    2 2  NaN  NaN
    3 3  NaN  NaN
    ```
    
    After:
    ```python
    >>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
    >>> df.groupby("A").rolling(2).sum()
           B
    A
    1 0  NaN
      1  1.0
    2 2  NaN
    3 3  NaN
    ```
    
    We should follow the behavior of pandas as much as possible.
    
    Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.
    
    Unit tests.
    
    Closes #33646 from itholic/SPARK-36388.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
    (cherry picked from commit b8508f48760023d656aab86860b1ce3f1e769b8f)
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/groupby.py              |  28 +--
 python/pyspark/pandas/tests/test_expanding.py |  35 +++-
 python/pyspark/pandas/tests/test_rolling.py   |  35 +++-
 python/pyspark/pandas/window.py               | 249 +++++++++++++-------------
 4 files changed, 199 insertions(+), 148 deletions(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index faa1de6..c732dff 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -125,7 +125,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
         groupkeys: List[Series],
         as_index: bool,
         dropna: bool,
-        column_labels_to_exlcude: Set[Label],
+        column_labels_to_exclude: Set[Label],
         agg_columns_selected: bool,
         agg_columns: List[Series],
     ):
@@ -133,7 +133,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
         self._groupkeys = groupkeys
         self._as_index = as_index
         self._dropna = dropna
-        self._column_labels_to_exlcude = column_labels_to_exlcude
+        self._column_labels_to_exclude = column_labels_to_exclude
         self._agg_columns_selected = agg_columns_selected
         self._agg_columns = agg_columns
 
@@ -1175,7 +1175,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
             agg_columns = [
                 psdf._psser_for(label)
                 for label in psdf._internal.column_labels
-                if label not in self._column_labels_to_exlcude
+                if label not in self._column_labels_to_exclude
             ]
 
         psdf, groupkey_labels, groupkey_names = GroupBy._prepare_group_map_apply(
@@ -1372,7 +1372,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
             agg_columns = [
                 psdf._psser_for(label)
                 for label in psdf._internal.column_labels
-                if label not in self._column_labels_to_exlcude
+                if label not in self._column_labels_to_exclude
             ]
 
         data_schema = (
@@ -1890,7 +1890,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
             agg_columns = [
                 psdf._psser_for(label)
                 for label in psdf._internal.column_labels
-                if label not in self._column_labels_to_exlcude
+                if label not in self._column_labels_to_exclude
             ]
 
         psdf, groupkey_labels, _ = GroupBy._prepare_group_map_apply(
@@ -2708,17 +2708,17 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
             (
                 psdf,
                 new_by_series,
-                column_labels_to_exlcude,
+                column_labels_to_exclude,
             ) = GroupBy._resolve_grouping_from_diff_dataframes(psdf, by)
         else:
             new_by_series = GroupBy._resolve_grouping(psdf, by)
-            column_labels_to_exlcude = set()
+            column_labels_to_exclude = set()
         return DataFrameGroupBy(
             psdf,
             new_by_series,
             as_index=as_index,
             dropna=dropna,
-            column_labels_to_exlcude=column_labels_to_exlcude,
+            column_labels_to_exclude=column_labels_to_exclude,
         )
 
     def __init__(
@@ -2727,20 +2727,20 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
         by: List[Series],
         as_index: bool,
         dropna: bool,
-        column_labels_to_exlcude: Set[Label],
+        column_labels_to_exclude: Set[Label],
         agg_columns: List[Label] = None,
     ):
         agg_columns_selected = agg_columns is not None
         if agg_columns_selected:
             for label in agg_columns:
-                if label in column_labels_to_exlcude:
+                if label in column_labels_to_exclude:
                     raise KeyError(label)
         else:
             agg_columns = [
                 label
                 for label in psdf._internal.column_labels
                 if not any(label == key._column_label and key._psdf is psdf for key in by)
-                and label not in column_labels_to_exlcude
+                and label not in column_labels_to_exclude
             ]
 
         super().__init__(
@@ -2748,7 +2748,7 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
             groupkeys=by,
             as_index=as_index,
             dropna=dropna,
-            column_labels_to_exlcude=column_labels_to_exlcude,
+            column_labels_to_exclude=column_labels_to_exclude,
             agg_columns_selected=agg_columns_selected,
             agg_columns=[psdf[label] for label in agg_columns],
         )
@@ -2788,7 +2788,7 @@ class DataFrameGroupBy(GroupBy[DataFrame]):
                 self._groupkeys,
                 as_index=self._as_index,
                 dropna=self._dropna,
-                column_labels_to_exlcude=self._column_labels_to_exlcude,
+                column_labels_to_exclude=self._column_labels_to_exclude,
                 agg_columns=item,
             )
 
@@ -2932,7 +2932,7 @@ class SeriesGroupBy(GroupBy[Series]):
             groupkeys=by,
             as_index=True,
             dropna=dropna,
-            column_labels_to_exlcude=set(),
+            column_labels_to_exclude=set(),
             agg_columns_selected=True,
             agg_columns=[psser],
         )
diff --git a/python/pyspark/pandas/tests/test_expanding.py b/python/pyspark/pandas/tests/test_expanding.py
index 2cd5e52..d52ccba 100644
--- a/python/pyspark/pandas/tests/test_expanding.py
+++ b/python/pyspark/pandas/tests/test_expanding.py
@@ -146,10 +146,8 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
         pdf = pd.DataFrame({"a": [1.0, 2.0, 3.0, 2.0], "b": [4.0, 2.0, 3.0, 1.0]})
         psdf = ps.from_pandas(pdf)
 
+        # The behavior of GroupBy.expanding is changed from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
             self.assert_eq(
                 getattr(psdf.groupby(psdf.a).expanding(2), f)().sort_index(),
                 getattr(pdf.groupby(pdf.a).expanding(2), f)().sort_index(),
@@ -162,6 +160,19 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
                 getattr(psdf.groupby(psdf.a + 1).expanding(2), f)().sort_index(),
                 getattr(pdf.groupby(pdf.a + 1).expanding(2), f)().sort_index(),
             )
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a).expanding(2), f)().drop("a", axis=1).sort_index(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).expanding(2), f)().sum(),
+                getattr(pdf.groupby(pdf.a).expanding(2), f)().sum().drop("a"),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a + 1).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a + 1).expanding(2), f)().drop("a", axis=1).sort_index(),
+            )
 
         self.assert_eq(
             getattr(psdf.b.groupby(psdf.a).expanding(2), f)().sort_index(),
@@ -181,10 +192,8 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
         pdf.columns = columns
         psdf.columns = columns
 
+        # The behavior of GroupBy.expanding is changed from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
             self.assert_eq(
                 getattr(psdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
                 getattr(pdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
@@ -194,6 +203,20 @@ class ExpandingTest(PandasOnSparkTestCase, TestUtils):
                 getattr(psdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
                 getattr(pdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
             )
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(("a", "x")).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby(("a", "x")).expanding(2), f)()
+                .drop(("a", "x"), axis=1)
+                .sort_index(),
+            )
+
+            self.assert_eq(
+                getattr(psdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)().sort_index(),
+                getattr(pdf.groupby([("a", "x"), ("a", "y")]).expanding(2), f)()
+                .drop([("a", "x"), ("a", "y")], axis=1)
+                .sort_index(),
+            )
 
     def test_groupby_expanding_count(self):
         # The behaviour of ExpandingGroupby.count are different between pandas>=1.0.0 and lower,
diff --git a/python/pyspark/pandas/tests/test_rolling.py b/python/pyspark/pandas/tests/test_rolling.py
index 7409d69..3c9563c 100644
--- a/python/pyspark/pandas/tests/test_rolling.py
+++ b/python/pyspark/pandas/tests/test_rolling.py
@@ -112,10 +112,8 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
         pdf = pd.DataFrame({"a": [1.0, 2.0, 3.0, 2.0], "b": [4.0, 2.0, 3.0, 1.0]})
         psdf = ps.from_pandas(pdf)
 
+        # The behavior of GroupBy.rolling is changed from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
             self.assert_eq(
                 getattr(psdf.groupby(psdf.a).rolling(2), f)().sort_index(),
                 getattr(pdf.groupby(pdf.a).rolling(2), f)().sort_index(),
@@ -128,6 +126,19 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
                 getattr(psdf.groupby(psdf.a + 1).rolling(2), f)().sort_index(),
                 getattr(pdf.groupby(pdf.a + 1).rolling(2), f)().sort_index(),
             )
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a).rolling(2), f)().drop("a", axis=1).sort_index(),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a).rolling(2), f)().sum(),
+                getattr(pdf.groupby(pdf.a).rolling(2), f)().sum().drop("a"),
+            )
+            self.assert_eq(
+                getattr(psdf.groupby(psdf.a + 1).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(pdf.a + 1).rolling(2), f)().drop("a", axis=1).sort_index(),
+            )
 
         self.assert_eq(
             getattr(psdf.b.groupby(psdf.a).rolling(2), f)().sort_index(),
@@ -147,10 +158,8 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
         pdf.columns = columns
         psdf.columns = columns
 
+        # The behavior of GroupBy.rolling is changed from pandas 1.3.
         if LooseVersion(pd.__version__) >= LooseVersion("1.3"):
-            # TODO(SPARK-36367): Fix the behavior to follow pandas >= 1.3
-            pass
-        else:
             self.assert_eq(
                 getattr(psdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
                 getattr(pdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
@@ -160,6 +169,20 @@ class RollingTest(PandasOnSparkTestCase, TestUtils):
                 getattr(psdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
                 getattr(pdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
             )
+        else:
+            self.assert_eq(
+                getattr(psdf.groupby(("a", "x")).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby(("a", "x")).rolling(2), f)()
+                .drop(("a", "x"), axis=1)
+                .sort_index(),
+            )
+
+            self.assert_eq(
+                getattr(psdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)().sort_index(),
+                getattr(pdf.groupby([("a", "x"), ("a", "y")]).rolling(2), f)()
+                .drop([("a", "x"), ("a", "y")], axis=1)
+                .sort_index(),
+            )
 
     def test_groupby_rolling_count(self):
         self._test_groupby_rolling_func("count")
diff --git a/python/pyspark/pandas/window.py b/python/pyspark/pandas/window.py
index 0d656c2..68d87fb 100644
--- a/python/pyspark/pandas/window.py
+++ b/python/pyspark/pandas/window.py
@@ -36,7 +36,7 @@ from pyspark.pandas.missing.window import (
 # For running doctests and reference resolution in PyCharm.
 from pyspark import pandas as ps  # noqa: F401
 from pyspark.pandas._typing import FrameLike
-from pyspark.pandas.groupby import GroupBy
+from pyspark.pandas.groupby import GroupBy, DataFrameGroupBy
 from pyspark.pandas.internal import NATURAL_ORDER_COLUMN_NAME, SPARK_INDEX_NAME_FORMAT
 from pyspark.pandas.spark import functions as SF
 from pyspark.pandas.utils import scol_for
@@ -706,10 +706,15 @@ class RollingGroupby(RollingLike[FrameLike]):
         if groupby._agg_columns_selected:
             agg_columns = groupby._agg_columns
         else:
+            # pandas doesn't keep the groupkey as a column from 1.3 for DataFrameGroupBy
+            column_labels_to_exclude = groupby._column_labels_to_exclude.copy()
+            if isinstance(groupby, DataFrameGroupBy):
+                for groupkey in groupby._groupkeys:  # type: ignore
+                    column_labels_to_exclude.add(groupkey._internal.column_labels[0])
             agg_columns = [
                 psdf._psser_for(label)
                 for label in psdf._internal.column_labels
-                if label not in groupby._column_labels_to_exlcude
+                if label not in column_labels_to_exclude
             ]
 
         applied = []
@@ -777,19 +782,19 @@ class RollingGroupby(RollingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).rolling(2).count().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A    B
+                B
         A
-        2 0   1.0  1.0
-          1   2.0  2.0
-        3 2   1.0  1.0
-          3   2.0  2.0
-          4   2.0  2.0
-        4 5   1.0  1.0
-          6   2.0  2.0
-          7   2.0  2.0
-          8   2.0  2.0
-        5 9   1.0  1.0
-          10  2.0  2.0
+        2 0   1.0
+          1   2.0
+        3 2   1.0
+          3   2.0
+          4   2.0
+        4 5   1.0
+          6   2.0
+          7   2.0
+          8   2.0
+        5 9   1.0
+          10  2.0
         """
         return super().count()
 
@@ -831,19 +836,19 @@ class RollingGroupby(RollingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).rolling(2).sum().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                 A     B
+                 B
         A
-        2 0    NaN   NaN
-          1    4.0   8.0
-        3 2    NaN   NaN
-          3    6.0  18.0
-          4    6.0  18.0
-        4 5    NaN   NaN
-          6    8.0  32.0
-          7    8.0  32.0
-          8    8.0  32.0
-        5 9    NaN   NaN
-          10  10.0  50.0
+        2 0    NaN
+          1    8.0
+        3 2    NaN
+          3   18.0
+          4   18.0
+        4 5    NaN
+          6   32.0
+          7   32.0
+          8   32.0
+        5 9    NaN
+          10  50.0
         """
         return super().sum()
 
@@ -885,19 +890,19 @@ class RollingGroupby(RollingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).rolling(2).min().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().min()
 
@@ -939,19 +944,19 @@ class RollingGroupby(RollingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).rolling(2).max().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().max()
 
@@ -993,19 +998,19 @@ class RollingGroupby(RollingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).rolling(2).mean().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().mean()
 
@@ -1478,19 +1483,19 @@ class ExpandingGroupby(ExpandingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).expanding(2).count().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A    B
+                B
         A
-        2 0   NaN  NaN
-          1   2.0  2.0
-        3 2   NaN  NaN
-          3   2.0  2.0
-          4   3.0  3.0
-        4 5   NaN  NaN
-          6   2.0  2.0
-          7   3.0  3.0
-          8   4.0  4.0
-        5 9   NaN  NaN
-          10  2.0  2.0
+        2 0   NaN
+          1   2.0
+        3 2   NaN
+          3   2.0
+          4   3.0
+        4 5   NaN
+          6   2.0
+          7   3.0
+          8   4.0
+        5 9   NaN
+          10  2.0
         """
         return super().count()
 
@@ -1532,19 +1537,19 @@ class ExpandingGroupby(ExpandingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).expanding(2).sum().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                 A     B
+                 B
         A
-        2 0    NaN   NaN
-          1    4.0   8.0
-        3 2    NaN   NaN
-          3    6.0  18.0
-          4    9.0  27.0
-        4 5    NaN   NaN
-          6    8.0  32.0
-          7   12.0  48.0
-          8   16.0  64.0
-        5 9    NaN   NaN
-          10  10.0  50.0
+        2 0    NaN
+          1    8.0
+        3 2    NaN
+          3   18.0
+          4   27.0
+        4 5    NaN
+          6   32.0
+          7   48.0
+          8   64.0
+        5 9    NaN
+          10  50.0
         """
         return super().sum()
 
@@ -1586,19 +1591,19 @@ class ExpandingGroupby(ExpandingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).expanding(2).min().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().min()
 
@@ -1639,19 +1644,19 @@ class ExpandingGroupby(ExpandingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).expanding(2).max().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().max()
 
@@ -1693,19 +1698,19 @@ class ExpandingGroupby(ExpandingLike[FrameLike]):
 
         >>> df = ps.DataFrame({"A": s.to_numpy(), "B": s.to_numpy() ** 2})
         >>> df.groupby(df.A).expanding(2).mean().sort_index()  # doctest: +NORMALIZE_WHITESPACE
-                A     B
+                 B
         A
-        2 0   NaN   NaN
-          1   2.0   4.0
-        3 2   NaN   NaN
-          3   3.0   9.0
-          4   3.0   9.0
-        4 5   NaN   NaN
-          6   4.0  16.0
-          7   4.0  16.0
-          8   4.0  16.0
-        5 9   NaN   NaN
-          10  5.0  25.0
+        2 0    NaN
+          1    4.0
+        3 2    NaN
+          3    9.0
+          4    9.0
+        4 5    NaN
+          6   16.0
+          7   16.0
+          8   16.0
+        5 9    NaN
+          10  25.0
         """
         return super().mean()
 

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org