You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2022/07/19 00:34:48 UTC

[spark] branch master updated: [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new dcccbf4f9dd [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
dcccbf4f9dd is described below

commit dcccbf4f9ddd22dc59e6199a940625f677b23a81
Author: Yikun Jiang <yi...@gmail.com>
AuthorDate: Tue Jul 19 09:34:32 2022 +0900

    [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
    
    ### What changes were proposed in this pull request?
    
    Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior.
    
    ### Why are the changes needed?
    In https://github.com/apache/spark/pull/36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case.
    
    In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue https://github.com/pandas-dev/pandas/issues/47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, we already cover this case in:
    https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
    ```
    In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors.
    ```
    
    ### How was this patch tested?
    - CI passed
    - test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3.
    
    Closes #37217 from Yikun/SPARK-39807.
    
    Authored-by: Yikun Jiang <yi...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 python/pyspark/pandas/namespace.py            |  5 ++---
 python/pyspark/pandas/tests/test_namespace.py | 20 +++++++++++---------
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/pandas/namespace.py b/python/pyspark/pandas/namespace.py
index 7691bf465e7..0f0dc606c52 100644
--- a/python/pyspark/pandas/namespace.py
+++ b/python/pyspark/pandas/namespace.py
@@ -2621,9 +2621,8 @@ def concat(
 
             assert len(merged_columns) > 0
 
-            # If sort is True, always sort when there are more than two Series,
-            # and if there is only one Series, never sort to follow pandas 1.4+ behavior.
-            if sort and num_series != 1:
+            # If sort is True, always sort
+            if sort:
                 # FIXME: better ordering
                 merged_columns = sorted(merged_columns, key=name_like_string)
 
diff --git a/python/pyspark/pandas/tests/test_namespace.py b/python/pyspark/pandas/tests/test_namespace.py
index 4db756c6e66..ac033f7828b 100644
--- a/python/pyspark/pandas/tests/test_namespace.py
+++ b/python/pyspark/pandas/tests/test_namespace.py
@@ -334,19 +334,21 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils):
             ([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
             ([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
             ([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
-            # only one Series
-            ([psdf, psdf["C"]], [pdf, pdf["C"]]),
-            ([psdf["C"], psdf], [pdf["C"], pdf]),
             # more than two Series
             ([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
         ]
 
-        if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
-            # more than two Series
-            psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]])
-            for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
-                # See also https://github.com/pandas-dev/pandas/issues/47127
-                if (join, sort) != ("outer", True):
+        # See also https://github.com/pandas-dev/pandas/issues/47127
+        if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
+            series_objs = [
+                # more than two Series
+                ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
+                # only one Series
+                ([psdf, psdf["C"]], [pdf, pdf["C"]]),
+                ([psdf["C"], psdf], [pdf["C"], pdf]),
+            ]
+            for psdfs, pdfs in series_objs:
+                for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
                     self.assert_eq(
                         ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort),
                         pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort),


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org