You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2022/07/19 00:34:48 UTC
[spark] branch master updated: [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new dcccbf4f9dd [SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
dcccbf4f9dd is described below
commit dcccbf4f9ddd22dc59e6199a940625f677b23a81
Author: Yikun Jiang <yi...@gmail.com>
AuthorDate: Tue Jul 19 09:34:32 2022 +0900
[SPARK-39807][PYTHON][PS] Respect Series.concat sort parameter to follow 1.4.3 behavior
### What changes were proposed in this pull request?
Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior.
### Why are the changes needed?
In https://github.com/apache/spark/pull/36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case.
In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue https://github.com/pandas-dev/pandas/issues/47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior.
### Does this PR introduce _any_ user-facing change?
Yes, we already cover this case in:
https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
```
In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors.
```
### How was this patch tested?
- CI passed
- test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3.
Closes #37217 from Yikun/SPARK-39807.
Authored-by: Yikun Jiang <yi...@gmail.com>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
python/pyspark/pandas/namespace.py | 5 ++---
python/pyspark/pandas/tests/test_namespace.py | 20 +++++++++++---------
2 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/python/pyspark/pandas/namespace.py b/python/pyspark/pandas/namespace.py
index 7691bf465e7..0f0dc606c52 100644
--- a/python/pyspark/pandas/namespace.py
+++ b/python/pyspark/pandas/namespace.py
@@ -2621,9 +2621,8 @@ def concat(
assert len(merged_columns) > 0
- # If sort is True, always sort when there are more than two Series,
- # and if there is only one Series, never sort to follow pandas 1.4+ behavior.
- if sort and num_series != 1:
+ # If sort is True, always sort
+ if sort:
# FIXME: better ordering
merged_columns = sorted(merged_columns, key=name_like_string)
diff --git a/python/pyspark/pandas/tests/test_namespace.py b/python/pyspark/pandas/tests/test_namespace.py
index 4db756c6e66..ac033f7828b 100644
--- a/python/pyspark/pandas/tests/test_namespace.py
+++ b/python/pyspark/pandas/tests/test_namespace.py
@@ -334,19 +334,21 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils):
([psdf.reset_index(), psdf], [pdf.reset_index(), pdf]),
([psdf, psdf[["C", "A"]]], [pdf, pdf[["C", "A"]]]),
([psdf[["C", "A"]], psdf], [pdf[["C", "A"]], pdf]),
- # only one Series
- ([psdf, psdf["C"]], [pdf, pdf["C"]]),
- ([psdf["C"], psdf], [pdf["C"], pdf]),
# more than two Series
([psdf["C"], psdf, psdf["A"]], [pdf["C"], pdf, pdf["A"]]),
]
- if LooseVersion(pd.__version__) >= LooseVersion("1.4"):
- # more than two Series
- psdfs, pdfs = ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]])
- for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
- # See also https://github.com/pandas-dev/pandas/issues/47127
- if (join, sort) != ("outer", True):
+ # See also https://github.com/pandas-dev/pandas/issues/47127
+ if LooseVersion(pd.__version__) >= LooseVersion("1.4.3"):
+ series_objs = [
+ # more than two Series
+ ([psdf, psdf["C"], psdf["A"]], [pdf, pdf["C"], pdf["A"]]),
+ # only one Series
+ ([psdf, psdf["C"]], [pdf, pdf["C"]]),
+ ([psdf["C"], psdf], [pdf["C"], pdf]),
+ ]
+ for psdfs, pdfs in series_objs:
+ for ignore_index, join, sort in itertools.product(ignore_indexes, joins, sorts):
self.assert_eq(
ps.concat(psdfs, ignore_index=ignore_index, join=join, sort=sort),
pd.concat(pdfs, ignore_index=ignore_index, join=join, sort=sort),
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org