You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ru...@apache.org on 2022/09/05 06:57:05 UTC
[spark] branch master updated: [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection
This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 8482ec9e5d8 [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection
8482ec9e5d8 is described below
commit 8482ec9e5d832f89fa55d29cdde0f8005a062f17
Author: itholic <ha...@databricks.com>
AuthorDate: Mon Sep 5 14:56:37 2022 +0800
[SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection
### What changes were proposed in this pull request?
This PR proposes to fix the inconsistent behavior for `Index.intersection` function as below:
When `other` is list of tuple, the behavior of pandas API on Spark is difference from pandas.
- pandas API on Spark
```python
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
```
- pandas
```python
>>> pidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
```
### Why are the changes needed?
To reach parity with pandas.
### Does this PR introduce _any_ user-facing change?
Yes, the behavior of `Index.intersection` is chaged, when the `other` is list of tuple:
- Before
```python
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
MultiIndex([], )
```
- After
```python
>>> psidx
Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
>>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
Traceback (most recent call last):
...
ValueError: Names should be list-like for a MultiIndex
```
### How was this patch tested?
Added a unit test.
Closes #37739 from itholic/SPARK-40265.
Authored-by: itholic <ha...@databricks.com>
Signed-off-by: Ruifeng Zheng <ru...@apache.org>
---
python/pyspark/pandas/indexes/base.py | 2 +-
python/pyspark/pandas/tests/indexes/test_base.py | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/python/pyspark/pandas/indexes/base.py b/python/pyspark/pandas/indexes/base.py
index facedb1dc91..5043325ccbb 100644
--- a/python/pyspark/pandas/indexes/base.py
+++ b/python/pyspark/pandas/indexes/base.py
@@ -2509,7 +2509,7 @@ class Index(IndexOpsMixin):
elif is_list_like(other):
other_idx = Index(other)
if isinstance(other_idx, MultiIndex):
- return other_idx.to_frame().head(0).index
+ raise ValueError("Names should be list-like for a MultiIndex")
spark_frame_other = other_idx.to_frame()._to_spark()
keep_name = True
else:
diff --git a/python/pyspark/pandas/tests/indexes/test_base.py b/python/pyspark/pandas/tests/indexes/test_base.py
index 958314c5741..169a22571ec 100644
--- a/python/pyspark/pandas/tests/indexes/test_base.py
+++ b/python/pyspark/pandas/tests/indexes/test_base.py
@@ -1977,6 +1977,9 @@ class IndexesTest(ComparisonTestBase, TestUtils):
psidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}))
with self.assertRaisesRegex(ValueError, "Index data must be 1-dimensional"):
psmidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}))
+ # other = list of tuple
+ with self.assertRaisesRegex(ValueError, "Names should be list-like for a MultiIndex"):
+ psidx.intersection([(1, 2), (3, 4)])
def test_item(self):
pidx = pd.Index([10])
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org