You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ru...@apache.org on 2022/09/05 06:57:05 UTC

[spark] branch master updated: [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 8482ec9e5d8 [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection
8482ec9e5d8 is described below

commit 8482ec9e5d832f89fa55d29cdde0f8005a062f17
Author: itholic <ha...@databricks.com>
AuthorDate: Mon Sep 5 14:56:37 2022 +0800

    [SPARK-40265][PS] Fix the inconsistent behavior for Index.intersection
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix the inconsistent behavior for `Index.intersection` function as below:
    
    When `other` is list of tuple, the behavior of pandas API on Spark is difference from pandas.
    
    - pandas API on Spark
    ```python
    >>> psidx
    Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
    >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
    MultiIndex([], )
    ```
    
    - pandas
    ```python
    >>> pidx
    Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
    >>> pidx.intersection([(1, 2), (3, 4)]).sort_values()
    Traceback (most recent call last):
    ...
    ValueError: Names should be list-like for a MultiIndex
    ```
    
    ### Why are the changes needed?
    
    To reach parity with pandas.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, the behavior of `Index.intersection` is chaged, when the `other` is list of tuple:
    
    - Before
    ```python
    >>> psidx
    Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
    >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
    MultiIndex([], )
    ```
    
    - After
    ```python
    >>> psidx
    Int64Index([1, 2, 3, 4], dtype='int64', name='Koalas')
    >>> psidx.intersection([(1, 2), (3, 4)]).sort_values()
    Traceback (most recent call last):
    ...
    ValueError: Names should be list-like for a MultiIndex
    ```
    
    ### How was this patch tested?
    
    Added a unit test.
    
    Closes #37739 from itholic/SPARK-40265.
    
    Authored-by: itholic <ha...@databricks.com>
    Signed-off-by: Ruifeng Zheng <ru...@apache.org>
---
 python/pyspark/pandas/indexes/base.py            | 2 +-
 python/pyspark/pandas/tests/indexes/test_base.py | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/indexes/base.py b/python/pyspark/pandas/indexes/base.py
index facedb1dc91..5043325ccbb 100644
--- a/python/pyspark/pandas/indexes/base.py
+++ b/python/pyspark/pandas/indexes/base.py
@@ -2509,7 +2509,7 @@ class Index(IndexOpsMixin):
         elif is_list_like(other):
             other_idx = Index(other)
             if isinstance(other_idx, MultiIndex):
-                return other_idx.to_frame().head(0).index
+                raise ValueError("Names should be list-like for a MultiIndex")
             spark_frame_other = other_idx.to_frame()._to_spark()
             keep_name = True
         else:
diff --git a/python/pyspark/pandas/tests/indexes/test_base.py b/python/pyspark/pandas/tests/indexes/test_base.py
index 958314c5741..169a22571ec 100644
--- a/python/pyspark/pandas/tests/indexes/test_base.py
+++ b/python/pyspark/pandas/tests/indexes/test_base.py
@@ -1977,6 +1977,9 @@ class IndexesTest(ComparisonTestBase, TestUtils):
             psidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}))
         with self.assertRaisesRegex(ValueError, "Index data must be 1-dimensional"):
             psmidx.intersection(ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}))
+        # other = list of tuple
+        with self.assertRaisesRegex(ValueError, "Names should be list-like for a MultiIndex"):
+            psidx.intersection([(1, 2), (3, 4)])
 
     def test_item(self):
         pidx = pd.Index([10])


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org