You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/27 07:48:23 UTC

[GitHub] [spark] itholic opened a new pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

itholic opened a new pull request #34113:
URL: https://github.com/apache/spark/pull/34113


   ### What changes were proposed in this pull request?
   This PR proposes implementing `MultiIndex.equal_levels`.
   
   ```python
   >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
   >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
   >>> psmidx1.equal_levels(psmidx2)
   True
   
   >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z"), ("a", "y")])
   >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "y"), ("b", "x"), ("c", "z"), ("c", "x")])
   >>> psmidx1.equal_levels(psmidx2)
   True
   ```
   
   This was originally proposed in https://github.com/databricks/koalas/pull/1789, and all reviews in origin PR has been resolved.
   
   ### Why are the changes needed?
   
   We should support the pandas API as much as possible for pandas-on-Spark module.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the `MultiIndex.equal_levels` API is available.
   
   
   ### How was this patch tested?
   
   Unittests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718151080



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think you should use `exceptAll` to preserve same values.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       This looks like we'll compare each value of multiindex individually. e.g.)
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   will be conisdered as same?

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(
+                other_sdf.select(other_index_scol)
+            )
+            subtract_list.append(self_subtract_other)
+
+        unioned_subtracts = reduce(lambda x, y: x.union(y), subtract_list)
+        if len(unioned_subtracts.head(1)) == 0:
+            return True
+        else:
+            return False

Review comment:
       ```suggestion
           return len(unioned_subtracts.head(1)) == 0
   ```

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       👌 

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       👌 

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Can you elabourate the equality condition condition on this API? Seems like we can just leverage distinct values to compare?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718164009



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think we don't have to preserve the same values.
   
   It only compares unique value of each levels.
   
   For example,
   
   ```python
   >>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   >>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   >>> pmidx1.equal_levels(pmidx2)
   True
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.
   
   I think it will be more expensive if we compare after applying `distinct` to each level since it requires sorting.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.
   
   I think maybe it will be more expensive if we compare after applying `distinct` to each level since it requires sorting ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929851118


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48218/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929848648


   **[Test build #143706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.
   
   I think maybe it will be more expensive if we compare after applying `distinct` to each level since it requires sorting ?
   
   I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718164009



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think we don't need to preserve the same values.
   
   It only compares unique value of each levels.
   
   For example,
   
   ```python
   >>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   >>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   >>> pmidx1.equal_levels(pmidx2)
   True
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929848648


   **[Test build #143706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929842512


   **[Test build #143702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143702/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719939464



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       👌 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931873684


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48290/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929872763


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48218/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index. Let me address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index values. Let me address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Correct, and 
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   is also considered as same.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Correct. 
   
   And 
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   is also considered as same.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think we don't have to preserve the same values.
   
   It only compares unique value of each levels.
   
   For example,
   
   ```python
   >>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   >>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   >>> pmidx1.equal_levels(pmidx2)
   True
   ```

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think we don't need to preserve the same values.
   
   It only compares unique value of each levels.
   
   For example,
   
   ```python
   >>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   >>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   >>> pmidx1.equal_levels(pmidx2)
   True
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534


   **[Test build #143702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927680025


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48154/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48154/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929906029


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48220/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931860083


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48290/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931844920


   **[Test build #143778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648356


   **[Test build #143642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `public class NettyLogger `
     * `class IndexNameTypeHolder(object):`
     * `        new_class = type(\"NameType\", (NameTypeHolder,), `
     * `        new_class = param.type if isinstance(param, np.dtype) else param`
     * `public final class AlwaysFalse extends Filter `
     * `public final class AlwaysTrue extends Filter `
     * `public final class And extends BinaryFilter `
     * `abstract class BinaryComparison extends Filter `
     * `abstract class BinaryFilter extends Filter `
     * `public final class EqualNullSafe extends BinaryComparison `
     * `public final class EqualTo extends BinaryComparison `
     * `public abstract class Filter implements Expression, Serializable `
     * `public final class GreaterThan extends BinaryComparison `
     * `public final class GreaterThanOrEqual extends BinaryComparison `
     * `public final class In extends Filter `
     * `public final class IsNotNull extends Filter `
     * `public final class IsNull extends Filter `
     * `public final class LessThan extends BinaryComparison `
     * `public final class LessThanOrEqual extends BinaryComparison `
     * `public final class Not extends Filter `
     * `public final class Or extends BinaryFilter `
     * `public final class StringContains extends StringPredicate `
     * `public final class StringEndsWith extends StringPredicate `
     * `abstract class StringPredicate extends Filter `
     * `public final class StringStartsWith extends StringPredicate `
     * `public class ColumnarBatch implements AutoCloseable `
     * `case class Sec(child: Expression)`
     * `case class Csc(child: Expression)`
     * `trait OperationHelper extends AliasHelper with PredicateHelper `
     * `class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](`
     * `case class OptimizeSkewedJoin(`
     * `case class SkewJoinChildWrapper(plan: SparkPlan) extends LeafExecNode `
     * `case class SimpleCostEvaluator(forceOptimizeSkewedJoin: Boolean) extends CostEvaluator `
     * `case class WriterBucketSpec(`
     * `case class EnsureRequirements(`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726536


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48154/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929859580


   **[Test build #143706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929845045


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143704/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718163223



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Correct, and 
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   is also considered as same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.
   
   I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is enough to satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.
   
   I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931864430


   **[Test build #143778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927745638


   cc @ueshin @HyukjinKwon @xinrong-databricks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718151080



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       I think you should use `exceptAll` to preserve same values.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       This looks like we'll compare each value of multiindex individually. e.g.)
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   will be conisdered as same?

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(
+                other_sdf.select(other_index_scol)
+            )
+            subtract_list.append(self_subtract_other)
+
+        unioned_subtracts = reduce(lambda x, y: x.union(y), subtract_list)
+        if len(unioned_subtracts.head(1)) == 0:
+            return True
+        else:
+            return False

Review comment:
       ```suggestion
           return len(unioned_subtracts.head(1)) == 0
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929833756


   **[Test build #143704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929833756


   **[Test build #143704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716583441



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       it will also trigger the spark job for each level. can we do it in single pass?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718169889



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       👌 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929870864






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931844920


   **[Test build #143778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Yeah, the equality condition of this API is,
   
   **Is all of the unique(distinct) values of each level are the same ?**
   
   So I think applying the `subtract` for each level of index column is satisfy the equality condition.
   
   Because if there is at least one different unique value, it will not be filtered out when subtracting.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726458


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48154/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929906029


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48220/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929844811


   **[Test build #143704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931902591


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716581142



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       can we avoid sorting for each level? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931886653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48290/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index values. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718170132



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       👌 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648563


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143642/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931865814


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143778/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927638906


   **[Test build #143642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929872730


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48218/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929871970


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48220/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718170642



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Can you elabourate the equality condition condition on this API? Seems like we can just leverage distinct values to compare?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927638906


   **[Test build #143642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #34113:
URL: https://github.com/apache/spark/pull/34113


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648563


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143642/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931865814


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143778/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931886653


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48290/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718163223



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Examples
+        --------
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        self_sdf = self._internal.spark_frame
+        other_sdf = other._internal.spark_frame
+        subtract_list = []
+        for nlevel in range(nlevels):
+            self_index_scol = self._internal.index_spark_columns[nlevel]
+            other_index_scol = other._internal.index_spark_columns[nlevel]
+            self_subtract_other = self_sdf.select(self_index_scol).subtract(

Review comment:
       Correct. 
   
   And 
   
   ```python
   ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
   ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
   ```
   
   is also considered as same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929894410


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48220/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534


   **[Test build #143702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929870864


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143706/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716583589



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same

Review comment:
       let;s add versionadded: 3.3.0




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org