You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/29 02:56:32 UTC

[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels

itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641



##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index.

##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
         )
         return cast(MultiIndex, DataFrame(internal).index)
 
+    def equal_levels(self, other: "MultiIndex") -> bool:
+        """
+        Return True if the levels of both MultiIndex objects are the same
+
+        Notes
+        -----
+        This API can be expensive since it has logic to sort and compare the values of
+        all levels of indices that belong to MultiIndex.
+
+        Examples
+        --------
+        >>> from pyspark.pandas.config import set_option, reset_option
+        >>> set_option("compute.ops_on_diff_frames", True)
+
+        >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+        >>> psmidx1.equal_levels(psmidx2)
+        True
+
+        >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+        >>> psmidx1.equal_levels(psmidx2)
+        False
+
+        >>> reset_option("compute.ops_on_diff_frames")
+        """
+        nlevels = self.nlevels
+        if nlevels != other.nlevels:
+            return False
+
+        for nlevel in range(nlevels):

Review comment:
       Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org