You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/09/27 07:48:23 UTC
[GitHub] [spark] itholic opened a new pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
itholic opened a new pull request #34113:
URL: https://github.com/apache/spark/pull/34113
### What changes were proposed in this pull request?
This PR proposes implementing `MultiIndex.equal_levels`.
```python
>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
>>> psmidx1.equal_levels(psmidx2)
True
>>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z"), ("a", "y")])
>>> psmidx2 = ps.MultiIndex.from_tuples([("a", "y"), ("b", "x"), ("c", "z"), ("c", "x")])
>>> psmidx1.equal_levels(psmidx2)
True
```
This was originally proposed in https://github.com/databricks/koalas/pull/1789, and all reviews in origin PR has been resolved.
### Why are the changes needed?
We should support the pandas API as much as possible for pandas-on-Spark module.
### Does this PR introduce _any_ user-facing change?
Yes, the `MultiIndex.equal_levels` API is available.
### How was this patch tested?
Unittests
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718151080
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think you should use `exceptAll` to preserve same values.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
This looks like we'll compare each value of multiindex individually. e.g.)
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
will be conisdered as same?
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
+ other_sdf.select(other_index_scol)
+ )
+ subtract_list.append(self_subtract_other)
+
+ unioned_subtracts = reduce(lambda x, y: x.union(y), subtract_list)
+ if len(unioned_subtracts.head(1)) == 0:
+ return True
+ else:
+ return False
Review comment:
```suggestion
return len(unioned_subtracts.head(1)) == 0
```
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
👌
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
👌
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Can you elabourate the equality condition condition on this API? Seems like we can just leverage distinct values to compare?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718164009
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think we don't have to preserve the same values.
It only compares unique value of each levels.
For example,
```python
>>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
>>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
>>> pmidx1.equal_levels(pmidx2)
True
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
I think it will be more expensive if we compare after applying `distinct` to each level since it requires sorting.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
I think maybe it will be more expensive if we compare after applying `distinct` to each level since it requires sorting ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929851118
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48218/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929848648
**[Test build #143706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
I think maybe it will be more expensive if we compare after applying `distinct` to each level since it requires sorting ?
I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718164009
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think we don't need to preserve the same values.
It only compares unique value of each levels.
For example,
```python
>>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
>>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
>>> pmidx1.equal_levels(pmidx2)
True
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929848648
**[Test build #143706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929842512
**[Test build #143702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143702/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719939464
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
👌
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931873684
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48290/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929872763
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48218/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index. Let me address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index values. Let me address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Correct, and
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
is also considered as same.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Correct.
And
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
is also considered as same.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think we don't have to preserve the same values.
It only compares unique value of each levels.
For example,
```python
>>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
>>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
>>> pmidx1.equal_levels(pmidx2)
True
```
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think we don't need to preserve the same values.
It only compares unique value of each levels.
For example,
```python
>>> pmidx1 = pd.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
>>> pmidx2 = pd.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
>>> pmidx1.equal_levels(pmidx2)
True
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534
**[Test build #143702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927680025
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48154/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726536
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48154/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929906029
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48220/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931860083
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48290/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931844920
**[Test build #143778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648356
**[Test build #143642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `public class NettyLogger `
* `class IndexNameTypeHolder(object):`
* ` new_class = type(\"NameType\", (NameTypeHolder,), `
* ` new_class = param.type if isinstance(param, np.dtype) else param`
* `public final class AlwaysFalse extends Filter `
* `public final class AlwaysTrue extends Filter `
* `public final class And extends BinaryFilter `
* `abstract class BinaryComparison extends Filter `
* `abstract class BinaryFilter extends Filter `
* `public final class EqualNullSafe extends BinaryComparison `
* `public final class EqualTo extends BinaryComparison `
* `public abstract class Filter implements Expression, Serializable `
* `public final class GreaterThan extends BinaryComparison `
* `public final class GreaterThanOrEqual extends BinaryComparison `
* `public final class In extends Filter `
* `public final class IsNotNull extends Filter `
* `public final class IsNull extends Filter `
* `public final class LessThan extends BinaryComparison `
* `public final class LessThanOrEqual extends BinaryComparison `
* `public final class Not extends Filter `
* `public final class Or extends BinaryFilter `
* `public final class StringContains extends StringPredicate `
* `public final class StringEndsWith extends StringPredicate `
* `abstract class StringPredicate extends Filter `
* `public final class StringStartsWith extends StringPredicate `
* `public class ColumnarBatch implements AutoCloseable `
* `case class Sec(child: Expression)`
* `case class Csc(child: Expression)`
* `trait OperationHelper extends AliasHelper with PredicateHelper `
* `class SQLOpenHashSet[@specialized(Long, Int, Double, Float) T: ClassTag](`
* `case class OptimizeSkewedJoin(`
* `case class SkewJoinChildWrapper(plan: SparkPlan) extends LeafExecNode `
* `case class SimpleCostEvaluator(forceOptimizeSkewedJoin: Boolean) extends CostEvaluator `
* `case class WriterBucketSpec(`
* `case class EnsureRequirements(`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726536
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48154/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929859580
**[Test build #143706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143706/testReport)** for PR 34113 at commit [`f19e365`](https://github.com/apache/spark/commit/f19e365c451bc921533e18fe82bfd5dc27fdcebe).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929845045
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143704/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718163223
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Correct, and
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
is also considered as same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is enough to satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
I'm afraid it will be more expensive because we have to sort before comparing, if we want to apply `distinct` to each level.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931864430
**[Test build #143778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927745638
cc @ueshin @HyukjinKwon @xinrong-databricks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718151080
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
I think you should use `exceptAll` to preserve same values.
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
This looks like we'll compare each value of multiindex individually. e.g.)
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
will be conisdered as same?
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
+ other_sdf.select(other_index_scol)
+ )
+ subtract_list.append(self_subtract_other)
+
+ unioned_subtracts = reduce(lambda x, y: x.union(y), subtract_list)
+ if len(unioned_subtracts.head(1)) == 0:
+ return True
+ else:
+ return False
Review comment:
```suggestion
return len(unioned_subtracts.head(1)) == 0
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929833756
**[Test build #143704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929833756
**[Test build #143704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716583441
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
it will also trigger the spark job for each level. can we do it in single pass?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718169889
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
👌
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929870864
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931844920
**[Test build #143778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143778/testReport)** for PR 34113 at commit [`188f9e7`](https://github.com/apache/spark/commit/188f9e7a53ec616e588c8aaf18d097ee4f667273).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r719124229
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Yeah, the equality condition of this API is,
**Is all of the unique(distinct) values of each level are the same ?**
So I think applying the `subtract` for each level of index column is satisfy the equality condition.
Because if there is at least one different unique value, it will not be filtered out when subtracting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but at least one Spark job is required for comparing each level of index. Let me address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, it might be possible to reduce the Spark job for each iteration, but I think at least one Spark job is still required for comparing each level of index values. Let me try to address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927726458
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48154/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929906029
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48220/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929844811
**[Test build #143704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143704/testReport)** for PR 34113 at commit [`1f51541`](https://github.com/apache/spark/commit/1f5154100f37a3ecf574917bc0c76c852bd4bf23).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931902591
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716581142
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
can we avoid sorting for each level?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931886653
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48290/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index. Let me address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718114641
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Notes
+ -----
+ This API can be expensive since it has logic to sort and compare the values of
+ all levels of indices that belong to MultiIndex.
+
+ Examples
+ --------
+ >>> from pyspark.pandas.config import set_option, reset_option
+ >>> set_option("compute.ops_on_diff_frames", True)
+
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+
+ >>> reset_option("compute.ops_on_diff_frames")
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ for nlevel in range(nlevels):
Review comment:
Yeah, I think it might be possible to reduce the Spark job for each iteration, but still at least one Spark job is required for comparing each level of index values. Let me address it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718170132
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
👌
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648563
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143642/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931865814
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143778/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927638906
**[Test build #143642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929872730
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48218/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929871970
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48220/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718170642
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Can you elabourate the equality condition condition on this API? Seems like we can just leverage distinct values to compare?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927638906
**[Test build #143642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143642/testReport)** for PR 34113 at commit [`0b79e4a`](https://github.com/apache/spark/commit/0b79e4aef9b8832accebd18d4d4a8135a6721fec).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #34113:
URL: https://github.com/apache/spark/pull/34113
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-927648563
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143642/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931865814
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143778/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-931886653
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/48290/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] itholic commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r718163223
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,42 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
+
+ Examples
+ --------
+ >>> psmidx1 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "z")])
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("b", "y"), ("a", "x"), ("c", "z")])
+ >>> psmidx1.equal_levels(psmidx2)
+ True
+
+ >>> psmidx2 = ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("c", "j")])
+ >>> psmidx1.equal_levels(psmidx2)
+ False
+ """
+ nlevels = self.nlevels
+ if nlevels != other.nlevels:
+ return False
+
+ self_sdf = self._internal.spark_frame
+ other_sdf = other._internal.spark_frame
+ subtract_list = []
+ for nlevel in range(nlevels):
+ self_index_scol = self._internal.index_spark_columns[nlevel]
+ other_index_scol = other._internal.index_spark_columns[nlevel]
+ self_subtract_other = self_sdf.select(self_index_scol).subtract(
Review comment:
Correct.
And
```python
ps.MultiIndex.from_tuples([("a", "x"), ("b", "y"), ("a", "y"), ("b", "x")])
ps.MultiIndex.from_tuples([("b", "x"), ("a", "y")])
```
is also considered as same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929894410
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48220/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929831534
**[Test build #143702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143702/testReport)** for PR 34113 at commit [`6306fb1`](https://github.com/apache/spark/commit/6306fb1c66233e7ec159def64eee0d51cdee026d).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929870864
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/143706/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #34113:
URL: https://github.com/apache/spark/pull/34113#discussion_r716583589
##########
File path: python/pyspark/pandas/indexes/multi.py
##########
@@ -1137,6 +1137,43 @@ def intersection(self, other: Union[DataFrame, Series, Index, List]) -> "MultiIn
)
return cast(MultiIndex, DataFrame(internal).index)
+ def equal_levels(self, other: "MultiIndex") -> bool:
+ """
+ Return True if the levels of both MultiIndex objects are the same
Review comment:
let;s add versionadded: 3.3.0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #34113: [SPARK-36435][PYTHON] Implement MultIndex.equal_levels
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34113:
URL: https://github.com/apache/spark/pull/34113#issuecomment-929843984
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org