You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/13 15:37:26 UTC
[GitHub] [spark] liangz1 opened a new pull request #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
liangz1 opened a new pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565
### What changes were proposed in this pull request?
This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class.
### Why are the changes needed?
They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs.
### Does this PR introduce any user-facing change?
```
scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2")
df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2")
df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2")
df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2")
df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int]
scala> df1.semanticHash
res0: Int = 594427822
scala> df2.semanticHash
res1: Int = 594427822
scala> df1.sameSemantics(df2)
res2: Boolean = true
scala> df1.sameSemantics(df3)
res3: Boolean = false
scala> df3.semanticHash
res4: Int = -1592702048
scala> df4.semanticHash
res5: Int = -1592702048
scala> df4.sameSemantics(df3)
res6: Boolean = true
```
### How was this patch tested?
The underlying lower-level API `sameResult` is tested in the `org.apache.spark.sql.catalyst.plans.SameResultSuite`. The `semanticHash` just uses the hashCode, which might not be necessary to test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586109523
**[Test build #118385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).
* This patch passes all tests.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586202370
**[Test build #118419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177603
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587245610
Merge to master/branch-3.0
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 edited a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069714
@cloud-fan @HyukjinKwon Any comments ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942207
**[Test build #118571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
* This patch **fails to generate documentation**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809253
**[Test build #118525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881773
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
More tests:
```
>>> df1=spark.range(100)
>>> df2=spark.range(100)
>>> df3=spark.range(100)
>>> df11=df1.withColumn("col1", df1.id +1)
>>> df21=df2.withColumn("col1", df2.id -1)
>>> df31=df3.withColumn("col1", df3.id *2)
>>> df32=df3.withColumn("col1", df3.id +2)
>>> df33=df3.withColumn("col1", df3.id /2)
>>> df34=df3.withColumn("col1", df3.id -2)
>>> df11.semanticHash()
1855039936
>>> df21.semanticHash()
1855039936
>>> df31.semanticHash()
-1719131362
>>> df32.semanticHash()
-1719131362
>>> df32.semanticHash()
-1719131362
>>> df34.semanticHash()
-706037631
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379362550
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
Currently, this test would fail. I don't have a clue why only the same object would have the same hash and all other cases seem to have a different hash. Scala side tests all pass.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585821610
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948594
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864075
**[Test build #118523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864664
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118523/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177612
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23169/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872211
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118373/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379358950
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
+ * results, it is okay for this function to return false, even if the results are actually
+ * the same. Such behavior will not affect correctness, only the application of performance
+ * enhancements like caching. However, it is not acceptable to return true if the results could
+ * possibly be different.
+ *
+ * This function performs a modified version of equality that is tolerant of cosmetic
+ * differences like attribute naming and or expression id differences.
+ *
+ * @since 3.1.0
+ */
+ @DeveloperApi
+ def sameSemantics(other: Dataset[T]): Boolean = {
+ queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+ }
+
+ /**
+ * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+ * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
Review comment:
I would write as below:
```
Returns a `hashCode` of the logical query plan against this [[Dataset]].
@note Unlike the standard `hashCode`, the hash is calculated against the query plan
simplified by tolerating the cosmetic differences such as attribute names.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193899
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118417/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948594
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295857
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118419/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379342918
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,45 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
Review comment:
Shall we also add one test that checks error message? could be added as below (not tested by myself though):
```diff
diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
index d738449799b..942cd4b4b0e 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -782,6 +782,11 @@ class DataFrameTests(ReusedSQLTestCase):
break
self.assertEqual(df.take(8), result)
+ def test_same_semantics_error(self):
+ with QuietTest(self.sc):
+ with self.assertRaisesRegexp(ValueError, "should be of DataFrame.*int"):
+ self.spark.range(10).sameSemantics(1)
+
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340929
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586947035
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839557
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679645
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118493/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203267
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23176/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215401
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679631
**[Test build #118489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379367247
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
I need to debug too. Don't know the cause. One hypothesis on my mind is though,
There is one difference between Python and Scala side is, the Python side is currently creating the `DataFrame` based on `RDD[Array[Byte]]` in JVM perspective.
In Scala side, it will hold `LocalRelation` which contains the actual data. This could affect the hash.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379979541
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
Thanks! Could you share the link to the PR or issue?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804198
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23278/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449
**[Test build #118491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379364408
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
Review comment:
Sounds good to me. It's more clear and concise.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942266
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118571/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586950940
jenkins retest
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379226694
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
It becomes the same error as the second one:
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/liang.zhang/work/repos/apache/spark/python/pyspark/sql/dataframe.py", line 2162, in sameSemantics
return self._jdf.sameSemantics(other._jdf)
File "/Users/liang.zhang/mypy3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/liang.zhang/work/repos/apache/spark/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/Users/liang.zhang/mypy3/lib/python3.7/site-packages/py4j/protocol.py", line 332, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling o42.sameSemantics. Trace:
py4j.Py4JException: Method sameSemantics([class org.apache.spark.sql.Dataset]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110262
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118385/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203267
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23176/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942266
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118571/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804194
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586934216
**[Test build #118571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379963546
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
Review comment:
nit: there should be a space between `note::This ` -> `note:: This `
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379359352
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,45 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
Review comment:
```
Returns a hash code of the logical query plan against this :class:`DataFrame`.
.. note:: Unlike the standard hash code, the hash is calculated against the query plan
simplified by tolerating the cosmetic differences such as attribute names.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379991667
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,58 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
Review comment:
nit:
```python
>>> df1 = spark.range(10)
>>> df2 = spark.range(10)
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
True
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id + 2))
False
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col0", df2.id * 2))
True
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379356166
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,45 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
Review comment:
The documentation seems mismatched with Scala side. I would suggest:
```
Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
therefore return same results.
.. note:: The equality comparison here is simplified by tolerating the cosmetic differences
such as attribute names.
.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return `False` on
the :class:`DataFrame` that return the same results, for instance, from different plans. Such
false negative semantic can be useful when caching as an example.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379354632
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
Review comment:
I would rewrite the doc as below if you guys think it's fine.
```
Returns `true` when the logical query plans inside both [[Dataset]]s are equal and
therefore return same results.
@note The equality comparison here is simplified by tolerating the cosmetic differences
such as attribute names.
@note This API can compare both [[Dataset]]s very fast but can still return `false` on
the [[Dataset]] that return the same results, for instance, from different plans. Such
false negative semantic can be useful when caching as an example.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295845
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090196
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379226303
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
+
+ @since(3.0)
+ def semanticHash(self):
+ """
+
+ :return:
+ """
+ return self._jdf.semanticHash(None)
Review comment:
It shows the same error...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586192634
**[Test build #118417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340172
**[Test build #118421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379992810
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,58 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> # df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash() # False
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df4.withColumn("col0", df4.id * 2).semanticHash()
+ True
Review comment:
nit:
I think you can just show a couple of skipped examples instead of comparisons as @WeichenXu123 pointed out.
```python
>>> spark.range(10).selectExpr("id as col0").semanticHash() # doctest: +SKIP
1855039936
>>> spark.range(10).selectExpr("id as col1").semanticHash() # doctest: +SKIP
1855039936
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229505
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
@WeichenXu123 is correct. It should be `self._jdf.sameSemantics(other._jdf)`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379992810
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,58 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> # df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash() # False
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df4.withColumn("col0", df4.id * 2).semanticHash()
+ True
Review comment:
nit:
I think you can just show a couple of examples:
```python
>>> spark.range(10).selectExpr("id as col0").semanticHash() # doctest: +SKIP
1855039936
>>> spark.range(10).selectExpr("id as col1").semanticHash() # doctest: +SKIP
1855039936
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809431
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23280/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229670
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
+
+ @since(3.0)
+ def semanticHash(self):
+ """
Review comment:
Doc seems missing. Shall we add it with doctest?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679644
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379230709
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
emm,.. it should work. Do you correctly build and test locally ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881028
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
```
Failed example:
df1.withColumn("col1", df1.id * 2).semanticHash() == df3.withColumn("col1", df3.id + 2).semanticHash()
Differences (ndiff with -expected +actual):
- False
+ True
```
Now we have another unexpected result. (Note L2176 passed, which is expected.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866
**[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948605
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23327/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177675
**[Test build #118412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).
* This patch **fails Python style tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379374685
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
emm, let's change unit test. Don't test on dataframe created from in-memory list (LocalRelation), they have different implementation between scala and pyspark.
Our usecase also do not care the behavior of LocalRelation.
so, I suggested add a unit test on: spark.df.read(...).where(...)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586214562
**[Test build #118421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203075
@liangz1 seems your branch here is not synced against the current master. You should probably rebase or merge the upstream into your PR. Might be good to refer "The Review Process" in https://spark.apache.org/contributing.html.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069480
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090205
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118572/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177694
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069714
@cloud-fan Any comments ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379225118
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
should be `self._jdf.sameSemantics(other._jdf)`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679648
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864627
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809431
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23280/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe
add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839043
**[Test build #118373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193885
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804194
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229959
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
Since Python API side does not check the type, we could add an if-else with throwing `ValueError`.
```python
if not isinstance(other, DataFrame):
raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379368127
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
If this is the case, we might have to leave an explicit caveat that false positive case possible. Also we might have to mark those APIs as `@Unstable` or `@Experimental` due to this reason cc @mengxr.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215401
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193885
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931037
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23325/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864653
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864664
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118523/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586341296
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118421/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586202370
**[Test build #118419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793
**[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215408
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23179/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe
add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586068918
**[Test build #118385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379362763
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
Do you have any ideas? @HyukjinKwon
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188200
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r378957996
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,33 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
+ * results, it is okay for this function to return false, even if the results are actually
+ * the same. Such behavior will not affect correctness, only the application of performance
+ * enhancements like caching. However, it is not acceptable to return true if the results could
+ * possibly be different.
+ *
+ * This function performs a modified version of equality that is tolerant of cosmetic
+ * differences like attribute naming and or expression id differences.
+ *
+ * @since 3.0.0
+ */
+ @DeveloperApi
+ def sameSemantics(other: Dataset[T]): Boolean = {
Review comment:
Remove @DeveloperApi. Now it is user API.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188200
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872197
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379977102
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
Okay, this seems an issue in Scala. I will open a PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215408
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23179/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188210
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23174/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931037
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23325/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872211
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118373/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229484
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,31 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
+ * results, it is okay for this function to return false, even if the results are actually
+ * the same. Such behavior will not affect correctness, only the application of performance
+ * enhancements like caching. However, it is not acceptable to return true if the results could
+ * possibly be different.
+ *
+ * This function performs a modified version of equality that is tolerant of cosmetic
+ * differences like attribute naming and or expression id differences.
+ *
+ * @since 3.0.0
+ */
+ def sameSemantics(other: Dataset[T]): Boolean = {
+ queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+ }
+
+ /**
+ * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+ * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
+ */
Review comment:
It should have `@since` too.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090196
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679653
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295845
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804198
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23278/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679656
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118491/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931023
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193873
**[Test build #118417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).
* This patch **fails Python style tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379349637
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,45 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
Review comment:
Thanks! I'll try it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948085
**[Test build #118572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586826118
I am okay otherwise. I will leave it to @WeichenXu123 and @mengxr
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585821610
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379228900
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -3308,6 +3308,31 @@ class Dataset[T] private[sql](
files.toSet.toArray
}
+ /**
+ * Returns true when the query plan of the given Dataset will return the same results as this
+ * Dataset.
+ *
+ * Since its likely undecidable to generally determine if two given plans will produce the same
+ * results, it is okay for this function to return false, even if the results are actually
+ * the same. Such behavior will not affect correctness, only the application of performance
+ * enhancements like caching. However, it is not acceptable to return true if the results could
+ * possibly be different.
+ *
+ * This function performs a modified version of equality that is tolerant of cosmetic
+ * differences like attribute naming and or expression id differences.
+ *
+ * @since 3.0.0
+ */
+ def sameSemantics(other: Dataset[T]): Boolean = {
+ queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+ }
+
+ /**
+ * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+ * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
+ */
+ def semanticHash: Int = queryExecution.analyzed.semanticHash()
Review comment:
I would make it as a proper function `semanticHash()` to allow the same API usage in PySpark.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585820740
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193899
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118417/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586068918
**[Test build #118385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379342918
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,45 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
Review comment:
Shall we also add one test that checks error message? could be added as below (not tested):
```diff
diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
index d738449799b..942cd4b4b0e 100644
--- a/python/pyspark/sql/tests/test_dataframe.py
+++ b/python/pyspark/sql/tests/test_dataframe.py
@@ -782,6 +782,11 @@ class DataFrameTests(ReusedSQLTestCase):
break
self.assertEqual(df.take(8), result)
+ def test_same_semantics_error(self):
+ with QuietTest(self.sc):
+ with self.assertRaisesRegexp(ValueError, "should be of DataFrame.*int"):
+ self.spark.range(10).sameSemantics(1)
+
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679649
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118489/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809430
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866
**[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069483
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23142/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948085
**[Test build #118572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586176641
**[Test build #118412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839569
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23130/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069480
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679653
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586341296
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118421/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295857
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118419/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839569
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23130/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe
add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872019
**[Test build #118373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942256
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 closed pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 closed pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864653
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679644
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864627
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090205
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118572/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379228610
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
Review comment:
It's during the code freeze and RC will start next week. Is there a reason to target Spark 3.0? cc @mengxr.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110262
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118385/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379267923
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
Review comment:
@HyukjinKwon We can change this as develop API.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586176641
**[Test build #118412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379882384
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
Same behavior for dataframe from `spark.read.load()`
```
>>> df4=spark.read.load(csv_file_path, format="csv", inferSchema="true", header="true")
>>> df4.schema
StructType(List(StructField(bool_col,BooleanType,true),StructField(float_col,DoubleType,true),StructField(double_col,DoubleType,true),StructField(int_col,IntegerType,true),StructField(long_col,IntegerType,true)))
>>> df4.withColumn("col1", df4.int_col *2).semanticHash()
-1746346451
>>> df4.withColumn("col1", df4.int_col +2).semanticHash()
-1746346451
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587089410
**[Test build #118572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942256
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809253
**[Test build #118525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586214562
**[Test build #118421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177704
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118412/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679645
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118493/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379969968
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
@liangz1 I think you should check correctness via df.sameSemantics, don't via hash, hash may cause collisions issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586803973
**[Test build #118523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379374685
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,50 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.sameSemantics(df2)
+ False
+ >>> df1.sameSemantics(df3)
+ False
+ >>> df1.sameSemantics(df4)
+ True
+ >>> df1.sameSemantics(df1)
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+ >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+ >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+ >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+ >>> df1.semanticHash() == df2.semanticHash()
+ False
+ >>> df1.semanticHash() == df3.semanticHash()
+ False
+ >>> df1.semanticHash() == df4.semanticHash()
+ True
Review comment:
emm, let's change unit test. Don't test on dataframe created from in-memory list (LocalRelation), they have different implementation between scala and pyspark.
Our usecase also do not care the behavior of LocalRelation.
so, I suggested add a unit test on: `spark.range(100)` and do some equivalent transforms on it to see whether `sameSemantic` return True
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839557
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586294634
**[Test build #118419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).
* This patch **fails PySpark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110260
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203260
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] mengxr commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
mengxr commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379272784
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
Review comment:
We don't need it for 3.0.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340929
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586803973
**[Test build #118523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177603
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872197
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110260
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379225187
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
+
+ @since(3.0)
+ def semanticHash(self):
+ """
+
+ :return:
+ """
+ return self._jdf.semanticHash(None)
Review comment:
should be `self._jdf.semanticHash()`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379977002
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
I am concerned that the hash collision rate is too high, which could be contradictory to people's expectations. Typically, if two messages (in this case, dataframes) have subtle differences, the hash should be different. Do you think we should add a warning in the docstring?
@HyukjinKwon @WeichenXu123 @mengxr
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679656
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118491/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229413
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
##########
@@ -1899,6 +1899,21 @@ class DatasetSuite extends QueryTest
val e = intercept[AnalysisException](spark.range(1).tail(-1))
e.getMessage.contains("tail expression must be equal to or greater than 0")
}
+
+ test("sameSemantics and semanticHash work") {
Review comment:
Shall we add `SPARK-30791:` prefix although it is a feature API per "Pull Request" at https://spark.apache.org/contributing.html
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177612
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23169/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679648
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188210
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23174/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839043
**[Test build #118373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948605
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23327/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586934216
**[Test build #118571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177694
Build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586192634
**[Test build #118417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177704
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118412/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585820740
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on issue #27565: [SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585835273
ok to test
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] WeichenXu123 edited a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
WeichenXu123 edited a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587245610
Merge to master. Thanks!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793
**[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379991667
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,58 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
Review comment:
nit:
```python
>>> df1 = spark.range(100)
>>> df2 = spark.range(100)
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
True
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id + 2))
False
>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col0", df2.id * 2))
True
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069483
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23142/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229633
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
Review comment:
Can you add a doctest too?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679628
**[Test build #118491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379982400
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,59 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.1)
+ def sameSemantics(self, other):
+ """
+ Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+ therefore return same results.
+
+ .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+ such as attribute names.
+
+ .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+ `False` on the :class:`DataFrame` that return the same results, for instance, from
+ different plans. Such false negative semantic can be useful when caching as an example.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+ True
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+ False
+ >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+ True
+ """
+ if not isinstance(other, DataFrame):
+ raise ValueError("other parameter should be of DataFrame; however, got %s"
+ % type(other))
+ return self._jdf.sameSemantics(other._jdf)
+
+ @since(3.1)
+ def semanticHash(self):
+ """
+ Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+ .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+ simplified by tolerating the cosmetic differences such as attribute names.
+
+ >>> df1 = spark.range(100)
+ >>> df2 = spark.range(100)
+ >>> df3 = spark.range(100)
+ >>> df4 = spark.range(100)
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df2.withColumn("col1", df2.id * 2).semanticHash()
+ True
+ >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+ df3.withColumn("col1", df3.id + 2).semanticHash()
+ False
Review comment:
Sure, PR: https://github.com/apache/spark/pull/27601 and the JIRA: https://issues.apache.org/jira/browse/SPARK-30847
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679649
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118489/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809430
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864080
**[Test build #118525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449
**[Test build #118491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864635
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118525/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565:
[WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379289223
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2153,6 +2153,22 @@ def transform(self, func):
"should have been DataFrame." % type(result)
return result
+ @since(3.0)
+ def sameSemantics(self, other):
+ """
+ Return true when the query plan of the given :class:`DataFrame` will return the same
+ results as this :class:`DataFrame`.
+ """
+ return self._jdf.sameSemantics(other)
Review comment:
I could run the scala test "sameSemantics and semanticHash work" in sbt and it passed. I uninstalled & reinstalled the pyspark by `python setup.py install` in the virtualenv. Is my routine correct?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791]
Dataframe add sameSemantics and sementicHash method
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203260
Build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565:
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash'
methods in Dataset
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679630
**[Test build #118493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864635
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118525/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565:
[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods
in Dataset
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931023
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org