You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/13 15:37:26 UTC

[GitHub] [spark] liangz1 opened a new pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

liangz1 opened a new pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565
 
 
   ### What changes were proposed in this pull request?
   This PR added two DeveloperApis to the Dataset[T] class. Both methods are just exposing lower-level methods to the Dataset[T] class.
   
   
   ### Why are the changes needed?
   They are useful for checking whether two dataframes are the same when implementing dataframe caching in python, and also get a unique ID. It's easier to use if we wrap the lower-level APIs.
   
   ### Does this PR introduce any user-facing change?
   ```
   scala> val df1 = Seq((1,2),(4,5)).toDF("col1", "col2")
   df1: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df2 = Seq((1,2),(4,5)).toDF("col1", "col2")
   df2: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df3 = Seq((0,2),(4,5)).toDF("col1", "col2")
   df3: org.apache.spark.sql.DataFrame = [col1: int, col2: int]
   
   scala> val df4 = Seq((0,2),(4,5)).toDF("col0", "col2")
   df4: org.apache.spark.sql.DataFrame = [col0: int, col2: int]
   
   scala> df1.semanticHash
   res0: Int = 594427822
   
   scala> df2.semanticHash
   res1: Int = 594427822
   
   scala> df1.sameSemantics(df2)
   res2: Boolean = true
   
   scala> df1.sameSemantics(df3)
   res3: Boolean = false
   
   scala> df3.semanticHash
   res4: Int = -1592702048
   
   scala> df4.semanticHash
   res5: Int = -1592702048
   
   scala> df4.sameSemantics(df3)
   res6: Boolean = true
   ```
   
   
   ### How was this patch tested?
   The underlying lower-level API `sameResult` is tested in the `org.apache.spark.sql.catalyst.plans.SameResultSuite`. The `semanticHash` just uses the hashCode, which might not be necessary to test.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586109523
 
 
   **[Test build #118385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).
    * This patch passes all tests.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586202370
 
 
   **[Test build #118419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177603
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587245610
 
 
   Merge to master/branch-3.0

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 edited a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 edited a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069714
 
 
   @cloud-fan @HyukjinKwon  Any comments ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942207
 
 
   **[Test build #118571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
    * This patch **fails to generate documentation**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809253
 
 
   **[Test build #118525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881773
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   More tests:
   ```
   >>> df1=spark.range(100)
   >>> df2=spark.range(100)
   >>> df3=spark.range(100)
   >>> df11=df1.withColumn("col1", df1.id +1)
   >>> df21=df2.withColumn("col1", df2.id -1)
   >>> df31=df3.withColumn("col1", df3.id *2)
   >>> df32=df3.withColumn("col1", df3.id +2)
   >>> df33=df3.withColumn("col1", df3.id /2)
   >>> df34=df3.withColumn("col1", df3.id -2)
   
   >>> df11.semanticHash()
   1855039936
   >>> df21.semanticHash()
   1855039936
   
   >>> df31.semanticHash()
   -1719131362
   >>> df32.semanticHash()
   -1719131362
   
   >>> df32.semanticHash()
   -1719131362
   >>> df34.semanticHash()
   -706037631

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379362550
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   Currently, this test would fail. I don't have a clue why only the same object would have the same hash and all other cases seem to have a different hash. Scala side tests all pass.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585821610
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948594
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864075
 
 
   **[Test build #118523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864664
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118523/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177612
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23169/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872211
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118373/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379358950
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
+   * results, it is okay for this function to return false, even if the results are actually
+   * the same.  Such behavior will not affect correctness, only the application of performance
+   * enhancements like caching.  However, it is not acceptable to return true if the results could
+   * possibly be different.
+   *
+   * This function performs a modified version of equality that is tolerant of cosmetic
+   * differences like attribute naming and or expression id differences.
+   *
+   * @since 3.1.0
+   */
+  @DeveloperApi
+  def sameSemantics(other: Dataset[T]): Boolean = {
+    queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+  }
+
+  /**
+   * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+   * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
 
 Review comment:
   I would write as below:
   
   ```
   Returns a `hashCode` of the logical query plan against this [[Dataset]].
   
   @note Unlike the standard `hashCode`, the hash is calculated against the query plan
   simplified by tolerating the cosmetic differences such as attribute names.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193899
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118417/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948594
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295857
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118419/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379342918
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,45 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
 
 Review comment:
   Shall we also add one test that checks error message? could be added as below (not tested by myself though):
   
   ```diff
   diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
   index d738449799b..942cd4b4b0e 100644
   --- a/python/pyspark/sql/tests/test_dataframe.py
   +++ b/python/pyspark/sql/tests/test_dataframe.py
   @@ -782,6 +782,11 @@ class DataFrameTests(ReusedSQLTestCase):
                        break
                self.assertEqual(df.take(8), result)
   
   +    def test_same_semantics_error(self):
   +        with QuietTest(self.sc):
   +            with self.assertRaisesRegexp(ValueError, "should be of DataFrame.*int"):
   +                self.spark.range(10).sameSemantics(1)
   +
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340929
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586947035
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839557
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679645
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118493/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203267
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23176/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215401
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679631
 
 
   **[Test build #118489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379367247
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   I need to debug too. Don't know the cause. One hypothesis on my mind is though,
   
   There is one difference between Python and Scala side is, the Python side is currently creating the `DataFrame` based on `RDD[Array[Byte]]` in JVM perspective. 
   In Scala side, it will hold `LocalRelation` which contains the actual data. This could affect the hash.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379979541
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   Thanks! Could you share the link to the PR or issue?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23278/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449
 
 
   **[Test build #118491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379364408
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
 
 Review comment:
   Sounds good to me. It's more clear and concise.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942266
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118571/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586950940
 
 
   jenkins retest

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379226694
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   It becomes the same error as the second one:
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/liang.zhang/work/repos/apache/spark/python/pyspark/sql/dataframe.py", line 2162, in sameSemantics
       return self._jdf.sameSemantics(other._jdf)
     File "/Users/liang.zhang/mypy3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
       answer, self.gateway_client, self.target_id, self.name)
     File "/Users/liang.zhang/work/repos/apache/spark/python/pyspark/sql/utils.py", line 98, in deco
       return f(*a, **kw)
     File "/Users/liang.zhang/mypy3/lib/python3.7/site-packages/py4j/protocol.py", line 332, in get_return_value
       format(target_id, ".", name, value))
   py4j.protocol.Py4JError: An error occurred while calling o42.sameSemantics. Trace:
   py4j.Py4JException: Method sameSemantics([class org.apache.spark.sql.Dataset]) does not exist
   	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
   	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
   	at py4j.Gateway.invoke(Gateway.java:274)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:748)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110262
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118385/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203267
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23176/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942266
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118571/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804194
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586934216
 
 
   **[Test build #118571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379963546
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
 
 Review comment:
   nit: there should be a space between `note::This ` -> `note:: This `

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379359352
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,45 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
 
 Review comment:
   ```
   Returns a hash code of the logical query plan against this :class:`DataFrame`.
   
   .. note:: Unlike the standard hash code, the hash is calculated against the query plan
       simplified by tolerating the cosmetic differences such as attribute names.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379991667
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,58 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
 
 Review comment:
   nit:
   
   ```python
           >>> df1 = spark.range(10)
           >>> df2 = spark.range(10)
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
           True
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id + 2))
           False
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col0", df2.id * 2))
           True
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379356166
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,45 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
 
 Review comment:
   The documentation seems mismatched with Scala side. I would suggest:
   
   ```
   Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
   therefore return same results.
   
   .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
       such as attribute names.
   
   .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return `False` on
       the :class:`DataFrame` that return the same results, for instance, from different plans. Such
       false negative semantic can be useful when caching as an example.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379354632
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,37 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
 
 Review comment:
   I would rewrite the doc as below if you guys think it's fine.
   
   ```
   Returns `true` when the logical query plans inside both [[Dataset]]s are equal and
   therefore return same results.
   
   @note The equality comparison here is simplified by tolerating the cosmetic differences
   such as attribute names.
   
   @note This API can compare both [[Dataset]]s very fast but can still return `false` on
   the [[Dataset]] that return the same results, for instance, from different plans. Such
   false negative semantic can be useful when caching as an example.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295845
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090196
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379226303
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
+
+    @since(3.0)
+    def semanticHash(self):
+        """
+
+        :return:
+        """
+        return self._jdf.semanticHash(None)
 
 Review comment:
   It shows the same error...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586192634
 
 
   **[Test build #118417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340172
 
 
   **[Test build #118421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379992810
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,58 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> # df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()  # False
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df4.withColumn("col0", df4.id * 2).semanticHash()
+        True
 
 Review comment:
   nit:
   
   I think you can just show a couple of skipped examples instead of comparisons as @WeichenXu123 pointed out.
   
   ```python
   >>> spark.range(10).selectExpr("id as col0").semanticHash()  # doctest: +SKIP
   1855039936
   >>> spark.range(10).selectExpr("id as col1").semanticHash()  # doctest: +SKIP
   1855039936
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229505
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   @WeichenXu123 is correct. It should be `self._jdf.sameSemantics(other._jdf)`.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379992810
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,58 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> # df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()  # False
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df4.withColumn("col0", df4.id * 2).semanticHash()
+        True
 
 Review comment:
   nit:
   
   I think you can just show a couple of examples:
   
   ```python
   >>> spark.range(10).selectExpr("id as col0").semanticHash()  # doctest: +SKIP
   1855039936
   >>> spark.range(10).selectExpr("id as col1").semanticHash()  # doctest: +SKIP
   1855039936
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809431
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23280/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229670
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
+
+    @since(3.0)
+    def semanticHash(self):
+        """
 
 Review comment:
   Doc seems missing. Shall we add it with doctest?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679644
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379230709
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   emm,.. it should work. Do you correctly build and test locally ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881028
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   ```
   Failed example:
       df1.withColumn("col1", df1.id * 2).semanticHash() ==             df3.withColumn("col1", df3.id + 2).semanticHash()
   Differences (ndiff with -expected +actual):
       - False
       + True
   ```
   Now we have another unexpected result. (Note L2176 passed, which is expected.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866
 
 
   **[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948605
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23327/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177675
 
 
   **[Test build #118412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).
    * This patch **fails Python style tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379374685
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   emm, let's change unit test. Don't test on dataframe created from in-memory list (LocalRelation), they have different implementation between scala and pyspark.
   Our usecase also do not care the behavior of LocalRelation.
   so, I suggested add a unit test on: spark.df.read(...).where(...)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586214562
 
 
   **[Test build #118421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203075
 
 
   @liangz1 seems your branch here is not synced against the current master. You should probably rebase or merge the upstream into your PR. Might be good to refer "The Review Process" in https://spark.apache.org/contributing.html.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069480
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090205
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118572/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177694
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069714
 
 
   @cloud-fan Any comments ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379225118
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   should be `self._jdf.sameSemantics(other._jdf)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679648
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864627
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809431
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23280/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839043
 
 
   **[Test build #118373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193885
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804194
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229959
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   Since Python API side does not check the type, we could add an if-else with throwing `ValueError`.
   
   ```python
   if not isinstance(other, DataFrame):
       raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379368127
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   If this is the case, we might have to leave an explicit caveat that false positive case possible. Also we might have to mark those APIs as `@Unstable` or `@Experimental` due to this reason cc @mengxr.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215401
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193885
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931037
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23325/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864664
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118523/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586341296
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118421/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586202370
 
 
   **[Test build #118419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793
 
 
   **[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215408
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23179/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586068918
 
 
   **[Test build #118385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379362763
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   Do you have any ideas? @HyukjinKwon 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188200
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r378957996
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,33 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
+   * results, it is okay for this function to return false, even if the results are actually
+   * the same.  Such behavior will not affect correctness, only the application of performance
+   * enhancements like caching.  However, it is not acceptable to return true if the results could
+   * possibly be different.
+   *
+   * This function performs a modified version of equality that is tolerant of cosmetic
+   * differences like attribute naming and or expression id differences.
+   *
+   * @since 3.0.0
+   */
+  @DeveloperApi
+  def sameSemantics(other: Dataset[T]): Boolean = {
 
 Review comment:
   Remove @DeveloperApi. Now it is user API.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188200
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872197
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379977102
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   Okay, this seems an issue in Scala. I will open a PR.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586215408
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23179/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188210
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23174/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931037
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23325/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872211
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118373/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229484
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,31 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
+   * results, it is okay for this function to return false, even if the results are actually
+   * the same.  Such behavior will not affect correctness, only the application of performance
+   * enhancements like caching.  However, it is not acceptable to return true if the results could
+   * possibly be different.
+   *
+   * This function performs a modified version of equality that is tolerant of cosmetic
+   * differences like attribute naming and or expression id differences.
+   *
+   * @since 3.0.0
+   */
+  def sameSemantics(other: Dataset[T]): Boolean = {
+    queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+  }
+
+  /**
+   * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+   * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
+   */
 
 Review comment:
   It should have `@since` too.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090196
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295845
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586804198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23278/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679656
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118491/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931023
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193873
 
 
   **[Test build #118417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).
    * This patch **fails Python style tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379349637
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,45 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
 
 Review comment:
   Thanks! I'll try it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948085
 
 
   **[Test build #118572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586826118
 
 
   I am okay otherwise. I will leave it to @WeichenXu123 and @mengxr 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585821610
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379228900
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
 ##########
 @@ -3308,6 +3308,31 @@ class Dataset[T] private[sql](
     files.toSet.toArray
   }
 
+  /**
+   * Returns true when the query plan of the given Dataset will return the same results as this
+   * Dataset.
+   *
+   * Since its likely undecidable to generally determine if two given plans will produce the same
+   * results, it is okay for this function to return false, even if the results are actually
+   * the same.  Such behavior will not affect correctness, only the application of performance
+   * enhancements like caching.  However, it is not acceptable to return true if the results could
+   * possibly be different.
+   *
+   * This function performs a modified version of equality that is tolerant of cosmetic
+   * differences like attribute naming and or expression id differences.
+   *
+   * @since 3.0.0
+   */
+  def sameSemantics(other: Dataset[T]): Boolean = {
+    queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
+  }
+
+  /**
+   * Returns a `hashCode` for the calculation performed by the query plan of this Dataset. Unlike
+   * the standard `hashCode`, an attempt has been made to eliminate cosmetic differences.
+   */
+  def semanticHash: Int = queryExecution.analyzed.semanticHash()
 
 Review comment:
   I would make it as a proper function `semanticHash()` to allow the same API usage in PySpark.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585820740
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586193899
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118417/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586068918
 
 
   **[Test build #118385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118385/testReport)** for PR 27565 at commit [`284d7ad`](https://github.com/apache/spark/commit/284d7ad3de0a15a6b6aebf92c7b9e32349607048).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379342918
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,45 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s" % type(other))
 
 Review comment:
   Shall we also add one test that checks error message? could be added as below (not tested):
   
   ```diff
   diff --git a/python/pyspark/sql/tests/test_dataframe.py b/python/pyspark/sql/tests/test_dataframe.py
   index d738449799b..942cd4b4b0e 100644
   --- a/python/pyspark/sql/tests/test_dataframe.py
   +++ b/python/pyspark/sql/tests/test_dataframe.py
   @@ -782,6 +782,11 @@ class DataFrameTests(ReusedSQLTestCase):
                        break
                self.assertEqual(df.take(8), result)
   
   +    def test_same_semantics_error(self):
   +        with QuietTest(self.sc):
   +            with self.assertRaisesRegexp(ValueError, "should be of DataFrame.*int"):
   +                self.spark.range(10).sameSemantics(1)
   +
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679649
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118489/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866
 
 
   **[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069483
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23142/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948085
 
 
   **[Test build #118572 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586176641
 
 
   **[Test build #118412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839569
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23130/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069480
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586341296
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118421/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586295857
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118419/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839569
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23130/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872019
 
 
   **[Test build #118373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942256
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 closed pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 closed pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864653
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679644
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864627
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587090205
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118572/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379228610
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
 
 Review comment:
   It's during the code freeze and RC will start next week. Is there a reason to target Spark 3.0? cc @mengxr.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110262
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118385/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379267923
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
 
 Review comment:
   @HyukjinKwon We can change this as develop API.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586176641
 
 
   **[Test build #118412 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118412/testReport)** for PR 27565 at commit [`368e74a`](https://github.com/apache/spark/commit/368e74ade22b03882bf7e12e2fb2a2e0ac9387fd).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379882384
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   Same behavior for dataframe from `spark.read.load()`
   ```
   >>> df4=spark.read.load(csv_file_path, format="csv", inferSchema="true", header="true")
   >>> df4.schema
   StructType(List(StructField(bool_col,BooleanType,true),StructField(float_col,DoubleType,true),StructField(double_col,DoubleType,true),StructField(int_col,IntegerType,true),StructField(long_col,IntegerType,true)))
   >>> df4.withColumn("col1", df4.int_col *2).semanticHash()
   -1746346451
   >>> df4.withColumn("col1", df4.int_col +2).semanticHash()
   -1746346451

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587089410
 
 
   **[Test build #118572 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118572/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586942256
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809253
 
 
   **[Test build #118525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586214562
 
 
   **[Test build #118421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118421/testReport)** for PR 27565 at commit [`1deb7a9`](https://github.com/apache/spark/commit/1deb7a9fd153324877057e849790644d0148e7ca).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177704
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118412/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679645
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118493/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379969968
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   @liangz1 I think you should check correctness via df.sameSemantics, don't via hash, hash may cause collisions issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586803973
 
 
   **[Test build #118523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379374685
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,50 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.sameSemantics(df2)
+        False
+        >>> df1.sameSemantics(df3)
+        False
+        >>> df1.sameSemantics(df4)
+        True
+        >>> df1.sameSemantics(df1)
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a `hashCode` for the calculation performed by the query plan of this Dataset.
+
+        >>> df1 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df2 = spark.createDataFrame([(1, 2),(4, 5)], ["col0", "col2"])
+        >>> df3 = spark.createDataFrame([(0, 2),(4, 5)], ["col1", "col2"])
+        >>> df4 = spark.createDataFrame([(1, 2),(4, 5)], ["col1", "col2"])
+        >>> df1.semanticHash() == df2.semanticHash()
+        False
+        >>> df1.semanticHash() == df3.semanticHash()
+        False
+        >>> df1.semanticHash() == df4.semanticHash()
+        True
 
 Review comment:
   emm, let's change unit test. Don't test on dataframe created from in-memory list (LocalRelation), they have different implementation between scala and pyspark.
   Our usecase also do not care the behavior of LocalRelation.
   so, I suggested add a unit test on: `spark.range(100)` and do some equivalent transforms on it to see whether `sameSemantic` return True

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839557
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586294634
 
 
   **[Test build #118419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118419/testReport)** for PR 27565 at commit [`d154d6b`](https://github.com/apache/spark/commit/d154d6bb991f6dbc284645db33b49206de61e56a).
    * This patch **fails PySpark unit tests**.
    * This patch **does not merge cleanly**.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110260
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203260
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mengxr commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

mengxr commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379272784
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
 
 Review comment:
   We don't need it for 3.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586340929
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586803973
 
 
   **[Test build #118523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118523/testReport)** for PR 27565 at commit [`65c5210`](https://github.com/apache/spark/commit/65c5210edeba283253986978c3fc5fde68129f71).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177603
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585872197
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586110260
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379225187
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
+
+    @since(3.0)
+    def semanticHash(self):
+        """
+
+        :return:
+        """
+        return self._jdf.semanticHash(None)
 
 Review comment:
   should be `self._jdf.semanticHash()`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379977002
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   I am concerned that the hash collision rate is too high, which could be contradictory to people's expectations. Typically, if two messages (in this case, dataframes) have subtle differences, the hash should be different. Do you think we should add a warning in the docstring?
   @HyukjinKwon @WeichenXu123 @mengxr 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679656
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118491/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229413
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
 ##########
 @@ -1899,6 +1899,21 @@ class DatasetSuite extends QueryTest
     val e = intercept[AnalysisException](spark.range(1).tail(-1))
     e.getMessage.contains("tail expression must be equal to or greater than 0")
   }
+
+  test("sameSemantics and semanticHash work") {
 
 Review comment:
   Shall we add `SPARK-30791:` prefix although it is a feature API per "Pull Request" at https://spark.apache.org/contributing.html

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177612
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23169/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679648
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586188210
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23174/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585839043
 
 
   **[Test build #118373 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118373/testReport)** for PR 27565 at commit [`103979a`](https://github.com/apache/spark/commit/103979a205318d52e49b44cbaceae9c2ca569e8b).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586948605
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23327/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586934216
 
 
   **[Test build #118571 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118571/testReport)** for PR 27565 at commit [`0e74940`](https://github.com/apache/spark/commit/0e749403b4bb127341849827cc95313752b6c715).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177694
 
 
   Build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586192634
 
 
   **[Test build #118417 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118417/testReport)** for PR 27565 at commit [`2cbce71`](https://github.com/apache/spark/commit/2cbce71bf1d59b2557b75535529198221d5c3f9d).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586177704
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118412/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585820740
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

WeichenXu123 commented on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-585835273
 
 
   ok to test

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] WeichenXu123 edited a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

WeichenXu123 edited a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-587245610
 
 
   Merge to master. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793
 
 
   **[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379991667
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,58 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note:: This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
 
 Review comment:
   nit:
   
   ```python
           >>> df1 = spark.range(100)
           >>> df2 = spark.range(100)
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
           True
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id + 2))
           False
           >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col0", df2.id * 2))
           True
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586069483
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23142/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379229633
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
 
 Review comment:
   Can you add a doctest too?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679628
 
 
   **[Test build #118491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379982400
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,59 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.1)
+    def sameSemantics(self, other):
+        """
+        Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and
+        therefore return same results.
+
+        .. note:: The equality comparison here is simplified by tolerating the cosmetic differences
+            such as attribute names.
+
+        .. note::This API can compare both :class:`DataFrame`\\s very fast but can still return
+            `False` on the :class:`DataFrame` that return the same results, for instance, from
+            different plans. Such false negative semantic can be useful when caching as an example.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+        True
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+        False
+        >>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+        True
+        """
+        if not isinstance(other, DataFrame):
+            raise ValueError("other parameter should be of DataFrame; however, got %s"
+                             % type(other))
+        return self._jdf.sameSemantics(other._jdf)
+
+    @since(3.1)
+    def semanticHash(self):
+        """
+        Returns a hash code of the logical query plan against this :class:`DataFrame`.
+
+        .. note:: Unlike the standard hash code, the hash is calculated against the query plan
+            simplified by tolerating the cosmetic differences such as attribute names.
+
+        >>> df1 = spark.range(100)
+        >>> df2 = spark.range(100)
+        >>> df3 = spark.range(100)
+        >>> df4 = spark.range(100)
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df2.withColumn("col1", df2.id * 2).semanticHash()
+        True
+        >>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+            df3.withColumn("col1", df3.id + 2).semanticHash()
+        False
 
 Review comment:
   Sure, PR: https://github.com/apache/spark/pull/27601 and the JIRA: https://issues.apache.org/jira/browse/SPARK-30847

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679649
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118489/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586809430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864080
 
 
   **[Test build #118525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118525/testReport)** for PR 27565 at commit [`af476cb`](https://github.com/apache/spark/commit/af476cbd42ec491e9a860be4fcd66ba8c49256a4).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449
 
 
   **[Test build #118491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864635
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118525/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#discussion_r379289223
 
 

 ##########
 File path: python/pyspark/sql/dataframe.py
 ##########
 @@ -2153,6 +2153,22 @@ def transform(self, func):
                                               "should have been DataFrame." % type(result)
         return result
 
+    @since(3.0)
+    def sameSemantics(self, other):
+        """
+        Return true when the query plan of the given :class:`DataFrame` will return the same
+        results as this :class:`DataFrame`.
+        """
+        return self._jdf.sameSemantics(other)
 
 Review comment:
   I could run the scala test "sameSemantics and semanticHash work" in sbt and it passed. I uninstalled & reinstalled the pyspark by `python setup.py install` in the virtualenv. Is my routine correct?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791] Dataframe add sameSemantics and sementicHash method
URL: https://github.com/apache/spark/pull/27565#issuecomment-586203260
 
 
   Build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586679630
 
 
   **[Test build #118493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586864635
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118525/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27565: [SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586931023
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org